Reinforcement Learning
Reinforcement Learning
State is a function of history, and we have three different kinds of state
- Environment State: it decides what is the next state; and it is invisible to agent
- Agent State: it is used by the algorithm
- Information State or Markov State: mathematical definition $\mathbb{P}[s_{t+1} \mid s_t]$
Markov means future is independent of current given past.
Two things to remember:
- environment state is Markov
- history is Markov
Observability
Fully observability
agent is aware of the whole environment
environment state = agent state = information state
$\Longrightarrow$ this is a Markov Decision Process (MDP)
This is the case where we know all the state/action and all other things, and if we treat this as a dynamic processing, we can come up with a closed-form solution.
Partial observability
POMDP: agent can observe only part of the environment, it will build up its own state presentation.
- complete history
- belief of environment state: $S_t^a = ( \mathbb{P}[S_t^e =S^1], \mathbb{P}[S_t^e =S^2], …, \mathbb{P}[S_t^e =S^n] ) $, where n is the number of different state.
- RNN: $S_t^a = \sigma( S_{t-1}^a w_s + O_t W_o ) $, where $O_t$ is the environment state at time t.
RL Agent
policy: a map from state space to action space
- deterministic policy: $a=\pi(s)$
-
stochastic policy: $\pi(a s) = \mathbb{P}[a \mid s]$
Tips
prediction: get the value function
control: get the policy
David Silver 和 Eric Xing 笔记
policy: 是状态空间到动作空间的映射
reward function: 将每一个状态或者状态-动作对,映射到一个实数,这个实数就是reward
value function: 一个状态的value就是,从这个状态开始的total expected reward
RL的目标是为了找到一个policy,能够最大化value function。 解决RL问题可以分三大类或者三个阶段:
- DP,比如value iteration和policy iteration是更加经典的算法
- MC
- TD,比如Q-learning和Sarsa
Dynamic Process Method (动态规划法)
动态规划的核心是找出动态转移方程,而在这里就是Bellman Equation
- policy iteration (策略迭代) 使用policy evaluation获得状态函数值, 然后根据policy improvement选择更优的策略。 \(v(s) = \sum_{s'} p(s'|s,\pi(s)) [r(s,\pi(s),s') + \gamma v(s')]\) \(\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]\)
- value iteration (值迭代) policy iteration需要把所有的状态扫描多次,冗余的计算太多。 \(v(s) = max_a \sum_{s'} p(s'|s,a)) [r(s,a,s') + \gamma v(s')]\) \(\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]\) 通过Sutton Intro to RL 里面的92页和96页对比可以知道, 在policy evaluation和policy improvement中都有根据当前的action, 得到下一个状态, 从而估计下一步的value function。 (policy improvement会依据这个选择最优策略) 可以理解是使用迭代的方式来逼近Bellman Equation。
Monte Carlo Method (蒙特卡洛法)
使用蒙特卡洛的一个大前提是任务都是episodic或者是finite的。 也就是我只要往前走,肯定能得到结果。
Temporal Difference Method (时间差分法)
总结一下DP和MC方法的有点
DP
- 需要环境模型,即为状态转移概率 $P_{sa}$
- 状态值函数的估计是自举的(bootstraping),当前状态函数的更新依赖于已知的其他状态值函数
MC
- model-free
- 状态值的估计相互独立
- episodic task
希望能够model-free model,同时也不限制episodic task。
Temporal Difference learning可以分为两种。
- Q-learnig,off-policy learning (can follow any policy) 比较current state和best next possible state。 \(Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma \underset{a}{max} Q(s_{t+1}, a) - Q(s_t, a_t) ]\)
- SARSA, on-policy learning (follows the policy that is learning) 比较current state和actual next state。 \(Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ]\)
二者的区别,这里给了很好的解释。
- 在SARSA中,我们使用同样的policy来产生下一个action; 而在Q-learning中,当前state产生的action不是确定的,也就是没有固定的policy,而是根据optimize的结果选取下一个action $\underset{a}{max} Q(s_{t+1}, a)$。 如果对应到公式中,那么SARSA使用的$Q(s_{t+1}, a_{t+1})$是固定的,而Q-learning中则用的是$\underset{a}{max} Q(s_{t+1}, a)$。
- Q-learning算exact solution,而SARSA找到的是approximate solution。
Policy Gradient
RL有三种形式的policy function
- episodic
- continuing
- average reward per timestep
Useful Materials for Reinforcement Learning
- Papers/Posters:
- Sutton Reinforcement Learning: An Introduction
- Policy Gradient Methods for Reinforcement Learning with Function Approximation
- DQN: playing atari with deep reinforcement learning
- Sutton’s tutorial on NIPS 2015: Introduction to Reinforcement Learning with Function Approximation
- David’s tutorial on ICML 2016: Reinforcement Learning
- Course Materials:
- Eric Xing’s slides, lecture 27 and 28
- David Silver’s slides
- log derivative tricks
- Blogs:
- Andrej Karpathy’s Pong Example
- Difference between on- and off-policy: Comparing Reinforcement Learning Algorithms
- Natasha’s Sequence Tutor blog