Reinforcement Learning

State is a function of history, and we have three different kinds of state

  1. Environment State: it decides what is the next state; and it is invisible to agent
  2. Agent State: it is used by the algorithm
  3. Information State or Markov State: mathematical definition $\mathbb{P}[s_{t+1} \mid s_t]$

Markov means future is independent of current given past.

Two things to remember:

  • environment state is Markov
  • history is Markov


Fully observability

agent is aware of the whole environment

environment state = agent state = information state

$\Longrightarrow$ this is a Markov Decision Process (MDP)

This is the case where we know all the state/action and all other things, and if we treat this as a dynamic processing, we can come up with a closed-form solution.

Partial observability

POMDP: agent can observe only part of the environment, it will build up its own state presentation.

  • complete history
  • belief of environment state: $S_t^a = ( \mathbb{P}[S_t^e =S^1], \mathbb{P}[S_t^e =S^2], …, \mathbb{P}[S_t^e =S^n] ) $, where n is the number of different state.
  • RNN: $S_t^a = \sigma( S_{t-1}^a w_s + O_t W_o ) $, where $O_t$ is the environment state at time t.

RL Agent

policy: a map from state space to action space

  • deterministic policy: $a=\pi(s)$
  • stochastic policy: $\pi(a s) = \mathbb{P}[a \mid s]$


prediction: get the value function

control: get the policy

David Silver 和 Eric Xing 笔记

policy: 是状态空间到动作空间的映射

reward function: 将每一个状态或者状态-动作对,映射到一个实数,这个实数就是reward

value function: 一个状态的value就是,从这个状态开始的total expected reward

RL的目标是为了找到一个policy,能够最大化value function。 解决RL问题可以分三大类或者三个阶段:

  1. DP,比如value iteration和policy iteration是更加经典的算法
  2. MC
  3. TD,比如Q-learning和Sarsa

Dynamic Process Method (动态规划法)

动态规划的核心是找出动态转移方程,而在这里就是Bellman Equation

  1. policy iteration (策略迭代) 使用policy evaluation获得状态函数值, 然后根据policy improvement选择更优的策略。 \(v(s) = \sum_{s'} p(s'|s,\pi(s)) [r(s,\pi(s),s') + \gamma v(s')]\) \(\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]\)
  2. value iteration (值迭代) policy iteration需要把所有的状态扫描多次,冗余的计算太多。 \(v(s) = max_a \sum_{s'} p(s'|s,a)) [r(s,a,s') + \gamma v(s')]\) \(\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]\) 通过Sutton Intro to RL 里面的92页和96页对比可以知道, 在policy evaluation和policy improvement中都有根据当前的action, 得到下一个状态, 从而估计下一步的value function。 (policy improvement会依据这个选择最优策略) 可以理解是使用迭代的方式来逼近Bellman Equation。

Monte Carlo Method (蒙特卡洛法)

使用蒙特卡洛的一个大前提是任务都是episodic或者是finite的。 也就是我只要往前走,肯定能得到结果。

Temporal Difference Method (时间差分法)



  • 需要环境模型,即为状态转移概率 $P_{sa}$
  • 状态值函数的估计是自举的(bootstraping),当前状态函数的更新依赖于已知的其他状态值函数


  • model-free
  • 状态值的估计相互独立
  • episodic task

希望能够model-free model,同时也不限制episodic task。

Temporal Difference learning可以分为两种。

  1. Q-learnig,off-policy learning (can follow any policy) 比较current state和best next possible state。 \(Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma \underset{a}{max} Q(s_{t+1}, a) - Q(s_t, a_t) ]\)
  2. SARSA, on-policy learning (follows the policy that is learning) 比较current state和actual next state。 \(Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ]\)


  1. 在SARSA中,我们使用同样的policy来产生下一个action; 而在Q-learning中,当前state产生的action不是确定的,也就是没有固定的policy,而是根据optimize的结果选取下一个action $\underset{a}{max} Q(s_{t+1}, a)$。 如果对应到公式中,那么SARSA使用的$Q(s_{t+1}, a_{t+1})$是固定的,而Q-learning中则用的是$\underset{a}{max} Q(s_{t+1}, a)$。
  2. Q-learning算exact solution,而SARSA找到的是approximate solution。

Policy Gradient

RL有三种形式的policy function

  1. episodic
  2. continuing
  3. average reward per timestep

Useful Materials for Reinforcement Learning