Reinforcement Learning

State is a function of history, and we have three different kinds of state

  1. Environment State: it decides what is the next state; and it is invisible to agent
  2. Agent State: it is used by the algorithm
  3. Information State or Markov State: mathematical definition $\mathbb{P}[s_{t+1} \mid s_t]$

Markov means future is independent of current given past.

Two things to remember:

  • environment state is Markov
  • history is Markov


Fully observability

agent is aware of the whole environment

environment state = agent state = information state

$\Longrightarrow$ this is a Markov Decision Process (MDP)

This is the case where we know all the state/action and all other things, and if we treat this as a dynamic processing, we can come up with a closed-form solution.

Partial observability

POMDP: agent can observe only part of the environment, it will build up its own state presentation.

  • complete history
  • belief of environment state: $S_t^a = ( \mathbb{P}[S_t^e =S^1], \mathbb{P}[S_t^e =S^2], …, \mathbb{P}[S_t^e =S^n] ) $, where n is the number of different state.
  • RNN: $S_t^a = \sigma( S_{t-1}^a w_s + O_t W_o ) $, where $O_t$ is the environment state at time t.

RL Agent

policy: a map from state space to action space

  • deterministic policy: $a=\pi(s)$
  • stochastic policy: $\pi(a s) = \mathbb{P}[a \mid s]$


prediction: get the value function

control: get the policy

David Silver 和 Eric Xing 笔记

policy: 是状态空间到动作空间的映射

reward function: 将每一个状态或者状态-动作对,映射到一个实数,这个实数就是reward

value function: 一个状态的value就是,从这个状态开始的total expected reward

RL的目标是为了找到一个policy,能够最大化value function。 解决RL问题可以分三大类或者三个阶段:

  1. DP,比如value iteration和policy iteration是更加经典的算法
  2. MC
  3. TD,比如Q-learning和Sarsa

Dynamic Process Method (动态规划法)

动态规划的核心是找出动态转移方程,而在这里就是Bellman Equation

  1. policy iteration (策略迭代) 使用policy evaluation获得状态函数值, 然后根据policy improvement选择更优的策略。
  2. value iteration (值迭代) policy iteration需要把所有的状态扫描多次,冗余的计算太多。 通过Sutton Intro to RL 里面的92页和96页对比可以知道, 在policy evaluation和policy improvement中都有根据当前的action, 得到下一个状态, 从而估计下一步的value function。 (policy improvement会依据这个选择最优策略) 可以理解是使用迭代的方式来逼近Bellman Equation。

Monte Carlo Method (蒙特卡洛法)

使用蒙特卡洛的一个大前提是任务都是episodic或者是finite的。 也就是我只要往前走,肯定能得到结果。

Temporal Difference Method (时间差分法)



  • 需要环境模型,即为状态转移概率 $P_{sa}$
  • 状态值函数的估计是自举的(bootstraping),当前状态函数的更新依赖于已知的其他状态值函数


  • model-free
  • 状态值的估计相互独立
  • episodic task

希望能够model-free model,同时也不限制episodic task。

Temporal Difference learning可以分为两种。

  1. Q-learnig,off-policy learning (can follow any policy) 比较current state和best next possible state。
  2. SARSA, on-policy learning (follows the policy that is learning) 比较current state和actual next state。


  1. 在SARSA中,我们使用同样的policy来产生下一个action; 而在Q-learning中,当前state产生的action不是确定的,也就是没有固定的policy,而是根据optimize的结果选取下一个action $\underset{a}{max} Q(s_{t+1}, a)$。 如果对应到公式中,那么SARSA使用的$Q(s_{t+1}, a_{t+1})$是固定的,而Q-learning中则用的是$\underset{a}{max} Q(s_{t+1}, a)$。
  2. Q-learning算exact solution,而SARSA找到的是approximate solution。

Policy Gradient

RL有三种形式的policy function

  1. episodic
  2. continuing
  3. average reward per timestep

Useful Materials for Reinforcement Learning