Reinforcement Learning

State is a function of history, and we have three different kinds of state

Environment State: it decides what is the next state; and it is invisible to agent
Agent State: it is used by the algorithm
Information State or Markov State: mathematical definition $\mathbb{P}[s_{t+1} \mid s_t]$

Markov means future is independent of current given past.

Two things to remember:

environment state is Markov
history is Markov

Observability

Fully observability

agent is aware of the whole environment

environment state = agent state = information state

$\Longrightarrow$ this is a Markov Decision Process (MDP)

This is the case where we know all the state/action and all other things, and if we treat this as a dynamic processing, we can come up with a closed-form solution.

Partial observability

POMDP: agent can observe only part of the environment, it will build up its own state presentation.

complete history
belief of environment state: $S_t^a = ( \mathbb{P}[S_t^e =S^1], \mathbb{P}[S_t^e =S^2], …, \mathbb{P}[S_t^e =S^n] ) $, where n is the number of different state.
RNN: $S_t^a = \sigma( S_{t-1}^a w_s + O_t W_o ) $, where $O_t$ is the environment state at time t.

RL Agent

policy: a map from state space to action space

deterministic policy: $a=\pi(s)$
stochastic policy: $\pi(a s) = \mathbb{P}[a \mid s]$

Tips

prediction: get the value function

control: get the policy

David Silver 和 Eric Xing 笔记

policy: 是状态空间到动作空间的映射

reward function: 将每一个状态或者状态-动作对，映射到一个实数，这个实数就是reward

value function: 一个状态的value就是，从这个状态开始的total expected reward

RL的目标是为了找到一个policy，能够最大化value function。解决RL问题可以分三大类或者三个阶段：

DP，比如value iteration和policy iteration是更加经典的算法
MC
TD，比如Q-learning和Sarsa

Dynamic Process Method （动态规划法）

动态规划的核心是找出动态转移方程，而在这里就是Bellman Equation

policy iteration （策略迭代）使用policy evaluation获得状态函数值，然后根据policy improvement选择更优的策略。 $v(s) = \sum_{s'} p(s'|s,\pi(s)) [r(s,\pi(s),s') + \gamma v(s')]$ $\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]$
value iteration （值迭代） policy iteration需要把所有的状态扫描多次，冗余的计算太多。 $v(s) = max_a \sum_{s'} p(s'|s,a)) [r(s,a,s') + \gamma v(s')]$ $\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]$ 通过Sutton Intro to RL 里面的92页和96页对比可以知道，在policy evaluation和policy improvement中都有根据当前的action，得到下一个状态，从而估计下一步的value function。（policy improvement会依据这个选择最优策略）可以理解是使用迭代的方式来逼近Bellman Equation。

Monte Carlo Method （蒙特卡洛法）

使用蒙特卡洛的一个大前提是任务都是episodic或者是finite的。也就是我只要往前走，肯定能得到结果。

Temporal Difference Method （时间差分法）

总结一下DP和MC方法的有点

需要环境模型，即为状态转移概率 $P_{sa}$
状态值函数的估计是自举的（bootstraping），当前状态函数的更新依赖于已知的其他状态值函数

model-free
状态值的估计相互独立
episodic task

希望能够model-free model，同时也不限制episodic task。

Temporal Difference learning可以分为两种。

Q-learnig，off-policy learning (can follow any policy) 比较current state和best next possible state。 $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma \underset{a}{max} Q(s_{t+1}, a) - Q(s_t, a_t) ]$
SARSA， on-policy learning (follows the policy that is learning) 比较current state和actual next state。 $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ]$

二者的区别，这里给了很好的解释。

在SARSA中，我们使用同样的policy来产生下一个action；而在Q-learning中，当前state产生的action不是确定的，也就是没有固定的policy，而是根据optimize的结果选取下一个action $\underset{a}{max} Q(s_{t+1}, a)$。如果对应到公式中，那么SARSA使用的$Q(s_{t+1}, a_{t+1})$是固定的，而Q-learning中则用的是$\underset{a}{max} Q(s_{t+1}, a)$。
Q-learning算exact solution，而SARSA找到的是approximate solution。

Policy Gradient

RL有三种形式的policy function

episodic
continuing
average reward per timestep

Useful Materials for Reinforcement Learning

Papers/Posters:
- Sutton Reinforcement Learning: An Introduction
- Policy Gradient Methods for Reinforcement Learning with Function Approximation
- DQN: playing atari with deep reinforcement learning
- Sutton’s tutorial on NIPS 2015: Introduction to Reinforcement Learning with Function Approximation
- David’s tutorial on ICML 2016: Reinforcement Learning
Course Materials:
- Eric Xing’s slides, lecture 27 and 28
- David Silver’s slides
- log derivative tricks
Blogs:
- Andrej Karpathy’s Pong Example
- Difference between on- and off-policy: Comparing Reinforcement Learning Algorithms
- Natasha’s Sequence Tutor blog