Reinforcement Learning

State is a function of history, and we have three different kinds of state

1. Environment State: it decides what is the next state; and it is invisible to agent
2. Agent State: it is used by the algorithm
3. Information State or Markov State: mathematical definition $\mathbb{P}[s_{t+1} \mid s_t]$

Markov means future is independent of current given past.

Two things to remember:

• environment state is Markov
• history is Markov

Observability

Fully observability

agent is aware of the whole environment

environment state = agent state = information state

$\Longrightarrow$ this is a Markov Decision Process (MDP)

This is the case where we know all the state/action and all other things, and if we treat this as a dynamic processing, we can come up with a closed-form solution.

Partial observability

POMDP: agent can observe only part of the environment, it will build up its own state presentation.

• complete history
• belief of environment state: $S_t^a = ( \mathbb{P}[S_t^e =S^1], \mathbb{P}[S_t^e =S^2], …, \mathbb{P}[S_t^e =S^n] )$, where n is the number of different state.
• RNN: $S_t^a = \sigma( S_{t-1}^a w_s + O_t W_o )$, where $O_t$ is the environment state at time t.

RL Agent

policy: a map from state space to action space

• deterministic policy: $a=\pi(s)$
•  stochastic policy: $\pi(a s) = \mathbb{P}[a \mid s]$

Tips

prediction: get the value function

control: get the policy

David Silver 和 Eric Xing 笔记

policy: 是状态空间到动作空间的映射

reward function: 将每一个状态或者状态-动作对，映射到一个实数，这个实数就是reward

value function: 一个状态的value就是，从这个状态开始的total expected reward

RL的目标是为了找到一个policy，能够最大化value function。 解决RL问题可以分三大类或者三个阶段：

1. DP，比如value iteration和policy iteration是更加经典的算法
2. MC
3. TD，比如Q-learning和Sarsa

Dynamic Process Method （动态规划法）

1. policy iteration （策略迭代） 使用policy evaluation获得状态函数值， 然后根据policy improvement选择更优的策略。 $v(s) = \sum_{s'} p(s'|s,\pi(s)) [r(s,\pi(s),s') + \gamma v(s')]$ $\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]$
2. value iteration （值迭代） policy iteration需要把所有的状态扫描多次，冗余的计算太多。 $v(s) = max_a \sum_{s'} p(s'|s,a)) [r(s,a,s') + \gamma v(s')]$ $\pi(s) = argmax_a \sum_{s'} p(s'|s,a)[r(s,a,s') + \gamma v(s')]$ 通过Sutton Intro to RL 里面的92页和96页对比可以知道， 在policy evaluation和policy improvement中都有根据当前的action， 得到下一个状态， 从而估计下一步的value function。 （policy improvement会依据这个选择最优策略） 可以理解是使用迭代的方式来逼近Bellman Equation。

Temporal Difference Method （时间差分法）

DP

• 需要环境模型，即为状态转移概率 $P_{sa}$
• 状态值函数的估计是自举的（bootstraping），当前状态函数的更新依赖于已知的其他状态值函数

MC

• model-free
• 状态值的估计相互独立

Temporal Difference learning可以分为两种。

1. Q-learnig，off-policy learning (can follow any policy) 比较current state和best next possible state。 $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha [r_{t+1} + \gamma \underset{a}{max} Q(s_{t+1}, a) - Q(s_t, a_t) ]$
2. SARSA， on-policy learning (follows the policy that is learning) 比较current state和actual next state。 $Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha[ r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) ]$

1. 在SARSA中，我们使用同样的policy来产生下一个action； 而在Q-learning中，当前state产生的action不是确定的，也就是没有固定的policy，而是根据optimize的结果选取下一个action $\underset{a}{max} Q(s_{t+1}, a)$。 如果对应到公式中，那么SARSA使用的$Q(s_{t+1}, a_{t+1})$是固定的，而Q-learning中则用的是$\underset{a}{max} Q(s_{t+1}, a)$。
2. Q-learning算exact solution，而SARSA找到的是approximate solution。