CS294-112 Deep Reinforcement Learning Notes

uvuvwevwev

不是太会玩这里的 latex，先放在 GitHub Pages 上维护了

Assignments on GitHub

hw1
hw2

uvuvwevwev

什么是 / 为什么是 “强化学习”？

What? And why?

$\text{maximize } r$

问题解释：假设你在玩 地球OL，如何成为人生赢家？

我们受现实制约： $s \in \mathbb{S}$
- 而且时常我们无法得知事情的真相： $o \subset s \text{ or } o \sim s$
我们能做的事情有限： $a \in \mathbb{A}$
我们心里对自己有数： $s, a, s' \Rightarrow_{\tau, b} r$
我们能与环境互动！ $s, a \Rightarrow_{\pi, p} s'$

类似于人类的学习机制
- “如果好吃你就多吃点”： $\text{if } r^\star \uparrow, \text{ then } \uparrow p(s\vert_{r \approx r^\star})$
一种可能的达成 GAI 的途径

线性二次调节器

Linear Quadratic Regulator

最优设计算法
- 控制论
世界模型： $\mathbf{x}_{t+1} = f(\mathbf{x}_t, \mathbf{u}_t) = \mathbf{F}_t \begin{bmatrix} \mathbf{x}_t \ \mathbf{u}_t \end{bmatrix} + \mathbf{f}_t$

允许包含高阶项 $\dot{x}, \ddot{x}, \cdots$
评判标准： $c(\mathbf{x}_t, \mathbf{u}_t) = \frac{1}{2} \begin{bmatrix} \mathbf{x}_t \ \mathbf{u}_t \end{bmatrix}^T \mathbf{C}_t \begin{bmatrix} \mathbf{x}_t \ \mathbf{u}_t \end{bmatrix} + \begin{bmatrix} \mathbf{x}_t \ \mathbf{u}_t \end{bmatrix}^T \mathbf{c}_t$
目标： $\min_\tau \sum c$
- 注意对 $\mathbf{x}$ 递归带入 $f$

iterative LQR：如果不能线性，就用泰勒展开
- 是的，对轨迹上每个点展开

了解世界

Learn about the world

Version 1.0

采集数据 $\mathcal{D} = { (\mathbf{x}, \mathbf{u}, \mathbf{x}')_i }$
从数据 $\mathcal{D}$ 学习世界模型 $f \leftarrow \min_f \sum_i \lVert f(\mathbf{x}_i, \mathbf{u}_i) - \mathbf{x}' \rVert^2$

Version 1.0 出了什么问题？

想想巴甫洛夫的狗...

$\mathcal{D}_0 , := { (\text{ring bell and food}, \cdots})$
$f_{\mathcal{D}_0} = (\text{bell} = \text{food})$
$\pi_{\mathcal{D}_0} = (\text{if bell, then go to front door for food})$

被训练集教坏的模型！

$\mathcal{D}$ 并不能代表 $\mathbb{S}$ ，因为 $\mathbf{x}_{\mathcal{D}} \subset \mathbb{S}$

Version 2.0 - DAgger

Data Aggregation

初始化 $\pi_0$
跑 $\pi_0$ 采集数据 $\mathcal{D}$
从 $\mathcal{D}$ 学习 $f$
更新 $\pi$ 为 $\pi_i$
跑 $\pi_i$ 采集数据 $\mathcal{D}_i$
[DAgger] $\mathcal{D} := \mathcal{D} \cup \mathcal{D}_i$
回到步骤 3，直至收敛

独当一面

Be the master of yourself

Markov Decision Process

$\pi(s) := \arg \max_a { \sum_{s'} P_a(s, s') ( R_a(s, s') + \gamma V(s') ) }$

$V(s) := \sum_{s'} P_{\pi(s)} ( R_{\pi(s)}(s, s') + \gamma V(s') )$

from WikiPedia

评价方法

策略表现： $\eta(\pi) = \mathbb{E} [ \sum_t \gamma^t r_t ]$
状态评分： $V_\pi(s) = \mathbb{E} [ \sum_t \gamma^t r_t | s_0 = s ]$
状态-动作评分： $Q_\pi(s, a) = \mathbb{E} [ \sum_t \gamma^t r_t | s_0 = s, a_0 = a ]$

$\gamma$ 为衰减因子

策略梯度

Prolicy Gradient

$\text{maximize } \mathbb{E} [ R | \pi_\theta ]$

基本思路

让好的轨迹更容易发生
让好的动作更容易发生
改善不好的动作

将 $f(x)$ 替换为我们需要的任务目标

$\hat{g} := \nabla_\theta \mathbb{E}_\tau [ R(\tau) ] = \mathbb{E}_\tau [ \nabla_\theta \log{p(\tau | \theta)} R(\tau) ]$

$f \rightarrow R$
$x \rightarrow \tau := (s_0, a_0, r_0, s_1, a_1, r_1, \cdots, s_{T-1}, a_{T-1}, r_{T-1}, s_T)$