- Efficient online RL with offline pretraining
- Offline pretraining without actions(and/or rewards, only passive static data)

Key to the success of large-scale ML system:

- big models
- large and high-capacity datasets

Comparison

prediction (ideal, assumptions) | decision making (real-world deployment of ML systems have feedback issues) |
---|---|

i.i.d data | each decision can change future inputs |

ground truth supervision | high-level supervision (e.g. a goal) |

objective is to predict the right label | objective if to accomplish the task |

use RL as a universal approach to ML

RL can consume data in a fundamentally different way from conventional maximum likelihood / supervised L systems

cheap, uncurated data (e.g. from past interaction, from the Internet) -> dynamics

limited amount of human supervision -> task / reward function

train the best possible initial model

RL is

- framework for L-based decision making
- active, online L algo for control

Problem: real-world decision-making prob is difficult to be fully active and online

Solution:

use large dataset of diverse (but possibly low-qual) behavior for offline (data-driven) RL pretraining

- human-defined skills
- goal-conditioned RL
- self-supervised skill discovery

- learning downstream task very efficiently
- online RL fine-tuning
- offline RL
- supervised L

## Efficient online RL with offline pretraining

Distributional shift: discrepancies of states and actions seen in the training data and in the real world

$Q(s,a) \leftarrow \underbrace{r(s,a) + \mathbb{E}_{a^\prime\sim\pi(a \mid s)}{[Q(s^\prime,a^\prime)]}}_{y(s,a)}$

objective:

$\min_{Q}{\mathbb{E}_{(s,a)\sim\pi_{\beta}{(s,a)}}{[(Q(s,a)-y(s,a))^2}]}$

expect good accuracy when $\pi_{\beta}{(a \mid s)}=\pi_{new}{(a \mid s)}$

even worse when

$\pi_{new} = arg\max_{\pi}{\mathbb{E}_{a\sim\pi(a \mid s)}{[Q(s,a)]}}$

Adversarial examples: optimize the input of a nnet w.r.t its output, fool the network

Conservative Q-learning: push down places where learning function overestimates

$$\newcommand{\states}{\mathcal{S}}

\newcommand{\bs}{\mathbf{s}}

\newcommand{\st}{\bs_t}

\newcommand{\hatbehavior}{\hat{\pi}_\beta}

\newcommand{\ba}{\mathbf{a}}

\newcommand{\bellman}{\mathcal{B}}

\newcommand{\E}{\mathbb{E}}

\begin{multline}

\hat{Q}^\pi = \small{arg\min_{Q} \textcolor{red}{\max_{\mu}}~~ \alpha \left(\E_{\bs \sim \mathcal{D}, \ba \sim \textcolor{red}{\mu(\ba|\bs)}}\left[Q(\bs, \ba)\right] - \E_{\bs \sim \mathcal{D}, \ba \sim \hatbehavior(\ba|\bs)}\left[Q(\bs, \ba)\right] \right)}\\

\small{+ \frac{1}{2}~ \E_{\bs, \ba, \bs' \sim \mathcal{D}}\left[\left(Q(\bs, \ba) - \hat{\bellman}^{\policy_k} \hat{Q}^{k} (\bs, \ba) \right)^2 \right] + \textcolor{red}{\mathcal{R}(\mu)} ~~~ \left(\text{CQL}(\mathcal{R})\right).}

\end{multline}$$

can show $\hat{Q}^\pi \le Q^\pi $ for large enough $\alpha$

CQL performance has big drop when finetuning starts:

underestimating too much during offline training, then wasting lots of effor recalibrating the value function when online fine-tuning starts

Cal-QL: callibrated offline RL pre-training for efficient online fine-tuning. 2023

with one-line change to CQL, provably efficient online finetuning from offlien initilization

## Offline pretraining without actions(and/or rewards, only passive static data)

representaional learning