Reinforcement Learning Pretraining for Reinforcement Learning Finetuning

I take this note when watching Reinforcement Learning Pretraining for Reinforcement Learning Finetuning - YouTube .

Key to the success of large-scale ML system:

  • big models
  • large and high-capacity datasets


prediction (ideal, assumptions) decision making (real-world deployment of ML systems have feedback issues)
i.i.d data each decision can change future inputs
ground truth supervision high-level supervision (e.g. a goal)
objective is to predict the right label objective if to accomplish the task

use RL as a universal approach to ML

RL can consume data in a fundamentally different way from conventional maximum likelihood / supervised L systems

cheap, uncurated data (e.g. from past interaction, from the Internet) → dynamics
limited amount of human supervision → task / reward function

train the best possible initial model

RL is

  • framework for L-based decision making
  • active, online L algo for control

Problem: real-world decision-making prob is difficult to be fully active and online

use large dataset of diverse (but possibly low-qual) behavior for offline (data-driven) RL pretraining

  • human-defined skills
  • goal-conditioned RL
  • self-supervised skill discovery
  • learning downstream task very efficiently
    • online RL fine-tuning
    • offline RL
    • supervised L

Efficient online RL with offline pretraining

Distributional shift: discrepancies of states and actions seen in the training data and in the real world

Q(s,a) \leftarrow \underbrace{r(s,a) + \mathbb{E}_{a^\prime\sim\pi(a \mid s)}{[Q(s^\prime,a^\prime)]}}_{y(s,a)}


expect good accuracy when \pi_{\beta}{(a \mid s)}=\pi_{new}{(a \mid s)}

even worse when
\pi_{new} = arg\max_{\pi}{\mathbb{E}_{a\sim\pi(a \mid s)}{[Q(s,a)]}}

Adversarial examples: optimize the input of a nnet w.r.t its output, fool the network

Conservative Q-learning: push down places where learning function overestimates

\begin{multline} \hat{Q}^\pi = \small{arg\min_{Q} \textcolor{red}{\max_{\mu}}~~ \alpha \left(\mathbb{E}_{\mathbf{s} \sim \mathcal{D}, \mathbf{a} \sim \textcolor{red}{\mu(\mathbf{a}|\mathbf{s})}}\left[Q(\mathbf{s}, \mathbf{a})\right] - \mathbb{E}_{\mathbf{s} \sim \mathcal{D}, \mathbf{a} \sim \hat{\pi}_\beta(\mathbf{a}|\mathbf{s})}\left[Q(\mathbf{s}, \mathbf{a})\right] \right)}\\ \small{+ \frac{1}{2}~ \mathbb{E}_{\mathbf{s}, \mathbf{a}, \mathbf{s}' \sim \mathcal{D}}\left[\left(Q(\mathbf{s}, \mathbf{a}) - \hat{\mathcal{B}}^{\pi_k} \hat{Q}^{k} (\mathbf{s}, \mathbf{a}) \right)^2 \right] + \textcolor{red}{\mathcal{R}(\mu)} ~~~ \left(\text{CQL}(\mathcal{R})\right).} \end{multline}

can show \hat{Q}^\pi \le Q^\pi for large enough \alpha

CQL performance has big drop when finetuning starts:
underestimating too much during offline training, then wasting lots of effor recalibrating the value function when online fine-tuning starts

Cal-QL: callibrated offline RL pre-training for efficient online fine-tuning. 2023

with one-line change to CQL, provably efficient online finetuning from offlien initilization

Offline pretraining without actions(and/or rewards, only passive static data)

representaional learning