I take this note when watching Reinforcement Learning Pretraining for Reinforcement Learning Finetuning - YouTube .
Key to the success of large-scale ML system:
- big models
- large and high-capacity datasets
Comparison
prediction (ideal, assumptions) | decision making (real-world deployment of ML systems have feedback issues) |
---|---|
i.i.d data | each decision can change future inputs |
ground truth supervision | high-level supervision (e.g. a goal) |
objective is to predict the right label | objective if to accomplish the task |
use RL as a universal approach to ML
RL can consume data in a fundamentally different way from conventional maximum likelihood / supervised L systems
cheap, uncurated data (e.g. from past interaction, from the Internet) → dynamics
limited amount of human supervision → task / reward function
train the best possible initial model
RL is
- framework for L-based decision making
- active, online L algo for control
Problem: real-world decision-making prob is difficult to be fully active and online
Solution:
use large dataset of diverse (but possibly low-qual) behavior for offline (data-driven) RL pretraining
- human-defined skills
- goal-conditioned RL
- self-supervised skill discovery
- learning downstream task very efficiently
- online RL fine-tuning
- offline RL
- supervised L
Efficient online RL with offline pretraining
Distributional shift: discrepancies of states and actions seen in the training data and in the real world
Q(s,a) \leftarrow \underbrace{r(s,a) + \mathbb{E}_{a^\prime\sim\pi(a \mid s)}{[Q(s^\prime,a^\prime)]}}_{y(s,a)}
objective:
\min_{Q}{\mathbb{E}_{(s,a)\sim\pi_{\beta}{(s,a)}}{[(Q(s,a)-y(s,a))^2}]}
expect good accuracy when \pi_{\beta}{(a \mid s)}=\pi_{new}{(a \mid s)}
even worse when
\pi_{new} = arg\max_{\pi}{\mathbb{E}_{a\sim\pi(a \mid s)}{[Q(s,a)]}}
Adversarial examples: optimize the input of a nnet w.r.t its output, fool the network
Conservative Q-learning: push down places where learning function overestimates
can show \hat{Q}^\pi \le Q^\pi for large enough \alpha
CQL performance has big drop when finetuning starts:
underestimating too much during offline training, then wasting lots of effor recalibrating the value function when online fine-tuning starts
Cal-QL: callibrated offline RL pre-training for efficient online fine-tuning. 2023
with one-line change to CQL, provably efficient online finetuning from offlien initilization
Offline pretraining without actions(and/or rewards, only passive static data)
representaional learning