I had finished homework 1-4 of CS285 three years ago, but not hw5. It seems that the content of hw 5 has changed compared to 2020:
You will first implement an exploration method called random network distillation (RND) and collect data using this exploration procedure, then perform offline training on the data collected via RND using conservative Q-learning (CQL), Advantage Weighted Actor Critic (AWAC), and Implicit Q-Learning (IQL). You will also explore with variants of exploration bonuses – where a bonus is provided alongside the actual environment reward for exploration.
Some auxiliary requirements:
- The questions will require you to perform multiple runs of Offline RL Training, which can take quite a long time as we ask you to analyze the emperical significance of specific hyperparameters and thus sweep over them.
- Furthermore, depending on your implementation, you may find it necessary to tweak some of the parameters, such as learning rates or exploration schedules, which can also be very time consuming.