Skip to content

Benchmarks

We adapt several methods from TRL for continual learning, and include the commands to run the algorithms end-to-end using AIF-Gen data. We integrate with WANDB and DeepSpeed for multi-GPU support.

Scripts are available here

Basic Usage

First, you'll need to sync additional dependencies for running the benchmarks. Assuming you did the uv install, you can simply issue:

uv sync --group benchmarks

Each algorithm (aside from DPO) involves first training a reward model, then training an RL agent with respect to the learned model.

Reward Model Training

Under construction.

Agent Training

Under construction.

Evaluation

Under construction.

Algorithms

PPO

Under construction.

Reference: https://arxiv.org/abs/1707.06347

DPO

Under construction.

Reference: https://arxiv.org/abs/2305.18290

COPR

Under construction.

Reference: https://arxiv.org/abs/2402.14228

CPPO

Under construction.

Reference: https://openreview.net/forum?id=86zAUE80pP

EWC

Under construction.

Reference: https://arxiv.org/abs/1612.00796