File size: 3,263 Bytes

# OREO: Offline REasoning Optimization

Source code for [Offline Reinforcement Learning for LLM Multi-Step Reasoning](https://arxiv.org/abs/2412.16145)

Model: [Policy](https://huggingface.co/jwhj/Qwen2.5-Math-1.5B-OREO) | [Value](https://huggingface.co/jwhj/Qwen2.5-Math-1.5B-OREO-Value)

<img src="https://raw.githubusercontent.com/jwhj/OREO/refs/heads/main/OREO.png" alt="Image description" width="50%" />


# Installation

This repo is based on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and the installation follows a similar process. We recommend using Docker to setup the environment.

First build Docker image
```bash
cd dockerfile
docker build -t [IMAGE_NAME] .
```

Start a docker container
```bash
docker run -itd --ipc host --gpus all [IMAGE_NAME] bash
```

Attach to the container
```bash
docker exec -it [CONTAINER_ID] /bin/bash
```

Install the current repo
```bash
cd [PATH_TO_THIS_REPO]
pip install -e .
```

As the data collection process involves randomness, we will publish the training data used in our experiments in the near future.

# Reproduction
## Training
You may need to change the following command line options in the following scripts:
- `--train_file` specifies the path of training data in OREO experiments.
- `--dataset` specifies the path of training data in SFT experiments.
- `--save_path` specifies the path to save the model.
- `--pretrain` specifies the path to load the pretrained model. In OREO experiments, this should be the path to the SFT model.

### Math Reasoning

Supervised fine-tuning
```bash
cd example/scripts
bash train_oreo_sft.sh
```

OREO training
```bash
cd example/scripts
bash train_oreo.sh
```

To train the `DeepSeekMath-7B-Instruct` model,
```bash
cd example/scripts
bash train_oreo_deepseek-math.sh
```
Note that `DeepSeekMath-7B-Instruct` is already supervise fine-tuned, so we don't have an SFT phase here.

### ALFWorld

Supervised fine-tuning
```bash
cd example/scripts
bash train_oreo_alfworld_sft.sh
```

OREO training
```bash
cd example/scripts
bash train_oreo_alfworld.sh
```

## Evaluation
### Math Reasoning

Make sure you have `antlr4-python3-runtime==4.11.0` installed.

For Qwen-based models
```bash
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --save [SAVE_GENERATED_RESULTS_JSONL]
```

For DeepSeekMath-based models
```bash
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --no_bos --save [SAVE_GENERATED_RESULTS_JSONL]
```
Note the `--no_bos` option here.

### ALFWorld

This part requires [ALFWorld](https://github.com/alfworld/alfworld) to be installed.

First start an vllm server
```bash
python -m vllm.entrypoints.openai.api_server --model [PATH_TO_YOUR_MODEL]
```

Then run evaluation with
```bash
cd example/scripts
python ../scratch/run_alfworld_async.py --model [PATH_TO_YOUR_MODEL] --save_dir [SAVE_GENERATED_TRAJS]
```
You can use `--split eval_in_distribution` for seen environments.

## Reference
```bibtex
@inproceedings{Wang2024OfflineRL,
  title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
  author={Huaijie Wang and Shibo Hao and Hanze Dong and Shenao Zhang and Yilin Bao and Ziran Yang and Yi Wu},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:274965107}
}
```