File size: 3,263 Bytes
6f28358
5da502d
6f28358
5da502d
6f28358
5da502d
6f28358
5da502d
 
6f28358
5da502d
6f28358
5da502d
6f28358
 
 
 
 
5da502d
6f28358
 
 
 
5da502d
6f28358
 
 
 
5da502d
6f28358
 
 
 
 
5da502d
6f28358
5da502d
6f28358
 
 
 
 
 
 
5da502d
6f28358
5da502d
6f28358
 
 
 
 
5da502d
6f28358
 
 
 
 
5da502d
6f28358
 
 
 
 
 
5da502d
6f28358
5da502d
6f28358
 
 
 
 
5da502d
6f28358
 
 
 
 
5da502d
 
6f28358
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# OREO: Offline REasoning Optimization

Source code for [Offline Reinforcement Learning for LLM Multi-Step Reasoning](https://arxiv.org/abs/2412.16145)

Model: [Policy](https://huggingface.co/jwhj/Qwen2.5-Math-1.5B-OREO) | [Value](https://huggingface.co/jwhj/Qwen2.5-Math-1.5B-OREO-Value)

<img src="https://raw.githubusercontent.com/jwhj/OREO/refs/heads/main/OREO.png" alt="Image description" width="50%" />


# Installation

This repo is based on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and the installation follows a similar process. We recommend using Docker to setup the environment.

First build Docker image
```bash
cd dockerfile
docker build -t [IMAGE_NAME] .
```

Start a docker container
```bash
docker run -itd --ipc host --gpus all [IMAGE_NAME] bash
```

Attach to the container
```bash
docker exec -it [CONTAINER_ID] /bin/bash
```

Install the current repo
```bash
cd [PATH_TO_THIS_REPO]
pip install -e .
```

As the data collection process involves randomness, we will publish the training data used in our experiments in the near future.

# Reproduction
## Training
You may need to change the following command line options in the following scripts:
- `--train_file` specifies the path of training data in OREO experiments.
- `--dataset` specifies the path of training data in SFT experiments.
- `--save_path` specifies the path to save the model.
- `--pretrain` specifies the path to load the pretrained model. In OREO experiments, this should be the path to the SFT model.

### Math Reasoning

Supervised fine-tuning
```bash
cd example/scripts
bash train_oreo_sft.sh
```

OREO training
```bash
cd example/scripts
bash train_oreo.sh
```

To train the `DeepSeekMath-7B-Instruct` model,
```bash
cd example/scripts
bash train_oreo_deepseek-math.sh
```
Note that `DeepSeekMath-7B-Instruct` is already supervise fine-tuned, so we don't have an SFT phase here.

### ALFWorld

Supervised fine-tuning
```bash
cd example/scripts
bash train_oreo_alfworld_sft.sh
```

OREO training
```bash
cd example/scripts
bash train_oreo_alfworld.sh
```

## Evaluation
### Math Reasoning

Make sure you have `antlr4-python3-runtime==4.11.0` installed.

For Qwen-based models
```bash
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --save [SAVE_GENERATED_RESULTS_JSONL]
```

For DeepSeekMath-based models
```bash
cd example/scripts
python ../scratch/run_qwen.py --model [PATH_TO_YOUR_MODEL] --no_bos --save [SAVE_GENERATED_RESULTS_JSONL]
```
Note the `--no_bos` option here.

### ALFWorld

This part requires [ALFWorld](https://github.com/alfworld/alfworld) to be installed.

First start an vllm server
```bash
python -m vllm.entrypoints.openai.api_server --model [PATH_TO_YOUR_MODEL]
```

Then run evaluation with
```bash
cd example/scripts
python ../scratch/run_alfworld_async.py --model [PATH_TO_YOUR_MODEL] --save_dir [SAVE_GENERATED_TRAJS]
```
You can use `--split eval_in_distribution` for seen environments.

## Reference
```bibtex
@inproceedings{Wang2024OfflineRL,
  title={Offline Reinforcement Learning for LLM Multi-Step Reasoning},
  author={Huaijie Wang and Shibo Hao and Hanze Dong and Shenao Zhang and Yilin Bao and Ziran Yang and Yi Wu},
  year={2024},
  url={https://api.semanticscholar.org/CorpusID:274965107}
}
```