xssstory commited on
Commit
ba54c34
·
verified ·
1 Parent(s): 62a617e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -81
README.md CHANGED
@@ -1,7 +1,5 @@
1
  ---
2
- license: apache-2.0
3
- base_model:
4
- - Qwen/Qwen3-8B
5
  ---
6
  <h1 align="center">
7
  <em>AReaL</em>: Ant Reasoning Reinforcement Learning for LLMs
@@ -12,20 +10,20 @@ base_model:
12
  </p>
13
 
14
 
15
- AReaL (Ant Reasoning RL) is an open-sourced **fully asynchronous reinforcement learning training system** for large reasoning models developed at **the RL Lab, Ant Research**, built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF). We fully commit to open-source by opening training details, data and infra required to reproduce the results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea as it is delicious, customizable, and affordable. We hope you all enjoy our project just like how you enjoy a real-world milk tea (cheers).
16
 
17
  **AReaL Highlights**
18
 
19
- + 🔥 **[NEW] Asynchronous RL:** With algorithm-system co-design, AReaL supports fully asynchronous RL for **the fastest training**! Experimental support for multi-turn agentic RL is also provided.
20
- + 🛠️ **Open & Reproducible**: We will continuously release _all code, datasets, and training recipes_ for RL training LLMs.
21
- + 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource settings, ranging from 1 single node to 1K GPUs.
22
- + 🔪 **Cutting-Edge Performances:** AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
23
 
24
  ## News
25
 
26
  **[2025/06/03] (v0.3, boba²)** We release **boba²** (double-boba) for fully asynchronous RL training, which achieves a **2.77x speedup while obtaining on-par or even better training performance** compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out [our v0.3 overview blog](/blog/AReaL_v0_3.md) and the [research paper](https://arxiv.org/pdf/2505.24298).
27
 
28
- **[2025/03/31] (v0.2, Boba)** Here comes our next milestone release - Boba! Please call it A-ReaL-Boba! This release includes much accelerated training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our [v0.2 technical blog](/blog/AReaL_v0_2.md).
29
 
30
  **[2025/02/24] (v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
31
 
@@ -33,19 +31,21 @@ AReaL (Ant Reasoning RL) is an open-sourced **fully asynchronous reinforcement l
33
 
34
  In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:
35
 
36
- + A fully asynchronous RL training pipeline with **system and RL algorithm co-design**, achieving [over 2.77x speedup](https://github.com/inclusionAI/AReaL/tree/main/benchmark) without any performance drop.
37
- + SOTA coding models, i.e., a 14B model with a **69.1 score on LCB-v5**. [Reproducible results](https://inclusionai.github.io/AReaL/references/reproduce.html) with fully open-sourced datasets are also provided.
38
- + Experimental support for **multi-turn** agentic RL training.
39
 
40
- For the complete system design and training details, please check [our v0.3 blog](/blog/AReaL_v0_3.md) and our [research paper](about:blank) for a more comprehensive presentation of our system design.
 
 
 
 
41
 
42
  ### Overview of Asynchronous RL Training
43
 
44
- During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRM, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) propose to overlap a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
45
 
46
  **Synchronous vs One-step Overlap RL**
47
 
48
- *Fig.1. Left: Execution timeline of a synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
49
 
50
  AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.
51
 
@@ -53,9 +53,9 @@ AReaL adopts a fully asynchronous RL training framework that completely decouple
53
 
54
  *Fig 2. Execution timeline of our fully asynchronous RL system.*
55
 
56
- AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs up model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
57
 
58
- We compare the scalability of **asynchronous RL** training based on our AReaL-boba² system with **classical synchronous RL** training (we adopt the fastest open-sourced system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates the much improved scaling capabilities w.r.t. training throughput. This is also partially due to that AReaL decouples training and generation, leading to much fewer GPU memory fragments. (Check the [benchmark directory](/benchmark) for detailed benchmark guide.)
59
 
60
  **Scaling Comparison**
61
 
@@ -63,48 +63,45 @@ We compare the scalability of **asynchronous RL** training based on our AReaL-bo
63
 
64
  ### SOTA Code Generation Model by AReaL-boba²
65
 
66
- We use **Qwen3** as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforce, and CodeContests benchmarks.
67
 
68
- | **Model (8B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
69
  | :---: | :---: | :---: | :---: |
70
  | Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
71
  | DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
72
- | AReaL-boba²-8B-Open | 62.0 | 1933/97.2% | **41.4** |
73
- | AReaL-boba²-8B | **63.0** | **1962/97.5%** | 40.8 |
74
 
75
- | **Model (14B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **CodeContests** |
76
  | :---: | :---: | :---: | :---: |
77
  | Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
78
  | DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
79
- | AReaL-boba²-14B-Open | 67.3 | 1990/97.8% | **46.2** |
80
- | AReal-boba²-14B | **69.1** | **2044/98.2%** | 46.1 |
81
 
82
- | **Larger Models** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforce** | **Codecontest** |
83
  | :---: | :---: | :---: | :---: |
84
  | Qwen3-235B | 70.7 | 2056 | - |
85
  | DeepSeek-R1 | 64.3 | 2029 | - |
86
  | OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
87
 
88
- *Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-sourced dataAReaL-boba²-8B/14B models are trained with an additional small amount of internal data could achieve SOTA performance on LiveCodeBench, Codeforce & CodeContests*
89
 
90
- We highlight the tutorials about the following key features for synchronous training in AReaL-boba²:
91
 
92
  + [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
93
  + [Interruptible rollout](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
94
  + [Data staleness control with the rollout controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
95
  + [The adoption of decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html)
96
 
97
- We provide a step-by-step [code walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html) and [customization guide](https://inclusionai.github.io/AReaL/customization/dataset.html) for these features and recommend users to walk through the corresponding documentation.
98
-
99
  ### RL Training for Multi-turn Agent
100
 
101
- AReaL-boba² allows you to independently customize the [dataset](https://inclusionai.github.io/AReaL/customization/dataset.html), [rollout behavior](https://inclusionai.github.io/AReaL/customization/agent.html), and the [training algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html), without the need to modify the heavy system-level code.
102
 
103
  In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the [step-by-step guide](https://inclusionai.github.io/AReaL/customization/agent.html) if you want to implement your own agentic RL project.
104
 
105
  **Multi-turn Agent Learning Curve**
106
 
107
-
108
  ## Getting Started
109
 
110
  ### Quick Start
@@ -112,30 +109,12 @@ In particular, we show a simple example to develop a multi-turn math agent for R
112
  Train Qwen3 1.7B locally:
113
 
114
  ```bash
115
- bash examples/env/scripts/setup-pip-deps.sh
116
- python3 training/main_async_ppo.py \
117
- n_nodes=1 n_gpus_per_node=8 \
118
- allocation_mode=sglang.d4p1m1+d2p2m1 \
119
- cluster.fileroot=/storage/testing/experiments \
120
- actor.type._class=qwen3 \
121
- actor.path=Qwen/Qwen3-1.7B \
122
- ref.type._class=qwen3 \
123
- ref.path=Qwen/Qwen3-1.7B \
124
- dataset.path=/path/to/dataset/boba_106k_0319.jsonl \
125
- dataset.train_bs_n_seqs=32 \
126
- group_size=8 \
127
- ppo.gen.max_new_tokens=4096 \
128
- ppo.ppo_n_minibatches=4 \
129
- actor_train.mb_spec.max_tokens_per_mb=32768 \
130
- actor_inf.mb_spec.max_tokens_per_mb=32768 \
131
- max_concurrent_rollouts=16 \
132
- max_head_offpolicyness=4
133
  ```
134
 
135
  Evaluation:
136
 
137
  ```bash
138
- bash examples/env/scripts/setup-eval-pip-deps.sh
139
  cd evaluation
140
  # Evaluate the model
141
  python eval_and_aggregate.py \
@@ -148,49 +127,72 @@ python eval_and_aggregate.py \
148
  --temperature 1.0
149
  ```
150
 
151
- ### Resources
152
 
153
- + [Installation](https://inclusionai.github.io/AReaL/tutorial/installation.html)
154
- + [Quickstart](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
155
- + [Code Walkthrough](https://inclusionai.github.io/AReaL/developer/overview.html)
156
- + **Customization Guide**
157
- - [Dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
158
- - [Rollout Behavior (Agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
159
- - [Training Algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html)
160
  + [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
161
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  ## Future Plan
163
 
164
- AReaL is under active development. We plan to have minor releases in a weekly manner and major releases in a monthly manner. Community engagements and contributions are extremely welcomed. We are also **hiring interns and full-timers** with open positions in both the US and China.
165
 
166
  For the research and development plan already in place, please see the following list:
167
 
168
  ### System Development
169
 
170
- - [x] Support for SGLang.
171
- - [x] RL training with coding problems.
172
- - [x] Asynchronous generation and RL training.
173
- - [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining.
174
- - [ ] RL for vision-language models (VLM).
175
- - [x] Multi-turn agentic RL.
176
- - [ ] Function calling and tool use.
177
 
178
  ### Algorithm Development
179
 
180
- - [x] RL training receipes for 1.5B and 7B models.
181
- - [x] A complete RL training receipe for 32B models.
182
- - [ ] Sample-efficient multi-task RL algorithms.
183
- - [ ] Agentic capabilities with end-to-end RL.
184
- - [ ] Stable RL training for larger MOE models.
185
 
186
  ## Acknowledgement
187
- We would like to remark that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
 
188
 
189
  Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.
190
 
191
- We also appreciate all the pioneering works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and those other projects, including but not limited to, [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [VeRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1) and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
192
 
193
  ## Citation
 
194
  ```bibtex
195
  @inproceedings{mei2025real,
196
  author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
@@ -203,12 +205,13 @@ We also appreciate all the pioneering works from the community, particularly the
203
  ```
204
 
205
  ```bibtex
206
- @misc{areal2025,
207
- author = {RL Lab, Ant Research},
208
- title = {AReaL: Ant Reasoning RL},
209
- year = {2025},
210
- publisher = {GitHub},
211
- journal = {GitHub repository},
212
- howpublished = {\url{https://github.com/inclusionAI/AReaL}},
 
213
  }
214
- ```
 
1
  ---
2
+ {}
 
 
3
  ---
4
  <h1 align="center">
5
  <em>AReaL</em>: Ant Reasoning Reinforcement Learning for LLMs
 
10
  </p>
11
 
12
 
13
+ AReaL (Ant Reasoning RL) is an open-source **fully asynchronous reinforcement learning training system** for large reasoning models developed at **the RL Lab, Ant Research**. Built upon the open-source project [RealHF](https://github.com/openpsi-project/ReaLHF), we are fully committed to open-source by providing training details, data, and infrastructure required to reproduce results along with the model itself. AReaL aims to help everyone build their own AI agents easily and affordably. Our team loves milk tea because it's delicious, customizable, and affordable. We hope you enjoy our project just like how you enjoy real-world milk tea (cheers).
14
 
15
  **AReaL Highlights**
16
 
17
+ + 🔥 <span style="color: red; font-weight: bold;">**[NEW] Asynchronous RL:**</span> With algorithm-system co-design, AReaL supports fully asynchronous RL for **the fastest training**! Experimental support for multi-turn agentic RL is also provided.
18
+ + 🛠️ **Open & Reproducible**: We continuously release _all code, datasets, and training recipes_ for RL training of LLMs.
19
+ + 🚀 **Scalability**: AReaL can seamlessly adapt to different computational resource settings, ranging from a single node to 1K GPUs.
20
+ + 🔪 **Cutting-Edge Performance:** AReaL can produce models with cutting-edge reasoning capabilities in math and coding. We are also actively working on agentic tasks.
21
 
22
  ## News
23
 
24
  **[2025/06/03] (v0.3, boba²)** We release **boba²** (double-boba) for fully asynchronous RL training, which achieves a **2.77x speedup while obtaining on-par or even better training performance** compared to synchronous systems. Moreover, asynchronous RL makes it extremely easy to set up multi-turn agentic RL training! Check out [our v0.3 overview blog](/blog/AReaL_v0_3.md) and the [research paper](https://arxiv.org/pdf/2505.24298).
25
 
26
+ **[2025/03/31] (v0.2, Boba)** Here comes our next milestone release - Boba! Please call it A-ReaL-Boba! This release includes much faster training with SGLang support and SOTA 7B and 32B models on math reasoning. Check our [v0.2 technical blog](/blog/AReaL_v0_2.md).
27
 
28
  **[2025/02/24] (v0.1)** Our initial release includes reproducible results for 1.5B and 7B LRMs. Check our [v0.1 technical blog](/blog/AReaL_v0_1.md).
29
 
 
31
 
32
  In our AReaL-boba² (A-ReaL-double-boba) release, we highlight the top 3 most important features:
33
 
34
+ + A fully asynchronous RL training pipeline with **system and RL algorithm co-design**, achieving over 2.77x speedup without any performance drop. Check the [benchmark scripts and instructions here](https://github.com/inclusionAI/AReaL/tree/main/benchmark/verl_v0_3_0_post1_76084d3).
 
 
35
 
36
+ + SOTA coding models, i.e., a 14B model with a **69.1 score on LCB-v5**. To reproduce, check the [configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code) and [instructions](https://inclusionai.github.io/AReaL/references/reproduce.html).
37
+
38
+ + Experimental support for **multi-turn** agentic RL training. Check our [complete example](https://inclusionai.github.io/AReaL/customization/agent.html).
39
+
40
+ For the complete system design and more training details, please check [our v0.3 blog](/blog/AReaL_v0_3.md) and our [research paper](about:blank) for a more comprehensive presentation of our system design.
41
 
42
  ### Overview of Asynchronous RL Training
43
 
44
+ During the synchronous RL training process, a generation step must wait until the longest sequence completes within the batch of LLM outputs. Due to the varying output lengths for LRMs, a synchronous RL system suffers from massive GPU idle time, leading to training inefficiency. Some recent works ([DeepCoder](https://pretty-radio-b75.notion.site/DeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51), [Intellect](https://www.primeintellect.ai/blog/intellect-2)) propose overlapping a single training step with a single generation step to accelerate training. However, the largest bottleneck remains unchanged: the samples within a batch are still from the same model version, leading to waiting and GPU idle time.
45
 
46
  **Synchronous vs One-step Overlap RL**
47
 
48
+ *Fig.1. Left: Execution timeline of synchronous RL training. Right: Execution timeline of one-step overlap RL system.*
49
 
50
  AReaL adopts a fully asynchronous RL training framework that completely decouples generation from training. In AReaL, LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting. Meanwhile, trainer workers perform parallel model updates upon receiving training batches.
51
 
 
53
 
54
  *Fig 2. Execution timeline of our fully asynchronous RL system.*
55
 
56
+ AReaL follows a system-algorithm co-design principle: on the system side, AReaL efficiently syncs model parameters and carefully controls the staleness of each training sample; on the algorithm side, AReaL improves the objective of PPO to make async-RL stable.
57
 
58
+ We compare the scalability of **asynchronous RL** training based on our AReaL-boba² system with **classical synchronous RL** training (we adopt the fastest open-source system veRL, main branch on 05/07/2025) across different model sizes and different numbers of H800 GPUs. AReaL demonstrates much improved scaling capabilities with respect to training throughput. This is also partially due to AReaL decoupling training and generation, leading to much fewer GPU memory fragments.
59
 
60
  **Scaling Comparison**
61
 
 
63
 
64
  ### SOTA Code Generation Model by AReaL-boba²
65
 
66
+ We use **Qwen3** as our base model. After asynchronous RL training, we achieve SOTA results on LiveCodeBench, Codeforces, and CodeContests benchmarks.
67
 
68
+ | **Model (8B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
69
  | :---: | :---: | :---: | :---: |
70
  | Qwen3-8B | 58.8 | 1879/96.7% | 31.4 |
71
  | DeepSeek-R1-0528-Qwen3-8B | 58.4 | 1945/97.3% | 31.0 |
72
+ | [🤗 AReaL-boba²-8B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset) | 62.0 | 1933/97.2% | **41.4** |
73
+ | [🤗 AReaL-boba²-8B](https://huggingface.co/inclusionAI/AReaL-boba-2-8B) | **63.0** | **1962/97.5%** | 40.8 |
74
 
75
+ | **Model (14B)** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
76
  | :---: | :---: | :---: | :---: |
77
  | Qwen3-14B | 65.4 | 1978/97.7% | 38.3 |
78
  | DeepCoder-14B-Preview | 60.6 | 1936/95.3% | 40.1 |
79
+ | [🤗 AReaL-boba²-14B-Open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset) | 67.3 | 1990/97.8% | **46.2** |
80
+ | [🤗 AReal-boba²-14B](https://huggingface.co/inclusionAI/AReaL-boba-2-14B) | **69.1** | **2044/98.2%** | 46.1 |
81
 
82
+ | **Larger Models** | **LiveCodeBench v5**<br/>**(2024.10-2025.2)** | **Codeforces** | **CodeContests** |
83
  | :---: | :---: | :---: | :---: |
84
  | Qwen3-235B | 70.7 | 2056 | - |
85
  | DeepSeek-R1 | 64.3 | 2029 | - |
86
  | OpenAI-o3-mini (Medium) | 66.3 | 2036 | - |
87
 
88
+ *Table 1: Coding Task Performance Comparison. AReaL-boba²-8B/14B-Open denotes training results on open-source data. AReaL-boba²-8B/14B models are trained with an additional small amount of internal data and achieve SOTA performance on LiveCodeBench, Codeforces & CodeContests.*
89
 
90
+ We highlight the [tutorials](https://inclusionai.github.io/AReaL/customization/dataset.html) and [code walkthroughs](https://inclusionai.github.io/AReaL/developer/overview.html) about the following key features for asynchronous training:
91
 
92
  + [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
93
  + [Interruptible rollout](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
94
  + [Data staleness control with the rollout controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
95
  + [The adoption of decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html)
96
 
 
 
97
  ### RL Training for Multi-turn Agent
98
 
99
+ AReaL-boba² allows you to independently customize the [dataset](https://inclusionai.github.io/AReaL/customization/dataset.html), [rollout behavior](https://inclusionai.github.io/AReaL/customization/agent.html), and the [training algorithm](https://inclusionai.github.io/AReaL/customization/algorithm.html), without needing to modify the heavy system-level code.
100
 
101
  In particular, we show a simple example to develop a multi-turn math agent for RL training. Please see the learning curve below and reference the [step-by-step guide](https://inclusionai.github.io/AReaL/customization/agent.html) if you want to implement your own agentic RL project.
102
 
103
  **Multi-turn Agent Learning Curve**
104
 
 
105
  ## Getting Started
106
 
107
  ### Quick Start
 
109
  Train Qwen3 1.7B locally:
110
 
111
  ```bash
112
+ bash examples/run_async_ppo.sh
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
  ```
114
 
115
  Evaluation:
116
 
117
  ```bash
 
118
  cd evaluation
119
  # Evaluate the model
120
  python eval_and_aggregate.py \
 
127
  --temperature 1.0
128
  ```
129
 
130
+ ## Resources
131
 
132
+ + [Documentation](https://inclusionai.github.io/AReaL/)
 
 
 
 
 
 
133
  + [Contributing](https://inclusionai.github.io/AReaL/contrib.html)
134
 
135
+ ### Quickstart
136
+
137
+ + [Installation](https://inclusionai.github.io/AReaL/tutorial/installation.html)
138
+ + [Example: Improving the math capability of Qwen3 with PPO](https://inclusionai.github.io/AReaL/tutorial/quickstart.html)
139
+
140
+ ### Benchmark and Reproduction
141
+
142
+ + **Reproduce boba² Code Models**
143
+ - 🤗 **Model weights**: [8B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-8B), [14B-code](https://huggingface.co/inclusionAI/AReaL-boba-2-14B), [8B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-8B-subset), [14B-code-open](https://huggingface.co/inclusionAI/AReaL-boba-2-14B-subset)
144
+ - [Evaluation Guide](https://inclusionai.github.io/AReaL/tutorial/eval.html)
145
+ - [Training configs](https://github.com/inclusionAI/AReaL/tree/main/examples/configs/v0.3-qwen3-code) and [instructions](https://inclusionai.github.io/AReaL/references/reproduce.html)
146
+ + [Scripts for Benchmark Training Throughput](https://github.com/inclusionAI/AReaL/tree/main/benchmark/verl_v0_3_0_post1_76084d3)
147
+
148
+ ### Customization Guide
149
+
150
+ - [Use your own dataset](https://inclusionai.github.io/AReaL/customization/dataset.html)
151
+ - [Modifying the reward function and rollout behavior (multi-turn agentic RL)](https://inclusionai.github.io/AReaL/customization/agent.html)
152
+ - [Modifying PPO to GRPO](https://inclusionai.github.io/AReaL/customization/algorithm.html#grouped-advantage-normalization)
153
+ - [Developing the decoupled PPO loss](https://inclusionai.github.io/AReaL/customization/algorithm.html#the-decoupled-ppo-loss)
154
+
155
+ ### System Code Walkthrough
156
+
157
+ + [Trainer](https://inclusionai.github.io/AReaL/developer/trainer/model_worker.html)
158
+ + [Model Backend and Algorithm Interface](https://inclusionai.github.io/AReaL/developer/trainer/algo_interface.html)
159
+ + [Rollout Controller](https://inclusionai.github.io/AReaL/developer/rollout/gserver.html)
160
+ + [Streaming generation and reward computation](https://inclusionai.github.io/AReaL/developer/rollout/rollout_worker.html)
161
+
162
  ## Future Plan
163
 
164
+ AReaL is under active development. We plan to have minor releases weekly and major releases monthly. Community engagement and contributions are extremely welcome. We are also **hiring interns and full-time employees** with open positions in both the US and China.
165
 
166
  For the research and development plan already in place, please see the following list:
167
 
168
  ### System Development
169
 
170
+ - [x] Support for SGLang
171
+ - [x] RL training with coding problems
172
+ - [x] Asynchronous generation and RL training
173
+ - [ ] Optimizations for distributed training: expert parallel for MOE and zero-bubble pipelining
174
+ - [ ] RL for vision-language models (VLM)
175
+ - [x] Multi-turn agentic RL
176
+ - [ ] Function calling and tool use
177
 
178
  ### Algorithm Development
179
 
180
+ - [x] RL training recipes for 1.5B and 7B models
181
+ - [x] A complete RL training recipe for 32B models
182
+ - [ ] Sample-efficient multi-task RL algorithms
183
+ - [ ] Agentic capabilities with end-to-end RL
184
+ - [ ] Stable RL training for larger MOE models
185
 
186
  ## Acknowledgement
187
+
188
+ We would like to note that major contributors are from the RL Lab at Ant Research and the Institute for Interdisciplinary Information Sciences, Tsinghua University.
189
 
190
  Our team has also received invaluable assistance from the Data Intelligence Lab at Ant Research for data support and from the Super Computing Technology (SCT) team at Ant Group, particularly in the realm of large-scale cluster operations and maintenance.
191
 
192
+ We also appreciate all the pioneering works from the community, particularly the [ReaLHF](https://github.com/openpsi-project/ReaLHF) project from OpenPsi Inc. and other projects, including but not limited to [DeepScaleR](https://github.com/agentica-project/deepscaler), [Open-Reasoner-Zero](https://github.com/Open-Reasoner-Zero/Open-Reasoner-Zero/tree/main), [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF), [VeRL](https://github.com/volcengine/verl), [SGLang](https://github.com/sgl-project/sglang), [QwQ](https://github.com/QwenLM/QwQ), [Light-R1](https://github.com/Qihoo360/Light-R1) and [DAPO](https://github.com/BytedTsinghua-SIA/DAPO).
193
 
194
  ## Citation
195
+
196
  ```bibtex
197
  @inproceedings{mei2025real,
198
  author = {Mei, Zhiyu and Fu, Wei and Li, Kaiwei and Wang, Guangju and Zhang, Huanchen and Wu, Yi},
 
205
  ```
206
 
207
  ```bibtex
208
+ @misc{fu2025areal,
209
+ title={AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning},
210
+ author={Wei Fu and Jiaxuan Gao and Xujie Shen and Chen Zhu and Zhiyu Mei and Chuyi He and Shusheng Xu and Guo Wei and Jun Mei and Jiashu Wang and Tongkai Yang and Binhang Yuan and Yi Wu},
211
+ year={2025},
212
+ eprint={2505.24298},
213
+ archivePrefix={arXiv},
214
+ primaryClass={cs.LG},
215
+ url={https://arxiv.org/abs/2505.24298},
216
  }
217
+ ```