I wrote a long-form study roadmap of modern RL post-training infrastructure.
→ Read the full survey at chenggong-zhang.github.io/RL_infra
What follows here is the table of contents and the case for why a mathematician should care. The body is on the other page.
What’s in the survey
The article is structured as a single long-form blog post. Section headings:
- Why this matters to a theorist — the case for caring about systems
- The skeleton — the five-stage cycle and the three pillars every framework instantiates
- The central tension — why training and inference fight on the same hardware
- Six engineering primitives — the engineered solutions:
- ① The hybrid engine (训推一体)
- ② Memory choreography
- ③ Zero-copy weight synchronization
- ④ Four
update_weights_from_*paths, one verb - ⑤ RadixAttention × GRPO — algebraic composition
- ⑥ Async training and the staleness tradeoff
- The layer beneath: CUDA, Triton, TileLang — kernel-level DSLs
- The training backbone: Megatron-LM — 5D parallelism (TP × PP × DP × EP × CP)
- Quantization and the numerical-alignment problem — FP8, INT4, MXFP4; the MoE routing divergence
- Multi-turn agentic RL — unifying VLM and LLM from first principles — the rollout loop,
BaseInteractionEnv, the dummy-messages + delta tokens trick (bounded context growth), multimodal tensor merge (O(n²) → O(n)) - Engineering case study — Miles’ DeepSeek-V3 RL pipeline — the 3-module decoupling and the 5-stage pipeline (
run_deepseek.pywalkthrough) - Recent advances from the SGLang RL team — INT4 QAT (Kimi K2-Thinking style), unified VLM/LLM multi-turn, Rollout Router Replay, full-flow FP8, speculative decoding in RL
- Reading real code — verl’s
fit()loop and AReaL’s async pattern - The framework landscape — 9 frameworks on 5 axes
- Beyond chat — multi-turn agents, embodied AI, world models, the TPU detour
- The scaling chain — what breaks at each cluster tier
- Pitfalls (踩坑录) — six production failure modes from Chenyang’s tutorial
- A reading list for the theoretically inclined
Eight SVG diagrams
The survey ships eight diagrams. Each links through to its section on the survey site — click any image to read the surrounding analysis.
Why this might be useful to a mathematician
You’re reading a paper and wondering:
- “Why does GRPO actually work better than PPO at scale?” — Because group sampling enables prefix-cache reuse multiplicatively, not because the variance reduction math is fundamentally cleaner. The system advantage is doing the heavy lifting.
- “Is on-policy in the empirical evaluation really on-policy?” — In every framework with partial rollout, no. The off-policy ratio is a hyperparameter you should be checking.
- “How can an MoE model train stably in FP8?” — Only with explicit routing replay (R3). Without it, two precision regimes diverge on top-k routing decisions and your gradient becomes noise. Miles’ contribution is precisely this invariant.
- “What’s actually different about cosmos-rl’s diffusion RL?” — Standard PPO doesn’t apply to a diffusion model; cosmos-rl invented DDRL (replaces KL with reward + standard diffusion loss) and ships custom 6D parallelism for 100K-token video sequences.
- “Why does AReaL claim 2.77×?” — One keyword:
async_op=Trueon every NCCL broadcast, paired with memory-bounded bucketing of the queued operations.
The survey answers each of these in its body. The point isn’t the specific answer; it’s that engineering choices are answering questions you’d otherwise ask of the math.
Source repos surveyed
19 repos read end-to-end. Frameworks first, then the layer beneath, then the rollout targets.
RL frameworks (9):
- 1. verl-project/verl — the hybrid-engine reference implementation
- 2. radixark/miles — bit-identical FP8/INT4 + R3 routing replay
- 3. THUDM/slime — on-policy distillation, GLM-5.1
- 4. NovaSky-AI/SkyRL — multi-turn agents, SA-SWE-32B
- 5. nvidia-cosmos/cosmos-rl — physical AI, diffusion world models
- 6. RLinf/RLinf — M2Flow scheduler, embodied + agentic
- 7. sgl-project/sglang — the inference substrate
- 8. inclusionAI/AReaL — fully-async RL
- 9. OpenRLHF/OpenRLHF — the OG framework
The layer beneath (4):
- 10. NVIDIA/Megatron-LM — 5D parallelism training backbone
- 11. triton-lang/triton — Python DSL for GPU kernels
- 12. tile-ai/tilelang — newer TVM-based DSL with Z3 verification
- 13. NVIDIA/cuda-python — Python bindings to CUDA proper
Adjacent infrastructure (3):
- 14. vllm-project/vllm — SGLang’s primary competitor
- 15. NVIDIA/TensorRT — the compilation framework
- 16. NVIDIA/TensorRT-LLM — optimized inference (not RL-friendly)
Rollout targets (3):
- 17. black-forest-labs/flux · flux2 — image diffusion
- 18. Wan-Video/Wan2.2 — MoE video model
- 19. Cambrian-MLLM TPU blog — why TPU-native RL is unsolved