I wrote a long-form survey of modern RL post-training infrastructure for mathematicians and RL theorists who want to read the source code of verl, SGLang, Megatron-LM, and Miles and recognize what’s mathematically interesting about the engineering choices.
→ Read the full survey at chenggong-zhang.github.io/RL_infra/
The premise
If you’re a theorist reading the DeepSeek-V3 report or studying GRPO, the engineering choices in production RL systems are quietly encoding mathematical decisions about your algorithm — which approximations are bounded, which biases are corrected, which invariants hold by construction. A claim about GRPO’s variance reduction holds only as long as the prefix cache, the importance-sampling correction, and the memory-handoff contract all behave as advertised. When they don’t, the loss curve still looks fine and the eval curve mysteriously degrades.
The survey gives you the vocabulary to read the source code without that hazard.
What’s in it
- The skeleton — the five-stage rollout cycle and the three pillars every framework instantiates
- Six engineering primitives — hybrid engine (训推一体), memory choreography, zero-copy weight sync, four
update_weights_from_*paths, RadixAttention × GRPO, async + staleness corrections - The layer beneath — CUDA Python, Triton, TileLang — when PyTorch’s abstractions aren’t enough
- The training backbone — Megatron-LM’s 5D parallelism (TP × PP × DP × EP × CP)
- Quantization — FP8, INT4, MXFP4; the MoE routing divergence problem (the deep reason Miles’ R3 exists)
- Multi-turn agentic RL — unifying VLM and LLM from first principles; the dummy-messages + delta-tokens trick; multimodal tensor merge in O(n)
- Engineering case study — Miles’ DeepSeek-V3 RL pipeline, the 5-stage entry script and the async main loop
- Recent advances — INT4 QAT, full-flow FP8, Rollout Router Replay, speculative decoding in RL
- The framework landscape — nine production frameworks (verl, Miles, SLIME, SkyRL, cosmos-rl, RLinf, AReaL, OpenRLHF, SGLang) compared on five axes
- Pitfalls — the six production failure modes the field has paid years of debugging on
Why this might be useful to a mathematician
You’re reading a paper and wondering:
- “Why does GRPO actually work better than PPO at scale?” — Because group sampling enables prefix-cache reuse multiplicatively, not because the variance reduction math is fundamentally cleaner. The system advantage is doing the heavy lifting.
- “Is on-policy in the empirical evaluation really on-policy?” — In every framework with partial rollout, no. The off-policy ratio is a hyperparameter you should be checking.
- “How can an MoE model train stably in FP8?” — Only with explicit routing replay (R3). Without it, two precision regimes diverge on top-k routing decisions and your gradient becomes noise. Miles’ contribution is precisely this invariant.
The survey answers each of these in its body. The point isn’t the specific answer; it’s that engineering choices are answering questions you’d otherwise ask of the math.
Acknowledgements
The connective analysis — especially “what’s mathematically interesting” — synthesizes Chenyang Zhao’s Awesome-ML-SYS-Tutorial with primary-source reading of 19 production repos. The Chinese-language original is the canonical reference for the field; this survey is a complement, not a substitute.