Survey · ~65 min read

RL Infrastructure for the Mathematically Inclined

A field survey of reinforcement-learning post-training systems — for RL theorists who want to read the source code of verl, SGLang, Megatron, Triton and recognize what's mathematically interesting about the engineering choices.

I wrote a long-form study roadmap of modern RL post-training infrastructure.

→ Read the full survey at chenggong-zhang.github.io/RL_infra

What follows here is the table of contents and the case for why a mathematician should care. The body is on the other page.

What’s in the survey

The article is structured as a single long-form blog post. Section headings:

  • Why this matters to a theorist — the case for caring about systems
  • The skeleton — the five-stage cycle and the three pillars every framework instantiates
  • The central tension — why training and inference fight on the same hardware
  • Six engineering primitives — the engineered solutions:
    • ① The hybrid engine (训推一体)
    • ② Memory choreography
    • ③ Zero-copy weight synchronization
    • ④ Four update_weights_from_* paths, one verb
    • ⑤ RadixAttention × GRPO — algebraic composition
    • ⑥ Async training and the staleness tradeoff
  • The layer beneath: CUDA, Triton, TileLang — kernel-level DSLs
  • The training backbone: Megatron-LM — 5D parallelism (TP × PP × DP × EP × CP)
  • Quantization and the numerical-alignment problem — FP8, INT4, MXFP4; the MoE routing divergence
  • Multi-turn agentic RL — unifying VLM and LLM from first principles — the rollout loop, BaseInteractionEnv, the dummy-messages + delta tokens trick (bounded context growth), multimodal tensor merge (O(n²) → O(n))
  • Engineering case study — Miles’ DeepSeek-V3 RL pipeline — the 3-module decoupling and the 5-stage pipeline (run_deepseek.py walkthrough)
  • Recent advances from the SGLang RL team — INT4 QAT (Kimi K2-Thinking style), unified VLM/LLM multi-turn, Rollout Router Replay, full-flow FP8, speculative decoding in RL
  • Reading real code — verl’s fit() loop and AReaL’s async pattern
  • The framework landscape — 9 frameworks on 5 axes
  • Beyond chat — multi-turn agents, embodied AI, world models, the TPU detour
  • The scaling chain — what breaks at each cluster tier
  • Pitfalls (踩坑录) — six production failure modes from Chenyang’s tutorial
  • A reading list for the theoretically inclined

Eight SVG diagrams

The survey ships eight diagrams. Each links through to its section on the survey site — click any image to read the surrounding analysis.

The five-stage rollout cycle: Generate → Score → Filter → Train → Sync weights → repeat.
Figure 1. The skeleton — the five-stage loop every framework instantiates.
Three pillars: training engine (Megatron, FSDP, DeepSpeed), inference engine (SGLang, vLLM, TRT-LLM), orchestrator (Ray, custom NCCL, torchrun).
Figure 2. Three pillars — Training engine + Inference engine + Orchestrator. Pick a row from each column to name a framework.
Timeline of memory ownership swapping between inference and training, with release_memory_occupation and resume_memory_occupation marking the transitions.
Figure 3. Memory choreography — one GPU, two roles, over time. Two function calls encode the entire concurrency contract.
The four update_weights paths: from_tensor (same process, via ZMQ), from_disk (checkpoint on shared FS), from_distributed (NCCL broadcast across ranks), from_ipc (CUDA IPC same host, different processes).
Figure 4. Four update_weights_from_* paths — one verb, four transport topologies. The same operation lifted to four categories of physical configuration.
Without prefix cache: 4 separate prefills of the prompt P. With RadixAttention: one shared prefix node with lock_ref=4, four tail completions branching from it.
Figure 5. RadixAttention × GRPO — hash cache vs radix tree for group rollouts. Algorithmic sharing pattern meets data-structure sharing mechanism; savings multiply.
Six tiers from 8 GPUs to 10K+, each with what breaks (red), what dominates cost (gray), and what primitive saves you (green).
Figure 6. The scaling chain — what breaks at each tier from 8 GPUs to 10K+, and which primitive saves you there.
The multi-turn rollout loop: prepare inputs, SGLang generate, concat assistant tokens (loss_mask=1), tool call / env interaction, encode observation (loss_mask=0), check termination, loop or finalize.
Figure 7. Multi-turn agentic RL — the turn loop with the loss-mask 1/0 split, sample ↔ weight-sync handoff between rollout and training.
Miles three-module architecture (Training, Data Buffer, Rollout) plus the 5-stage DeepSeek-V3 pipeline: download, FP8→BF16, HF→Megatron, rsync, ray job submit.
Figure 8. Miles · DeepSeek-V3 — three-module decoupling (Training · Data Buffer · Rollout) plus the 5-stage entry pipeline.

Why this might be useful to a mathematician

You’re reading a paper and wondering:

  • “Why does GRPO actually work better than PPO at scale?” — Because group sampling enables prefix-cache reuse multiplicatively, not because the variance reduction math is fundamentally cleaner. The system advantage is doing the heavy lifting.
  • “Is on-policy in the empirical evaluation really on-policy?” — In every framework with partial rollout, no. The off-policy ratio is a hyperparameter you should be checking.
  • “How can an MoE model train stably in FP8?” — Only with explicit routing replay (R3). Without it, two precision regimes diverge on top-k routing decisions and your gradient becomes noise. Miles’ contribution is precisely this invariant.
  • “What’s actually different about cosmos-rl’s diffusion RL?” — Standard PPO doesn’t apply to a diffusion model; cosmos-rl invented DDRL (replaces KL with reward + standard diffusion loss) and ships custom 6D parallelism for 100K-token video sequences.
  • “Why does AReaL claim 2.77×?” — One keyword: async_op=True on every NCCL broadcast, paired with memory-bounded bucketing of the queued operations.

The survey answers each of these in its body. The point isn’t the specific answer; it’s that engineering choices are answering questions you’d otherwise ask of the math.

Source repos surveyed

19 repos read end-to-end. Frameworks first, then the layer beneath, then the rollout targets.

RL frameworks (9):

The layer beneath (4):

Adjacent infrastructure (3):

Rollout targets (3):


Read the full survey →