RL Infrastructure for the Mathematically Inclined
A field survey of reinforcement-learning post-training systems — six engineering primitives, three kernel-level DSLs, the training backbone, the quantization problem, three production case studies (Miles · DeepSeek V4 · slime OPD), six recurring design patterns, a researcher's checklist, and open questions at the systems–theory boundary. Written for mathematicians and RL theorists who want to read the source code of verl, SGLang, Megatron-LM, Triton and recognize what's mathematically interesting about the engineering choices.
Why this matters to a theorist
If you're a mathematician thinking about reinforcement learning, you might assume infrastructure is "the boring part that runs your algorithm." This survey argues otherwise. The engineering choices in modern RL post-training systems encode mathematical decisions — which approximations are bounded, which biases are corrected, which invariants hold by construction. You can't read DeepSeek-R1's training story without knowing how Miles' routing-replay makes it numerically possible; you can't compare on-policy and off-policy results without understanding how AReaL's asynchronous broadcasts shift the bias term in your gradient estimator.
The engineering is, in many cases, the assumption set that your theorem is implicitly invoking. A claim about GRPO's variance reduction holds only as long as the prefix cache, the importance-sampling correction, and the memory-handoff contract all behave as advertised. When they don't, the loss curve still looks fine and the eval curve mysteriously degrades — the field has paid years of debugging in this exact pattern.
This survey covers what is mathematically interesting about the engineering. I won't try to teach you the algorithms; the field has good papers and reading lists for that. I'll try instead to give you the structural picture: the five-stage cycle every framework instantiates, the three pillars they all stitch together, the six primitives that resolve the central tension, and the three kernel-level DSLs that make any of it run at a competitive speed. Where the engineering encodes a mathematical fact — an invariant, a composition rule, a bias correction — I'll flag it.
A large number of RL conclusions derived from papers are based on RL infrastructure that may be extremely flawed. — paraphrasing Chenyang Zhao, whose Awesome-ML-SYS-Tutorial is the most-cited source in this survey
The mathematician's instinct here is the right one: if you can't verify the infrastructure, treat the results as conjecture. The good news is that the infrastructure is open source, and the patterns are surprisingly elegant once you have a vocabulary. The aim of this survey is to give you the vocabulary.
How to read this survey
The most useful thing I can tell a theorist before they start is that this material has a particular structure that rewards a particular reading mindset. The skeleton (next section) and the central tension are the only two ideas you cannot skip; everything else is variation. If you have an hour, read those two and the six engineering primitives and stop — you'll have the structural picture. If you have a weekend, read everything in order. If you came here looking for a specific framework, jump to the framework landscape and use its links into the rest of the survey as a vocabulary glossary.
Three reading mindsets reward different kinds of attention.
The structural reading
Treat each engineering primitive as an algebraic statement: here is the invariant the primitive preserves; here is the operation that establishes it. The hybrid engine establishes mutual exclusion on GPU ownership. The four update_weights_from_* paths are a functor lifted across four topology categories. R3 establishes a fixed-point invariant on MoE routing decisions across precision regimes. RadixAttention × GRPO is a categorical product whose savings multiply. Read this way and the survey looks like a small algebra textbook with concrete examples — which is what it is.
The computational reading
Treat each primitive as an asymptotic statement: here is what gets cheaper and by how much, and here is what the next bottleneck is. The hybrid engine takes GPU utilization from 45–55% to 85–90%. Handle-tuple weight sync drops a 50ms-per-tensor cost to sub-millisecond. RadixAttention turns a 4× prefill cost into 1× plus 4× decode. Partial rollout turns 18s of idle into 2× throughput at the price of measurable off-policy bias. Read this way and the scaling chain section (which we'll get to) becomes the survey's organizing principle.
The empirical reading
Treat each primitive as an assumption set under which a paper's claim holds: here is what is invisibly assumed when the paper says "trained with GRPO". The variance reduction of group sampling assumes the prefix cache is reused. The on-policy interpretation of PPO/GRPO assumes the off-policy ratio is bounded. The MoE training stability claim assumes routing replay or unified precision. Read this way and the pitfalls section and the researcher's checklist (toward the end) become essential reading.
The three mindsets are complementary, not exclusive. The point of distinguishing them is that the same paragraph in this survey carries three kinds of content. A senior systems engineer might dwell on the asymptotics; a category-theorist might dwell on the invariants; an empirical-RL researcher might dwell on the assumption set. The page tries to surface all three, but you decide which to weight.
Why theorists usually skip this material — and what they miss
The honest reason most theoretically-trained RL people skip infrastructure is that the field's papers do most of the engineering exposition in appendices that read like release notes. "We use GRPO with group size 8" hides what's mathematically interesting (why 8, what the prefix cache buys, what the staleness correction does); "Training is done on a 128-H100 cluster" hides the parallelism choices. The body of the paper foregrounds the algorithm and treats the infrastructure as an implementation detail, when in fact the infrastructure is often where the empirical story is decided.
What you miss by skipping: you miss the fact that the engineering choices are themselves a contribution to the algorithm's behavior. Two papers can both claim "GRPO + Megatron" and report different numbers, and the difference lives in choices that the body never describes. Once you internalize the survey's vocabulary, the engineering exposition starts answering questions you'd previously have asked of the math. That's the conversion the survey aims for.
A useful default: assume the systems choices materially shaped any empirical RL result, until the paper proves otherwise.
The skeleton: a five-stage cycle on three pillars
Every framework you'll encounter — Miles, SLIME, verl, SkyRL, RLinf, cosmos-rl, AReaL, OpenRLHF, even SGLang's own RL adopters — instantiates the same skeleton. One step of RL training is a five-stage loop: the model generates completions, those are scored by a reward function, low-quality ones are filtered, the survivors train the policy, and the new policy is synced back to the inference engine. Repeat.
Around this cycle live three components: a training engine that computes gradients and holds the optimizer; an inference engine that runs the policy to generate rollouts; and an orchestrator that coordinates them across GPUs and nodes. A framework is, to a first approximation, a particular choice for each of these three slots — plus a data layer (reward hub, filter hub, data buffer) and an algorithm layer (GRPO, PPO, DAPO, OPD) that sit underneath.
The genius of the modern stack is that these three pillars are pluggable, with stable interfaces between them. The framework code looks single-threaded but runs SPMD across hundreds of GPUs. The driver writes engine.update_weights_from_distributed(...) and never sees the NCCL topology underneath. This decoupling is what makes a framework a framework and not a one-off training script.
The central tension: two halves that fight
Pretraining an LLM is a static dataset plus a training loop. Inference is a model plus a request stream. Both are well-understood; the right tools are Megatron-LM and vLLM (or SGLang) respectively. RL post-training is both at once, and the two halves want opposite things from the same hardware.
| Concern | Training prefers | Rollout prefers |
|---|---|---|
| Parallelism | High TP / PP — split the model | High DP — split the batch |
| Memory | Optimizer states + gradients + activations | Weights + KV cache + CUDA graphs |
| Throughput unit | Tokens/step (compute-bound) | Trajectories/sec (memory-bound) |
| Goal | Update one global policy | Generate N independent samples |
If you give half your fleet to training and half to inference and let them run in parallel, the on-policy constraint forces them to alternate anyway — half the fleet is always idle. If you force identical parallelism on both, one side bottlenecks. If you give them different parallelism, every phase transition requires resharding the model. Every engineering primitive that follows is a way out of this trap.
This is, mathematically, a resource-conflict problem with a Pareto frontier. The pragmatic solution — colocate both halves on the same GPUs and serialize them in time — turns the conflict into a sequencing problem. That sequencing problem is what the next six sections solve.
Six engineering primitives worth knowing
Each subsection covers one primitive. For each: the invariant it preserves, the implementation, and what's mathematically interesting about it.
① The hybrid engine — 训推一体
The answer to the opposing-preferences problem is a pattern called 训推一体 ("integrated train-inference"), formalized in verl's HybridFlow paper. Put both halves on the same GPUs; run them serially; swap memory between phases.
Mathematically, this is mutual exclusion: at any moment the GPU is in exactly one of two states (training-mode or inference-mode), and a controlled transition between them happens once per cycle. The cost is the transition itself. The benefit is that every GPU is doing useful work at all times — peak utilization moves from 45-55% (disaggregated) to 85-90% (colocated). Every modern framework — Miles, SLIME, verl, SkyRL, RLinf, cosmos-rl — uses some form of this pattern. The disagreement is on how to make the transition cheap.
Up to about 64 GPUs, full colocation (all roles share one resource pool) wins. Above that, split colocation (actor+ref on one pool, critic+reward on another) wins, because the parallelism preferences of the four roles start to diverge. Pure disaggregation only becomes economic above ~1024 GPUs, when rollout volume can amortize a separate inference cluster.
② Memory choreography
The hybrid engine works because two functions encode the entire concurrency contract: release_memory_occupation and resume_memory_occupation. They pause the inference engine, hand the GPU to the trainer, then restore it. The implementation is granular — three independent memory pools (KV cache, weights, CUDA graphs) can be released independently.
Here's the actual implementation in SGLang. The mathematical interest is in the assert is_fully_idle() at the top: the function literally cannot run if any request is in flight. The concurrency contract is checked structurally, not by convention. Each if tag in tags branch is a separate pool with its own pause semantics — granularity is the point.
def release_memory_occupation(self, recv_req):
assert self.is_fully_idle(), \
"release_memory_occupation should be called only when server is idle."
tags = recv_req.tags or GPU_MEMORY_ALL_TYPES
for tag in tags:
self.offload_tags.add(tag)
if GPU_MEMORY_TYPE_KV_CACHE in tags:
self.memory_saver_adapter.pause(GPU_MEMORY_TYPE_KV_CACHE)
self.flush_cache()
if GPU_MEMORY_TYPE_WEIGHTS in tags:
self.stashed_model_static_state = _export_static_state(
self.tp_worker.model_runner.model
)
torch.distributed.barrier(self.tp_cpu_group)
self.memory_saver_adapter.pause(GPU_MEMORY_TYPE_WEIGHTS)
if GPU_MEMORY_TYPE_CUDA_GRAPH in tags:
self.memory_saver_adapter.pause(GPU_MEMORY_TYPE_CUDA_GRAPH)
torch.get_device_module().synchronize()
return ReleaseMemoryOccupationReqOutput()
③ Zero-copy weight synchronization
After each training step the trainer has a new policy θt+1; the inference engine still has θt. Naively, you serialize the new weights, send them, deserialize. On a 70B model that's tens of milliseconds per parameter, with thousands of parameters per layer — minutes per training step, untenable.
The trick is to never move the tensor data. The trainer process has the new weights in GPU memory; the inference process is either on the same host or shares a NCCL group with the trainer. The trainer sends only a handle tuple — pointer, stride, offset, CUDA IPC descriptor — under a kilobyte. The inference process reconstructs a Python tensor object that points to the same physical GPU memory.
What's mathematically interesting here is that the abstraction is functorial: a Python Tensor is a thin wrapper around an underlying memory descriptor, and reconstructing the wrapper is independent of moving the bytes. The trainer and inference engine end up with two different objects denoting the same memory. The synchronization cost goes from O(parameters × byte_throughput) to O(parameters × pointer_size). Sub-millisecond instead of minutes.
④ Four update_weights_from_* paths, one verb
The trainer might hold the new weights in any of four physical configurations: same process as the inference engine, on a shared disk, on remote NCCL ranks, or in a sibling CUDA-IPC process. SGLang exposes four implementations behind one verb. The framework above writes engine.update_weights_from_*(...); the topology underneath dictates which method.
The pattern is functor-like: the same operation lifted to different categories of topology. What matters mathematically is that all four methods share flush_cache_after_weight_update. The RadixCache (next subsection) must be invalidated because old prefix entries reference the old policy — keeping them would mean the inference engine is serving generations based on a model that no longer exists. Cache invalidation is the inherent consistency obligation; the four transports differ in how they get the bytes to the inference engine, but they all owe the same invariant downstream.
class SchedulerUpdateWeightsMixin:
def update_weights_from_disk(self, recv_req):
success, message = self.tp_worker.update_weights_from_disk(recv_req)
if success and self.draft_worker is not None:
success, message = self.draft_worker.update_weights_from_disk(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
return UpdateWeightFromDiskReqOutput(success, message, 0)
def update_weights_from_distributed(self, recv_req):
success, message = self.tp_worker.update_weights_from_distributed(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
return UpdateWeightsFromDistributedReqOutput(success, message)
def update_weights_from_tensor(self, recv_req):
worker = self.draft_worker or self.tp_worker
success, message = worker.update_weights_from_tensor(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
torch.distributed.barrier(group=self.tp_cpu_group)
return UpdateWeightsFromTensorReqOutput(success, message)
def update_weights_from_ipc(self, recv_req):
success, message = self.tp_worker.update_weights_from_ipc(recv_req)
if success and self.draft_worker is not None:
success, message = self.draft_worker.update_weights_from_ipc(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
torch.distributed.barrier(group=self.tp_cpu_group)
return UpdateWeightsFromIPCReqOutput(success, message)
An underrated systems constraint that this verb hides: NCCL process groups are static. Their participant set is fixed at creation; adding a new inference node mid-training means destroying and recreating the group. RDMA's point-to-point model treats each connection independently — a new peer establishes a fresh Queue Pair without disturbing existing links. This single fact explains why colocated frameworks lean on NCCL (via init_weights_update_group) while disaggregated systems prefer RDMA or shared disk. Elasticity is incompatible with NCCL.
⑤ RadixAttention × GRPO — algebraic composition
This is the example I would lead with if I were trying to convince a mathematician that engineering can be beautiful. GRPO generates N completions per prompt and normalizes rewards within the group as a baseline; the algorithmic motivation is variance reduction. The system-level consequence is that all N completions share the prompt prefix.
RadixAttention exploits this. The shared prefix is a single node in a radix tree; each completion branches off as a child. Reference counting (inc_lock_ref / dec_lock_ref) means a node with lock_ref > 0 cannot be evicted. Use-after-free is structurally unrepresentable.
class RadixCache(BasePrefixCache, KVCacheEventMixin):
def match_prefix(self, params: MatchPrefixParams) -> MatchResult: ...
def insert(self, params: InsertParams) -> InsertResult: ...
def inc_lock_ref(self, node: TreeNode) -> IncLockRefResult: ...
def dec_lock_ref(self, ...) -> DecLockRefResult: ...
def evict(self, params: EvictParams) -> EvictResult: ...
@property
def evictable_size(self): ...
Neither was designed for the other. GRPO chose group sampling for variance reduction. RadixAttention chose a tree for cache reuse. Their composition multiplies savings — and reference counting means cache lifetime is automatic.
This kind of accidental-but-elegant composition recurs throughout the field. A mathematician will recognize the structure: the algorithm exposes a sharing pattern (here, a common prefix); the data structure exposes a sharing mechanism (here, a radix tree); their composition is the categorical product. The right notation makes the saving obvious — the wrong notation makes the saving impossible to express.
⑥ Async training and the staleness tradeoff
Strict on-policy RL says: wait for every trajectory to finish under policy θt before running the gradient step. In practice this is fatal at scale. If 10% of prompts produce long-context trajectories (2K tokens of decode) and 90% are short (512 tokens), the short ones finish in ~2s but the long ones take ~20s. 90% of GPUs idle for 18s every iteration. Over 1000 iterations: ~5 GPU-hours per GPU wasted.
The pragmatic solution is partial rollout: train on the 128 trajectories that finished, let the rest continue under (now stale) θt. Throughput rises 2–4×; the off-policy ratio rises with it. There are two paths to bounding the resulting bias.
Mathematical correction. Miles applies Truncated Importance Sampling (TIS) and Masked Importance Sampling (MIS): the gradient update of a sample generated under θt-k is reweighted by the importance ratio πθt(a|s) / πθt-k(a|s), with truncation or masking to control variance. The estimator becomes unbiased again at the cost of variance the truncation introduces — a classical Radon-Nikodym story.
Operational correction. Kimi K1.5 bounds staleness instead: context-length checkpointing, dynamic batch reordering, prefix-maximizing scheduling. The off-policy ratio never grows large enough to require explicit correction, but the bound is enforced operationally rather than mathematically.
The architectural extreme is AReaL's fully-async design: NCCL broadcasts are launched with async_op=True, bucketed by memory budget. The trainer never blocks on weight sync. The off-policy bias is whatever falls out of how long inference takes to catch up — call it "implicit staleness." AReaL bets that with the right bucketing, the implicit staleness stays small enough to skip explicit corrections. They claim 2.77× speedup over synchronous baselines.
The layer beneath: CUDA, Triton, and TileLang
So far the survey has stayed at the framework level. But the throughput of all the primitives above depends on the speed of the underlying kernels — attention, GEMM, normalization, softmax. A 2× speedup on an attention kernel is a 2× speedup on rollout, which is 2× on the whole RL loop. Below PyTorch, there is a hierarchy of languages for writing GPU kernels. Theorists tend to underestimate how much of the field's progress lives at this layer.
CUDA Python — the bottom of the stack
NVIDIA/cuda-python is the metapackage that exposes CUDA from Python. It has multiple subpackages: cuda.bindings (low-level bindings to the CUDA driver, runtime, NVRTC, NVVM), cuda.core (Pythonic access to CUDA Runtime and JIT compilation), numba.cuda (a SIMT DSL that compiles a restricted subset of Python to CUDA kernels), and newer DSLs cuda.tile (NumPy-like syntax for the CUDA Tile programming model) and cuda.coop (block-wide and warp-wide primitives). If you need to write a kernel from scratch in Python with full control, this is the layer.
The reason this exists, beyond pedagogy, is that PyTorch's abstractions occasionally aren't enough. When you need NVSHMEM for one-sided RDMA, or NVML for fine-grained device queries, or a custom reduction over a non-standard layout, you reach for cuda.bindings. For RL infra specifically, NCCL group management and CUDA IPC (Section ③) live here.
Triton — the Python DSL that became universal
Triton is the Python DSL for writing GPU kernels that became the de facto standard the moment FlashAttention shipped in it. The pitch from the original MAPL 2019 paper is simple: write code that's higher-productivity than CUDA but more flexible than fixed-shape DSLs, with autotuning across block sizes. The compiler handles memory coalescing, shared-memory allocation, and tensor-core dispatch.
python/tutorials/01-vector-add.py@triton.jit
def add_kernel(x_ptr, # *Pointer* to first input vector.
y_ptr, # *Pointer* to second input vector.
output_ptr, # *Pointer* to output vector.
n_elements,
BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
tl.add(x_ptr + offsets, y, mask=mask)
What's worth noticing: the indexing is at the block level, not the thread level. tl.arange(0, BLOCK_SIZE) denotes a whole block of indices; mask handles out-of-bounds without manual loops. This is the "tiled" model — write block-wise math, the compiler vectorizes within the block. vLLM's PagedAttention, SGLang's attention kernels, the FlashAttention family, and most of the actually-fast layers in production frameworks are Triton kernels.
TileLang — the new DSL with theorem-prover integration
TileLang (open-sourced Jan 2025) is a newer Python DSL on top of TVM. Its pitch is similar to Triton's — productivity above CUDA, flexibility above fixed DSLs — but the design choices differ. TileLang exposes more explicit memory tier control (shared, fragment, register) and integrates the Z3 theorem prover into TVM's arith analyzer for SMT-based symbolic reasoning and automatic correctness verification. Backends include NVIDIA via CUTLASS CuTe DSL, AMD MI300X (with async copy), Apple Metal, Huawei AscendC, and WebGPU.
examples/deepseek_mla/example_mla_decode.py@tilelang.jit(out_idx=[4], pass_configs={...})
def flashattn(batch, heads, kv_head_num, seqlen_kv, dim, pe_dim, ...):
@T.prim_func
def main_split(Q, Q_pe, KV, K_pe, Output):
with T.Kernel(batch, heads // min(block_H, kv_group_num), num_split, threads=256) as (bid, hid, bz):
Q_shared = T.alloc_shared([block_H, dim], dtype)
S_shared = T.alloc_shared([block_H, block_N], dtype)
KV_shared = T.alloc_shared([block_N, dim], dtype)
acc_s = T.alloc_fragment([block_H, block_N], accum_dtype)
acc_o = T.alloc_fragment([block_H, dim], accum_dtype)
scores_max = T.alloc_fragment([block_H], accum_dtype)
logsum = T.alloc_fragment([block_H], accum_dtype)
# ... ~50 lines of tiled FlashAttention math ...
The TileLang authors claim performance parity with hand-written FlashMLA on H100 in 80 lines of Python. The Z3 integration is the unusual choice — it lets the compiler prove arithmetic invariants symbolically (e.g., that an index expression stays in bounds for all parameter values, that a tiling is a valid partition). For a mathematician, this is the most interesting compiler design move in the GPU DSL space.
Which one to learn?
If you only learn one, learn Triton — it's where the field's research code lives. If you care about correctness verification or AMD/Apple/Huawei portability, look at TileLang. Reach for raw cuda.bindings only when you need a CUDA capability the DSLs don't expose (NVSHMEM, certain NCCL primitives, CUDA IPC manipulation). The framework-level RL code (Section ③) lives entirely in pure PyTorch + cuda-python bindings; the inference engines underneath (SGLang, vLLM) are where Triton kernels do the heavy lifting.
- triton-lang.org — tutorials (vector add, fused softmax, matmul, FlashAttention)
- srush/Triton-Puzzles — Sasha Rush's puzzles, runnable without a GPU
- tile-ai/tilelang-puzzles — 10 progressively harder puzzles for TileLang
- tilelang/examples/deepseek_mla — MLA decode reference implementation
- cuda.core docs · cuda.tile docs
The training backbone: Megatron-LM
Megatron-LM is the training engine sitting under almost every RL framework in this survey. The repository contains two parts: Megatron Core (a composable library of GPU-optimized building blocks — kernels, parallelism strategies, mixed-precision support) and Megatron-LM proper (reference training scripts using the core). Performance numbers from NVIDIA's benchmarks: 462B-parameter models trained on 6144 H100 GPUs, reaching 47% Model FLOP Utilization. For comparison, naïve PyTorch DDP on a 70B model typically achieves 30-35% MFU.
Five-dimensional parallelism
The reason Megatron exists, and the reason every serious LLM training run uses it, is its parallelism. Modern Megatron supports five orthogonal dimensions:
- TP (Tensor Parallel) — split each layer's weight matrices across GPUs. Within-machine, NVLink-bound. Default for the highest-bandwidth domain.
- PP (Pipeline Parallel) — split layers across GPUs. Cross-machine, InfiniBand-bound. Sparse communication, tolerates slower links.
- DP (Data Parallel) — replicate the model, split the batch. Tolerant of slow communication. Outer loop of the parallelism nest.
- EP (Expert Parallel) — split MoE experts across GPUs. Necessary above ~30B-parameter MoE models. Required for DeepSeek-V3, Qwen3-MoE, GPT-OSS.
- CP (Context Parallel) — split the sequence dimension across GPUs. Needed for long-context training (32K+ tokens) and for video models (100K+ tokens).
The product space is enormous: TP × PP × DP × EP × CP. Choosing the right point in this grid for a given model and cluster is a small art. Megatron's defaults are good, but the framework above (verl, slime, etc.) typically overrides them for the RL-specific case where rollout and training have conflicting preferences.
The communication intensity decreases as you move right through TP → CP → EP → DP → PP. Place the densest communication on the fastest hardware (NVLink within a node) and the sparsest on the slowest (cross-rack InfiniBand). This is the structure principle of every distributed training topology.
Megatron's own RL module
Less well-known: Megatron-LM ships its own RL training module at megatron/rl/. It includes agent/, inference/, server/, rl_utils.py, and sequence_packing_utils.py — an integrated RL framework, sitting alongside the verl / slime / Miles ecosystem. This is partly redundant with those frameworks but useful when you want first-party NVIDIA support and tight Megatron-Core integration. The post_training/ directory handles quantization, distillation, and pruning, which the next section will cover.
What's mathematically interesting
The communication overlapping flags (--overlap-grad-reduce, --overlap-param-gather, --tp-comm-overlap) are an exercise in hiding latency behind compute — a kind of staggered evaluation. Pipeline parallelism's "1F1B" schedule (one forward, one backward, interleaved) recovers most of the bubble in a fully-synchronous pipeline; the math of why this works is straightforward but writing it down clarifies why Megatron's MFU stays high even with PP.
The dynamic context parallelism shipped in January 2026 (1.48× speedup for variable-length sequences) is a nice example of treating CP size as a runtime variable rather than a static config. The "right" CP size depends on the sequence length of the current batch — adapting it dynamically is essentially online load balancing.
- NVIDIA/Megatron-LM — repo root
- Megatron Core docs — parallelism guide, mixed precision, quickstart
megatron/rl/— Megatron's own RL modulemegatron/post_training/— quantization and distillation paths- Megatron-LM paper (arxiv 1909.08053) — the original tensor-parallel architecture
- Megatron-Bridge — bidirectional HuggingFace ↔ Megatron checkpoint conversion
Quantization and the numerical-alignment problem
If you're reading the Miles paper or the DeepSeek-V3 technical report and wondering why so much engineering effort goes into making FP8 training work correctly, this section is for you. Quantization is where the abstract algorithm meets the concrete number system, and getting it wrong silently corrupts the policy.
The numerical formats
Three families are worth knowing:
- FP8 (E4M3 and E5M2) — IEEE-754-style 8-bit floats with two variants: E4M3 has 4 exponent bits and 3 mantissa bits (used for forward activations, weights); E5M2 has 5/2 (used for gradients, with their wider dynamic range). H100-and-newer hardware has native FP8 tensor cores. The two variants exist because forward and backward have different statistical profiles.
- INT4 / INT8 — fixed-point integers. INT4 weight-only quantization (W4A16) is the workhorse for inference of large models (Llama-70B fits on a single H100 at INT4). Quantization-aware training in INT4 is hard; post-training quantization with GPTQ / AWQ is more common.
- MXFP4 / MXFP8 / NVFP4 — block-scaled formats. Each block of 16 or 32 values shares a single scale; the per-value representation is small (4 or 8 bits) but the dynamic range is large. NVFP4 is NVIDIA Blackwell-specific (the B200 and beyond); MXFP4 is the OCP standard.
QAT vs PTQ
Two strategies for converting a model to a lower-precision format:
- Quantization-Aware Training (QAT) — simulate the quantization during training, so the optimizer learns weights that are robust to the lower precision. Expensive but produces the highest-quality models. Miles' INT4-QAT pipeline is the most aggressive example in this survey.
- Post-Training Quantization (PTQ) — train in full precision, quantize at the end. Algorithms include GPTQ (one-shot, second-order weight rounding), AWQ (activation-aware weight quantization), GGUF (llama.cpp's format), and ModelOpt (NVIDIA's toolkit). Cheap but lossy.
For RL specifically, the question is whether the inference engine can be in a lower precision than the trainer without breaking learning. The answer turns out to be "yes, but only with care."
The MoE routing divergence problem
This is the deep reason Miles exists. In a Mixture-of-Experts model, each token is routed to k experts based on the output of a small gating network. The gating decision is a top-k over k experts' affinity scores. Under floating-point arithmetic, the affinity scores are computed in a specific precision. If the inference engine computes them in FP8 and the trainer in BF16, two scores that are equal in BF16 can be unequal in FP8 (or vice versa), and the top-k decision can flip. The token routes to a different expert at inference than it did at training time. The gradient signal becomes random noise with respect to which expert actually generated the token.
This is the "BERT-era unsolved bug" that Chenyang's tutorial flags — the existence of a numerical-precision-induced routing divergence has been known for years. The brute-force fix is to make inference and training share the same precision and the same kernels for the routing computation. That's what Miles' Unified FP8 Pipeline does. The cleverer fix is to replay the routing decision: record which experts inference picked, force training to use the same picks. This is R3 — Rollout Routing Replay, Miles' signature contribution. It guarantees the routing is identical regardless of precision.
The mathematical invariant: routing(x, θ, fp8) ≡ routing(x, θ, bf16) by replay, not by approximation. Without it, every gradient update in an MoE model is partially noise.
For non-MoE models the problem is milder but still present: numerical drift in logprobs accumulates batch-to-batch. The safe practice (every framework in this survey follows it) is to never use the inference engine's logprobs for loss computation. Always recompute logprobs with the training engine, even at the cost of an extra forward pass. The inference engine generates the tokens; the trainer recomputes their probabilities.
- GPTQ paper · AWQ paper — the standard PTQ algorithms
- NVIDIA ModelOpt — production toolkit for FP8/INT4/MXFP4 conversion
- OCP MX format spec — the standard behind MXFP4 / MXFP8
- Megatron's
post_training/— checkpointing and conversion paths - Miles docs — R3 + unified FP8 pipeline writeups (the canonical RL-quantization references)
Multi-turn agentic RL — unifying VLM and LLM from first principles
Up to this point the survey has implicitly assumed a single-turn setting: one prompt, one completion, one reward. The 2026 frontier is multi-turn. A model is no longer a chatbot but a thinking machine embedded in an environment loop — it emits an action, the environment responds with an observation (possibly multimodal), the model reads the observation and emits the next action, and the trajectory grows. Computer Use agents, embodied robotics, and tool-augmented reasoning all live in this regime.
The mathematician's instinct here is right: a multi-turn setting is just a Markov decision process with episodes. The engineering question is then narrow — how do you implement the trajectory generation cleanly enough that VLM and LLM share one code path? The slime + Miles answer is what I'd call the first-principles answer: any multi-turn training is just custom sampling and interaction logic. Decouple the rollout function from the environment; let the user supply both.
The turn loop
Each turn of the loop has four distinct phases: (a) the actor generates a response under the current context and sampling parameters; (b) the environment steps on the response and returns an observation; (c) the observation is encoded into a fresh delta of tokens and appended to the context with loss_mask = 0 — this is what tells the trainer "don't compute loss against the environment's words"; (d) any new multimodal payload is appended to two parallel buffers, one for inference, one for training. Termination is whichever fires first: max_turns, a token budget, or env.step() returning done=True.
# Pseudocode: custom multi-turn rollout.generate
async def generate(args, sample, sampling_params):
env = load_env_module(args.rollout_interaction_env_path).build_env(sample=sample, args=args)
max_turns = args.max_turns
sample.tokens, image_data, mm_train_buffer = init_from_prompt(sample, state)
for _ in range(max_turns):
# (a) Actor generation — assistant tokens
response_text, new_tokens, new_logprobs, finish_reason = sglang_generate(
url=url, input_ids=sample.tokens,
sampling_params=sampling_params, image_data=image_data
)
append(sample, new_tokens, new_logprobs, loss_mask_val=1)
# (b) Env step
observation, done, _ = env.step(response_text)
if done: break
# (c) Process & append observation tokens
user_msg = env.format_observation(observation)
obs_ids, obs_image_data, obs_mm_inputs, obs_mm_train = encode_observation_delta(
user_msg, tokenizer=state.tokenizer, processor=state.processor,
tools=sample.metadata.get("tools")
)
append(sample, obs_ids, [0.0] * len(obs_ids), loss_mask_val=0)
# (d) Multimodal state update — TWO parallel buffers
image_data += obs_image_data # inference-side
if obs_mm_train:
mm_train_buffer.append(obs_mm_train) # training-side
return sample
The BaseInteractionEnv interface is intentionally minimal: reset(), step(response_text) → (observation, done, info), and format_observation(observation) → message. No assumptions about action grammars, no coupling to dataset format. "How the environment parses an action, executes a tool, returns an observation" is entirely the user's call. This is the decoupling the field needed to support Computer Use, embodied robotics, and tool-augmented reasoning under one framework.
Two engineering tricks worth knowing
Two implementation details from the slime team's writeup deserve highlighting because they capture exactly the kind of "look beneath the API" reasoning the engineering rewards.
Dummy messages + delta tokens — bounded context growth
The naive way to encode an observation back into the context is tokenizer.apply_chat_template([obs_message], tools=...). The problem: chat templates auto-prepend a system prompt and tool-use instructions every time. If you do this each turn, the system prompt is duplicated into the context T times across T turns — quadratic context growth, partial waste of the token budget, and (even though these tokens are loss-masked) their presence shifts the actor's behavior distribution.
The trick: encode twice, take only the difference. Encode a fixed DUMMY_MESSAGES base alone to get the preamble token count; encode DUMMY_MESSAGES + [obs_message] together; slice off the preamble length. What you append is the clean observation delta only — system prompt and tool preamble appear exactly once across the whole trajectory.
dummy = apply_chat_template(DUMMY_MESSAGES, tools=tools, add_generation_prompt=False)
full = apply_chat_template(DUMMY_MESSAGES + [obs_msg], tools=tools, add_generation_prompt=True)
trim = len(encode(dummy))
obs_ids = encode(full)[trim:] # delta tokens only
Mathematically this is just set difference on token sequences; what's interesting is that the chat template API doesn't expose a primitive for "encode just this message under this preamble," so the user implements the difference operation manually. The trick is widely needed but rarely surfaced.
Multimodal tensor merge — O(n²) → O(n)
Each turn that adds an observation also produces a dict of tensors for the training side (vision features, audio features, whatever the VLM processor emits). The trainer wants one consolidated tensor per key over the whole trajectory. Naïvely concatenating each turn with torch.cat is O(n²): each call allocates a new output buffer and copies all existing data plus the new turn's increment.
The clean answer: buffer-then-merge. Append each turn's tensor dict to a Python list (O(1) per turn); at trajectory finalization, traverse the list once and call torch.cat exactly once per key. Total work drops from O(n²) to O(n), and you avoid the peak-memory transient where both the old and new concatenated tensors are simultaneously resident.
This too is a kind of math-as-engineering: the same input-output behavior, two different complexity profiles, distinguished only by where the allocation boundary sits. For a 32-turn rollout with 100K-token VLM context, the asymptotic difference is the difference between training and OOM.
Engineering case study — Miles' DeepSeek-V3 RL pipeline
Miles is the most concrete case study of how the primitives above compose into a production pipeline. Built on slime, powered by Megatron-LM + SGLang, orchestrated by Ray. The repo's scripts/run_deepseek.py is a single Typer command that takes a DeepSeek-V3 model from HuggingFace and runs full GRPO training on AIME-2024 or GSM8K. Reading it teaches you what the actual engineering glue looks like.
The five-stage pipeline
What makes run_deepseek.py instructive is that each stage tests for already done before re-running. The pipeline is resumable on every reboot — a property the field undervalues until it's debugging at 2am with a corrupted checkpoint.
- ① Download.
hf download deepseek-ai/DeepSeek-V3for the model,hf_download_datasetfor the training set (DAPO-math-17k + AIME-2024, or GSM8K). Skip if already present. - ② FP8 → BF16 cast. DeepSeek-V3 ships in FP8 on HuggingFace; training needs BF16 master weights (a QAT/master-weight prerequisite from the quantization section).
tools/fp8_cast_bf16.pyhandles it. Skip ifmodel.safetensors.index.jsonexists. - ③ HF → Megatron distributed format. Run
torchrun convert_hf_to_torch_dist.pywith the right PP/EP/TP sizes for the model. On multi-node this becomesexec_command_all_ray_node(...)— Ray fans the conversion command out across all nodes with{{master_addr}}/{{node_rank}}substitution. Skip iflatest_checkpointed_iteration.txtreads"release". - ④ Rsync to local node storage. Cross-node shared FS is too slow for the hot path. Every Ray node rsyncs the converted checkpoint to its local NVMe in parallel.
- ⑤ Ray job submit. Builds the giant
train_argsstring (rollout, optimizer, GRPO, wandb, perf, eval, SGLang, misc args), kills any stale processes, startsray start --head, thenray job submit -- python3 train.pywith the full runtime environment JSON.
The training main loop
Once Ray has the job, train.py is a tight async loop. Read it as the runtime of all the primitives this survey covered:
async def train(args):
pgs = create_placement_groups(args) # TP/PP/EP-aware GPU groups
init_tracking(args)
rollout_manager = create_rollout_manager(args, pgs["rollout"]) # SGLang
actor_model, critic_model = await create_training_models(args, pgs, ...) # Megatron
await actor_model.update_weights() # initial sync → SGLang
for rollout_id in range(args.start_rollout_id, args.num_rollout):
rollout_data_ref = await rollout_manager.generate.remote(rollout_id) # Phase A-B
await actor_model.train(rollout_id, rollout_data_ref) # Phase D
if rollout_id % args.save_interval == 0:
await actor_model.save_model(...)
await actor_model.update_weights() # Phase E
if rollout_id % args.eval_interval == 0:
await rollout_manager.eval.remote(rollout_id)
This is the call graph from Section "Reading real code," made concrete. Two Ray actor groups (training, rollout) bound to placement groups; an outer async loop alternating generate, train, update_weights. Every primitive — the hybrid engine, the memory choreography, the zero-copy weight sync, the four update_weights_from_* paths, the RadixAttention prefix cache — is invoked along this loop without the user-facing code ever spelling them out explicitly.
What's mathematically interesting is the layering. The top-level loop is sequential and easy to reason about as a fixed-point iteration on the policy parameters. The middle layer (Ray, placement groups) is concurrent but explicitly scoped. The bottom layer (CUDA kernels, NCCL collectives, ZMQ messages) is asynchronous but bounded by clean contracts. Each layer adds a strictly smaller amount of nondeterminism than the one beneath it — a kind of algebraic abstraction ladder that lets a theorist reason about convergence without unfolding the systems mess.
Miles is not a trainer — it is a runtime phase machine
If you read the source carefully, the most useful reframe is this: the top-level loop is not algorithm-driven, it is phase-driven. There is no PPO/GRPO logic in train.py at all. There is only a sequence of phases — rollout, training, sync, eval — and the code's job is to hand the GPU between them cleanly. The actual math (advantages, KL, importance weights) lives inside the actor, far below the top-level loop. From a theorist's angle this matters because it tells you where to look for bugs: a wrong policy gradient is a bug in the actor; a wrong rollout-train mismatch is a bug in the phase machine.
The phases are precise:
- Rollout phase: SGLang owns weights, KV cache, CUDA graphs. The trainer's optimizer is offloaded to CPU.
- Training phase: The trainer owns weights, gradients, activations, optimizer state. SGLang's KV cache and CUDA graphs are released.
- Sync phase: The trainer's new weights are pushed into SGLang. Generation is paused; the cache is flushed; the weight version is bumped.
- Eval / save phase: Periodic side effects that don't affect the policy gradient loop.
Two object models run in parallel. The mathematical model is θ_t → sample τ ~ π_θ → compute reward and advantage → θ_{t+1} → install into rollout. The engineering model is GPU ownership, KV cache ownership, weight version, Ray actor liveness, NCCL group state, offload/onload state. Miles' value is keeping these two models aligned — when the math says "the policy just updated," the engineering says "the inference engine now serves the new weights, and any prior in-flight requests have been retracted." If those two statements ever drift apart, the gradient becomes noise and the loss curve still looks fine.
Six engineering invariants Miles maintains across the loop
These are the invariants you can verify in the source code. They are also the right checklist if you are reading any other framework in this survey and want to know what to look for. Each one is one line of plain English, then one line of why it matters.
- Logical rank order maps stably to physical GPU order. Ray bundles, Megatron ranks, and SGLang engine ranks must agree on which physical GPU is "rank 5." If they disagree, you get silent NCCL hangs or weights written into the wrong worker — symptoms that are hours to diagnose. Miles handles this in
placement_group.pyby reading back actual node IP and GPU IDs from each Ray bundle and sorting deterministically. - GPU ownership switches explicitly, never by accident. Colocated mode never lets training and rollout both touch GPU memory at the same time. The transitions go through
rollout_manager.offload_*,actor_model.onload, and SGLang'srelease_memory_occupation/resume_memory_occupation. The invariant — at any moment, exactly one engine has the live GPU memory — is what makes colocation safe. - Weight version is monotonically increasing. Every sync bumps
weight_versionand propagates it to every SGLang engine. The trainer can later assert that the engine's reported version equals what it just sent. If they ever differ, a weight update silently failed somewhere and you want to know now, not after 10000 training steps. - Weight updates pause generation and flush the cache first. Before a sync, Miles calls
pause_generationon every engine and thenflush_cache. The point is to make sure no in-flight request continues decoding with old prefix-cache KVs that belong to the old policy. This is the rule SGLang's own scheduler also enforces; Miles just makes it visible at the top level. - Loss-relevant logprobs are recomputed on the trainer, not trusted from inference. Miles keeps the rollout logprobs (it needs them for TIS and mismatch checks) but the actual policy-gradient term uses logprobs computed by the training engine on the same tokens. This is the BERT-era numerical-drift discipline made concrete: the inference engine generates; the trainer scores.
- MoE routing is replayed, not re-derived. When R3 is on, SGLang returns the expert choices it made during generation; the trainer reuses those choices during the forward pass, regardless of what FP8 numerics would have decided locally. The mathematical statement is that the gradient is taken with respect to the routed graph the rollout used — which is the only way the gradient is a meaningful signal about that rollout's actions.
Two more invariants Miles enforces when partial rollout is on: tokens generated under an older weight version are loss-masked to zero by default (so only fresh tokens contribute to the gradient), and off-policy ratio is monitored as a first-class metric via TIS, ESS, and a "rollout-train mismatch" probe (so staleness is something you see, not something you discover via mysterious eval degradation three days in).
How a sample flows from rollout to gradient
The full data path is worth tracing once, because it shows where the abstractions are doing real work and where they're just plumbing. A prompt enters the system, gets sampled N=8 times by SGLang (this is GRPO's group), each completion is scored, the survivors become a training batch. Concretely:
- The
RolloutManagertakes a prompt and submits N parallel generation requests to SGLang. Each request returnstokens, logprobs, finish_reason— and if R3 is on, alsorouted_experts(a tensor of shape [response_len, num_layers, top_k] recording which experts processed each token at each MoE layer). - The reward function scores each completion. For GRPO, the rewards within the N-completion group are then group-normalized: subtract the group mean (and optionally divide by group std). The result is each sample's advantage relative to its sibling samples from the same prompt.
- Optionally, a dynamic sampling filter drops groups whose rewards are all equal (DAPO-style "if every sample got reward 1.0 or every sample got 0.0, the gradient is zero — don't bother training on it"). Miles uses
check_reward_nonzero_stdas the default filter. - Surviving samples are packed into a training batch. Every sample carries a loss mask — assistant tokens are 1, observation/system tokens are 0. If partial rollout is on, old-version tokens are also 0. The loss mask is a first-class field; the trainer does not "infer" which tokens to score.
- The batch is split across data-parallel ranks. If
balance_datais on, it splits to equalize tokens per rank, not samples per rank — a small detail that matters a lot when response lengths have long tails. - Each trainer rank recomputes logprobs on its slice and computes the policy-gradient loss. The recomputed logprobs are the ones that flow into the loss; the rollout logprobs only show up in TIS and in mismatch monitoring.
The reason to walk through this is that every field in the training batch has a job. tokens and response_lengths are the raw text. rewards and advantages drive the loss. loss_masks control which positions count. rollout_log_probs enable TIS. rollout_routed_experts enable R3. weight_versions let the trainer detect stale samples. teacher_log_probs are reserved for OPD. The batch is a small algebra of fields, each one corresponding to a distinct correctness concern.
Two paths for weight sync — and why the choice matters
Miles picks between two weight-sync paths at config time, and the choice has both performance and reliability implications.
Colocated path (UpdateWeightFromTensor): when the trainer and the SGLang engine share the same physical GPUs, the trainer assembles the new weights into one flattened bucket per layer, serializes the tensor descriptors (pointer + stride + offset + CUDA IPC handle, ~1KB total) using MultiprocessingSerializer, gathers them to the lead rank via Gloo, and hands them to SGLang via Ray IPC. SGLang reconstructs Python tensor objects that point to the same physical GPU memory the trainer just wrote. Zero actual tensor data crosses any wire. The reason this works is that two processes on the same host can share CUDA memory through IPC handles — the cost is sub-millisecond per update, regardless of model size.
Distributed path (UpdateWeightFromDistributed): when the trainer and rollout live on different physical GPUs (or across nodes), Miles creates a NCCL group whose participants are {trainer rank 0} ∪ {all rollout engine ranks}. The trainer sends metadata (parameter names, shapes, dtypes) over Ray, and the actual tensor data over NCCL broadcast from rank 0. This is the classical "split the control plane and the data plane" pattern: small metadata flows through a flexible RPC layer; large tensors flow through dedicated high-bandwidth collectives. Miles also serializes the broadcasts behind a Ray lock — concurrent broadcasts can deadlock NCCL, and the lock is cheap insurance.
Both paths share a third invariant: flush the cache after the update. Old prefix-cache entries reference the old policy; serving them under the new policy is a silent correctness bug. SGLang's mixin handles this with flush_cache_after_weight_update at the end of every transport.
Staleness corrections — what Miles measures during training
Even on a colocated setup, the rollout policy and the trainer policy are not the same. The trainer is one step ahead — it generated the sample under θt but is computing the loss against θt+1 after the parameter update. For PPO this is fine; for fully-async RL it can drift. Miles tracks staleness with three explicit quantities:
- TIS (truncated importance sampling). Compute
ratio = exp(log_prob_train - log_prob_rollout)per token; clip it into[lo, hi]; multiply into the policy-gradient term. This re-weights stale samples so the gradient is unbiased again, modulo the variance the truncation adds. - ESS (effective sample size).
ESS = (Σ w)² / Σ w²over the importance weights. A small ESS means most of your samples are effectively being ignored — useful as a single number to monitor over training. - Rollout-train mismatch metric. The mean and max of
|log_prob_train - log_prob_rollout|per token. If this drifts upward, the rollout has gotten too stale; you want to know before the loss diverges.
The principle is the same one that runs through this whole survey: the system tells you it is lying about being on-policy, and quantifies how much. That is the right design for a system whose inputs are partially stale by construction.
What Miles actually solves — seven brittle problems made systematic
The most useful summary of Miles is not "it implements GRPO." It is that Miles takes seven specific failure modes that wreck large-scale MoE RL training and turns each one into an engineering invariant with a corresponding code path:
- Train and rollout sharing GPUs without OOM → colocate + per-pool offload + SGLang memory-saver.
- Pushing new weights into rollout fast → colocated tensor bucket + distributed NCCL broadcast + P2P paths.
- GRPO rollouts not duplicating work → group sampling + dynamic filter + SGLang prefix cache.
- Long-tail rollouts not stalling training → partial rollout + buffer recycling + loss masking of old tokens.
- Async / off-policy not silently corrupting the gradient → rollout logprobs + TIS + ESS + mismatch metric.
- MoE under low precision not destabilizing → R3 routed-experts replay + unified precision pipeline.
- Multi-day jobs being recoverable → fault tolerance + weight-version checks + restartable rollout engines.
Each one is small in code volume but large in production consequence. The seven together are why Miles is a useful case study and not just another RL framework: it is one of the cleanest places in the open-source RL ecosystem to see what "production-grade" actually means, at the line-of-code level.
Recommended reading order for Miles' source
If you want to walk the source yourself, this order goes from outside to inside, from runtime phases to algorithm internals. Each file builds on the last; jumping straight to the middle usually wastes a day.
train.py— establish the five phases (rollout, train, sync, save, eval). Skim, don't memorize.miles/ray/placement_group.py— see how colocate / split-colocate is actually expressed in Ray bundles.miles/ray/rollout.py— theRolloutManager, including reward normalization, DP split, and what fields end up in training data.miles/rollout/sglang_rollout.py— the default per-sample generation function; this is where dynamic sampling and partial rollout actually live.miles/backends/sglang_utils/sglang_engine.py— the HTTP wrapper around SGLang and the twelve control endpoints (memory, weights, generation pause).miles/backends/megatron_utils/actor.py— Megatron actor init, the train_actor flow (recompute logprob, R3 replay, advantage, train, backup).miles/backends/megatron_utils/update_weight/update_weight_from_tensor.py— the colocated path's flattened-bucket trick.miles/backends/megatron_utils/update_weight/update_weight_from_distributed/broadcast.py— the distributed NCCL path with the deadlock-prevention lock.miles/backends/training_utils/loss.py— the algorithm seam: response-aligned logprobs, all advantage estimators, TIS, ESS.scripts/run_deepseek.py— finally the production recipe, where all of the above gets wired up for an actual DeepSeek-V3 run.
By the time you reach run_deepseek.py, you are reading it not as a script but as a witness: every flag in the giant train_args string lights up a specific invariant or path you have already seen. That recognition is the whole point of the exercise.
- radixark/miles — the repo
scripts/run_deepseek.py— the 5-stage entry pointtrain.py— the async main looptools/— fp8_cast_bf16, convert_hf_to_torch_dist- slime — the lightweight RL framework Miles is built on top of
- SGLang RL team: VLM multi-turn writeup — the canonical reference for the rollout design
Recent advances from the SGLang RL team
The slime + Miles + SGLang community has shipped a cluster of advances in late 2025 / early 2026 that are worth knowing collectively because they extend the survey's six primitives along five different axes. I list them with one-paragraph reads.
INT4 QAT full-flow training
Inspired by Kimi K2-Thinking's W4A16 QAT recipe, slime now runs an end-to-end INT4 quantization-aware training pipeline. The training side keeps BF16 master weights but inserts fake quantization (quant-dequant) into the forward pass — the model "sees" INT4 noise and learns weights robust to it. The backward pass uses Straight-Through Estimator (STE): the round function's derivative is set to 1, letting gradient flow through the unquantized weights. At inference time, SGLang loads true W4A16 weights with the Marlin kernel. Net effect: a 1TB-class model (Kimi K2 scale) fits the rollout in a single H200 (141GB), eliminating cross-machine communication overhead. Technical writeup.
Unified VLM/LLM multi-turn
The first-principles design described in the previous section. One rollout function, two domains, full decoupling between sampling logic and environment. Blog.
Rollout Router Replay (R3) for MoE stability
Already covered as Miles' signature contribution to the quantization section. Captures expert-routing decisions during SGLang inference, replays them during Megatron training. Makes MoE RL stable under low-precision routing.
Full-flow FP8 training and sampling
The follow-on to R3. "Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RL" walks through hardware foundations, scale selection, and MoE experiment results. The headline: FP8 inference + FP8 training + R3-style routing replay gives bit-identical numerics and ~2× rollout throughput on H100/H200.
Speculative decoding in RL
Standard practice for serving, novel in RL training. A small draft model proposes tokens; the policy verifies them in parallel. Net effect: 25%+ rollout speedup with no accuracy compromise. slime docs.
The cluster of advances above traces one through-line: the bottleneck of RL training is the rollout, not the gradient update. Every primitive optimizes rollout throughput or stability. The R3 / FP8 work makes the rollout faster and correct on MoE. The QAT work shrinks the rollout's memory. The multi-turn work expands what counts as a rollout. The speculative work decodes faster.
For a theorist this matters because the empirical claims in the field's papers (DeepSeek-R1, GLM-5.1, K2-Thinking, Doubao-1.5-pro) are produced under these specific infrastructure choices. If your work depends on understanding why those models behave as they do, the choices above are the assumption set you're implicitly invoking.
Case study — DeepSeek V4's post-training infrastructure
If Miles is the case study of "train one policy with RL," DeepSeek V4's post-training is the case study of the opposite bet: train many domain experts with RL separately, then merge them into one student via multi-teacher On-Policy Distillation (OPD). The algorithmic choice — distillation instead of RL as the final-stage merging primitive — shapes a different set of infrastructure problems, and DeepSeek's V4 report (Sections 5.1.2 and 5.2) reads as a clinic on what those problems are and how their team solves them. The system extends the same primitives Miles uses (hybrid engine, FP-aware QAT, fault-tolerant rollout) but adds two genuinely new pieces: efficient multi-teacher scheduling for OPD at trillion-parameter scale, and a production-grade sandbox platform (DSec) for the agentic-AI rollout side. Below is the reading of Section 5 a theorist should take away.
Multi-teacher OPD — the merging objective
The V4 team trains specialist models in math, coding, reasoning, world-knowledge etc. as separate post-training runs, then distills all of them into one unified student. The objective is a weighted sum of reverse KL divergences against each teacher, computed on trajectories sampled from the student:
ℒOPD(MS) = Σi=1..L wi · DKL(MS ∥ MTi) — V4 §5.1.2. L teachers, wi weights, trajectories drawn from MS to preserve the on-policy property.
Two design choices deserve a theorist's attention. Reverse KL on student trajectories is what makes "on-policy distillation" on-policy: the student samples its own actions and the teachers score them, so the gradient is taken against the distribution the student actually inhabits. The alternative — forward KL with teacher trajectories — would give a Behavior Cloning-style loss that ignores the student's own failure modes. Selective alignment per task emerges from the formulation: wi · DKL downweights teachers whose distribution is far from the student's current trajectory, so the math expert dominates math contexts and the coding expert dominates coding contexts automatically. The student converges to a policy that chooses which expert to imitate per context — without an explicit gating network.
The infrastructure twist is in how to compute that KL term. Prior practice approximates DKL with a per-token estimate: at each position, treat log(MT(a)/MS(a)) for the sampled action as the per-token advantage and reuse the RL framework's PPO/GRPO loss machinery. Cheap, but high-variance — the per-token ratio swings wildly across positions and the gradient is noisy. V4 instead computes the full-vocabulary reverse KL at every position, summing across all |V| ≈ 100k+ tokens. The gradient is lower-variance and faithful to the teacher's full distribution, but the compute and memory cost is what the rest of Section 5.2 exists to make tractable.
Efficient teacher scheduling — the hardest piece
The challenge V4 had to solve: more than ten teacher models, each potentially trillion-parameter scale, contributing to a single student training step. The naive setup — materialize all teachers' full logits over the full vocabulary at every position — is prohibitive, even spooled to disk (think hundreds of GB of logits per mini-batch). V4's framework solves it through four composed engineering moves:
- Offload all teacher weights to centralized distributed storage, load on demand with ZeRO-like parameter sharding. Teachers live in shared storage, not in GPU memory.
- Cache only the last-layer hidden states in a centralized buffer during the teacher forward pass — not the full logits. The logit dimension |V| collapses; the hidden dimension d is ~10× smaller.
- Reconstruct full logits on demand via the prediction head module at training time. Negligible recomputation, no logit-materialization memory burden.
- Order training samples by teacher index during data dispatching, so each teacher head is loaded exactly once per mini-batch and at most one head resides in device memory at any moment. Parameter loading and offloading proceeds asynchronously, off the critical path.
And — closing the loop with this survey's TileLang section — V4 reports that "the exact KL divergences between teacher and student logits are computed using a specialized TileLang kernel, which accelerates the computation and curtails dynamic memory allocation." The hidden-state-cache + on-demand prediction-head trick is what makes the algorithm fit in memory; the TileLang kernel is what makes the KL computation fast. The whole subsection is a microcosm of how the survey's separate primitives compose: distributed storage offload, parameter sharding, asynchronous I/O, a custom DSL kernel — all stacked to make one mathematical objective economical.
FP4 (MXFP4) QAT — lossless FP4→FP8 dequant
V4 applies MXFP4 quantization-aware training (FP4 weights with block-shared exponents, the OCP standard) to two components: MoE expert weights and the QK path in the indexer of Compressed Sparse Attention. The trick worth flagging is what they call lossless FP4→FP8 dequantization:
FP8 (E4M3) has 2 more exponent bits than FP4 (E2M1). As long as the ratio between max and min scale factors of the FP4 sub-blocks (1×32 tiles) within each FP8 quantization block (128×128 tiles) doesn't exceed a threshold, the fine-grained scale information is fully absorbed by the FP8 dynamic range.
The mathematical statement: under a bounded-scale-ratio condition (empirically satisfied by their weights), the composition FP32 → FP4 → FP8 preserves the FP4 scale-block information exactly when re-expressed in FP8. This means the existing FP8 training framework is reused without modification — the QAT pipeline plugs in via Straight-Through Estimator on the FP8 backward, and the entire framework's FP8 numerics stack remains intact. For deployment, native FP4 quantized weights are used during rollout instead of simulated quantization, so model behavior during sampling is bit-consistent with online inference. Where the survey's quantization section discussed FP8 and INT4 separately, V4 demonstrates a clean composition: MXFP4 weights flowing through FP8 compute paths, with neither the training framework nor the inference framework needing to know about the FP4 layer underneath.
Token-granular Write-Ahead Log — fault-tolerant rollout
This is the most mathematically interesting piece of V4's infrastructure, and it ties to the survey's design patterns directly. The problem: in a cluster-wide preemptive scheduler, any rollout request can be interrupted at any token by hardware failure or by preemption for a higher-priority task. The naive recovery — restart preempted requests from scratch — is the kind of thing that looks fine but is mathematically wrong.
Regenerating unfinished requests from scratch introduces length bias. Shorter responses are more likely to survive interruption, so regenerating from scratch makes the model more prone to producing shorter sequences whenever an interruption occurs. The bias is a survivorship artifact of the recovery policy, not the policy gradient. — DeepSeek V4 §5.2.3
For a probabilist this is the kind of subtle bias they should hear once and recognize forever. The fix is a token-granular Write-Ahead Log (WAL): every new token is immediately appended to a persistent log; preemption pauses the inference engine and persists the in-flight KV cache; resumption replays the WAL + cached KV to continue decoding. Even on fatal hardware failure, the WAL's tokens are enough to re-run prefill and reconstruct the KV cache from a clean start without restarting generation. The mathematical claim is that the distribution of output sequences is the same whether or not preemption occurred — the WAL preserves on-policy statistics under arbitrary interruption.
An equivalent solution V4 considers and rejects: a batch-invariant, deterministic inference stack with seeded PRNGs would also let interrupted runs be replayed exactly. Mathematically equivalent; engineering-wise prohibitive (full re-decoding cost instead of WAL-replay cost). The WAL is the right answer at this scale, and it generalizes to any rollout system that has to be preemption-safe.
This is a sixth design pattern, complementing the five from the next section: persistent log + replay = correctness under preemption. I'll formalize it there.
Million-token RL — metadata vs heavy data separation
V4 supports million-token context windows. The corresponding rollout infrastructure has to handle trajectories where a single sample's per-token fields (logprobs, masks, multimodal payloads) easily exceed gigabytes. The team's solution is to decompose rollout data into two streams: lightweight metadata (lengths, sample IDs, reward scalars) loaded eagerly for global shuffling and packing layout decisions; and heavy per-token fields loaded lazily via a shared-memory data loader (intra-node deduplication) and released immediately upon consumption at mini-batch granularity. The number of on-device mini-batches is determined dynamically based on workload to trade compute throughput against I/O overlap.
For a theorist this is "buffer-then-fold" (pattern 5 in the next section) at the data-pipeline layer rather than the tensor layer — same structural argument, different granularity.
DeepSeek Elastic Compute (DSec) — agentic-AI sandbox platform
The agentic-RL frontier (SWE-bench, web research, tool-use trained policies) needs the rollout engine to call code execution, not just to generate tokens. V4 builds this as a separate platform: DSec, a Rust-based production sandbox system that manages "hundreds of thousands of concurrent sandbox instances" per cluster. Four execution substrates behind one unified Python SDK:
- Function Call — stateless invocations dispatched to a pre-warmed container pool, no cold start
- Container — Docker-compatible, EROFS on-demand image loading
- microVM — Firecracker, for security-sensitive high-density
- fullVM — QEMU, for arbitrary guest OSes
Built on the 3FS distributed filesystem and a custom RPC protocol. Crucially, sandbox lifecycles coordinate with GPU training schedules — preemption and checkpoint-based resumption are first-class. Each sandbox maintains a globally-ordered trajectory log that serves three purposes: fast-forward replay (when training is preempted, cached results for completed commands are replayed on resumption to skip non-idempotent re-execution); fine-grained provenance (every state change is traceable to its command); deterministic replay (any historical session reproduces from its log).
The unifying observation: V4 has now applied the WAL-replay pattern twice — once at token granularity for LLM rollout (Section 5.2.3), once at command granularity for sandbox state (Section 5.2.5). The same correctness argument carries over: persistent ordered logs let you preempt at any boundary and resume without changing the output distribution. The pattern generalizes; the granularity changes with the workload.
What's interesting about V4 as a case study
The honest summary: V4's contribution is not a new RL algorithm but a new merging algorithm. The team trains specialists with RL (covered in Section 5.1.1 with GRPO), then uses OPD to consolidate them. The infrastructure innovations — multi-teacher scheduling, WAL fault-tolerance, DSec sandboxes — exist because OPD-as-merger creates problems that pure RL doesn't. For a theorist reading V4 alongside Miles, the right framing is that both papers solve the same skeleton (rollout cycle, three pillars, hybrid engine) but assemble different upper layers on top.
- DeepSeek V4 technical report — Section 5.1.2 (OPD objective) and Section 5.2 (post-training infrastructure)
- Gu et al. 2024 — MiniLLM: the original on-policy distillation formulation
- Lu & Lab 2025 — Thinking Machines OPD blog: a sharper case for OPD over RLHF
- slime · examples/on_policy_distillation — the open-source reference with two teacher modes (sglang external, megatron in-process)
- slime OPD README — flags:
--use-opd,--opd-type,--opd-kl-coef,--opd-teacher-load
Engineering design patterns — the algebraic view
Six patterns recur across every framework in the survey. Once you see them named, the rest of the page reads as variations on a small set of themes. I lay them out here as a synthesis of what the previous sections have already shown. The sixth pattern is the one V4 just demonstrated above — I add it explicitly at the end.
1. Functor — one verb, many topologies
The four update_weights_from_* methods are the canonical example. There is one operation ("write the new policy into the inference engine") lifted across four different categories of physical configuration (same process / shared disk / NCCL group / CUDA IPC). The interface is fixed; the implementation per category is what differs. The same pattern appears in SkyRL's three-backend abstraction (vLLM / SGLang / OpenAI API) and in verl's training-backend choice (FSDP / FSDP2 / Megatron).
The mathematical statement: an operation lifted to a category of topologies, with a consistency obligation downstream (the flush_cache_after_weight_update contract). For a theorist, this is the cleanest example of "the abstraction does mathematical work" in the survey — the verb is the same, the meaning is preserved, but the cost varies by category.
2. Categorical product — multiplicative composition
RadixAttention × GRPO is the example I lead with elsewhere because it's the most beautiful. An algorithmic choice (group sampling, for variance reduction) exposes a sharing pattern (a common prefix). A data structure (the prefix tree) exposes a sharing mechanism (reference-counted nodes). Their composition multiplies savings — 4× prefill cost becomes 1× plus tails — and reference counting makes cache lifetime automatic.
The pattern recurs elsewhere. The hybrid engine × Megatron 5D parallelism: colocation chooses one GPU per role at a time, while Megatron handles the parallelism within a role; their composition is what makes 462B-parameter training feasible. R3 routing replay × FP8 inference: routing replay ensures expert selection is identical, FP8 makes it economical; together they make MoE RL stable. Whenever you see "the system advantage is doing the heavy lifting," you're looking at a categorical product.
3. Mutual exclusion as serialization
The hybrid engine (训推一体) is the central instance. The GPU is in exactly one of two states at any moment (training-mode or inference-mode); transitions are explicit; concurrency contracts are checked structurally with assert is_fully_idle(). This is the engineering version of a state machine with two states and well-defined transitions — but its real significance is that shared state with mutable ownership has no race conditions when ownership is exclusive. Concurrent programming's worst class of bugs is structurally unavailable.
The same pattern shows up in subtler places. RadixAttention's inc_lock_ref / dec_lock_ref establishes mutual exclusion between active readers and the eviction policy — a node with lock_ref > 0 simply cannot be evicted. Use-after-free is unrepresentable. The structural argument is the same; the state machine is just per-node instead of global.
4. Identity invariant under reparameterization
This is the deepest pattern, and Miles' R3 is the canonical case. The mathematical statement: routing(x, θ, fp8) ≡ routing(x, θ, bf16), enforced by replay rather than by hoping numerics agree. Two computations that should give the same answer but might not, made to give the same answer by recording one and replaying the other.
The QAT pipeline does the same thing differently — fake quantization in the forward pass establishes loss(θ, bf16) ≈ loss(θ, int4) by inserting the int4 noise into the training distribution. STE (Straight-Through Estimator) makes the backward pass behave as if int4 weren't there. Both are identity invariants under a reparameterization (precision regime, in this case) maintained by explicit engineering rather than mathematical equality.
For a theorist this is the most interesting pattern because it appears whenever an engineering shortcut (lower precision, async update, partial rollout) creates a mathematical inconsistency that has to be closed by another mechanism. The shortcut + the fix together are the contribution; neither alone is.
5. Buffer-then-fold — bounding asymptotic complexity
The multimodal tensor merge in multi-turn rollout (Section above) is the clean example. Naïve concatenation per turn: O(n²). Buffer-then-fold: O(n). The same input-output behavior; two different complexity profiles distinguished only by where the allocation boundary sits.
The pattern recurs in AReaL's _PendingWeightUpdateBucket (queue NCCL broadcasts in memory-bounded buckets, fire them at the end), in verl's DataProto.union (merge fields lazily, materialize the full batch once at dispatch time), and in slime's bucket-based weight sync (avoid OOM by streaming updates in slices). The unifying claim: do not perform an O(n) operation inside an O(n) loop unless you must.
6. Persistent log + replay — correctness under preemption
The pattern V4 demonstrates with the token-granular Write-Ahead Log. The setup: a long-running stateful computation (a rollout, an agent session, a database transaction) can be interrupted at any moment. The naive recovery — restart from the most recent persistent checkpoint — has subtle correctness pitfalls. Restarting an in-flight LLM generation, as we saw, introduces length bias: shorter responses are over-represented in the survivor population, distorting the empirical distribution of trajectories the policy gradient sees.
The pattern's claim: maintain an append-only log of every operation; on resumption, replay the log to reconstruct the pre-interruption state. The result is that the post-interruption execution is indistinguishable from a no-interruption execution at the observable boundary (the sampled trajectory distribution in V4's case; the database state in a transactional setting). Token-granular WAL is the LLM-rollout instance; command-granular trajectory logs are DSec's agentic-sandbox instance; the same correctness argument applies in both.
Mathematically, the pattern is a fixed-point claim: the function from input prompt to output trajectory is the same with or without preemption, when preemption is bracketed by WAL persistence + replay. This is a stronger guarantee than "results are approximately the same" — it's that they're distributionally identical. For a probabilist, the recognition is that recovery policies are part of the data-generating process, and the wrong recovery policy is a hidden experimental variable.
The synthesis
Once you have these six patterns, reading a new RL framework becomes pattern-matching. The first time you see RLinf's M2Flow scheduler, you ask: "what's the functor here, and what's the consistency obligation downstream?" The first time you see cosmos-rl's async reward microservice, you ask: "what's the mutual exclusion contract, and what's the buffer-then-fold strategy?" The first time you see a fault-tolerant rollout pipeline, you ask: "what's logged, what's replayed, and what's the distributional-equivalence claim?" The first time you read a new framework's train.py, the six patterns are the lens.
This is why the survey insists on names. Naming the patterns turns "what's going on in this code" into "which of the six patterns is this." The latter is a search; the former is reading.
Reading real code: verl's fit() loop
Everything above has been concept. If you want to see how it ties together in a real framework, the cleanest reference is verl's ray_trainer.py. One PPO/GRPO iteration is roughly 30 lines of driver code; every phase is a marked_timer block; every cross-module call is a Ray dispatch to SPMD worker groups. The batch: DataProto accumulates fields phase by phase via .union(...).
with marked_timer("step", timing_raw):
with marked_timer("gen", timing_raw, color="red"):
combined_gen_output = self.async_rollout_manager.generate_sequences(combined_gen_batch)
self.checkpoint_manager.sleep_replicas()
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
batch = batch.union(gen_batch_output)
with marked_timer("reward", timing_raw, color="yellow"):
if self.use_rm and "rm_scores" not in batch.batch.keys():
batch_reward = self._compute_reward_colocate(batch)
batch = batch.union(batch_reward)
reward_tensor, reward_extra_infos_dict = extract_reward(batch)
with marked_timer("old_log_prob", timing_raw, color="blue"):
old_log_prob, old_log_prob_mfu = self._compute_old_log_prob(batch)
batch = batch.union(old_log_prob)
if self.use_reference_policy:
with marked_timer(str(Role.RefPolicy), timing_raw, color="olive"):
ref_log_prob = self._compute_ref_log_prob(batch)
batch = batch.union(ref_log_prob)
if self.use_critic:
with marked_timer("values", timing_raw, color="cyan"):
values = self._compute_values(batch)
batch = batch.union(values)
with marked_timer("adv", timing_raw, color="brown"):
batch = compute_advantage(batch, adv_estimator=self.config.algorithm.adv_estimator, ...)
if self.use_critic:
with marked_timer("update_critic", timing_raw, color="pink"):
critic_output = self._update_critic(batch)
with marked_timer("update_actor", timing_raw, color="red"):
actor_output = self._update_actor(batch)
The colors aren't decorative — they map to phases you'd see in verl's wandb traces. Red is generation, yellow is reward, blue is old_log_prob, etc. The driver code is short because every method call is a Ray dispatch to SPMD worker groups, which expand to hundreds of GPUs internally. Reading this once is worth more than reading 20 RL papers — it makes the abstract loop in Figure 1 fully concrete.
The contrast worth looking at is AReaL's async path. The single keyword that distinguishes "synchronous" from "fully async" RL is async_op=True on the NCCL broadcasts. AReaL launches them, queues them in memory-bounded buckets, and continues without waiting; OpenRLHF's broadcast_to_vllm blocks until every engine acknowledges. Two designs, one keyword apart, with substantially different scaling properties:
class _PendingWeightUpdateBucket:
handles: list[dist.broadcast] # async_op=True handles
futures: list[torch.cuda.Event]
tensors: list[torch.Tensor]
# During weight sync:
for bucket in buckets:
dist.broadcast(bucket.tensor, src=0, group=dp_group, async_op=True)
# continue iterating — don't wait
# Inference side:
future = rollout_engine.update_weights_from_distributed(meta, param_specs, async_op=True)
# Inference keeps running until it explicitly awaits the future
The framework landscape
Nine frameworks, all instantiations of the skeleton in Section 2. They differ on five axes: training backend, inference backend, orchestrator, placement policy, and target domain. Once you know the five axes, you can place every framework on a 5-dimensional map and the surface differences (DAPO vs PRIME vs GSPO vs OPD) collapse into local choices made within the same architectural envelope.
| Framework | Training | Inference | Orchestration | Domain | Distinctive bet |
|---|---|---|---|---|---|
| Miles | Megatron (plugin) | SGLang | Ray | Frontier MoE | FP8/INT4 bit-identical · R3 routing replay |
| SLIME | Megatron | SGLang | Ray | Reasoning + agentic (GLM-5.1) | On-policy distillation · math/science graders |
| SGLang | — | (itself) | — | Substrate | update_weights_from_* · RadixAttention |
| SkyRL | FSDP / Megatron | vLLM / SGLang | Ray | Multi-turn agents | skyrl-gym + skyrl-agent + Tinker |
| cosmos-rl | PyTorch + 6D parallel | vLLM / diffusers | Custom NCCL | Physical AI | WFM RL (DDRL) · FP8/MXFP4 |
| RLinf | FSDP + Megatron | SGLang + vLLM | Ray + M2Flow | Embodied + agentic | Macro→Micro flow: 2.43× from scheduling |
| verl | FSDP / Megatron | vLLM / SGLang | Ray (HybridFlow) | Frontier LLM RL | 3D-HybridEngine · DAPO/PRIME/GSPO recipes |
| OpenRLHF | DeepSpeed | vLLM | Centralized Ray | RLHF baseline | The OG · structurally synchronous |
| AReaL | FSDP / Megatron | SGLang | Ray + async futures | Fully-async RL | _PendingWeightUpdateBucket · 2.77× speedup |
If you cluster these by their bet, three families emerge. LLM purists (Miles, SLIME, verl) make a tight Megatron + SGLang + Ray bet and innovate on numerics and algorithms. Agent platforms (SkyRL, RLinf) abstract environments and trade backend specificity for flexibility. The domain extender (cosmos-rl) brings RL to non-text modalities — diffusion video, robotics — and rebuilds orchestration around hardware. SGLang sits underneath all of them as substrate; AReaL and OpenRLHF are performance specialists that differ mainly in their async choice.
For a theorist, the useful exercise is to ask, for a given research result, which framework it was trained on and whether the framework's choices preserve the assumptions the result needs. A claim about variance reduction under GRPO needs the prefix cache to behave; a claim about MoE expert specialization needs the routing-replay invariant; a claim about off-policy correction needs the IS weights to actually be computed. The framework matters more than the algorithm name suggests.
Beyond chat LLMs
All nine frameworks above were designed for chat LLMs. The 2026 frontier is wider — multi-turn agents that call tools, vision-language-action policies for robots, diffusion world models that generate video. These targets push different requirements onto the framework, and frameworks that can't accommodate them get left behind.
Multi-turn agents
Single-turn RL: one prompt → one completion → one reward. Easy. Multi-turn RL: the model emits a tool call, gets a tool response, emits another action, gets another response — across 10+ turns. Reward arrives at the end (task success) or per-turn (intermediate signals). The rollout is a trajectory, not a completion. New systems requirements: KV cache reuse across turns (SGLang's open_session()), variable-length trajectories (mandates partial rollout), and a clean environment abstraction so the agent can call real tools. SkyRL with skyrl-gym + skyrl-agent (which trained the SA-SWE-32B SWE-bench model) and SLIME with its concrete examples (tau-bench, retool, search-r1) are the leaders.
Embodied AI and VLA
Vision-Language-Action models (π₀ / π₀.₅ from Physical Intelligence, OpenVLA, NVIDIA GR00T-N1.5) take an image plus a language instruction and output continuous action sequences for robots. The action space is no longer discrete tokens but continuous joint angles. RL algorithms shift to SAC, DAPO, SAC-Flow (a flow-matching policy variant). The simulator-in-the-loop becomes a non-negotiable part of the framework — RLinf abstracts ManiSkill, IsaacLab, Habitat, LIBERO, RoboTwin, CALVIN, MetaWorld behind a single envs/ wrapper.
World foundation models and diffusion
Cosmos-Predict, SANA, Stable Diffusion 3, Wan2.2 (Alibaba's 27B-total/14B-active MoE video model), FLUX/FLUX.2 (Black Forest Labs) — these are diffusion-based generators. Each "rollout" is 50+ denoising steps, not one autoregressive pass; cost-per-sample dominates everything. Long-context handling becomes critical (video tokens reach 100K+), which is why cosmos-rl needs 6D parallelism. The algorithm shifts too: PPO/GRPO doesn't directly apply to diffusion models. Cosmos-rl's DDRL (Data-Regularized DRL) replaces the KL term with reward maximization plus standard diffusion training loss. RLinf as of Feb 2026 supports RL fine-tuning of VLA on Wan world models — closing the loop where Wan simulates rollouts for embodied agents.
The TPU detour
If you're forced onto Google TPUs, almost none of this works. The Cambrian-MLLM TPU training blog documents two years of training multimodal models on TPUs and reaches three uncomfortable conclusions. First, "dynamic shapes are the enemy" — every shape change triggers XLA recompilation. Variable-length generation as we know it on GPU is essentially banned. Second, "arbitrary SPMD sharding proved impractical" — Ray-based hybrid placement doesn't translate. Third, silent library incompatibilities hide everywhere: F.scaled_dot_product_attention and torch.utils.checkpoint fail without errors on TorchXLA.
The conclusion: a TPU-native RL framework would look fundamentally different. Static graph compilation, fixed shape contracts, no Ray, no SGLang. No one has built this yet. vLLM has plugin support for TPUs in inference, but the full RL-on-TPU story is open. This is a real research gap.
The scaling chain
The architecture is elegant because each primitive is designed to address the bottleneck that appears at the next scale tier. Walking up the chain:
| Tier | What breaks first | What saves you |
|---|---|---|
| ~8 GPUs (single node) | KV + train state collide → OOM | release_memory_occupation + colocate |
| ~64 GPUs (small cluster) | GRPO group prefills duplicate work | RadixAttention prefix cache |
| ~256 GPUs (medium) | Naïve param copy = 50ms × thousands of tensors | Handle-tuple zero-copy via CUDA IPC + ZMQ |
| ~1024 GPUs (large) | Long-tail rollouts — 90% GPUs idle 18s/iter | Partial rollout + TIS/MIS staleness corrections |
| ~4096 GPUs (frontier) | Single-controller becomes CPU bottleneck | SPMD multi-controller; M2Flow dynamic scheduling |
| ~10000+ GPUs | NCCL groups can't be resized | RDMA point-to-point + disaggregated rollout pools |
The meta-property is that each primitive is a local optimization. You don't pay for RDMA until you need elasticity. You don't pay for partial rollout until tail latency dominates. The stack is "pay for what you need," which is what makes it scale — the opposite design (monolithic optimization tuned to one tier) wins at that tier and dies at the next.
Pitfalls — 踩坑录, lessons paid in pain
Chenyang's tutorial catalogs the production failure modes the field has paid years of debugging on. Six are worth knowing for anyone trying to interpret RL results:
1. Training-inference numerical drift
Inference kernels fuse operations to maximize throughput; the fusion depends on batch shape. Same model, same input, different batch size = slightly different logits. Invisible at the token level, fatal at the logprob level. Never trust inference-engine logprobs for loss computation.
2. Handle-tuple deserialization segfaults
verl's update_weights_from_tensor requires monkey_patch_torch_reductions() to register CUDA IPC handle deserializers. Missing this call: silent segfault, intermittent.
3. NCCL hangs under mixed inference backends
OpenRLHF + SGLang integration: silent deadlocks, no error, the run just stalls. Mixing distributed backends (DeepSpeed + Ray + SGLang's own dist) creates fragile NCCL group management.
4. Memory choreography under colocate
Megatron's CPU offload is imperfect — KV cache and model parameters contend for the same address ranges. slime's bucket-based weight update exists specifically to avoid OOM on large MoE models. Test the memory hand-off at full scale early.
5. Off-policy ratio drift
Without explicit IS correction or operational staleness bounds, the off-policy ratio grows monotonically until the policy is training on data from a fundamentally different policy. Training curve looks fine. Eval curve mysteriously degrades. Monitor off-policy ratio as a first-class metric.
6. Mixing precisions across train and infer
BF16 training + FP8 inference + FP32 optimizer state = three numerical regimes interacting via weight sync. MoE routing diverges between regimes (the problem R3 solves). If you can't make precision uniform, instrument routing decisions per-expert. The bug is invisible at the loss level.
A researcher's checklist — what to verify before trusting an RL result
If the empirical content of the previous sections distilled to anything actionable, it would be a short list of questions you should be able to answer about any RL paper before you treat its numbers as evidence for its theoretical claims. I list nine. They are intentionally ordered by how often they materially affect results — the first three matter more than the last three.
1. Which framework was the training run on?
The most important single question, and the one papers most often answer vaguely. "We used GRPO" tells you the algorithm. The framework choice — Miles vs verl vs SLIME vs OpenRLHF — tells you which set of engineering primitives shaped the run. Routinely, two groups training "the same algorithm" report different numbers because their frameworks make different choices about staleness, prefix caching, weight sync, and precision alignment.
2. Inference engine and rollout topology?
SGLang vs vLLM matters because of the RadixAttention prefix-cache difference (especially for GRPO group sampling). Colocated vs disaggregated rollout matters because of the weight-sync path. "vLLM with disaggregated rollout" and "SGLang with hybrid engine" can produce noticeably different sample-efficiency curves under the same algorithm and the same model. A paper that does not name the inference engine has not described its experimental setup.
3. Precision regime, and what enforces train-infer consistency?
BF16 train + BF16 infer is the safe case. FP8 inference + BF16 training is the dangerous case for MoE models without R3-style routing replay. INT4 inference + BF16 training requires QAT. The question to ask: in what precision are the logprobs that drive the loss computed, and where do they come from? If the answer is "FP8 inference engine" without further qualification, you should expect numerical drift. If the answer is "recomputed in BF16 on the training engine," good — that's the safe pattern.
4. On-policy or async? Off-policy ratio cap?
Strict on-policy RL is rare at any scale. If the framework uses partial rollout (most do, above 256 GPUs), the result is technically off-policy, and there should be either an explicit importance-sampling correction (Miles' TIS/MIS) or an explicit staleness bound (Kimi K1.5's curriculum scheduling, AReaL's bucket size). The question: is the off-policy ratio reported, capped, or even monitored? If none of those, the on-policy claim is unverified.
5. Was logprob recomputed on the trainer or trusted from inference?
The "BERT-era unsolved bug" — inference kernels fuse operations differently from training kernels, producing different logprobs for the same input. Every well-engineered framework recomputes logprobs on the training engine. If a paper trains on inference-engine logprobs, the loss has a numerical bias that grows with training. Worth checking in the framework's source even if the paper doesn't say.
6. For MoE models: is routing replay enabled?
The MoE-specific version of question 3. Without R3 (or equivalent), the gradient signal has a noise floor from routing divergence between inference and training. DeepSeek-V3, Qwen3-MoE, GPT-OSS, Mixtral — all need this. A paper training one of these without naming the routing-replay mechanism is implicitly trusting that the precision regimes agree exactly, which they typically don't.
7. Group size N for GRPO?
GRPO's variance reduction scales with group size. The prefix-cache savings also scale with group size (more completions sharing a prefix). The optimal N depends on the cluster, the model, and the task. Papers often report N=8 or N=16 with no ablation. If you're trying to reproduce or compare against a result, the group size is part of the experimental setup, not a hyperparameter.
8. For multi-turn experiments: token budget, max_turns, truncation policy?
Multi-turn trajectories have long tails. A 32-turn rollout with no per-turn limit gives different results from a 32-turn rollout with a 4096-token per-turn cap. Whether observation tokens are loss-masked (they should be) is part of the setup. Whether the chat-template preamble is repeated each turn (it shouldn't be — see the dummy-messages trick) is part of the setup. Papers comparing agents trained under different multi-turn settings often aren't comparing the same thing.
9. Compute budget per training step, and where time is spent?
The least-asked but most-revealing question. A training run reports "256 H100-days for the full schedule" and the breakdown between rollout, reward, and gradient update is typically 60% / 5% / 35%. If a paper compares two algorithms but doesn't break down where the time went, the comparison may be artifact of one method exploiting prefix caching better, not of the algorithm being inherently faster. The wandb-style timing breakdown (gen / reward / old_log_prob / values / adv / update_critic / update_actor) that verl's marked_timer blocks produce is what you actually want to see in a paper's appendix.
A paper that doesn't answer the first three of these has not described its experiment. A paper that doesn't answer the first six is not reproducible from its text alone. The 8th and 9th are luxury — but where the most interesting comparisons live.
None of this is to claim that papers should be ignored unless they answer all nine questions. It is to say that the empirical comparisons you make in your head while reading the field's papers should always include "and what is implicitly assumed about questions 1 through 6." The discipline of working through this checklist is what the survey's vocabulary is for.
A 19-repo reading plan — how to do this yourself
This survey distilled 19 source repos into a few thousand words. If you want to do the same exercise yourself — read the actual code, not just my summary — here is the order I would recommend, and the template I would use for each repo. The two together turn what looks like an overwhelming corpus into a manageable sequence of small case studies.
Read in five groups, not in their listed order
The natural impulse is to start with the most familiar repo on the list. That is the wrong move. The list groups by topic — image diffusion, RL framework, inference engine, kernel DSL — and reading them in topic order means you spend a week on image generation before you have any of the RL skeleton in your head. The better order is the one this survey is built on:
- The RL skeleton itself. Start with Miles, slime, verl, SGLang, and Megatron-LM together. These five repos cover the full generate → score → filter → train → sync path. Miles and slime give you a clean five-phase loop in
train.py. verl gives you the most complete PPO/GRPO driver. SGLang gives you the rollout-and-weight-sync substrate. Megatron-LM gives you the training backbone and, surprisingly, its own native RL path inmegatron/rl/. By the end of group 1 you can read any RL framework'strain.pyin fifteen minutes. - Comparison across RL framework designs. Then read AReaL, OpenRLHF, SkyRL, and RLinf. Now you're not learning the skeleton — you're learning the variations. AReaL shows fully-async; OpenRLHF shows the classical synchronous Ray + DeepSpeed + vLLM design; SkyRL specializes in long-horizon agents; RLinf adds the M2Flow scheduler for embodied and agentic workloads. The point of group 2 is to see where the same skeleton can bend.
- Inference and compiler substrate. Next read vLLM, TensorRT-LLM, TensorRT, Triton, TileLang. These are not RL frameworks. They are the layer beneath. The goal is to understand why prefix caches, paged attention, CUDA graphs, kernel DSLs, quantization, and MoE communication kernels determine rollout economics. After group 3, you will read RL papers differently — you'll see that "we used SGLang" or "we used vLLM" is a substantive experimental claim, not a deployment detail.
- Rollout targets beyond chat. Then FLUX, FLUX.2, Wan2.2. These are not RL frameworks either — they are the targets RL frameworks might want to fine-tune. FLUX.2's reference-token KV-cache trick and Wan2.2's video-diffusion MoE both are useful systems case studies in their own right. The point is to understand what changes when the target is not an autoregressive LLM.
- Failure modes and meta-reading. Finish with the Cambrian TPU blog and Chenyang Zhao's Awesome-ML-SYS-Tutorial. Cambrian tells you why almost none of the GPU-RL stack ports to TPUs. The tutorial tells you which papers' results to distrust until you have verified the infrastructure. These two are the reality check at the end of a long reading binge.
Five groups, nineteen repos. If you spend a real week on group 1 and then a few days on each of groups 2 through 5, you finish with a working mental model of the field's plumbing. Without group 1, the others read like trivia. With group 1, they read like commentary on a structure you already understand.
A ten-question template for each repo
I have used the same template for every case study in this survey. It is short enough to apply in an evening, structured enough to make the reading comparable across repos. The questions are deliberately ordered from outside to inside, so you build context before you dive into specifics.
| # | Question | Where to look |
|---|---|---|
| 1 | What is this repo, really? Is it a target model, a framework, a backbone, an inference engine, or a kernel substrate? | README + top-level package layout |
| 2 | Directory map — what lives where? README, docs, examples, scripts, tests, src. | One-shot ls -R with note-taking |
| 3 | Entry scripts — what's the training entry, the inference entry, the benchmark entry, the deployment entry? | scripts/ and the README's "Quick Start" |
| 4 | Core objects — name the trainer, rollout manager, scheduler, engine, worker, cache, model runner. | The __init__.py exports of the top-level package |
| 5 | One sample's path — trace a single prompt/request from input through rollout, reward, loss, update, sync. | The main loop file; trace the variable names |
| 6 | Engineering invariants — memory ownership, weight freshness, cache invalidation, logprob recompute, precision consistency. | The asserts and concurrency contracts |
| 7 | Parallelism and resources — TP/PP/DP/EP/CP, FSDP, DeepSpeed, Ray placement, NCCL/RDMA, offload. | The config files; the placement-group code |
| 8 | Known failure modes — OOM, staleness, routing divergence, NCCL hangs, cache staleness, dynamic shapes, precision drift. | Issues, CHANGELOG, comments tagged "TODO" or "WARN" |
| 9 | Mapping to this survey's primitives — which of the six engineering primitives does it implement, and which does it skip? | Cross-reference with the design patterns section above |
| 10 | The ten files most worth reading — in walking order, outside to inside. | Your own notes from steps 3-6 |
This template is what the Miles case study in this survey looks like under the hood. The deep dive into Miles is just questions 1-10 answered for one repo. Once you have the answers for one repo, the answers for the next repo are easier to find — you know what to look for.
The point of a template is not that it teaches you the answer. The point is that it makes the unknown answers visible. Two pages of "I don't know yet, but I know where to look" beats ten pages of disorganized notes from skimming.
Why this matters more than any single answer
The honest reason to learn this corpus is not that you will read all nineteen repos. The reason is that you will have to read one of them, sooner than you expect, because something in your own work depends on it. When that happens, knowing the five-group structure and the ten-question template is what turns "I have to read a giant unfamiliar codebase" into "I have a routine." A theorist who has practiced this routine is hard to surprise.
The corpus is a vocabulary, not a curriculum. The five groups are the rough order in which the vocabulary becomes useful. The ten questions are how you make any single new repo legible. Together they are the most portable thing in this survey — they outlast any specific framework or paper, because the structure of the field changes more slowly than its surface.
Open questions at the systems–theory boundary
A short list of unresolved questions where a mathematically-minded RL person could plausibly contribute. These are not "research suggestions" so much as the cluster of problems the survey's content gestures at without solving.
TPU-native RL infrastructure
The Cambrian-MLLM blog establishes that almost none of the GPU-RL stack ports to TPUs cleanly. Static shapes, XLA compilation, the absence of NCCL, the awkwardness of Ray on TPU pods — all of these forbid the patterns the survey covers. The question is whether a different set of patterns can deliver the same outcome on TPU hardware: tile-based parallelism, AOT-compiled rollout pipelines, JAX-native weight sync. No framework today fills this gap. Building one is at least a year of senior engineering, but the design space is genuinely open.
Formal verification of weight-sync correctness
TileLang's Z3 theorem-prover integration into the TVM arithmetic analyzer is the first credible attempt to bring SMT-style verification into the GPU DSL. The weight-sync contract — that the inference engine's Python tensors point to the same physical memory the trainer wrote, with no aliasing or stride mismatches — is the kind of contract that would benefit from a machine-checked proof. Current implementations rely on careful code review plus runtime asserts. A formal treatment is missing.
Tight off-policy bounds under partial rollout
TIS and MIS truncate or mask importance weights to control variance, but the bias-variance tradeoff is hand-tuned. The theoretical question: under what conditions on the staleness distribution does partial rollout converge to the same fixed point as strict on-policy training, with how much added variance? Miles' TIS/MIS implementation is a starting point; a tight analysis would let frameworks set the truncation threshold automatically rather than as a hyperparameter.
MoE routing under quantization — beyond R3
R3 (Rollout Routing Replay) is a fix, not a theorem. The deeper question: under what conditions on the routing computation does a precision-induced top-k flip materially affect downstream learning? If the answer is "always," R3 is permanent. If the answer is "only when the expert affinities are within ε of each other," there might be a quantization scheme that preserves routing without explicit replay. This is the kind of question a numerical analyst could answer.
Theoretical basis for DDRL
cosmos-rl's Data-Regularized DRL replaces the KL term in standard PPO with a reward maximization objective plus the standard diffusion training loss. The empirical results are good; the theoretical basis is light. What's the corresponding policy-improvement guarantee? Under what conditions does the diffusion loss act as an implicit KL regularizer? The video-generation RL frontier needs this analysis to mature.
Multi-agent RL as a category of interacting policies
RLinf's multi-agent support (rStar2, WideSeek-R1) and the broader multi-agent RL literature lack a clean systems abstraction. The category-theoretic framing — agents as objects, message-passing as morphisms — is intuitive but doesn't yet correspond to a framework primitive. A mathematically clean abstraction here would have systems consequences.
The diffusion–autoregressive bridge
cosmos-rl supports both diffusion world models (Cosmos-Predict, SANA, SD3) and autoregressive LLMs (Qwen, LLaMA, DeepSeek) — but they go through fundamentally different inference paths (the diffusers backend vs vLLM). The two paradigms increasingly need to interact: video generation conditioned on language, robot policies that output both continuous actions and language explanations. A unified RL framework that treats both as first-class is an open systems problem.
Each of these is a real research direction, not a textbook exercise. The field's progress is increasingly going to depend on people who can think across the systems–theory boundary — which is the audience this survey is written for.
A reading list for the theoretically inclined
Ordered not by sequence but by what you want to understand. Pick a row.
| If you want to understand... | Read |
|---|---|
| The mental model behind every modern RL framework | HybridFlow paper (verl/EuroSys 2025) |
| The connective tissue between the abstract algorithm and the engineering | Chenyang Zhao's tutorial (中文, the canonical reference) |
| What makes inference engines fast | PagedAttention paper (vLLM/SOSP 2023) · SGLang paper · LMSys RadixAttention announcement |
| Why GRPO works (and where group size matters) | DeepSeekMath / GRPO paper |
| What MoE routing replay actually preserves | Miles docs on R3 and the FP8 pipeline |
| How to write a fast GPU kernel without learning CUDA proper | Triton tutorials · Sasha Rush's Triton Puzzles |
| What 5D parallelism looks like in practice | Megatron Core parallelism guide |
| One framework's source code, end to end | verl's ray_trainer.py |
| The mathematician's case for caring about systems | (this survey) |
The 19 source repos surveyed
verl · miles · slime · SkyRL · cosmos-rl · RLinf · SGLang · vLLM · AReaL · OpenRLHF · Megatron-LM · Triton · TileLang · cuda-python · TensorRT · TensorRT-LLM · FLUX · FLUX2 · Wan2.2 · Cambrian TPU blog