RL Infrastructure for the Mathematically Inclined
A field survey of reinforcement-learning post-training systems — six engineering primitives, three kernel-level DSLs, the training backbone, the quantization problem, seven production case studies (multi-turn agentic · slime · Miles · verl · DeepSeek V4 · SGLang internals · vLLM block-paging), six recurring design patterns, a researcher's checklist, a 19-repo reading plan, and open questions at the systems–theory boundary. Written for mathematicians and RL theorists who want to read the source code of verl, SGLang, vLLM, Megatron-LM, Triton and recognize what's mathematically interesting about the engineering choices.
Why this matters to a theorist
If you're a mathematician thinking about reinforcement learning, you might assume infrastructure is "the boring part that runs your algorithm." This survey argues otherwise. The engineering choices in modern RL post-training systems encode mathematical decisions — which approximations are bounded, which biases are corrected, which invariants hold by construction. You can't read DeepSeek-R1's training story without knowing how Miles' routing-replay makes it numerically possible; you can't compare on-policy and off-policy results without understanding how AReaL's asynchronous broadcasts shift the bias term in your gradient estimator.
The engineering is, in many cases, the assumption set that your theorem is implicitly invoking. A claim about GRPO's variance reduction holds only as long as the prefix cache, the importance-sampling correction, and the memory-handoff contract all behave as advertised. When they don't, the loss curve still looks fine and the eval curve mysteriously degrades — the field has paid years of debugging in this exact pattern.
This survey covers what is mathematically interesting about the engineering. I won't try to teach you the algorithms; the field has good papers and reading lists for that. I'll try instead to give you the structural picture: the five-stage cycle every framework instantiates, the three pillars they all stitch together, the six primitives that resolve the central tension, and the three kernel-level DSLs that make any of it run at a competitive speed. Where the engineering encodes a mathematical fact — an invariant, a composition rule, a bias correction — I'll flag it.
A large number of RL conclusions derived from papers are based on RL infrastructure that may be extremely flawed. — paraphrasing Chenyang Zhao, whose Awesome-ML-SYS-Tutorial is the most-cited source in this survey
The mathematician's instinct here is the right one: if you can't verify the infrastructure, treat the results as conjecture. The good news is that the infrastructure is open source, and the patterns are surprisingly elegant once you have a vocabulary. The aim of this survey is to give you the vocabulary.
How to read this survey
The most useful thing I can tell a theorist before they start is that this material has a particular structure that rewards a particular reading mindset. The skeleton (next section) and the central tension are the only two ideas you cannot skip; everything else is variation. If you have an hour, read those two and the six engineering primitives and stop — you'll have the structural picture. If you have a weekend, read everything in order. If you came here looking for a specific framework, jump to the framework landscape and use its links into the rest of the survey as a vocabulary glossary.
Three reading mindsets reward different kinds of attention.
The structural reading
Treat each engineering primitive as an algebraic statement: here is the invariant the primitive preserves; here is the operation that establishes it. The hybrid engine establishes mutual exclusion on GPU ownership. The four update_weights_from_* paths are a functor lifted across four topology categories. R3 establishes a fixed-point invariant on MoE routing decisions across precision regimes. RadixAttention × GRPO is a categorical product whose savings multiply. Read this way and the survey looks like a small algebra textbook with concrete examples — which is what it is.
The computational reading
Treat each primitive as an asymptotic statement: here is what gets cheaper and by how much, and here is what the next bottleneck is. The hybrid engine takes GPU utilization from 45–55% to 85–90%. Handle-tuple weight sync drops a 50ms-per-tensor cost to sub-millisecond. RadixAttention turns a 4× prefill cost into 1× plus 4× decode. Partial rollout turns 18s of idle into 2× throughput at the price of measurable off-policy bias. Read this way and the scaling chain section (which we'll get to) becomes the survey's organizing principle.
The empirical reading
Treat each primitive as an assumption set under which a paper's claim holds: here is what is invisibly assumed when the paper says "trained with GRPO". The variance reduction of group sampling assumes the prefix cache is reused. The on-policy interpretation of PPO/GRPO assumes the off-policy ratio is bounded. The MoE training stability claim assumes routing replay or unified precision. Read this way and the pitfalls section and the researcher's checklist (toward the end) become essential reading.
The three mindsets are complementary, not exclusive. The point of distinguishing them is that the same paragraph in this survey carries three kinds of content. A senior systems engineer might dwell on the asymptotics; a category-theorist might dwell on the invariants; an empirical-RL researcher might dwell on the assumption set. The page tries to surface all three, but you decide which to weight.
Why theorists usually skip this material — and what they miss
The honest reason most theoretically-trained RL people skip infrastructure is that the field's papers do most of the engineering exposition in appendices that read like release notes. "We use GRPO with group size 8" hides what's mathematically interesting (why 8, what the prefix cache buys, what the staleness correction does); "Training is done on a 128-H100 cluster" hides the parallelism choices. The body of the paper foregrounds the algorithm and treats the infrastructure as an implementation detail, when in fact the infrastructure is often where the empirical story is decided.
What you miss by skipping: you miss the fact that the engineering choices are themselves a contribution to the algorithm's behavior. Two papers can both claim "GRPO + Megatron" and report different numbers, and the difference lives in choices that the body never describes. Once you internalize the survey's vocabulary, the engineering exposition starts answering questions you'd previously have asked of the math. That's the conversion the survey aims for.
A useful default: assume the systems choices materially shaped any empirical RL result, until the paper proves otherwise.
The skeleton: a five-stage cycle on three pillars
Every framework you'll encounter — Miles, SLIME, verl, SkyRL, RLinf, cosmos-rl, AReaL, OpenRLHF, even SGLang's own RL adopters — instantiates the same skeleton. One step of RL training is a five-stage loop: the model generates completions, those are scored by a reward function, low-quality ones are filtered, the survivors train the policy, and the new policy is synced back to the inference engine. Repeat.
Around this cycle live three components: a training engine that computes gradients and holds the optimizer; an inference engine that runs the policy to generate rollouts; and an orchestrator that coordinates them across GPUs and nodes. A framework is, to a first approximation, a particular choice for each of these three slots — plus a data layer (reward hub, filter hub, data buffer) and an algorithm layer (GRPO, PPO, DAPO, OPD) that sit underneath.
The genius of the modern stack is that these three pillars are pluggable, with stable interfaces between them. The framework code looks single-threaded but runs SPMD across hundreds of GPUs. The driver writes engine.update_weights_from_distributed(...) and never sees the NCCL topology underneath. This decoupling is what makes a framework a framework and not a one-off training script.
The central tension: two halves that fight
Pretraining an LLM is a static dataset plus a training loop. Inference is a model plus a request stream. Both are well-understood; the right tools are Megatron-LM and vLLM (or SGLang) respectively. RL post-training is both at once, and the two halves want opposite things from the same hardware.
| Concern | Training prefers | Rollout prefers |
|---|---|---|
| Parallelism | High TP / PP — split the model | High DP — split the batch |
| Memory | Optimizer states + gradients + activations | Weights + KV cache + CUDA graphs |
| Throughput unit | Tokens/step (compute-bound) | Trajectories/sec (memory-bound) |
| Goal | Update one global policy | Generate N independent samples |
If you give half your fleet to training and half to inference and let them run in parallel, the on-policy constraint forces them to alternate anyway — half the fleet is always idle. If you force identical parallelism on both, one side bottlenecks. If you give them different parallelism, every phase transition requires resharding the model. Every engineering primitive that follows is a way out of this trap.
This is, mathematically, a resource-conflict problem with a Pareto frontier. The pragmatic solution — colocate both halves on the same GPUs and serialize them in time — turns the conflict into a sequencing problem. That sequencing problem is what the next six sections solve.
Six engineering primitives worth knowing
Each subsection covers one primitive. For each: the invariant it preserves, the implementation, and what's mathematically interesting about it.
① The hybrid engine — 训推一体
The answer to the opposing-preferences problem is a pattern called 训推一体 ("integrated train-inference"), formalized in verl's HybridFlow paper. Put both halves on the same GPUs; run them serially; swap memory between phases.
Mathematically, this is mutual exclusion: at any moment the GPU is in exactly one of two states (training-mode or inference-mode), and a controlled transition between them happens once per cycle. The cost is the transition itself. The benefit is that every GPU is doing useful work at all times — peak utilization moves from 45-55% (disaggregated) to 85-90% (colocated). Every modern framework — Miles, SLIME, verl, SkyRL, RLinf, cosmos-rl — uses some form of this pattern. The disagreement is on how to make the transition cheap.
Up to about 64 GPUs, full colocation (all roles share one resource pool) wins. Above that, split colocation (actor+ref on one pool, critic+reward on another) wins, because the parallelism preferences of the four roles start to diverge. Pure disaggregation only becomes economic above ~1024 GPUs, when rollout volume can amortize a separate inference cluster.
② Memory choreography
The hybrid engine works because two functions encode the entire concurrency contract: release_memory_occupation and resume_memory_occupation. They pause the inference engine, hand the GPU to the trainer, then restore it. The implementation is granular — three independent memory pools (KV cache, weights, CUDA graphs) can be released independently.
Here's the actual implementation in SGLang. The mathematical interest is in the assert is_fully_idle() at the top: the function literally cannot run if any request is in flight. The concurrency contract is checked structurally, not by convention. Each if tag in tags branch is a separate pool with its own pause semantics — granularity is the point.
def release_memory_occupation(self, recv_req):
assert self.is_fully_idle(), \
"release_memory_occupation should be called only when server is idle."
tags = recv_req.tags or GPU_MEMORY_ALL_TYPES
for tag in tags:
self.offload_tags.add(tag)
if GPU_MEMORY_TYPE_KV_CACHE in tags:
self.memory_saver_adapter.pause(GPU_MEMORY_TYPE_KV_CACHE)
self.flush_cache()
if GPU_MEMORY_TYPE_WEIGHTS in tags:
self.stashed_model_static_state = _export_static_state(
self.tp_worker.model_runner.model
)
torch.distributed.barrier(self.tp_cpu_group)
self.memory_saver_adapter.pause(GPU_MEMORY_TYPE_WEIGHTS)
if GPU_MEMORY_TYPE_CUDA_GRAPH in tags:
self.memory_saver_adapter.pause(GPU_MEMORY_TYPE_CUDA_GRAPH)
torch.get_device_module().synchronize()
return ReleaseMemoryOccupationReqOutput()
③ Zero-copy weight synchronization
After each training step the trainer has a new policy θt+1; the inference engine still has θt. Naively, you serialize the new weights, send them, deserialize. On a 70B model that's tens of milliseconds per parameter, with thousands of parameters per layer — minutes per training step, untenable.
The trick is to never move the tensor data. The trainer process has the new weights in GPU memory; the inference process is either on the same host or shares a NCCL group with the trainer. The trainer sends only a handle tuple — pointer, stride, offset, CUDA IPC descriptor — under a kilobyte. The inference process reconstructs a Python tensor object that points to the same physical GPU memory.
What's mathematically interesting here is that the abstraction is functorial: a Python Tensor is a thin wrapper around an underlying memory descriptor, and reconstructing the wrapper is independent of moving the bytes. The trainer and inference engine end up with two different objects denoting the same memory. The synchronization cost goes from O(parameters × byte_throughput) to O(parameters × pointer_size). Sub-millisecond instead of minutes.
④ Four update_weights_from_* paths, one verb
The trainer might hold the new weights in any of four physical configurations: same process as the inference engine, on a shared disk, on remote NCCL ranks, or in a sibling CUDA-IPC process. SGLang exposes four implementations behind one verb. The framework above writes engine.update_weights_from_*(...); the topology underneath dictates which method.
The pattern is functor-like: the same operation lifted to different categories of topology. What matters mathematically is that all four methods share flush_cache_after_weight_update. The RadixCache (next subsection) must be invalidated because old prefix entries reference the old policy — keeping them would mean the inference engine is serving generations based on a model that no longer exists. Cache invalidation is the inherent consistency obligation; the four transports differ in how they get the bytes to the inference engine, but they all owe the same invariant downstream.
class SchedulerUpdateWeightsMixin:
def update_weights_from_disk(self, recv_req):
success, message = self.tp_worker.update_weights_from_disk(recv_req)
if success and self.draft_worker is not None:
success, message = self.draft_worker.update_weights_from_disk(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
return UpdateWeightFromDiskReqOutput(success, message, 0)
def update_weights_from_distributed(self, recv_req):
success, message = self.tp_worker.update_weights_from_distributed(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
return UpdateWeightsFromDistributedReqOutput(success, message)
def update_weights_from_tensor(self, recv_req):
worker = self.draft_worker or self.tp_worker
success, message = worker.update_weights_from_tensor(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
torch.distributed.barrier(group=self.tp_cpu_group)
return UpdateWeightsFromTensorReqOutput(success, message)
def update_weights_from_ipc(self, recv_req):
success, message = self.tp_worker.update_weights_from_ipc(recv_req)
if success and self.draft_worker is not None:
success, message = self.draft_worker.update_weights_from_ipc(recv_req)
if success:
self.flush_cache_after_weight_update(recv_req)
torch.distributed.barrier(group=self.tp_cpu_group)
return UpdateWeightsFromIPCReqOutput(success, message)
An underrated systems constraint that this verb hides: NCCL process groups are static. Their participant set is fixed at creation; adding a new inference node mid-training means destroying and recreating the group. RDMA's point-to-point model treats each connection independently — a new peer establishes a fresh Queue Pair without disturbing existing links. This single fact explains why colocated frameworks lean on NCCL (via init_weights_update_group) while disaggregated systems prefer RDMA or shared disk. Elasticity is incompatible with NCCL.
⑤ RadixAttention × GRPO — algebraic composition
This is the example I would lead with if I were trying to convince a mathematician that engineering can be beautiful. GRPO generates N completions per prompt and normalizes rewards within the group as a baseline; the algorithmic motivation is variance reduction. The system-level consequence is that all N completions share the prompt prefix.
RadixAttention exploits this. The shared prefix is a single node in a radix tree; each completion branches off as a child. Reference counting (inc_lock_ref / dec_lock_ref) means a node with lock_ref > 0 cannot be evicted. Use-after-free is structurally unrepresentable.
class RadixCache(BasePrefixCache, KVCacheEventMixin):
def match_prefix(self, params: MatchPrefixParams) -> MatchResult: ...
def insert(self, params: InsertParams) -> InsertResult: ...
def inc_lock_ref(self, node: TreeNode) -> IncLockRefResult: ...
def dec_lock_ref(self, ...) -> DecLockRefResult: ...
def evict(self, params: EvictParams) -> EvictResult: ...
@property
def evictable_size(self): ...
Neither was designed for the other. GRPO chose group sampling for variance reduction. RadixAttention chose a tree for cache reuse. Their composition multiplies savings — and reference counting means cache lifetime is automatic.
This kind of accidental-but-elegant composition recurs throughout the field. A mathematician will recognize the structure: the algorithm exposes a sharing pattern (here, a common prefix); the data structure exposes a sharing mechanism (here, a radix tree); their composition is the categorical product. The right notation makes the saving obvious — the wrong notation makes the saving impossible to express.
⑥ Async training and the staleness tradeoff
Strict on-policy RL says: wait for every trajectory to finish under policy θt before running the gradient step. In practice this is fatal at scale. If 10% of prompts produce long-context trajectories (2K tokens of decode) and 90% are short (512 tokens), the short ones finish in ~2s but the long ones take ~20s. 90% of GPUs idle for 18s every iteration. Over 1000 iterations: ~5 GPU-hours per GPU wasted.
The pragmatic solution is partial rollout: train on the 128 trajectories that finished, let the rest continue under (now stale) θt. Throughput rises 2–4×; the off-policy ratio rises with it. There are two paths to bounding the resulting bias.
Mathematical correction. Miles applies Truncated Importance Sampling (TIS) and Masked Importance Sampling (MIS): the gradient update of a sample generated under θt-k is reweighted by the importance ratio πθt(a|s) / πθt-k(a|s), with truncation or masking to control variance. The estimator becomes unbiased again at the cost of variance the truncation introduces — a classical Radon-Nikodym story.
Operational correction. Kimi K1.5 bounds staleness instead: context-length checkpointing, dynamic batch reordering, prefix-maximizing scheduling. The off-policy ratio never grows large enough to require explicit correction, but the bound is enforced operationally rather than mathematically.
The architectural extreme is AReaL's fully-async design: NCCL broadcasts are launched with async_op=True, bucketed by memory budget. The trainer never blocks on weight sync. The off-policy bias is whatever falls out of how long inference takes to catch up — call it "implicit staleness." AReaL bets that with the right bucketing, the implicit staleness stays small enough to skip explicit corrections. They claim 2.77× speedup over synchronous baselines.
The layer beneath: CUDA, Triton, and TileLang
So far the survey has stayed at the framework level. But the throughput of all the primitives above depends on the speed of the underlying kernels — attention, GEMM, normalization, softmax. A 2× speedup on an attention kernel is a 2× speedup on rollout, which is 2× on the whole RL loop. Below PyTorch, there is a hierarchy of languages for writing GPU kernels. Theorists tend to underestimate how much of the field's progress lives at this layer.
CUDA Python — the bottom of the stack
NVIDIA/cuda-python is the metapackage that exposes CUDA from Python. It has multiple subpackages: cuda.bindings (low-level bindings to the CUDA driver, runtime, NVRTC, NVVM), cuda.core (Pythonic access to CUDA Runtime and JIT compilation), numba.cuda (a SIMT DSL that compiles a restricted subset of Python to CUDA kernels), and newer DSLs cuda.tile (NumPy-like syntax for the CUDA Tile programming model) and cuda.coop (block-wide and warp-wide primitives). If you need to write a kernel from scratch in Python with full control, this is the layer.
The reason this exists, beyond pedagogy, is that PyTorch's abstractions occasionally aren't enough. When you need NVSHMEM for one-sided RDMA, or NVML for fine-grained device queries, or a custom reduction over a non-standard layout, you reach for cuda.bindings. For RL infra specifically, NCCL group management and CUDA IPC (Section ③) live here.
Triton — the Python DSL that became universal
Triton is the Python DSL for writing GPU kernels that became the de facto standard the moment FlashAttention shipped in it. The pitch from the original MAPL 2019 paper is simple: write code that's higher-productivity than CUDA but more flexible than fixed-shape DSLs, with autotuning across block sizes. The compiler handles memory coalescing, shared-memory allocation, and tensor-core dispatch.
python/tutorials/01-vector-add.py@triton.jit
def add_kernel(x_ptr, # *Pointer* to first input vector.
y_ptr, # *Pointer* to second input vector.
output_ptr, # *Pointer* to output vector.
n_elements,
BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
tl.add(x_ptr + offsets, y, mask=mask)
What's worth noticing: the indexing is at the block level, not the thread level. tl.arange(0, BLOCK_SIZE) denotes a whole block of indices; mask handles out-of-bounds without manual loops. This is the "tiled" model — write block-wise math, the compiler vectorizes within the block. vLLM's PagedAttention, SGLang's attention kernels, the FlashAttention family, and most of the actually-fast layers in production frameworks are Triton kernels.
TileLang — the new DSL with theorem-prover integration
TileLang (open-sourced Jan 2025) is a newer Python DSL on top of TVM. Its pitch is similar to Triton's — productivity above CUDA, flexibility above fixed DSLs — but the design choices differ. TileLang exposes more explicit memory tier control (shared, fragment, register) and integrates the Z3 theorem prover into TVM's arith analyzer for SMT-based symbolic reasoning and automatic correctness verification. Backends include NVIDIA via CUTLASS CuTe DSL, AMD MI300X (with async copy), Apple Metal, Huawei AscendC, and WebGPU.
examples/deepseek_mla/example_mla_decode.py@tilelang.jit(out_idx=[4], pass_configs={...})
def flashattn(batch, heads, kv_head_num, seqlen_kv, dim, pe_dim, ...):
@T.prim_func
def main_split(Q, Q_pe, KV, K_pe, Output):
with T.Kernel(batch, heads // min(block_H, kv_group_num), num_split, threads=256) as (bid, hid, bz):
Q_shared = T.alloc_shared([block_H, dim], dtype)
S_shared = T.alloc_shared([block_H, block_N], dtype)
KV_shared = T.alloc_shared([block_N, dim], dtype)
acc_s = T.alloc_fragment([block_H, block_N], accum_dtype)
acc_o = T.alloc_fragment([block_H, dim], accum_dtype)
scores_max = T.alloc_fragment([block_H], accum_dtype)
logsum = T.alloc_fragment([block_H], accum_dtype)
# ... ~50 lines of tiled FlashAttention math ...
The TileLang authors claim performance parity with hand-written FlashMLA on H100 in 80 lines of Python. The Z3 integration is the unusual choice — it lets the compiler prove arithmetic invariants symbolically (e.g., that an index expression stays in bounds for all parameter values, that a tiling is a valid partition). For a mathematician, this is the most interesting compiler design move in the GPU DSL space.
Which one to learn?
If you only learn one, learn Triton — it's where the field's research code lives. If you care about correctness verification or AMD/Apple/Huawei portability, look at TileLang. Reach for raw cuda.bindings only when you need a CUDA capability the DSLs don't expose (NVSHMEM, certain NCCL primitives, CUDA IPC manipulation). The framework-level RL code (Section ③) lives entirely in pure PyTorch + cuda-python bindings; the inference engines underneath (SGLang, vLLM) are where Triton kernels do the heavy lifting.
- triton-lang.org — tutorials (vector add, fused softmax, matmul, FlashAttention)
- srush/Triton-Puzzles — Sasha Rush's puzzles, runnable without a GPU
- tile-ai/tilelang-puzzles — 10 progressively harder puzzles for TileLang
- tilelang/examples/deepseek_mla — MLA decode reference implementation
- cuda.core docs · cuda.tile docs
The training backbone: Megatron-LM
Megatron-LM is the training engine sitting under almost every RL framework in this survey. The repository contains two parts: Megatron Core (a composable library of GPU-optimized building blocks — kernels, parallelism strategies, mixed-precision support) and Megatron-LM proper (reference training scripts using the core). Performance numbers from NVIDIA's benchmarks: 462B-parameter models trained on 6144 H100 GPUs, reaching 47% Model FLOP Utilization. For comparison, naïve PyTorch DDP on a 70B model typically achieves 30-35% MFU.
From the survey's point of view, Megatron plays two distinct roles at once. As a substrate, it is the engine that Miles, slime, and verl wrap to do the actual gradient work — recomputing logprobs, computing the GRPO loss, running backward, stepping the optimizer, and materializing rollout weights. As a native RL framework, it ships its own complete first-party RL stack at megatron/rl/, with train_rl.py, rl_utils.py, sequence_packing_utils.py, and an SGLang-server bridge. Knowing both roles is what lets you read the verl/slime/Miles code without losing track of which abstraction owns which invariant.
megatron/rl/. Bottom: 5D parallelism, each axis with its concrete RL meaning.Role 1: substrate for Miles / slime / verl
When a framework above wraps Megatron, it asks Megatron for four things, in this order:
- Recompute logprobs for the rollout tokens. The inference engine produced its own logprobs but they cannot be trusted for the loss — different precision, different kernels, sometimes different operator order. The trainer must recompute them in the training precision, with the training kernels, on the training parallel layout. This is one forward pass per microbatch, no backward.
- Compute the loss. Given recomputed logprobs πθ, rollout logprobs πold, advantages Â, and a KL reference πref, build the GRPO objective and a few diagnostic scalars. The next subsection has the explicit formula.
- Backward + optimizer step. Standard Megatron, but with a quirk: at RL scale, the parallelism layout chosen for backward (favoring DP for throughput) is usually not the same one that's fastest for the rollout forward (favoring TP for low latency). The framework above arbitrates.
- Materialize weights for rollout sync. After the optimizer step, the new policy has to flow back into SGLang. The trainer exposes a method (parameter iteration with DTensor → full_tensor() bucketing, or an offload-to-disk path) that yields named tensors in a form the inference engine can ingest. Primitive ③ covers this.
Three of the four jobs above are exactly what a normal pretraining loop does. The one that distinguishes RL is #1, the recompute. Every RL framework eats this cost because the alternative — trusting inference-engine logprobs — produces silent training-time drift that the gradient cannot detect.
Role 2: Megatron's own RL stack
The directory megatron/rl/ is a complete RL framework in its own right, hidden inside what is nominally a pretraining library. The headline files are worth knowing by name:
train_rl.py— the top-level entrypoint. Builds the model with the standard Megatronpretrain()harness, then swaps the forward-step function for an RL-aware one and the data iterator for a rollout-driven one. The training loop is identical to pretraining at the outer level; only the inner step differs.megatron/rl/rl_utils.py— the GRPO loss, the truncated-importance-sampling clip, the KL penalty, the entropy bonus, and the precision-mismatch diagnostics. This is the file to read if you want to see what a GRPO step actually computes, with all corrections inline rather than spread across a framework.megatron/rl/sequence_packing_utils.py— the packing logic that lets variable-length rollouts share a microbatch without attention leaking across sequence boundaries. Produces aPackedSeqParamsobject holdingcu_seqlens(cumulative lengths) which the Transformer Engine attention kernels honor as a hard mask. The packing prevents O(n²) attention waste on the tail of short rollouts.megatron/rl/agent/·inference/·server/— the three modules that bridge Megatron to an external sampler.server/exposes aMegatronLocalhandle that the SGLang RL backend uses for parameter transfer andverify_model_weights_swap()checks.inference/hosts the rollout client.agent/hosts the environment interface.verify_model_weights_swap()— a small but load-bearing invariant. After a sync, it asserts that the inference engine and trainer hold the same weights up to a tolerance, by sampling parameters and comparing reductions. Failures here are the canonical staleness bug. Without this check, a missed sync drifts silently until rewards crater.
The GRPO loss as Megatron writes it
GRPO drops PPO's critic and replaces it with a group-relative baseline: for a prompt, sample G completions, compute the reward for each, normalize within the group, and use the normalized reward as the advantage. The Megatron implementation builds the loss as:
L_GRPO = -E[ min( ρ · Â, clip(ρ, 1-ε, 1+ε) · Â ) ] # truncated importance sampling
+ β · KL( π_θ || π_ref ) # reference penalty
- α · H( π_θ ) # entropy bonus
where ρ = exp( π_θ(a|s) - π_old(a|s) ),
 = group-normalized reward within G samples for the same prompt,
π_old = logprobs at sampling time (from the rollout engine, NOT trusted as numerics
but trusted as the policy that generated the action),
π_θ = recomputed logprobs from the trainer in the training precision.
The mathematically delicate term is ρ. In on-policy PPO/GRPO, ρ should be ≈ 1 because the same model generates and trains. In practice it is not, for three reasons: (a) partial rollouts replay older policy actions, (b) numerical drift between FP8 inference and BF16 training, (c) MoE routing divergence when low-precision routing flips top-k picks. Megatron's rl_utils.py exposes mismatch metrics that compute |ρ - 1| moments per batch — when those grow beyond a threshold, the training is no longer on-policy in the empirical sense and the gradient estimator's bias has eaten the trust region.
The mismatch metric is the empirical answer to "is my training on-policy?". The math says ρ should be 1; the metric tells you what it actually is. Every production GRPO run watches this number.
Sequence packing — why attention leakage matters
Rollouts vary in length wildly: a math problem may finish in 200 tokens, a long-horizon code task in 8000. A naïve batch pads to the longest sequence and burns FLOPs on padding. Packing concatenates several rollouts end-to-end into a single packed sequence and tells the attention kernel where the boundaries are. Done wrong — without the boundary mask — tokens at position 250 in rollout B attend to tokens at position 200 in rollout A, and the gradient learns to mix unrelated trajectories. PackedSeqParams with cu_seqlens = [0, 200, 4500, 4700, …] is the data structure that Transformer Engine's flash attention uses to enforce the boundary. The packing transformation is not just a memory optimization; it is a correctness primitive. The bug it prevents (cross-sequence attention) is the kind that produces a coherent-looking loss curve and a quietly-broken policy.
The four-layer Megatron ↔ SGLang interface
When a framework above (slime, Miles) connects Megatron to SGLang, the interface decomposes into four layers, each enforcing one invariant:
- Process layer. Ray spawns a Megatron worker group and an SGLang scheduler group. They live on the same GPUs (sleep-mode time-sharing) or different GPUs (disaggregated). Invariant: every Megatron rank knows which SGLang rank it talks to, and vice versa.
- Memory layer.
release_memory_occupation()on SGLang frees the engine's KV cache before backward;resume_memory_occupation()reclaims it after. Invariant: at any instant, exactly one of {trainer activations, inference KV} occupies the shared GPU memory pool. Primitive ② made this concrete. - Weight layer.
update_weights_from_distributed()(or its IPC / disk variant) streams parameters from trainer to engine, bucket by bucket, NCCL-overlapped. Invariant:verify_model_weights_swap()must pass before the next rollout begins. - Token layer. SGLang generates, Megatron recomputes logprobs, the framework above stitches the tuples. Invariant: the
π_oldstored on disk for the loss is the engine's logprob; theπ_θused for the loss is always the recompute.
Bugs at each layer have different symptoms: process-layer bugs hang; memory-layer bugs OOM; weight-layer bugs drift; token-layer bugs corrupt gradients. The right invariant to assert at each layer is what makes the four layers cleanly decomposable.
5D parallelism in the RL setting
The product space TP × PP × DP × EP × CP is huge. Below is how each axis interacts specifically with RL — what makes RL different from pretraining is that the same model has to run in both rollout and training modes, with different optimal layouts.
- TP (Tensor Parallel) — split each layer's weight matrices across GPUs. Within-machine, NVLink-bound. RL setting: TP is the dimension most often shared between rollout and training (because the rollout engine is also TP-sharded). Sleep-mode hybrid engines (verl, slime) typically use identical TP for trainer and engine, which makes the weight handoff a no-op in layout.
- PP (Pipeline Parallel) — split layers across GPUs. Cross-machine, InfiniBand-bound. RL setting: PP is rarely matched between trainer and engine — SGLang prefers TP-only for low decode latency, while the trainer benefits from PP for the backward. The mismatch is handled at the weight-sync layer with explicit gather and scatter steps.
- DP (Data Parallel) — replicate the model, split the batch. RL setting: DP scales the trainer cheaply but does nothing for the rollout (the rollout is bottlenecked by single-sample latency, not per-batch throughput). High-DP layouts therefore over-provision the trainer relative to the engine, which is fine if rollout dominates wall time but wasteful otherwise.
- EP (Expert Parallel) — split MoE experts across GPUs. Required for DeepSeek-V3, Qwen3-MoE, GPT-OSS. RL setting: EP is where the R3 routing-replay invariant lives. The trainer's EP layout and the engine's EP layout must agree on which expert handles which token, or the gradient signal corrupts. Miles' contribution is the EP-aware replay.
- CP (Context Parallel) — split the sequence dimension across GPUs. RL setting: CP is what makes V4's million-token RL possible. The 2026 dynamic-CP work (1.48× on variable-length) is precisely the RL setting — rollouts have wildly variable lengths, so a static CP wastes capacity on short sequences.
Communication intensity falls along TP > CP > EP > DP > PP. Place the densest communication on the fastest hardware (NVLink within a node) and the sparsest on the slowest (cross-rack InfiniBand). This is the structure principle of every distributed training topology — Megatron, vLLM, SGLang, DeepSpeed, all the same.
Six engineering invariants Megatron enforces
- Logprob recompute — never use inference-engine logprobs for the loss; recompute them in the trainer's precision and kernels.
- Mismatch metrics — track |ρ - 1| moments per batch; if they grow, the training is no longer empirically on-policy.
- Sequence-packing boundary mask —
cu_seqlensis mandatory whenever variable-length rollouts share a microbatch. - Weight-swap verification —
verify_model_weights_swap()must pass before each new rollout. - EP routing agreement — trainer's expert assignment must match the engine's, either by shared routing or by R3 replay.
- Communication-overlap flags —
--overlap-grad-reduce,--overlap-param-gather,--tp-comm-overlapare not optional at scale; they hide the all-reduce behind compute and are why 47% MFU is achievable.
What's mathematically interesting
The communication overlapping flags are an exercise in hiding latency behind compute — a kind of staggered evaluation. Pipeline parallelism's "1F1B" schedule (one forward, one backward, interleaved) recovers most of the bubble in a fully-synchronous pipeline; the math of why this works is straightforward but writing it down clarifies why Megatron's MFU stays high even with PP.
The dynamic context parallelism shipped in January 2026 (1.48× speedup for variable-length sequences) is a nice example of treating CP size as a runtime variable rather than a static config. The "right" CP size depends on the sequence length of the current batch — adapting it dynamically is essentially online load balancing.
The GRPO loss as written in rl_utils.py is the cleanest place in any open codebase to see all three "correction" terms (truncated importance sampling, KL, entropy) composed together with diagnostic metrics. Reading it answers a question that papers leave ambiguous: which corrections are mandatory and which are heuristics. The Megatron version's answer: the truncation clip is mandatory at scale (without it, off-policy ρ explodes); the KL is mandatory for reference anchoring (without it, the policy walks away from coherence); the entropy bonus is the tunable knob.
Recommended reading order
If you want to actually read the Megatron RL code rather than treat it as an opaque substrate, the cleanest path:
megatron/rl/train_rl.py— top-level entrypoint. See how it differs frompretrain.py: same outer loop, different forward-step function.megatron/rl/rl_utils.py— the GRPO loss with all corrections inline. The single best file in any codebase for the "what does GRPO actually compute" question.megatron/rl/sequence_packing_utils.py— the packing transformation. Short file, easy to misread; the point is the boundary mask, not the packing itself.megatron/rl/server/— theMegatronLocalbridge. Read it after you've read slime's or Miles's framework code so you know what's calling it.megatron/post_training/— the quantization paths covered in the next section. Read it after you've seen the FP8 / INT4 discussion below.
- NVIDIA/Megatron-LM — repo root
- Megatron Core docs — parallelism guide, mixed precision, quickstart
megatron/rl/— Megatron's own RL module (the native RL stack)rl_utils.py— the GRPO loss with all correctionssequence_packing_utils.py— the packing logicmegatron/post_training/— quantization and distillation paths- Megatron-LM paper (arxiv 1909.08053) — the original tensor-parallel architecture
- Megatron-Bridge — bidirectional HuggingFace ↔ Megatron checkpoint conversion
Quantization and the numerical-alignment problem
If you're reading the Miles paper or the DeepSeek-V3 technical report and wondering why so much engineering effort goes into making FP8 training work correctly, this section is for you. Quantization is where the abstract algorithm meets the concrete number system, and getting it wrong silently corrupts the policy.
The numerical formats
Three families are worth knowing:
- FP8 (E4M3 and E5M2) — IEEE-754-style 8-bit floats with two variants: E4M3 has 4 exponent bits and 3 mantissa bits (used for forward activations, weights); E5M2 has 5/2 (used for gradients, with their wider dynamic range). H100-and-newer hardware has native FP8 tensor cores. The two variants exist because forward and backward have different statistical profiles.
- INT4 / INT8 — fixed-point integers. INT4 weight-only quantization (W4A16) is the workhorse for inference of large models (Llama-70B fits on a single H100 at INT4). Quantization-aware training in INT4 is hard; post-training quantization with GPTQ / AWQ is more common.
- MXFP4 / MXFP8 / NVFP4 — block-scaled formats. Each block of 16 or 32 values shares a single scale; the per-value representation is small (4 or 8 bits) but the dynamic range is large. NVFP4 is NVIDIA Blackwell-specific (the B200 and beyond); MXFP4 is the OCP standard.
QAT vs PTQ
Two strategies for converting a model to a lower-precision format:
- Quantization-Aware Training (QAT) — simulate the quantization during training, so the optimizer learns weights that are robust to the lower precision. Expensive but produces the highest-quality models. Miles' INT4-QAT pipeline is the most aggressive example in this survey.
- Post-Training Quantization (PTQ) — train in full precision, quantize at the end. Algorithms include GPTQ (one-shot, second-order weight rounding), AWQ (activation-aware weight quantization), GGUF (llama.cpp's format), and ModelOpt (NVIDIA's toolkit). Cheap but lossy.
For RL specifically, the question is whether the inference engine can be in a lower precision than the trainer without breaking learning. The answer turns out to be "yes, but only with care."
The MoE routing divergence problem
This is the deep reason Miles exists. In a Mixture-of-Experts model, each token is routed to k experts based on the output of a small gating network. The gating decision is a top-k over k experts' affinity scores. Under floating-point arithmetic, the affinity scores are computed in a specific precision. If the inference engine computes them in FP8 and the trainer in BF16, two scores that are equal in BF16 can be unequal in FP8 (or vice versa), and the top-k decision can flip. The token routes to a different expert at inference than it did at training time. The gradient signal becomes random noise with respect to which expert actually generated the token.
This is the "BERT-era unsolved bug" that Chenyang's tutorial flags — the existence of a numerical-precision-induced routing divergence has been known for years. The brute-force fix is to make inference and training share the same precision and the same kernels for the routing computation. That's what Miles' Unified FP8 Pipeline does. The cleverer fix is to replay the routing decision: record which experts inference picked, force training to use the same picks. This is R3 — Rollout Routing Replay, Miles' signature contribution. It guarantees the routing is identical regardless of precision.
The mathematical invariant: routing(x, θ, fp8) ≡ routing(x, θ, bf16) by replay, not by approximation. Without it, every gradient update in an MoE model is partially noise.
For non-MoE models the problem is milder but still present: numerical drift in logprobs accumulates batch-to-batch. The safe practice (every framework in this survey follows it) is to never use the inference engine's logprobs for loss computation. Always recompute logprobs with the training engine, even at the cost of an extra forward pass. The inference engine generates the tokens; the trainer recomputes their probabilities.
- GPTQ paper · AWQ paper — the standard PTQ algorithms
- NVIDIA ModelOpt — production toolkit for FP8/INT4/MXFP4 conversion
- OCP MX format spec — the standard behind MXFP4 / MXFP8
- Megatron's
post_training/— checkpointing and conversion paths - Miles docs — R3 + unified FP8 pipeline writeups (the canonical RL-quantization references)
Multi-turn agentic RL — unifying VLM and LLM from first principles
Up to this point the survey has implicitly assumed a single-turn setting: one prompt, one completion, one reward. The 2026 frontier is multi-turn. A model is no longer a chatbot but a thinking machine embedded in an environment loop — it emits an action, the environment responds with an observation (possibly multimodal), the model reads the observation and emits the next action, and the trajectory grows. Computer Use agents, embodied robotics, and tool-augmented reasoning all live in this regime.
The mathematician's instinct here is right: a multi-turn setting is just a Markov decision process with episodes. The engineering question is then narrow — how do you implement the trajectory generation cleanly enough that VLM and LLM share one code path? The slime + Miles answer is what I'd call the first-principles answer: any multi-turn training is just custom sampling and interaction logic. Decouple the rollout function from the environment; let the user supply both.
The turn loop
Each turn of the loop has four distinct phases: (a) the actor generates a response under the current context and sampling parameters; (b) the environment steps on the response and returns an observation; (c) the observation is encoded into a fresh delta of tokens and appended to the context with loss_mask = 0 — this is what tells the trainer "don't compute loss against the environment's words"; (d) any new multimodal payload is appended to two parallel buffers, one for inference, one for training. Termination is whichever fires first: max_turns, a token budget, or env.step() returning done=True.
# Pseudocode: custom multi-turn rollout.generate
async def generate(args, sample, sampling_params):
env = load_env_module(args.rollout_interaction_env_path).build_env(sample=sample, args=args)
max_turns = args.max_turns
sample.tokens, image_data, mm_train_buffer = init_from_prompt(sample, state)
for _ in range(max_turns):
# (a) Actor generation — assistant tokens
response_text, new_tokens, new_logprobs, finish_reason = sglang_generate(
url=url, input_ids=sample.tokens,
sampling_params=sampling_params, image_data=image_data
)
append(sample, new_tokens, new_logprobs, loss_mask_val=1)
# (b) Env step
observation, done, _ = env.step(response_text)
if done: break
# (c) Process & append observation tokens
user_msg = env.format_observation(observation)
obs_ids, obs_image_data, obs_mm_inputs, obs_mm_train = encode_observation_delta(
user_msg, tokenizer=state.tokenizer, processor=state.processor,
tools=sample.metadata.get("tools")
)
append(sample, obs_ids, [0.0] * len(obs_ids), loss_mask_val=0)
# (d) Multimodal state update — TWO parallel buffers
image_data += obs_image_data # inference-side
if obs_mm_train:
mm_train_buffer.append(obs_mm_train) # training-side
return sample
The BaseInteractionEnv interface is intentionally minimal: reset(), step(response_text) → (observation, done, info), and format_observation(observation) → message. No assumptions about action grammars, no coupling to dataset format. "How the environment parses an action, executes a tool, returns an observation" is entirely the user's call. This is the decoupling the field needed to support Computer Use, embodied robotics, and tool-augmented reasoning under one framework.
Two engineering tricks worth knowing
Two implementation details from the slime team's writeup deserve highlighting because they capture exactly the kind of "look beneath the API" reasoning the engineering rewards.
Dummy messages + delta tokens — bounded context growth
The naive way to encode an observation back into the context is tokenizer.apply_chat_template([obs_message], tools=...). The problem: chat templates auto-prepend a system prompt and tool-use instructions every time. If you do this each turn, the system prompt is duplicated into the context T times across T turns — quadratic context growth, partial waste of the token budget, and (even though these tokens are loss-masked) their presence shifts the actor's behavior distribution.
The trick: encode twice, take only the difference. Encode a fixed DUMMY_MESSAGES base alone to get the preamble token count; encode DUMMY_MESSAGES + [obs_message] together; slice off the preamble length. What you append is the clean observation delta only — system prompt and tool preamble appear exactly once across the whole trajectory.
dummy = apply_chat_template(DUMMY_MESSAGES, tools=tools, add_generation_prompt=False)
full = apply_chat_template(DUMMY_MESSAGES + [obs_msg], tools=tools, add_generation_prompt=True)
trim = len(encode(dummy))
obs_ids = encode(full)[trim:] # delta tokens only
Mathematically this is just set difference on token sequences; what's interesting is that the chat template API doesn't expose a primitive for "encode just this message under this preamble," so the user implements the difference operation manually. The trick is widely needed but rarely surfaced.
Multimodal tensor merge — O(n²) → O(n)
Each turn that adds an observation also produces a dict of tensors for the training side (vision features, audio features, whatever the VLM processor emits). The trainer wants one consolidated tensor per key over the whole trajectory. Naïvely concatenating each turn with torch.cat is O(n²): each call allocates a new output buffer and copies all existing data plus the new turn's increment.
The clean answer: buffer-then-merge. Append each turn's tensor dict to a Python list (O(1) per turn); at trajectory finalization, traverse the list once and call torch.cat exactly once per key. Total work drops from O(n²) to O(n), and you avoid the peak-memory transient where both the old and new concatenated tensors are simultaneously resident.
This too is a kind of math-as-engineering: the same input-output behavior, two different complexity profiles, distinguished only by where the allocation boundary sits. For a 32-turn rollout with 100K-token VLM context, the asymptotic difference is the difference between training and OOM.
Engineering case study — slime, the clean upstream framework
Before Miles became Miles, there was slime — the SGLang-native post-training framework that Miles forked. slime is the RL framework behind GLM-5.1, GLM-5, GLM-4.7, GLM-4.6, and GLM-4.5. Where Miles bets on MoE production hardening (low precision, R3 routing replay, fault tolerance, weight-version checks), slime bets on something different: clean interface boundaries. The smallest framework that gets Megatron, SGLang, and Ray to cooperate, with extension points for everything else.
Reading slime alongside Miles is the cleanest way to learn what's essential in an RL framework and what's production glue. Miles teaches you what large-scale stability costs; slime teaches you what the framework actually has to do. If you only read one case study to understand the post-training skeleton, read this one. If you only deploy one framework at trillion-parameter scale, deploy Miles.
Same five phases, different controller style
slime's train.py runs the same five-phase loop as Miles — rollout → train → save → sync → eval — but the controller idiom is different. Miles uses asyncio with await on every cross-actor call. slime uses Ray ObjectRefs with synchronous ray.get. The trade is straightforward:
| Style | Reads like | Pro | Con |
|---|---|---|---|
| Miles · asyncio | concurrent Python | natural overlap of independent tasks | steeper if you've never read async code |
| slime · ray.get | a script, one phase at a time | reads top-to-bottom, debuggable | harder to overlap critic and actor training |
Both work. The synchronous controller is easier to read end-to-end — which is part of why slime is a better starting point for a theorist. The async controller is easier to extend with overlapping computation — which is part of why Miles forked when it needed more concurrency knobs.
The one place slime does use refs to overlap work is critic-actor coupling. When a critic is present, slime calls critic.async_train() first (returning Ray refs), then passes those refs as external_data=value_refs into actor.async_train(). The actor only needs values when it computes the policy loss, so this overlap is essentially free — the critic computes while the actor sets up.
The three-module architecture, made visible
slime's README states the architecture in three boxes: training (Megatron) → data buffer → rollout (SGLang + router). The training module reads from the buffer; the rollout module writes to it; the buffer manages prompts, samples, and custom data generation.
This is the same skeleton every framework in this survey instantiates. What's different about slime is that the abstraction is visible in the directory layout: slime/ray/ for orchestration, slime/backends/megatron_utils/ for training, slime/backends/sglang_utils/ for rollout, slime/rollout/ for the rollout-function library. The user-supplied path flags (--rollout-function-path, --custom-generate-function-path, and friends) make the extension points equally visible.
OPD as a first-class feature — two teacher modes
This is slime's most distinctive contribution to the open-source ecosystem, and the reason the V4 case study (next section) can reference a clean reference implementation. slime ships On-Policy Distillation as a built-in mode, with two teacher placements:
--opd-type sglang: the teacher runs as an external SGLang server. The teacher receives the student's sampled tokens, setsmax_new_tokens=0(no generation), and returnsinput_token_logprobs— i.e. the teacher's score of the student's trajectory. Use this when the teacher has a different architecture from the student, or is too large to fit alongside the student in training memory.--opd-type megatron: the teacher is loaded directly into Megatron as a backup model tag (alongsideactor,ref,old_actor). The teacher's logprobs are computed in the training forward pass. Use this when the teacher and student have the same architecture and you can afford the memory.
The mathematically interesting choice is how OPD enters the loss. slime treats it as an additive KL penalty on the advantage, not a separate estimator:
advantage' = advantage − λ · ( log πstudent(a|s) − log πteacher(a|s) ) — slime'sapply_opd_kl_to_advantages()inbackends/megatron_utils/loss.py
Two facts fall out. First, if the task reward is zero, the second term dominates and OPD becomes pure teacher imitation via reverse KL. Second, if the task reward is non-zero, OPD acts as a regularizer that pulls the student toward the teacher's distribution while still letting RL signal drive learning. This is what slime's README means by "OPD is orthogonal to advantage estimators" — you can layer it on top of GRPO, PPO, REINFORCE++, or any other advantage estimator by modifying the advantage rather than the loss-function shape.
Once you see this, DeepSeek V4's full-vocabulary multi-teacher OPD (the next case study) reads as the same idea scaled up: ten or more teachers instead of one, full-vocabulary KL instead of token-level KL, with engineering for trillion-parameter teacher scheduling. The objective is the same family; the engineering complexity is much larger.
The custom-generate escape hatch
slime's most important design choice is what it deliberately does not include. There is no built-in multi-turn agent loop, no hardcoded tool-use environment, no special-cased web-search rollout, no VLM observation encoder baked into the core. Instead, slime exposes extension points and gets out of the way:
--rollout-function-path— replace the entire rollout function--custom-generate-function-path— replace just the per-samplegenerate()--custom-reward-post-process-path— custom reward normalization--custom-convert-samples-to-train-data-path— custom train-batch shaping--buffer-filter-path— custom dynamic-sampling filter
The default generate() in slime/rollout/sglang_rollout.py handles single-turn generation with optional multimodal inputs. To do multi-turn agents, you write a generate() that loops turns. To do tool-use, the same. To do VLM observations, the same. The contract is small and stable: take a sample, return a populated sample with tokens, logprobs, and optional routed-experts.
This is slime's design philosophy: the framework guarantees the data contract; the user owns the environment. It's the opposite of building a universal agent framework that tries to anticipate every tool integration. Instead it gives you the small piece of plumbing that lets your own environment plug in. For research that touches new domains often (search, code execution, browser automation, robots), this is the better trade.
ServerGroup — the SGLang serving abstraction
slime introduces an abstraction Miles doesn't expose as cleanly: a ServerGroup is a set of homogeneous SGLang engines, and a RolloutServer can contain multiple ServerGroups. This matters for prefill/decode disaggregation, where prefill runs on one group of engines (with different SGLang flags optimized for compute-heavy prefill) and decode runs on another (optimized for memory-bound decode). Worker types are explicit: "regular", "prefill", "decode", or "placeholder".
If you've ever wondered how a research framework handles serving topology more complex than "one SGLang engine per rollout GPU," ServerGroup is a good reference. The same abstraction supports encoder-only servers (for reward models, embeddings, or anything that doesn't need autoregressive decode) by selecting a different SGLang entry point. By default, slime turns on DeepGEMM JIT precompile, fast warmup, memory-saver CUDA graph, and metrics scraping — all sensible defaults for training-side rollout where you want determinism and visibility.
Routing replay as explicit stages
R3 (routing replay, the invariant that lets MoE training stay numerically aligned with rollout) is in slime too, but the staging is explicit. An environment variable ROUTING_REPLAY_STAGE takes four values:
fallthrough— normal MoE routing, ignore replay recordsrecord— record routing decisions but use them tooreplay_forward— replace routing decisions with replay records in forwardreplay_backward— replace routing decisions in backward
The training loop sets replay_backward during actor.train_actor() so that gradients flow through the experts the rollout actually used. This is the same fixed-point invariant Miles enforces — routing(x, θ, rollout) ≡ routing(x, θ, training) by replay — but slime exposes the state machine as a small environment-controlled enum. Reading the source, you can pinpoint where each stage takes effect.
slime vs Miles — what each is good for
A side-by-side comparison helps locate each framework in your mental map.
| Dimension | slime | Miles |
|---|---|---|
| Positioning | Clean upstream framework | Production fork with hardening |
| Controller style | Synchronous Ray ObjectRefs | asyncio |
| Training backend | Megatron-LM | Megatron + experimental FSDP |
| Weight-sync paths | tensor (colocated) + distributed broadcast | + P2P + LoRA variants + quant variants |
| OPD | First-class with two teacher modes | Supported, foregrounded less |
| Routing replay | Env-var stage machine, 4 stages | Generalized replay manager system |
| Rollout extensibility | Strong — many --custom-*-path flags | Inherited from slime, with more knobs |
| Fault tolerance | Rollout health monitor + recovery | Heavier recovery, version checks, retries |
| SGLang integration | PD disagg, encoder-only, metrics | + Miles router, low-precision integration |
| Best for | Reading and learning, GLM-scale RL | DeepSeek-V3-scale MoE production runs |
One sentence summary: slime is the cleanest open-source framework to read end-to-end; Miles is the heaviest production-ready system to deploy at trillion-parameter scale. Read slime first if you want to understand how an RL framework works. Read Miles after if you want to understand what it costs to make one survive.
What slime actually solves — six interface contracts
The honest summary of slime's contribution is not an algorithm and not a system optimization. It's six interface contracts, each defined narrowly enough to be reusable:
- Sample contract — what fields a rollout sample must carry: tokens, response_lengths, rewards, loss_masks, rollout_log_probs, plus optional rollout_routed_experts (R3), teacher_log_probs (OPD), multimodal_train_inputs (VLM).
- Train-data conversion contract — how samples become a training batch, with per-DP-rank split that can token-balance (not sample-balance) when response lengths have long tails.
- Recompute contract — what the trainer recomputes (current logprobs, ref logprobs, optional teacher logprobs) versus what it trusts from rollout (rollout logprobs, routed experts).
- Weight-sync contract — pause generation, flush cache, sync, resume; weight version is monotonic; cache invalidation always follows sync.
- Custom-generate contract — user-supplied
generate(args, sample, sampling_params)can replace the default and plug straight into the rollout function. - OPD contract — teacher logprobs are a first-class sample field; the OPD KL penalty modifies advantage rather than replacing the loss-function shape.
These six contracts are why slime is portable. Adding a new estimator, a new environment, a new teacher mode, a new multimodal input type — none of these requires changing the framework core. Each requires implementing a function that satisfies one of these six contracts. The framework is the contracts; everything else is the user's.
Recommended reading order for slime's source
Following the same outside-in pattern as the Miles reading order. Each file builds on the last; jumping straight to the middle usually costs a day.
README.md— confirm the three-module architecture and skimexamples/to see the supported recipes.train.py— the synchronous Ray controller version of the five-phase loop.slime/ray/placement_group.py— Ray bundle layout, colocate vs split, critic-reuses-actor-pool.slime/ray/rollout.py—ServerGroup,RolloutServer,RolloutManager, train-data conversion, group reward normalization.slime/rollout/sglang_rollout.py— default per-sample generate, dynamic filtering, partial rollout, multimodal prompt handling, custom-function dispatch.slime/backends/sglang_utils/sglang_engine.py— SGLang HTTP wrapper, PD disaggregation, memory and weight endpoints.slime/backends/megatron_utils/actor.py— actor / ref / teacher / old_actor backup tags,train_actor(),fill_routing_replay().slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py— colocated tensor path with FlattenedTensorBucket.slime/backends/megatron_utils/update_weight/update_weight_from_distributed.py— distributed NCCL path with separate dense/expert (TP + EP) gathers.slime/backends/megatron_utils/loss.py— response-aligned logprobs, advantage estimators,apply_opd_kl_to_advantages(), TIS.examples/on_policy_distillation/README.md+slime/rollout/on_policy_distillation.py— the two-teacher-mode OPD reference, with the four key flags.
By step 11 you have a working mental model of the cleanest open-source RL post-training framework. From there, reading Miles is incremental: you're learning what was hardened, not what is. And reading DeepSeek V4's report (next case study) becomes a comparison rather than a deep dive — you already know what OPD does, so V4 reads as "what changes when you do this with ten teachers at trillion-parameter scale."
- THUDM/slime — the repo
train.py— the synchronous five-phase loopslime/ray/rollout.py—ServerGroup/RolloutManager/ train-data conversionexamples/on_policy_distillation/— Qwen3-8B student aligning to Qwen3-32B teacher, runnable- OPD README — flags
--use-opd,--opd-type,--opd-kl-coef,--opd-teacher-load backends/megatron_utils/loss.py— whereapply_opd_kl_to_advantages()lives
Engineering case study — Miles' DeepSeek-V3 RL pipeline
If slime is the clean upstream framework, Miles is the production fork that hardened it for DeepSeek-V3-scale MoE training. Miles is built on slime, powered by Megatron-LM + SGLang, orchestrated by Ray. The repo's scripts/run_deepseek.py is a single Typer command that takes a DeepSeek-V3 model from HuggingFace and runs full GRPO training on AIME-2024 or GSM8K. Reading Miles is reading what slime needed to look like when it had to survive a week-long trillion-parameter run.
The five-stage pipeline
What makes run_deepseek.py instructive is that each stage tests for already done before re-running. The pipeline is resumable on every reboot — a property the field undervalues until it's debugging at 2am with a corrupted checkpoint.
- ① Download.
hf download deepseek-ai/DeepSeek-V3for the model,hf_download_datasetfor the training set (DAPO-math-17k + AIME-2024, or GSM8K). Skip if already present. - ② FP8 → BF16 cast. DeepSeek-V3 ships in FP8 on HuggingFace; training needs BF16 master weights (a QAT/master-weight prerequisite from the quantization section).
tools/fp8_cast_bf16.pyhandles it. Skip ifmodel.safetensors.index.jsonexists. - ③ HF → Megatron distributed format. Run
torchrun convert_hf_to_torch_dist.pywith the right PP/EP/TP sizes for the model. On multi-node this becomesexec_command_all_ray_node(...)— Ray fans the conversion command out across all nodes with{{master_addr}}/{{node_rank}}substitution. Skip iflatest_checkpointed_iteration.txtreads"release". - ④ Rsync to local node storage. Cross-node shared FS is too slow for the hot path. Every Ray node rsyncs the converted checkpoint to its local NVMe in parallel.
- ⑤ Ray job submit. Builds the giant
train_argsstring (rollout, optimizer, GRPO, wandb, perf, eval, SGLang, misc args), kills any stale processes, startsray start --head, thenray job submit -- python3 train.pywith the full runtime environment JSON.
The training main loop
Once Ray has the job, train.py is a tight async loop. Read it as the runtime of all the primitives this survey covered:
async def train(args):
pgs = create_placement_groups(args) # TP/PP/EP-aware GPU groups
init_tracking(args)
rollout_manager = create_rollout_manager(args, pgs["rollout"]) # SGLang
actor_model, critic_model = await create_training_models(args, pgs, ...) # Megatron
await actor_model.update_weights() # initial sync → SGLang
for rollout_id in range(args.start_rollout_id, args.num_rollout):
rollout_data_ref = await rollout_manager.generate.remote(rollout_id) # Phase A-B
await actor_model.train(rollout_id, rollout_data_ref) # Phase D
if rollout_id % args.save_interval == 0:
await actor_model.save_model(...)
await actor_model.update_weights() # Phase E
if rollout_id % args.eval_interval == 0:
await rollout_manager.eval.remote(rollout_id)
This is the call graph from Section "Reading real code," made concrete. Two Ray actor groups (training, rollout) bound to placement groups; an outer async loop alternating generate, train, update_weights. Every primitive — the hybrid engine, the memory choreography, the zero-copy weight sync, the four update_weights_from_* paths, the RadixAttention prefix cache — is invoked along this loop without the user-facing code ever spelling them out explicitly.
What's mathematically interesting is the layering. The top-level loop is sequential and easy to reason about as a fixed-point iteration on the policy parameters. The middle layer (Ray, placement groups) is concurrent but explicitly scoped. The bottom layer (CUDA kernels, NCCL collectives, ZMQ messages) is asynchronous but bounded by clean contracts. Each layer adds a strictly smaller amount of nondeterminism than the one beneath it — a kind of algebraic abstraction ladder that lets a theorist reason about convergence without unfolding the systems mess.
Miles is not a trainer — it is a runtime phase machine
If you read the source carefully, the most useful reframe is this: the top-level loop is not algorithm-driven, it is phase-driven. There is no PPO/GRPO logic in train.py at all. There is only a sequence of phases — rollout, training, sync, eval — and the code's job is to hand the GPU between them cleanly. The actual math (advantages, KL, importance weights) lives inside the actor, far below the top-level loop. From a theorist's angle this matters because it tells you where to look for bugs: a wrong policy gradient is a bug in the actor; a wrong rollout-train mismatch is a bug in the phase machine.
The phases are precise:
- Rollout phase: SGLang owns weights, KV cache, CUDA graphs. The trainer's optimizer is offloaded to CPU.
- Training phase: The trainer owns weights, gradients, activations, optimizer state. SGLang's KV cache and CUDA graphs are released.
- Sync phase: The trainer's new weights are pushed into SGLang. Generation is paused; the cache is flushed; the weight version is bumped.
- Eval / save phase: Periodic side effects that don't affect the policy gradient loop.
Two object models run in parallel. The mathematical model is θ_t → sample τ ~ π_θ → compute reward and advantage → θ_{t+1} → install into rollout. The engineering model is GPU ownership, KV cache ownership, weight version, Ray actor liveness, NCCL group state, offload/onload state. Miles' value is keeping these two models aligned — when the math says "the policy just updated," the engineering says "the inference engine now serves the new weights, and any prior in-flight requests have been retracted." If those two statements ever drift apart, the gradient becomes noise and the loss curve still looks fine.
Six engineering invariants Miles maintains across the loop
These are the invariants you can verify in the source code. They are also the right checklist if you are reading any other framework in this survey and want to know what to look for. Each one is one line of plain English, then one line of why it matters.
- Logical rank order maps stably to physical GPU order. Ray bundles, Megatron ranks, and SGLang engine ranks must agree on which physical GPU is "rank 5." If they disagree, you get silent NCCL hangs or weights written into the wrong worker — symptoms that are hours to diagnose. Miles handles this in
placement_group.pyby reading back actual node IP and GPU IDs from each Ray bundle and sorting deterministically. - GPU ownership switches explicitly, never by accident. Colocated mode never lets training and rollout both touch GPU memory at the same time. The transitions go through
rollout_manager.offload_*,actor_model.onload, and SGLang'srelease_memory_occupation/resume_memory_occupation. The invariant — at any moment, exactly one engine has the live GPU memory — is what makes colocation safe. - Weight version is monotonically increasing. Every sync bumps
weight_versionand propagates it to every SGLang engine. The trainer can later assert that the engine's reported version equals what it just sent. If they ever differ, a weight update silently failed somewhere and you want to know now, not after 10000 training steps. - Weight updates pause generation and flush the cache first. Before a sync, Miles calls
pause_generationon every engine and thenflush_cache. The point is to make sure no in-flight request continues decoding with old prefix-cache KVs that belong to the old policy. This is the rule SGLang's own scheduler also enforces; Miles just makes it visible at the top level. - Loss-relevant logprobs are recomputed on the trainer, not trusted from inference. Miles keeps the rollout logprobs (it needs them for TIS and mismatch checks) but the actual policy-gradient term uses logprobs computed by the training engine on the same tokens. This is the BERT-era numerical-drift discipline made concrete: the inference engine generates; the trainer scores.
- MoE routing is replayed, not re-derived. When R3 is on, SGLang returns the expert choices it made during generation; the trainer reuses those choices during the forward pass, regardless of what FP8 numerics would have decided locally. The mathematical statement is that the gradient is taken with respect to the routed graph the rollout used — which is the only way the gradient is a meaningful signal about that rollout's actions.
Two more invariants Miles enforces when partial rollout is on: tokens generated under an older weight version are loss-masked to zero by default (so only fresh tokens contribute to the gradient), and off-policy ratio is monitored as a first-class metric via TIS, ESS, and a "rollout-train mismatch" probe (so staleness is something you see, not something you discover via mysterious eval degradation three days in).
How a sample flows from rollout to gradient
The full data path is worth tracing once, because it shows where the abstractions are doing real work and where they're just plumbing. A prompt enters the system, gets sampled N=8 times by SGLang (this is GRPO's group), each completion is scored, the survivors become a training batch. Concretely:
- The
RolloutManagertakes a prompt and submits N parallel generation requests to SGLang. Each request returnstokens, logprobs, finish_reason— and if R3 is on, alsorouted_experts(a tensor of shape [response_len, num_layers, top_k] recording which experts processed each token at each MoE layer). - The reward function scores each completion. For GRPO, the rewards within the N-completion group are then group-normalized: subtract the group mean (and optionally divide by group std). The result is each sample's advantage relative to its sibling samples from the same prompt.
- Optionally, a dynamic sampling filter drops groups whose rewards are all equal (DAPO-style "if every sample got reward 1.0 or every sample got 0.0, the gradient is zero — don't bother training on it"). Miles uses
check_reward_nonzero_stdas the default filter. - Surviving samples are packed into a training batch. Every sample carries a loss mask — assistant tokens are 1, observation/system tokens are 0. If partial rollout is on, old-version tokens are also 0. The loss mask is a first-class field; the trainer does not "infer" which tokens to score.
- The batch is split across data-parallel ranks. If
balance_datais on, it splits to equalize tokens per rank, not samples per rank — a small detail that matters a lot when response lengths have long tails. - Each trainer rank recomputes logprobs on its slice and computes the policy-gradient loss. The recomputed logprobs are the ones that flow into the loss; the rollout logprobs only show up in TIS and in mismatch monitoring.
The reason to walk through this is that every field in the training batch has a job. tokens and response_lengths are the raw text. rewards and advantages drive the loss. loss_masks control which positions count. rollout_log_probs enable TIS. rollout_routed_experts enable R3. weight_versions let the trainer detect stale samples. teacher_log_probs are reserved for OPD. The batch is a small algebra of fields, each one corresponding to a distinct correctness concern.
Two paths for weight sync — and why the choice matters
Miles picks between two weight-sync paths at config time, and the choice has both performance and reliability implications.
Colocated path (UpdateWeightFromTensor): when the trainer and the SGLang engine share the same physical GPUs, the trainer assembles the new weights into one flattened bucket per layer, serializes the tensor descriptors (pointer + stride + offset + CUDA IPC handle, ~1KB total) using MultiprocessingSerializer, gathers them to the lead rank via Gloo, and hands them to SGLang via Ray IPC. SGLang reconstructs Python tensor objects that point to the same physical GPU memory the trainer just wrote. Zero actual tensor data crosses any wire. The reason this works is that two processes on the same host can share CUDA memory through IPC handles — the cost is sub-millisecond per update, regardless of model size.
Distributed path (UpdateWeightFromDistributed): when the trainer and rollout live on different physical GPUs (or across nodes), Miles creates a NCCL group whose participants are {trainer rank 0} ∪ {all rollout engine ranks}. The trainer sends metadata (parameter names, shapes, dtypes) over Ray, and the actual tensor data over NCCL broadcast from rank 0. This is the classical "split the control plane and the data plane" pattern: small metadata flows through a flexible RPC layer; large tensors flow through dedicated high-bandwidth collectives. Miles also serializes the broadcasts behind a Ray lock — concurrent broadcasts can deadlock NCCL, and the lock is cheap insurance.
Both paths share a third invariant: flush the cache after the update. Old prefix-cache entries reference the old policy; serving them under the new policy is a silent correctness bug. SGLang's mixin handles this with flush_cache_after_weight_update at the end of every transport.
Staleness corrections — what Miles measures during training
Even on a colocated setup, the rollout policy and the trainer policy are not the same. The trainer is one step ahead — it generated the sample under θt but is computing the loss against θt+1 after the parameter update. For PPO this is fine; for fully-async RL it can drift. Miles tracks staleness with three explicit quantities:
- TIS (truncated importance sampling). Compute
ratio = exp(log_prob_train - log_prob_rollout)per token; clip it into[lo, hi]; multiply into the policy-gradient term. This re-weights stale samples so the gradient is unbiased again, modulo the variance the truncation adds. - ESS (effective sample size).
ESS = (Σ w)² / Σ w²over the importance weights. A small ESS means most of your samples are effectively being ignored — useful as a single number to monitor over training. - Rollout-train mismatch metric. The mean and max of
|log_prob_train - log_prob_rollout|per token. If this drifts upward, the rollout has gotten too stale; you want to know before the loss diverges.
The principle is the same one that runs through this whole survey: the system tells you it is lying about being on-policy, and quantifies how much. That is the right design for a system whose inputs are partially stale by construction.
What Miles actually solves — seven brittle problems made systematic
The most useful summary of Miles is not "it implements GRPO." It is that Miles takes seven specific failure modes that wreck large-scale MoE RL training and turns each one into an engineering invariant with a corresponding code path:
- Train and rollout sharing GPUs without OOM → colocate + per-pool offload + SGLang memory-saver.
- Pushing new weights into rollout fast → colocated tensor bucket + distributed NCCL broadcast + P2P paths.
- GRPO rollouts not duplicating work → group sampling + dynamic filter + SGLang prefix cache.
- Long-tail rollouts not stalling training → partial rollout + buffer recycling + loss masking of old tokens.
- Async / off-policy not silently corrupting the gradient → rollout logprobs + TIS + ESS + mismatch metric.
- MoE under low precision not destabilizing → R3 routed-experts replay + unified precision pipeline.
- Multi-day jobs being recoverable → fault tolerance + weight-version checks + restartable rollout engines.
Each one is small in code volume but large in production consequence. The seven together are why Miles is a useful case study and not just another RL framework: it is one of the cleanest places in the open-source RL ecosystem to see what "production-grade" actually means, at the line-of-code level.
Recommended reading order for Miles' source
If you want to walk the source yourself, this order goes from outside to inside, from runtime phases to algorithm internals. Each file builds on the last; jumping straight to the middle usually wastes a day.
train.py— establish the five phases (rollout, train, sync, save, eval). Skim, don't memorize.miles/ray/placement_group.py— see how colocate / split-colocate is actually expressed in Ray bundles.miles/ray/rollout.py— theRolloutManager, including reward normalization, DP split, and what fields end up in training data.miles/rollout/sglang_rollout.py— the default per-sample generation function; this is where dynamic sampling and partial rollout actually live.miles/backends/sglang_utils/sglang_engine.py— the HTTP wrapper around SGLang and the twelve control endpoints (memory, weights, generation pause).miles/backends/megatron_utils/actor.py— Megatron actor init, the train_actor flow (recompute logprob, R3 replay, advantage, train, backup).miles/backends/megatron_utils/update_weight/update_weight_from_tensor.py— the colocated path's flattened-bucket trick.miles/backends/megatron_utils/update_weight/update_weight_from_distributed/broadcast.py— the distributed NCCL path with the deadlock-prevention lock.miles/backends/training_utils/loss.py— the algorithm seam: response-aligned logprobs, all advantage estimators, TIS, ESS.scripts/run_deepseek.py— finally the production recipe, where all of the above gets wired up for an actual DeepSeek-V3 run.
By the time you reach run_deepseek.py, you are reading it not as a script but as a witness: every flag in the giant train_args string lights up a specific invariant or path you have already seen. That recognition is the whole point of the exercise.
- radixark/miles — the repo
scripts/run_deepseek.py— the 5-stage entry pointtrain.py— the async main looptools/— fp8_cast_bf16, convert_hf_to_torch_dist- slime — the lightweight RL framework Miles is built on top of
- SGLang RL team: VLM multi-turn writeup — the canonical reference for the rollout design
Engineering case study — verl, the HybridFlow programming model
The previous two case studies — slime and Miles — both treat the framework as "the Megatron + SGLang + Ray stack made cooperate." verl is a different bet. It is the open-source implementation of the HybridFlow paper, originally from ByteDance Seed, now community-maintained. Its design question is one level higher than slime's or Miles's: given that RL post-training is a particular kind of dataflow program, what's the cleanest way to express it so that algorithms and backends can vary independently? The answer verl proposes — a single-process controller composed with multi-process worker groups — is the most influential framework abstraction in the field today.
Reading verl alongside slime and Miles is the cleanest way to see what is a framework, and what is a stack. slime and Miles teach you the Megatron + SGLang + Ray loop. verl teaches you why you'd want to be able to write that loop without committing to those specific backends.
Control flow vs computation flow — the central bet
The HybridFlow paper distinguishes two kinds of dataflow in an RL system. The control flow is the high-level algorithm: rollout, compute log probs, compute advantages, update actor, update critic, sync weights. The computation flow is the low-level work: neural-network forward and backward, optimizer step, sampling, KV cache management. Most RL frameworks (slime and Miles included) interleave both flows — the algorithm and the engine code live close together.
verl separates them. The control flow runs in one Python process, the single controller. The computation flow runs in Ray worker groups, each backed by FSDP / Megatron / vLLM / SGLang / HF as configured. Between them, a small protocol — DataProto — flows from controller to workers and back, accumulating fields each time it passes through.
The engineering payoff of this split is concrete: changing the training backend doesn't change the algorithm code, and changing the algorithm doesn't change the backend code. Swap FSDP for Megatron — same PPO loop. Swap GRPO for DAPO — same worker classes. The single piece of code that connects them, the @register decorator on each worker method, declares how DataProto should be split, executed, and collected.
The PPO/GRPO loop as it actually looks
HybridFlow's docs give the canonical pseudo-code. Here is what a verl PPO/GRPO iteration looks like at the controller level — note that the entire loop reads as if it ran in a single process:
for prompt in dataloader:
batch = DataProto.from_prompt(prompt)
gen = actor_rollout_ref_wg.generate_sequences(batch) # SGLang/vLLM/HF rollout
batch = batch.union(gen)
old_lp = actor_rollout_ref_wg.compute_log_prob(batch) # training-engine recompute
ref_lp = actor_rollout_ref_wg.compute_ref_log_prob(batch) # frozen reference policy
batch = batch.union(old_lp).union(ref_lp)
values = critic_wg.compute_values(batch) # value function
rewards = reward_wg.compute_scores(batch) # reward model or rule
batch = batch.union(values).union(rewards)
batch = compute_advantage(batch) # controller-side, no remote
batch = apply_kl_penalty(batch) # controller-side reward shaping
actor_rollout_ref_wg.update_actor(batch)
critic_wg.update_critic(batch)
checkpoint_engine.update_weights(...) # trainer → rollout sync
Three things to notice. First, every wg.method(batch) looks like a plain Python call. Under the hood, the @register decorator splits batch across the worker group's data-parallel ranks, makes a Ray remote call to each worker, collects the results, and concatenates them back into one DataProto — but the controller never sees any of that. Second, compute_advantage and apply_kl_penalty run on the controller itself, with no remote call: advantage estimation is cheap, and putting it on the controller keeps it close to the algorithm's mathematical statement. Third, the loop reads like the algorithm in a paper. You can show this to an RL theorist and they will recognize PPO immediately, even though it's running across hundreds of GPUs.
DataProto — the data envelope
Every batch that flows between the controller and the worker groups is a DataProto. It has three fields:
batch— a TensorDict of batch-aligned tensors:input_ids,attention_mask,response_mask,old_log_probs,ref_log_probs,values,rewards,advantages, etc.non_tensor_batch— a dict of NumPy arrays for object-like fields:uid(used by GRPO for group-baseline calculation),data_source,reward_model,extra_info,request_id.meta_info— a Python dict for batch-global state: temperature, group size, KL coefficient, etc.
The key operation is .union(). It merges another DataProto's fields into the current one, asserting that any overlapping keys agree. This is what makes the controller code read as a sequence of attachments to the same batch:
batch = batch.union(gen) # adds: tokens, response_lengths, rollout_log_probs
batch = batch.union(old_lp) # adds: old_log_probs
batch = batch.union(ref_lp) # adds: ref_log_probs
batch = batch.union(values) # adds: values
batch = batch.union(rewards) # adds: token_level_scores
# ... compute_advantage on controller ...
batch.batch["advantages"] = ...
batch.batch["returns"] = ...
Mathematically, this is exactly the picture of attaching random variables to the same probability space, one at a time, with each new variable computed from the previous ones. The batch grows; the sample identity stays fixed. For a theorist this is the friendliest possible data structure — the engineering object matches the algebraic object.
WorkerGroup dispatch — one verb, many workers
The piece that makes the controller pseudo-code work is the @register decorator on each remote worker method. It declares a dispatch mode: how DataProto should be split before remote execution, and how results should be collected after. The common modes:
| Dispatch mode | Split | Collect | Used for |
|---|---|---|---|
DP_COMPUTE_PROTO | chunk batch along DP world size | concat back | per-sample work — log_prob, generate, train |
ONE_TO_ALL | send full batch to every worker | take rank-0 result | per-worker config / setup |
ALL_TO_ALL | broadcast as-is | broadcast-style result | weight-sync collectives |
This is the verl version of the functor pattern from the design-patterns section — one operation lifted to a category of worker topologies, with a contract about how data enters and exits. The controller writes actor_rollout_ref_wg.compute_log_prob(batch); the framework knows what "compute_log_prob across 64 ranks" actually means.
RayPPOTrainer — where the algorithm lives
The trainer class in verl/trainer/ppo/ray_trainer.py holds the loop above. Two things in its constructor matter:
First, self.hybrid_engine = config.actor_rollout_ref.hybrid_engine and immediately assert self.hybrid_engine, "Currently, only support hybrid engine". The framework is committed to the colocated-rollout pattern — actor training and actor rollout share GPUs within each worker, and weight sync between them is internal to the worker. This is the same commitment Miles and slime made, just expressed as an enforced invariant rather than a config branch.
Second, the compute_advantage() method is controller-side and supports multiple estimators. The GRPO path is especially interesting because of how it identifies groups:
advantages, returns = core_algos.compute_grpo_outcome_advantage(
token_level_rewards = data.batch["token_level_rewards"],
response_mask = grpo_calculation_mask,
index = data.non_tensor_batch["uid"], # ← group identity
norm_adv_by_std_in_grpo = norm_adv_by_std_in_grpo,
)
The group identity for GRPO baseline normalization comes from non_tensor_batch["uid"], an explicit field. If a prompt produces N completions, all N samples must share a uid. This is the data invariant that makes group normalization mathematically correct, and it's a first-class field in the DataProto schema rather than something the framework infers from sample ordering. Reading this is a small "click" moment — you see exactly how GRPO's mathematics translates to a data contract.
The KL penalty in apply_kl_penalty() is similar: it computes kl = old_log_probs − ref_log_prob, subtracts β · kl from the token-level scores to get token_level_rewards, and an adaptive KL controller updates β based on current KL. All of this runs on the controller. The training engine just sees a final token_level_rewards tensor with the KL shaping already applied.
Two-level worker abstraction
verl's worker code lives in verl/workers/engine_workers.py. There are two key classes:
TrainingWorker is a backend-agnostic training engine wrapper. It creates the actual engine via EngineRegistry.new(...), configured by model_type, backend, model_config, optimizer_config, and checkpoint_config. Same worker API, different backend underneath. The remote methods are minimal: train_mini_batch, train_batch, infer_batch, save_checkpoint, load_checkpoint, to(device), set_loss_fn, reset.
ActorRolloutRefWorker is the role-fused worker. Its role can be actor, rollout, ref, actor_rollout, or actor_rollout_ref. The default actor_rollout_ref mode means a single worker class holds: a reference model (frozen), an actor model (trained), and a rollout engine (SGLang or vLLM) — three roles per worker process. This is the concrete instantiation of the hybrid engine: training and rollout share GPUs by virtue of being instances of the same class. Its public methods include compute_ref_log_prob, compute_log_prob, update_actor, update_weights, save_checkpoint, load_checkpoint.
One small but instructive detail: compute_log_prob and update_actor are decorated with _with_routing_replay_flag(enabled=True), but compute_ref_log_prob uses enabled=False. R3 routing replay flows through actor compute and update; the frozen reference policy does not participate in replay. Reading this is enough to learn the invariant: the policy you're updating must replay the routing the rollout used; the reference policy is independent and replays nothing.
SGLang ServerAdapter — and the FSDP/DTensor collective invariant
verl's SGLang adapter lives in verl/workers/rollout/sglang_rollout/sglang_rollout.py. It handles rank mapping (including prefill/decode disaggregation), memory release/resume, and weight updates. Two details deserve attention.
Sleep level 1 vs 2. The release() method supports two granularities. sleep_level = 1 releases only the KV cache and keeps base weights alive (used when the next sync is a LoRA-adapter update — no need to refresh base weights). sleep_level = 2 releases both KV cache and weights (used when the next sync is a full-model update). This is finer-grained than slime's binary release/resume, and it matters in production for LoRA-RL pipelines.
The FSDP/DTensor collective invariant. The update_weights() method walks a generator of named tensors from the training engine. Crucially, the code notes that every rank must walk the generator, even though only TP-rank-0 actually sends the HTTP update to SGLang. The reason: each call to DTensor.full_tensor() inside the generator performs an all-gather across the FSDP group. If some ranks skip the walk, the collective hangs. This is the kind of invariant that the controller-level pseudo-code happily hides — the algorithm doesn't care how weights are materialized — but the worker code must enforce it strictly.
For an FP8 rollout, verl's adapter applies SGLangFP8QuantizerHelper to convert BF16 weights to FP8 in-flight before sending them to SGLang. After every successful update, flush_cache() invalidates the prefix cache (same consistency obligation as slime and Miles).
CheckpointEngine — another functor
verl exposes two paths for weight sync at a higher level than slime/Miles do. Inside ActorRolloutRefWorker.update_weights() there's a branch:
mode = "naive"— the synchronous colocated path. Pull per-tensor params from the actor engine; push them into the SGLang adapter; flush cache. This is what slime/Miles call the colocated tensor path.mode != "naive"— the checkpoint-engine path. Pull per-tensor params and hand them to a configuredCheckpointEngineinstance, which is responsible for transporting them. The engine is selected viaCheckpointEngineRegistryand may be backed by NCCL broadcast, async streaming, disk, or RDMA.
This is the same "one verb, many transports" pattern that the design-patterns section identified in SGLang's four update_weights_from_* methods — except here it's lifted one level higher, into the framework's own checkpoint-engine abstraction. The choice of transport becomes a config option, not a code branch in the algorithm.
Six engineering invariants verl maintains
- Control flow and computation flow are different programs. Algorithm logic runs in one Python process; computation runs in Ray worker groups. Changing one should not require changing the other. This is the framework's central design statement, expressed structurally rather than as a comment.
- Every field on a batch is aligned to the same sample dimension. DataProto's
union()asserts size compatibility; non_tensor_batch and meta_info are kept structurally separate from batch tensors. The mathematical sample identity is preserved across the entire pipeline. - GRPO group identity travels in
non_tensor_batch["uid"]. Group baseline normalization works only because the framework refuses to infer groups from sample ordering. Theuidfield is required, explicit, and immutable through the loop. - Logprobs for the loss are recomputed by the training engine. The rollout engine produces sampled tokens; the actor worker recomputes
old_log_probsthrough its training-engineinfer_batch. This is the same invariant slime and Miles enforce, expressed here as a worker-method-name convention. - Weight sync invalidates the prefix cache. SGLang ServerAdapter's
update_weights()ends withflush_cache(). The cache cannot survive across weight versions. - FSDP/DTensor walks are collective. Every rank walks the weight generator during sync, even though only one rank communicates with the inference engine. Skipping the walk on a single rank deadlocks the all-gather.
verl vs slime vs Miles — three points on the same axis
| Dimension | slime | Miles | verl |
|---|---|---|---|
| Core bet | clean upstream skeleton | production hardening | framework abstraction |
| Controller style | synchronous Ray ray.get | asyncio | single-process pseudo-code |
| Data envelope | Sample → train_data dict | same as slime + more fields | DataProto (TensorDict + non_tensor + meta) |
| Worker abstraction | RayTrainGroup over Megatron | same as slime, with extra production knobs | WorkerGroup + EngineRegistry, pluggable backends |
| Training backends | Megatron | Megatron, experimental FSDP | FSDP, FSDP2, Megatron, TorchTitan, VeOmni |
| Rollout backends | SGLang | SGLang | vLLM, SGLang, HF Transformers |
| Weight sync | tensor + distributed | + P2P + LoRA + quant variants | naive + checkpoint engine (registry) |
| GRPO group identity | positional inside batch | positional inside batch | explicit uid field |
| Algorithm reach | GRPO, PPO, OPD | + R3, FP8, INT4, more | PPO, GRPO, DAPO, PRIME, GSPO, RLOO, REINFORCE++, more recipes |
| Best read for | "how does an RL framework work?" | "what does production cost?" | "how do I design a framework that's portable across backends?" |
The summary sentence: slime teaches you the loop, Miles teaches you the production cost of the loop, verl teaches you how to abstract the loop so the backends can vary. Read them in this order if you want to understand the design space; read them in any order if you just want to ship one.
What verl actually solves
The honest summary of verl's contribution is not a new algorithm or a new system optimization. It's a programming model for RL post-training that decouples four things:
- Algorithm from training backend. Same PPO/GRPO loop runs on FSDP or Megatron. The choice is a config flag, not a fork.
- Algorithm from rollout backend. Same loop runs on SGLang, vLLM, or HF. Worker classes pick which one at construction time.
- Algorithm from placement. Resource pools and placement groups are constructed in
TaskRunner; the controller code doesn't see them. - Algorithm from transport. Weight sync goes through CheckpointEngine; the algorithm doesn't know whether it's NCCL, RDMA, IPC, or disk.
For a theorist this is the cleanest statement of "the framework is the contracts." You can write a new advantage estimator without learning Megatron. You can swap to vLLM without touching the algorithm. You can move from colocated to disaggregated rollout by changing one config and one checkpoint-engine backend. The decoupling is the contribution.
Recommended reading order for verl's source
This order goes from concept to implementation, with the HybridFlow design doc up front because verl is the kind of framework that becomes legible once you know its design philosophy.
README.md— confirm verl's identity: HybridFlow open-source, FSDP/FSDP2/Megatron + vLLM/SGLang, 3D-HybridEngine.docs/hybrid_flow.rst— the single most important file in the repo to read first. Control flow vs computation flow; single controller; WorkerGroup dispatch; the design motivation in the team's own words.verl/trainer/main_ppo.py— Ray init,TaskRunner, role-worker mapping, resource pools, dataset construction, trainer entry.verl/protocol.py—DataProto: batch / non_tensor_batch / meta_info, union/pop/select/repeat/pad, serialization.verl/trainer/ppo/ray_trainer.py—RayPPOTrainer:compute_advantage(with the GRPO uid invariant),apply_kl_penalty, worker construction.verl/workers/engine_workers.py—TrainingWorkerandActorRolloutRefWorker: register dispatch modes, three-role fusion,update_weightswith naive vs checkpoint-engine modes.verl/workers/rollout/sglang_rollout/sglang_rollout.py—ServerAdapter: rank mapping, sleep levels, FSDP/DTensor collective walk, FP8/LoRA bucket paths.examples/andverl-recipe/— once the framework makes sense, the recipes (GRPO on Qwen3, DAPO, multi-turn tool calling, VLM RL, LoRA RL) are quick to read.
By step 6 you can read any RL framework's train.py as variations on a theme. By step 7 you've seen how the framework abstraction touches the hottest piece of the system (weight sync). By step 8 you've translated all this into actually-runnable production setups.
- verl-project/verl — the repo
- HybridFlow paper — the design rationale
- verl.readthedocs.io — official docs, including the HybridFlow design page
verl/trainer/ppo/ray_trainer.py— RayPPOTrainerverl/protocol.py— DataProtoverl/workers/engine_workers.py— TrainingWorker, ActorRolloutRefWorker- verl-recipe — DAPO, PRIME, GSPO, DrGRPO, SPPO recipes
Case study — DeepSeek V4's post-training infrastructure
If Miles is the case study of "train one policy with RL," DeepSeek V4's post-training is the case study of the opposite bet: train many domain experts with RL separately, then merge them into one student via multi-teacher On-Policy Distillation (OPD). The algorithmic choice — distillation instead of RL as the final-stage merging primitive — shapes a different set of infrastructure problems, and DeepSeek's V4 report (Sections 5.1.2 and 5.2) reads as a clinic on what those problems are and how their team solves them. The system extends the same primitives Miles uses (hybrid engine, FP-aware QAT, fault-tolerant rollout) but adds two genuinely new pieces: efficient multi-teacher scheduling for OPD at trillion-parameter scale, and a production-grade sandbox platform (DSec) for the agentic-AI rollout side. Below is the reading of Section 5 a theorist should take away.
Multi-teacher OPD — the merging objective
The V4 team trains specialist models in math, coding, reasoning, world-knowledge etc. as separate post-training runs, then distills all of them into one unified student. The objective is a weighted sum of reverse KL divergences against each teacher, computed on trajectories sampled from the student:
ℒOPD(MS) = Σi=1..L wi · DKL(MS ∥ MTi) — V4 §5.1.2. L teachers, wi weights, trajectories drawn from MS to preserve the on-policy property.
Two design choices deserve a theorist's attention. Reverse KL on student trajectories is what makes "on-policy distillation" on-policy: the student samples its own actions and the teachers score them, so the gradient is taken against the distribution the student actually inhabits. The alternative — forward KL with teacher trajectories — would give a Behavior Cloning-style loss that ignores the student's own failure modes. Selective alignment per task emerges from the formulation: wi · DKL downweights teachers whose distribution is far from the student's current trajectory, so the math expert dominates math contexts and the coding expert dominates coding contexts automatically. The student converges to a policy that chooses which expert to imitate per context — without an explicit gating network.
The infrastructure twist is in how to compute that KL term. Prior practice approximates DKL with a per-token estimate: at each position, treat log(MT(a)/MS(a)) for the sampled action as the per-token advantage and reuse the RL framework's PPO/GRPO loss machinery. Cheap, but high-variance — the per-token ratio swings wildly across positions and the gradient is noisy. V4 instead computes the full-vocabulary reverse KL at every position, summing across all |V| ≈ 100k+ tokens. The gradient is lower-variance and faithful to the teacher's full distribution, but the compute and memory cost is what the rest of Section 5.2 exists to make tractable.
Efficient teacher scheduling — the hardest piece
The challenge V4 had to solve: more than ten teacher models, each potentially trillion-parameter scale, contributing to a single student training step. The naive setup — materialize all teachers' full logits over the full vocabulary at every position — is prohibitive, even spooled to disk (think hundreds of GB of logits per mini-batch). V4's framework solves it through four composed engineering moves:
- Offload all teacher weights to centralized distributed storage, load on demand with ZeRO-like parameter sharding. Teachers live in shared storage, not in GPU memory.
- Cache only the last-layer hidden states in a centralized buffer during the teacher forward pass — not the full logits. The logit dimension |V| collapses; the hidden dimension d is ~10× smaller.
- Reconstruct full logits on demand via the prediction head module at training time. Negligible recomputation, no logit-materialization memory burden.
- Order training samples by teacher index during data dispatching, so each teacher head is loaded exactly once per mini-batch and at most one head resides in device memory at any moment. Parameter loading and offloading proceeds asynchronously, off the critical path.
And — closing the loop with this survey's TileLang section — V4 reports that "the exact KL divergences between teacher and student logits are computed using a specialized TileLang kernel, which accelerates the computation and curtails dynamic memory allocation." The hidden-state-cache + on-demand prediction-head trick is what makes the algorithm fit in memory; the TileLang kernel is what makes the KL computation fast. The whole subsection is a microcosm of how the survey's separate primitives compose: distributed storage offload, parameter sharding, asynchronous I/O, a custom DSL kernel — all stacked to make one mathematical objective economical.
FP4 (MXFP4) QAT — lossless FP4→FP8 dequant
V4 applies MXFP4 quantization-aware training (FP4 weights with block-shared exponents, the OCP standard) to two components: MoE expert weights and the QK path in the indexer of Compressed Sparse Attention. The trick worth flagging is what they call lossless FP4→FP8 dequantization:
FP8 (E4M3) has 2 more exponent bits than FP4 (E2M1). As long as the ratio between max and min scale factors of the FP4 sub-blocks (1×32 tiles) within each FP8 quantization block (128×128 tiles) doesn't exceed a threshold, the fine-grained scale information is fully absorbed by the FP8 dynamic range.
The mathematical statement: under a bounded-scale-ratio condition (empirically satisfied by their weights), the composition FP32 → FP4 → FP8 preserves the FP4 scale-block information exactly when re-expressed in FP8. This means the existing FP8 training framework is reused without modification — the QAT pipeline plugs in via Straight-Through Estimator on the FP8 backward, and the entire framework's FP8 numerics stack remains intact. For deployment, native FP4 quantized weights are used during rollout instead of simulated quantization, so model behavior during sampling is bit-consistent with online inference. Where the survey's quantization section discussed FP8 and INT4 separately, V4 demonstrates a clean composition: MXFP4 weights flowing through FP8 compute paths, with neither the training framework nor the inference framework needing to know about the FP4 layer underneath.
Token-granular Write-Ahead Log — fault-tolerant rollout
This is the most mathematically interesting piece of V4's infrastructure, and it ties to the survey's design patterns directly. The problem: in a cluster-wide preemptive scheduler, any rollout request can be interrupted at any token by hardware failure or by preemption for a higher-priority task. The naive recovery — restart preempted requests from scratch — is the kind of thing that looks fine but is mathematically wrong.
Regenerating unfinished requests from scratch introduces length bias. Shorter responses are more likely to survive interruption, so regenerating from scratch makes the model more prone to producing shorter sequences whenever an interruption occurs. The bias is a survivorship artifact of the recovery policy, not the policy gradient. — DeepSeek V4 §5.2.3
For a probabilist this is the kind of subtle bias they should hear once and recognize forever. The fix is a token-granular Write-Ahead Log (WAL): every new token is immediately appended to a persistent log; preemption pauses the inference engine and persists the in-flight KV cache; resumption replays the WAL + cached KV to continue decoding. Even on fatal hardware failure, the WAL's tokens are enough to re-run prefill and reconstruct the KV cache from a clean start without restarting generation. The mathematical claim is that the distribution of output sequences is the same whether or not preemption occurred — the WAL preserves on-policy statistics under arbitrary interruption.
An equivalent solution V4 considers and rejects: a batch-invariant, deterministic inference stack with seeded PRNGs would also let interrupted runs be replayed exactly. Mathematically equivalent; engineering-wise prohibitive (full re-decoding cost instead of WAL-replay cost). The WAL is the right answer at this scale, and it generalizes to any rollout system that has to be preemption-safe.
This is a sixth design pattern, complementing the five from the next section: persistent log + replay = correctness under preemption. I'll formalize it there.
Million-token RL — metadata vs heavy data separation
V4 supports million-token context windows. The corresponding rollout infrastructure has to handle trajectories where a single sample's per-token fields (logprobs, masks, multimodal payloads) easily exceed gigabytes. The team's solution is to decompose rollout data into two streams: lightweight metadata (lengths, sample IDs, reward scalars) loaded eagerly for global shuffling and packing layout decisions; and heavy per-token fields loaded lazily via a shared-memory data loader (intra-node deduplication) and released immediately upon consumption at mini-batch granularity. The number of on-device mini-batches is determined dynamically based on workload to trade compute throughput against I/O overlap.
For a theorist this is "buffer-then-fold" (pattern 5 in the next section) at the data-pipeline layer rather than the tensor layer — same structural argument, different granularity.
DeepSeek Elastic Compute (DSec) — agentic-AI sandbox platform
The agentic-RL frontier (SWE-bench, web research, tool-use trained policies) needs the rollout engine to call code execution, not just to generate tokens. V4 builds this as a separate platform: DSec, a Rust-based production sandbox system that manages "hundreds of thousands of concurrent sandbox instances" per cluster. Four execution substrates behind one unified Python SDK:
- Function Call — stateless invocations dispatched to a pre-warmed container pool, no cold start
- Container — Docker-compatible, EROFS on-demand image loading
- microVM — Firecracker, for security-sensitive high-density
- fullVM — QEMU, for arbitrary guest OSes
Built on the 3FS distributed filesystem and a custom RPC protocol. Crucially, sandbox lifecycles coordinate with GPU training schedules — preemption and checkpoint-based resumption are first-class. Each sandbox maintains a globally-ordered trajectory log that serves three purposes: fast-forward replay (when training is preempted, cached results for completed commands are replayed on resumption to skip non-idempotent re-execution); fine-grained provenance (every state change is traceable to its command); deterministic replay (any historical session reproduces from its log).
The unifying observation: V4 has now applied the WAL-replay pattern twice — once at token granularity for LLM rollout (Section 5.2.3), once at command granularity for sandbox state (Section 5.2.5). The same correctness argument carries over: persistent ordered logs let you preempt at any boundary and resume without changing the output distribution. The pattern generalizes; the granularity changes with the workload.
What's interesting about V4 as a case study
The honest summary: V4's contribution is not a new RL algorithm but a new merging algorithm. The team trains specialists with RL (covered in Section 5.1.1 with GRPO), then uses OPD to consolidate them. The infrastructure innovations — multi-teacher scheduling, WAL fault-tolerance, DSec sandboxes — exist because OPD-as-merger creates problems that pure RL doesn't. For a theorist reading V4 alongside Miles, the right framing is that both papers solve the same skeleton (rollout cycle, three pillars, hybrid engine) but assemble different upper layers on top.
- DeepSeek V4 technical report — Section 5.1.2 (OPD objective) and Section 5.2 (post-training infrastructure)
- Gu et al. 2024 — MiniLLM: the original on-policy distillation formulation
- Lu & Lab 2025 — Thinking Machines OPD blog: a sharper case for OPD over RLHF
- slime · examples/on_policy_distillation — the open-source reference with two teacher modes (sglang external, megatron in-process)
- slime OPD README — flags:
--use-opd,--opd-type,--opd-kl-coef,--opd-teacher-load
Engineering case study — SGLang, the inference substrate
Every framework above (verl, slime, Miles) treats SGLang as a black box: feed it prompts, get back tokens; push new weights, expect the next sample to use them. This case study opens the box. The point is to make every interaction the trainer has with SGLang legible — what API it calls, what invariant that API guarantees, and what bug appears when the invariant breaks. Once those are clear, the SGLang ↔ Megatron interface section of any RL framework reads as the small, declarative shim it actually is, rather than as opaque glue.
The two-way contract
From the trainer's perspective, SGLang exposes a small control-plane API and a large data-plane behavior. The control plane is the easy part to read; the data plane is where the engineering lives. The control commands the trainer issues, in approximately the order an RL step uses them:
generate(prompts, sampling_params)— the rollout call. Returns a stream of(token_id, logprob, finish_reason)per sample. The sampling params includetemperature,top_p,top_k,max_new_tokens, and crucially areturn_logprobflag.update_weights_from_*— one of four flavors:_disk,_distributed,_tensor,_ipc. Primitive ④ covered the four paths. The trainer picks one based on whether the engine shares GPUs (IPC), shares an NCCL group (distributed), shares a filesystem (disk), or shares neither (tensor over HTTP).release_memory_occupation()— tells the engine to free its KV cache and weight buffers before the trainer's backward pass needs the GPU. The engine assertsis_fully_idle()first: no in-flight generations, no pending requests. If the assert fails, the framework above hangs intentionally — better hang than corrupt.resume_memory_occupation()— the symmetric reclaim. Re-allocates the KV pool and (if weights were also released) re-uploads them via the appropriateupdate_weights_from_*path.pause_scheduler()/continue_scheduler()— soft-stop the scheduler without freeing memory. Cheaper than release/resume; used when the pause is brief (waiting for a sync barrier) and the cost of re-uploading weights would dominate.flush_cache()— invalidate the RadixCache after a weight update. New weights, new logprobs, so old KV entries are stale. The flush is the boundary between two policy versions.
The data plane is what flows back up: tokens and their logprobs per sample (used as πold in the loss, never as πθ), finish_reason per sample (length cap, EOS, stop string), and — for MoE rollouts — routed_experts per token (the R3 record that lets the trainer replay routing).
Scheduler — the central state machine
SGLang's Scheduler is the loop that decides, every tick, which requests get tokens this step. It has two mixin behaviors that together implement everything interesting:
SchedulePolicy— picks which requests advance. The default is LPM (Longest Prefix Match), which prefers requests whose prefixes are already cached. In an RL setting with group sampling (G samples per prompt), LPM is what makes the G samples share the prompt's KV automatically. There is also in-batch prefix caching: within one decode batch, requests that share a prefix with a request being processed this tick reuse the same KV blocks without going through the cache lookup. The two work together: LPM at request admission, in-batch at the tick level.RadixCache + MemoryPool— owns the KV cache as a radix tree (Primitive ⑤) and the underlying physical block pool. The radix tree hasinc_lock_ref(node)/dec_lock_ref(node)primitives — the lock ref is what prevents an in-use prefix from being evicted under cache pressure. Without it, a long-running rollout could lose its prefix in the middle of generation. With it, the prefix is pinned for the rollout's lifetime and unpinned on completion.
The radix cache also has a namespace mechanism — the extra_key field on each prefix tree node. For multi-tenant RL (multiple training runs sharing one SGLang cluster), extra_key separates each run's tree, preventing cross-tenant prefix collisions. Within one RL run, extra_key can encode the policy version, which makes the flush_cache on weight update a pure namespace-rotation rather than a full eviction.
TpModelWorker — the parallel bridge
Between the single-threaded Scheduler and the TP-sharded GPU work sits TpModelWorker. It is the dispatch boundary: requests selected by the Scheduler are scattered to TP ranks; per-rank outputs are gathered back. The bridge owns the all-reduce barriers, the NCCL groups, and the lock-step invariant that all TP ranks see the same sequence of requests in the same order. A TP rank that drifts (e.g., because a kernel returned slightly different timings) causes the next all-reduce to hang. The Scheduler's strict ordering is what prevents this.
For RL, the TpModelWorker is also the layer where update_weights_from_distributed hooks in. The trainer's parameter stream arrives as NCCL all-gather operations across the cross-process group that spans trainer ranks and engine ranks. TpModelWorker consumes each parameter bucket as it arrives, writes it into the model's local shard, and proceeds. The bucket size is tuned so that NCCL overlap hides the all-gather latency behind the previous bucket's write.
ModelRunner and RoutedExpertsCapturer
ModelRunner owns the actual forward pass: it allocates KV blocks, runs the attention kernel, runs the FFN (or the MoE routing + experts), and emits logits. For MoE models, ModelRunner embeds a RoutedExpertsCapturer — a small hook that records, per token, which top-k experts the gating selected. This record is the R3 ingredient. Without the capturer in the inference path, R3 cannot replay; the trainer would have to re-run routing in its own precision, which is exactly the precision-divergence problem Miles introduced R3 to solve.
The capturer is off by default (it has memory and bandwidth cost) and turned on by the framework above via a config flag when MoE-RL with low-precision inference is the workload. The capture format is a compact (token_idx, expert_ids[k]) stream that the trainer reads alongside (token, logprob).
Sleep mode — the memory-choreography hookpoint
SGLang's sleep mode is the engine's side of the hybrid-engine handshake. Three sleep levels, in increasing aggression:
- Level 0: only the KV pool is dropped. The model weights and the CUDA graphs stay in VRAM. Recovery is fastest (just re-allocate the KV pool) but the freed memory is modest (KV is usually 20-40% of total).
- Level 1: KV pool + weights. The model itself leaves VRAM, offloaded to pinned host memory. Recovery requires re-uploading weights — but if the trainer is going to push new weights anyway after backward, this is essentially free. The frees are large (most of the engine's VRAM).
- Level 2: KV pool + weights + CUDA graphs. Even the captured CUDA graphs (which encode kernel launch sequences) are dropped. Recovery is slowest (re-trace + re-capture, hundreds of milliseconds) but the frees are maximal. Used when the trainer needs every byte it can get during backward.
The release_memory_occupation() command takes a level argument and the framework above picks based on its backward-pass memory profile. verl's sleep_level=1 default is the right choice for most setups; sleep_level=2 is only worth it for models that wouldn't otherwise fit.
The four update_weights_from_* paths from SGLang's side
Primitive ④ covered these from the trainer's perspective. From SGLang's side, each path becomes a method on the engine with different memory-and-latency tradeoffs:
| Path | What SGLang does | Latency | Memory overhead |
|---|---|---|---|
_disk | Read each shard from a shared filesystem; load into the local model. | Slowest (seconds-to-minutes) | Low (streaming) |
_distributed | Receive each parameter bucket via NCCL all-gather; write into local shard. | Fast (ms-to-sec) | Bucket-sized buffer |
_tensor | Receive each parameter over HTTP / gRPC; write into local shard. | Slow (network-bound) | Per-tensor buffer |
_ipc | Map the trainer's CUDA tensor by IPC handle; copy device-to-device. | Fastest (μs-to-ms) | Zero-copy if same-rank |
The same verb "update weights" has four implementations differing only in transport. This is the canonical example of pattern 1, the functor: one operation lifted across four topologies.
Six engineering invariants SGLang enforces
- Idle-before-release —
release_memory_occupation()assertsis_fully_idle()before freeing. No in-flight generations. - Lock-ref for in-use prefixes — every active rollout pins its prefix with
inc_lock_ref; the radix tree's evictor skips locked nodes. - TP lock-step — all TP ranks see the same request stream in the same order, or NCCL hangs by design.
- Flush on weight update — every
update_weights_from_*is followed byflush_cache(); new weights ≠ old KV. - RoutedExpertsCapturer for MoE-RL — when MoE inference runs in low precision, the routed-expert record must be emitted, or R3 has nothing to replay.
- Logprob is engine logprob, not trainer logprob — the data plane returns the engine's logprob as πold. The trainer's recompute is the one that goes into the loss as πθ.
What's interesting about SGLang as a case study
SGLang's design choice is to be the inference engine that knows it's being driven by a trainer. vLLM, by contrast, was designed first as a serving engine and then bolted RL hooks onto an existing scheduler. The two engines have converged in their RL surface (both expose sleep, wake, update_weights, reset_prefix_cache), but the internals diverge: SGLang's radix tree vs vLLM's block-hash map, SGLang's mixin Scheduler vs vLLM's EngineCore. The next case study walks through vLLM's choices.
- sgl-project/sglang — repo root
srt/managers/— Scheduler, SchedulePolicy, MemoryPoolsrt/mem_cache/— RadixCache and the lock-ref primitivessrt/model_executor/— ModelRunner and the RoutedExpertsCapturer hook- Awesome-ML-SYS-Tutorial — the canonical RL-side reference for SGLang internals
Engineering case study — vLLM, the block-paging contrast
vLLM is SGLang's primary competitor and the engine of choice in roughly half of all RL-framework configurations (OpenRLHF defaults to vLLM; verl supports both; SkyRL supports vLLM via an abstraction). Reading vLLM after SGLang is the cleanest way to see which design choices are essential to "be an inference engine" and which are stylistic. They converge on the same RL hooks but reach them by different internal paths. The most consequential divergence is the KV cache representation: SGLang uses a radix tree; vLLM uses block-paging with a block-hash map. The two structures are isomorphic in expressivity but very different to maintain.
PagedAttention and block-paging
vLLM's foundational paper is PagedAttention (the eponymous SOSP 2023 paper). The idea is virtual memory for KV cache: instead of allocating one contiguous tensor per sequence, allocate fixed-size blocks (typically 16 tokens each) from a global pool. A sequence is a list of block pointers; the attention kernel knows how to traverse the block list per request. The win is that KV memory fragmentation drops from ~60% in a contiguous-allocation engine to near zero — block boundaries align to allocation boundaries.
The data structure that owns this in vLLM v1 is BlockPool: a fixed-size array of blocks with reference counts. Allocation is O(1) from a free list; deallocation is O(1) by ref-count decrement. The block-hash design is what makes prefix caching work: BlockHashToBlockMap hashes the (parent_hash, token_ids[0:block_size]) tuple to find existing blocks. A new sequence whose first 64 tokens match an existing prefix hits four blocks of the prefix's hashes and shares them.
The contrast with SGLang's radix tree is the central diagram of inference-engine literature: both structures answer the question "which blocks already hold this prefix?" but with different update semantics. Radix tree is the right answer when prefixes are long and overlapping; block-hash is the right answer when prefixes are bursty and block-aligned. In RL, group sampling with G samples per prompt is exactly the bursty case — every sample shares the same prompt prefix, all of length divisible by 16 after tokenization. Both structures handle it well; the differences appear in edge cases like cache rotation under partial-rollout resumption.
EngineCore and the inner loop
vLLM's EngineCore is the equivalent of SGLang's Scheduler. The inner loop, simplified:
while True:
requests = scheduler.schedule() # pick which requests get tokens this step
if not requests:
wait_or_yield()
continue
token_progress = model_executor.run(requests)
scheduler.update(token_progress) # advance state, free finished blocks
output_processor.emit(token_progress) # stream tokens out
The Scheduler is token-progress-based rather than request-based. Each tick, the scheduler asks "how many tokens of forward progress are budgeted?" and packs as many requests as fit in that budget. This is unlike SGLang's request-list scheduling and is what lets vLLM keep continuous batching very full even when request lengths are uneven.
The KVCacheManager.allocate_slots() method is where the block-paging lives. It has a five-zone layout for a given request: (1) cached prefix blocks (already in BlockPool, ref-count++), (2) prefix blocks to allocate (new blocks for the rest of the prompt), (3) decode blocks (for generation, allocated lazily as decoding proceeds), (4) speculation blocks (when speculative decoding is enabled), (5) encoder blocks (for VLM image features). The five zones share one BlockPool with a unified ref-count protocol — when a request finishes, every block it touched gets a decrement, and any block whose count reaches zero returns to the free list.
Sleep / wake — vLLM's hybrid-engine hooks
vLLM acquired sleep / wake hooks later than SGLang did, but they now match in API. The implementation differs: vLLM uses a custom CuMemAllocator that wraps CUDA's virtual-memory APIs, letting the GPU allocator release physical memory while keeping virtual addresses reserved. On wake, the same virtual addresses get new physical pages backed — no relocation, no pointer fixups in the model. SGLang's release goes through PyTorch's caching allocator and re-binds on resume; the net behavior is the same but the bookkeeping is different. Both engines support sleep level 0/1/2 with the same semantics covered in the SGLang section.
RLHF-specific hooks
Three vLLM API methods exist primarily for RL workloads:
reset_prefix_cache()— invalidate the BlockHashToBlockMap. Called after every weight update for the same reason SGLang flushes its RadixCache: new weights, new logprobs, old KV is stale.reset_encoder_cache()— invalidate the VLM image-feature cache. For multimodal RL where the policy changes the vision encoder, the cached features become stale alongside the LM weights.pause_scheduler(mode)— soft-pause without freeing memory. Two modes:'idle'(let in-flight finish, refuse new) and'flush'(drop in-flight, immediate pause). The flush mode is what the trainer uses when a sync barrier needs to be hit on a deadline and finishing the in-flight tail would overshoot.
vLLM also supports the four update_weights_from_* paths (disk, distributed, tensor, ipc) under the same names as SGLang — the convergence on this verb is no accident, both engines adopted it because the trainer doesn't want to know which engine it's driving.
Where vLLM and SGLang diverge for RL
For a theorist deciding which engine to drive, the operational differences cluster into four areas:
- Prefix cache structure. vLLM's block-hash is faster to update under bursty group-sampling workloads (constant-time per block); SGLang's radix tree is more compact under deeply-shared prefix trees with many branching points.
- Scheduler granularity. vLLM schedules by token budget per tick, SGLang by request list. Token-budget scheduling produces flatter latency profiles under load; request-list scheduling is more predictable per-request.
- MoE routing capture. SGLang has the first-party
RoutedExpertsCapturerfor R3; vLLM exposes a hook but requires the framework above to wire it. Miles ships an internal vLLM fork that includes the equivalent; OpenRLHF currently does not. - Multimodal. vLLM's encoder cache (
reset_encoder_cache) is a more mature surface than SGLang's; SGLang's lead is in language-only workloads.
Seven engineering invariants vLLM enforces
- Block ref-count balance — every block allocated must eventually be freed; the free list must be consistent at all times. A leak shows up as gradual VRAM growth across requests.
- Hash collision handling — if two distinct prefixes hash to the same key, the BlockHashToBlockMap stores both and the lookup probes; collisions are rare but the structure is robust to them.
- Token-budget admission — every tick budgets a token count and never overshoots; the scheduler refuses requests that would exceed the budget rather than queuing them silently.
- Encoder-cache invalidation — for VLM RL,
reset_encoder_cachemust be called alongsidereset_prefix_cacheon every weight update. - Pause modes are exclusive — only one of
idle/flushis active at a time; the trainer commits before issuing. - Virtual address stability across sleep/wake —
CuMemAllocatorguarantees the same virtual addresses after wake, so the model's parameter pointers don't move. - Engine logprob, not trainer logprob — same invariant as SGLang: the engine returns its logprob as πold, never to be used as πθ.
What's interesting about vLLM as a case study
vLLM's bet is that virtual-memory primitives — paging, ref-counts, hash maps — are the right granularity for the engine's bookkeeping. SGLang's bet is that radix trees and locks are. Both bets work in production; the choice has more to do with which research group you're collaborating with than with any objective superiority. For a theorist, the value of reading both case studies side by side is that it inoculates against the temptation to treat the inference engine as a single opaque thing — there are at least two coherent designs for the same problem, with different operational tradeoffs.
- vllm-project/vllm — repo root
- PagedAttention paper (SOSP 2023) — the foundational design
vllm/v1/— the v1 architecture (EngineCore, BlockPool, KVCacheManager)- vLLM distributed serving docs — TP/PP configuration, the engine's parallel layout
- vLLM RLHF docs — the
sleep,wake,update_weights,reset_prefix_cacheAPIs
Recent advances from the SGLang RL team
The slime + Miles + SGLang community has shipped a cluster of advances in late 2025 / early 2026 that are worth knowing collectively because they extend the survey's six primitives along five different axes. I list them with one-paragraph reads.
INT4 QAT full-flow training
Inspired by Kimi K2-Thinking's W4A16 QAT recipe, slime now runs an end-to-end INT4 quantization-aware training pipeline. The training side keeps BF16 master weights but inserts fake quantization (quant-dequant) into the forward pass — the model "sees" INT4 noise and learns weights robust to it. The backward pass uses Straight-Through Estimator (STE): the round function's derivative is set to 1, letting gradient flow through the unquantized weights. At inference time, SGLang loads true W4A16 weights with the Marlin kernel. Net effect: a 1TB-class model (Kimi K2 scale) fits the rollout in a single H200 (141GB), eliminating cross-machine communication overhead. Technical writeup.
Unified VLM/LLM multi-turn
The first-principles design described in the previous section. One rollout function, two domains, full decoupling between sampling logic and environment. Blog.
Rollout Router Replay (R3) for MoE stability
Already covered as Miles' signature contribution to the quantization section. Captures expert-routing decisions during SGLang inference, replays them during Megatron training. Makes MoE RL stable under low-precision routing.
Full-flow FP8 training and sampling
The follow-on to R3. "Unified FP8: Moving Beyond Mixed Precision for Stable and Accelerated MoE RL" walks through hardware foundations, scale selection, and MoE experiment results. The headline: FP8 inference + FP8 training + R3-style routing replay gives bit-identical numerics and ~2× rollout throughput on H100/H200.
Speculative decoding in RL
Standard practice for serving, novel in RL training. A small draft model proposes tokens; the policy verifies them in parallel. Net effect: 25%+ rollout speedup with no accuracy compromise. slime docs.
The cluster of advances above traces one through-line: the bottleneck of RL training is the rollout, not the gradient update. Every primitive optimizes rollout throughput or stability. The R3 / FP8 work makes the rollout faster and correct on MoE. The QAT work shrinks the rollout's memory. The multi-turn work expands what counts as a rollout. The speculative work decodes faster.
For a theorist this matters because the empirical claims in the field's papers (DeepSeek-R1, GLM-5.1, K2-Thinking, Doubao-1.5-pro) are produced under these specific infrastructure choices. If your work depends on understanding why those models behave as they do, the choices above are the assumption set you're implicitly invoking.
Engineering design patterns — the algebraic view
Six patterns recur across every framework in the survey. Once you see them named, the rest of the page reads as variations on a small set of themes. I lay them out here as a synthesis of what the previous sections have already shown. The sixth pattern is the one V4 just demonstrated above — I add it explicitly at the end.
1. Functor — one verb, many topologies
The four update_weights_from_* methods are the canonical example. There is one operation ("write the new policy into the inference engine") lifted across four different categories of physical configuration (same process / shared disk / NCCL group / CUDA IPC). The interface is fixed; the implementation per category is what differs. The same pattern appears in SkyRL's three-backend abstraction (vLLM / SGLang / OpenAI API) and in verl's training-backend choice (FSDP / FSDP2 / Megatron).
The mathematical statement: an operation lifted to a category of topologies, with a consistency obligation downstream (the flush_cache_after_weight_update contract). For a theorist, this is the cleanest example of "the abstraction does mathematical work" in the survey — the verb is the same, the meaning is preserved, but the cost varies by category.
2. Categorical product — multiplicative composition
RadixAttention × GRPO is the example I lead with elsewhere because it's the most beautiful. An algorithmic choice (group sampling, for variance reduction) exposes a sharing pattern (a common prefix). A data structure (the prefix tree) exposes a sharing mechanism (reference-counted nodes). Their composition multiplies savings — 4× prefill cost becomes 1× plus tails — and reference counting makes cache lifetime automatic.
The pattern recurs elsewhere. The hybrid engine × Megatron 5D parallelism: colocation chooses one GPU per role at a time, while Megatron handles the parallelism within a role; their composition is what makes 462B-parameter training feasible. R3 routing replay × FP8 inference: routing replay ensures expert selection is identical, FP8 makes it economical; together they make MoE RL stable. Whenever you see "the system advantage is doing the heavy lifting," you're looking at a categorical product.
3. Mutual exclusion as serialization
The hybrid engine (训推一体) is the central instance. The GPU is in exactly one of two states at any moment (training-mode or inference-mode); transitions are explicit; concurrency contracts are checked structurally with assert is_fully_idle(). This is the engineering version of a state machine with two states and well-defined transitions — but its real significance is that shared state with mutable ownership has no race conditions when ownership is exclusive. Concurrent programming's worst class of bugs is structurally unavailable.
The same pattern shows up in subtler places. RadixAttention's inc_lock_ref / dec_lock_ref establishes mutual exclusion between active readers and the eviction policy — a node with lock_ref > 0 simply cannot be evicted. Use-after-free is unrepresentable. The structural argument is the same; the state machine is just per-node instead of global.
4. Identity invariant under reparameterization
This is the deepest pattern, and Miles' R3 is the canonical case. The mathematical statement: routing(x, θ, fp8) ≡ routing(x, θ, bf16), enforced by replay rather than by hoping numerics agree. Two computations that should give the same answer but might not, made to give the same answer by recording one and replaying the other.
The QAT pipeline does the same thing differently — fake quantization in the forward pass establishes loss(θ, bf16) ≈ loss(θ, int4) by inserting the int4 noise into the training distribution. STE (Straight-Through Estimator) makes the backward pass behave as if int4 weren't there. Both are identity invariants under a reparameterization (precision regime, in this case) maintained by explicit engineering rather than mathematical equality.
For a theorist this is the most interesting pattern because it appears whenever an engineering shortcut (lower precision, async update, partial rollout) creates a mathematical inconsistency that has to be closed by another mechanism. The shortcut + the fix together are the contribution; neither alone is.
5. Buffer-then-fold — bounding asymptotic complexity
The multimodal tensor merge in multi-turn rollout (Section above) is the clean example. Naïve concatenation per turn: O(n²). Buffer-then-fold: O(n). The same input-output behavior; two different complexity profiles distinguished only by where the allocation boundary sits.
The pattern recurs in AReaL's _PendingWeightUpdateBucket (queue NCCL broadcasts in memory-bounded buckets, fire them at the end), in verl's DataProto.union (merge fields lazily, materialize the full batch once at dispatch time), and in slime's bucket-based weight sync (avoid OOM by streaming updates in slices). The unifying claim: do not perform an O(n) operation inside an O(n) loop unless you must.
6. Persistent log + replay — correctness under preemption
The pattern V4 demonstrates with the token-granular Write-Ahead Log. The setup: a long-running stateful computation (a rollout, an agent session, a database transaction) can be interrupted at any moment. The naive recovery — restart from the most recent persistent checkpoint — has subtle correctness pitfalls. Restarting an in-flight LLM generation, as we saw, introduces length bias: shorter responses are over-represented in the survivor population, distorting the empirical distribution of trajectories the policy gradient sees.
The pattern's claim: maintain an append-only log of every operation; on resumption, replay the log to reconstruct the pre-interruption state. The result is that the post-interruption execution is indistinguishable from a no-interruption execution at the observable boundary (the sampled trajectory distribution in V4's case; the database state in a transactional setting). Token-granular WAL is the LLM-rollout instance; command-granular trajectory logs are DSec's agentic-sandbox instance; the same correctness argument applies in both.
Mathematically, the pattern is a fixed-point claim: the function from input prompt to output trajectory is the same with or without preemption, when preemption is bracketed by WAL persistence + replay. This is a stronger guarantee than "results are approximately the same" — it's that they're distributionally identical. For a probabilist, the recognition is that recovery policies are part of the data-generating process, and the wrong recovery policy is a hidden experimental variable.
The synthesis
Once you have these six patterns, reading a new RL framework becomes pattern-matching. The first time you see RLinf's M2Flow scheduler, you ask: "what's the functor here, and what's the consistency obligation downstream?" The first time you see cosmos-rl's async reward microservice, you ask: "what's the mutual exclusion contract, and what's the buffer-then-fold strategy?" The first time you see a fault-tolerant rollout pipeline, you ask: "what's logged, what's replayed, and what's the distributional-equivalence claim?" The first time you read a new framework's train.py, the six patterns are the lens.
This is why the survey insists on names. Naming the patterns turns "what's going on in this code" into "which of the six patterns is this." The latter is a search; the former is reading.
Reading real code: verl's fit() loop
Everything above has been concept. If you want to see how it ties together in a real framework, the cleanest reference is verl's ray_trainer.py. One PPO/GRPO iteration is roughly 30 lines of driver code; every phase is a marked_timer block; every cross-module call is a Ray dispatch to SPMD worker groups. The batch: DataProto accumulates fields phase by phase via .union(...).
with marked_timer("step", timing_raw):
with marked_timer("gen", timing_raw, color="red"):
combined_gen_output = self.async_rollout_manager.generate_sequences(combined_gen_batch)
self.checkpoint_manager.sleep_replicas()
batch = batch.repeat(repeat_times=self.config.actor_rollout_ref.rollout.n, interleave=True)
batch = batch.union(gen_batch_output)
with marked_timer("reward", timing_raw, color="yellow"):
if self.use_rm and "rm_scores" not in batch.batch.keys():
batch_reward = self._compute_reward_colocate(batch)
batch = batch.union(batch_reward)
reward_tensor, reward_extra_infos_dict = extract_reward(batch)
with marked_timer("old_log_prob", timing_raw, color="blue"):
old_log_prob, old_log_prob_mfu = self._compute_old_log_prob(batch)
batch = batch.union(old_log_prob)
if self.use_reference_policy:
with marked_timer(str(Role.RefPolicy), timing_raw, color="olive"):
ref_log_prob = self._compute_ref_log_prob(batch)
batch = batch.union(ref_log_prob)
if self.use_critic:
with marked_timer("values", timing_raw, color="cyan"):
values = self._compute_values(batch)
batch = batch.union(values)
with marked_timer("adv", timing_raw, color="brown"):
batch = compute_advantage(batch, adv_estimator=self.config.algorithm.adv_estimator, ...)
if self.use_critic:
with marked_timer("update_critic", timing_raw, color="pink"):
critic_output = self._update_critic(batch)
with marked_timer("update_actor", timing_raw, color="red"):
actor_output = self._update_actor(batch)
The colors aren't decorative — they map to phases you'd see in verl's wandb traces. Red is generation, yellow is reward, blue is old_log_prob, etc. The driver code is short because every method call is a Ray dispatch to SPMD worker groups, which expand to hundreds of GPUs internally. Reading this once is worth more than reading 20 RL papers — it makes the abstract loop in Figure 1 fully concrete.
The contrast worth looking at is AReaL's async path. The single keyword that distinguishes "synchronous" from "fully async" RL is async_op=True on the NCCL broadcasts. AReaL launches them, queues them in memory-bounded buckets, and continues without waiting; OpenRLHF's broadcast_to_vllm blocks until every engine acknowledges. Two designs, one keyword apart, with substantially different scaling properties:
class _PendingWeightUpdateBucket:
handles: list[dist.broadcast] # async_op=True handles
futures: list[torch.cuda.Event]
tensors: list[torch.Tensor]
# During weight sync:
for bucket in buckets:
dist.broadcast(bucket.tensor, src=0, group=dp_group, async_op=True)
# continue iterating — don't wait
# Inference side:
future = rollout_engine.update_weights_from_distributed(meta, param_specs, async_op=True)
# Inference keeps running until it explicitly awaits the future
The framework landscape
Nine frameworks, all instantiations of the skeleton in Section 2. They differ on five axes: training backend, inference backend, orchestrator, placement policy, and target domain. Once you know the five axes, you can place every framework on a 5-dimensional map and the surface differences (DAPO vs PRIME vs GSPO vs OPD) collapse into local choices made within the same architectural envelope.
| Framework | Training | Inference | Orchestration | Domain | Distinctive bet |
|---|---|---|---|---|---|
| Miles | Megatron (plugin) | SGLang | Ray | Frontier MoE | FP8/INT4 bit-identical · R3 routing replay |
| SLIME | Megatron | SGLang | Ray | Reasoning + agentic (GLM-5.1) | On-policy distillation · math/science graders |
| SGLang | — | (itself) | — | Substrate | update_weights_from_* · RadixAttention |
| SkyRL | FSDP / Megatron | vLLM / SGLang | Ray | Multi-turn agents | skyrl-gym + skyrl-agent + Tinker |
| cosmos-rl | PyTorch + 6D parallel | vLLM / diffusers | Custom NCCL | Physical AI | WFM RL (DDRL) · FP8/MXFP4 |
| RLinf | FSDP + Megatron | SGLang + vLLM | Ray + M2Flow | Embodied + agentic | Macro→Micro flow: 2.43× from scheduling |
| verl | FSDP / Megatron | vLLM / SGLang | Ray (HybridFlow) | Frontier LLM RL | 3D-HybridEngine · DAPO/PRIME/GSPO recipes |
| OpenRLHF | DeepSpeed | vLLM | Centralized Ray | RLHF baseline | The OG · structurally synchronous |
| AReaL | FSDP / Megatron | SGLang | Ray + async futures | Fully-async RL | _PendingWeightUpdateBucket · 2.77× speedup |
If you cluster these by their bet, three families emerge. LLM purists (Miles, SLIME, verl) make a tight Megatron + SGLang + Ray bet and innovate on numerics and algorithms. Agent platforms (SkyRL, RLinf) abstract environments and trade backend specificity for flexibility. The domain extender (cosmos-rl) brings RL to non-text modalities — diffusion video, robotics — and rebuilds orchestration around hardware. SGLang sits underneath all of them as substrate; AReaL and OpenRLHF are performance specialists that differ mainly in their async choice.
For a theorist, the useful exercise is to ask, for a given research result, which framework it was trained on and whether the framework's choices preserve the assumptions the result needs. A claim about variance reduction under GRPO needs the prefix cache to behave; a claim about MoE expert specialization needs the routing-replay invariant; a claim about off-policy correction needs the IS weights to actually be computed. The framework matters more than the algorithm name suggests.
Beyond chat LLMs
All nine frameworks above were designed for chat LLMs. The 2026 frontier is wider — multi-turn agents that call tools, vision-language-action policies for robots, diffusion world models that generate video. These targets push different requirements onto the framework, and frameworks that can't accommodate them get left behind.
Multi-turn agents
Single-turn RL: one prompt → one completion → one reward. Easy. Multi-turn RL: the model emits a tool call, gets a tool response, emits another action, gets another response — across 10+ turns. Reward arrives at the end (task success) or per-turn (intermediate signals). The rollout is a trajectory, not a completion. New systems requirements: KV cache reuse across turns (SGLang's open_session()), variable-length trajectories (mandates partial rollout), and a clean environment abstraction so the agent can call real tools. SkyRL with skyrl-gym + skyrl-agent (which trained the SA-SWE-32B SWE-bench model) and SLIME with its concrete examples (tau-bench, retool, search-r1) are the leaders.
Embodied AI and VLA
Vision-Language-Action models (π₀ / π₀.₅ from Physical Intelligence, OpenVLA, NVIDIA GR00T-N1.5) take an image plus a language instruction and output continuous action sequences for robots. The action space is no longer discrete tokens but continuous joint angles. RL algorithms shift to SAC, DAPO, SAC-Flow (a flow-matching policy variant). The simulator-in-the-loop becomes a non-negotiable part of the framework — RLinf abstracts ManiSkill, IsaacLab, Habitat, LIBERO, RoboTwin, CALVIN, MetaWorld behind a single envs/ wrapper.
World foundation models and diffusion
Cosmos-Predict, SANA, Stable Diffusion 3, Wan2.2 (Alibaba's 27B-total/14B-active MoE video model), FLUX/FLUX.2 (Black Forest Labs) — these are diffusion-based generators. Each "rollout" is 50+ denoising steps, not one autoregressive pass; cost-per-sample dominates everything. Long-context handling becomes critical (video tokens reach 100K+), which is why cosmos-rl needs 6D parallelism. The algorithm shifts too: PPO/GRPO doesn't directly apply to diffusion models. Cosmos-rl's DDRL (Data-Regularized DRL) replaces the KL term with reward maximization plus standard diffusion training loss. RLinf as of Feb 2026 supports RL fine-tuning of VLA on Wan world models — closing the loop where Wan simulates rollouts for embodied agents.
The TPU detour
If you're forced onto Google TPUs, almost none of this works. The Cambrian-MLLM TPU training blog documents two years of training multimodal models on TPUs and reaches three uncomfortable conclusions. First, "dynamic shapes are the enemy" — every shape change triggers XLA recompilation. Variable-length generation as we know it on GPU is essentially banned. Second, "arbitrary SPMD sharding proved impractical" — Ray-based hybrid placement doesn't translate. Third, silent library incompatibilities hide everywhere: F.scaled_dot_product_attention and torch.utils.checkpoint fail without errors on TorchXLA.
The conclusion: a TPU-native RL framework would look fundamentally different. Static graph compilation, fixed shape contracts, no Ray, no SGLang. No one has built this yet. vLLM has plugin support for TPUs in inference, but the full RL-on-TPU story is open. This is a real research gap.
The scaling chain
The architecture is elegant because each primitive is designed to address the bottleneck that appears at the next scale tier. Walking up the chain:
| Tier | What breaks first | What saves you |
|---|---|---|
| ~8 GPUs (single node) | KV + train state collide → OOM | release_memory_occupation + colocate |
| ~64 GPUs (small cluster) | GRPO group prefills duplicate work | RadixAttention prefix cache |
| ~256 GPUs (medium) | Naïve param copy = 50ms × thousands of tensors | Handle-tuple zero-copy via CUDA IPC + ZMQ |
| ~1024 GPUs (large) | Long-tail rollouts — 90% GPUs idle 18s/iter | Partial rollout + TIS/MIS staleness corrections |
| ~4096 GPUs (frontier) | Single-controller becomes CPU bottleneck | SPMD multi-controller; M2Flow dynamic scheduling |
| ~10000+ GPUs | NCCL groups can't be resized | RDMA point-to-point + disaggregated rollout pools |
The meta-property is that each primitive is a local optimization. You don't pay for RDMA until you need elasticity. You don't pay for partial rollout until tail latency dominates. The stack is "pay for what you need," which is what makes it scale — the opposite design (monolithic optimization tuned to one tier) wins at that tier and dies at the next.
Pitfalls — 踩坑录, lessons paid in pain
Chenyang's tutorial catalogs the production failure modes the field has paid years of debugging on. Six are worth knowing for anyone trying to interpret RL results:
1. Training-inference numerical drift
Inference kernels fuse operations to maximize throughput; the fusion depends on batch shape. Same model, same input, different batch size = slightly different logits. Invisible at the token level, fatal at the logprob level. Never trust inference-engine logprobs for loss computation.
2. Handle-tuple deserialization segfaults
verl's update_weights_from_tensor requires monkey_patch_torch_reductions() to register CUDA IPC handle deserializers. Missing this call: silent segfault, intermittent.
3. NCCL hangs under mixed inference backends
OpenRLHF + SGLang integration: silent deadlocks, no error, the run just stalls. Mixing distributed backends (DeepSpeed + Ray + SGLang's own dist) creates fragile NCCL group management.
4. Memory choreography under colocate
Megatron's CPU offload is imperfect — KV cache and model parameters contend for the same address ranges. slime's bucket-based weight update exists specifically to avoid OOM on large MoE models. Test the memory hand-off at full scale early.
5. Off-policy ratio drift
Without explicit IS correction or operational staleness bounds, the off-policy ratio grows monotonically until the policy is training on data from a fundamentally different policy. Training curve looks fine. Eval curve mysteriously degrades. Monitor off-policy ratio as a first-class metric.
6. Mixing precisions across train and infer
BF16 training + FP8 inference + FP32 optimizer state = three numerical regimes interacting via weight sync. MoE routing diverges between regimes (the problem R3 solves). If you can't make precision uniform, instrument routing decisions per-expert. The bug is invisible at the loss level.
A researcher's checklist — what to verify before trusting an RL result
If the empirical content of the previous sections distilled to anything actionable, it would be a short list of questions you should be able to answer about any RL paper before you treat its numbers as evidence for its theoretical claims. I list nine. They are intentionally ordered by how often they materially affect results — the first three matter more than the last three.
1. Which framework was the training run on?
The most important single question, and the one papers most often answer vaguely. "We used GRPO" tells you the algorithm. The framework choice — Miles vs verl vs SLIME vs OpenRLHF — tells you which set of engineering primitives shaped the run. Routinely, two groups training "the same algorithm" report different numbers because their frameworks make different choices about staleness, prefix caching, weight sync, and precision alignment.
2. Inference engine and rollout topology?
SGLang vs vLLM matters because of the RadixAttention prefix-cache difference (especially for GRPO group sampling). Colocated vs disaggregated rollout matters because of the weight-sync path. "vLLM with disaggregated rollout" and "SGLang with hybrid engine" can produce noticeably different sample-efficiency curves under the same algorithm and the same model. A paper that does not name the inference engine has not described its experimental setup.
3. Precision regime, and what enforces train-infer consistency?
BF16 train + BF16 infer is the safe case. FP8 inference + BF16 training is the dangerous case for MoE models without R3-style routing replay. INT4 inference + BF16 training requires QAT. The question to ask: in what precision are the logprobs that drive the loss computed, and where do they come from? If the answer is "FP8 inference engine" without further qualification, you should expect numerical drift. If the answer is "recomputed in BF16 on the training engine," good — that's the safe pattern.
4. On-policy or async? Off-policy ratio cap?
Strict on-policy RL is rare at any scale. If the framework uses partial rollout (most do, above 256 GPUs), the result is technically off-policy, and there should be either an explicit importance-sampling correction (Miles' TIS/MIS) or an explicit staleness bound (Kimi K1.5's curriculum scheduling, AReaL's bucket size). The question: is the off-policy ratio reported, capped, or even monitored? If none of those, the on-policy claim is unverified.
5. Was logprob recomputed on the trainer or trusted from inference?
The "BERT-era unsolved bug" — inference kernels fuse operations differently from training kernels, producing different logprobs for the same input. Every well-engineered framework recomputes logprobs on the training engine. If a paper trains on inference-engine logprobs, the loss has a numerical bias that grows with training. Worth checking in the framework's source even if the paper doesn't say.
6. For MoE models: is routing replay enabled?
The MoE-specific version of question 3. Without R3 (or equivalent), the gradient signal has a noise floor from routing divergence between inference and training. DeepSeek-V3, Qwen3-MoE, GPT-OSS, Mixtral — all need this. A paper training one of these without naming the routing-replay mechanism is implicitly trusting that the precision regimes agree exactly, which they typically don't.
7. Group size N for GRPO?
GRPO's variance reduction scales with group size. The prefix-cache savings also scale with group size (more completions sharing a prefix). The optimal N depends on the cluster, the model, and the task. Papers often report N=8 or N=16 with no ablation. If you're trying to reproduce or compare against a result, the group size is part of the experimental setup, not a hyperparameter.
8. For multi-turn experiments: token budget, max_turns, truncation policy?
Multi-turn trajectories have long tails. A 32-turn rollout with no per-turn limit gives different results from a 32-turn rollout with a 4096-token per-turn cap. Whether observation tokens are loss-masked (they should be) is part of the setup. Whether the chat-template preamble is repeated each turn (it shouldn't be — see the dummy-messages trick) is part of the setup. Papers comparing agents trained under different multi-turn settings often aren't comparing the same thing.
9. Compute budget per training step, and where time is spent?
The least-asked but most-revealing question. A training run reports "256 H100-days for the full schedule" and the breakdown between rollout, reward, and gradient update is typically 60% / 5% / 35%. If a paper compares two algorithms but doesn't break down where the time went, the comparison may be artifact of one method exploiting prefix caching better, not of the algorithm being inherently faster. The wandb-style timing breakdown (gen / reward / old_log_prob / values / adv / update_critic / update_actor) that verl's marked_timer blocks produce is what you actually want to see in a paper's appendix.
A paper that doesn't answer the first three of these has not described its experiment. A paper that doesn't answer the first six is not reproducible from its text alone. The 8th and 9th are luxury — but where the most interesting comparisons live.
None of this is to claim that papers should be ignored unless they answer all nine questions. It is to say that the empirical comparisons you make in your head while reading the field's papers should always include "and what is implicitly assumed about questions 1 through 6." The discipline of working through this checklist is what the survey's vocabulary is for.
A 19-repo reading plan — how to do this yourself
This survey distilled 19 source repos into a few thousand words. If you want to do the same exercise yourself — read the actual code, not just my summary — here is the order I would recommend, and the template I would use for each repo. The two together turn what looks like an overwhelming corpus into a manageable sequence of small case studies.
Read in five groups, not in their listed order
The natural impulse is to start with the most familiar repo on the list. That is the wrong move. The list groups by topic — image diffusion, RL framework, inference engine, kernel DSL — and reading them in topic order means you spend a week on image generation before you have any of the RL skeleton in your head. The better order is the one this survey is built on:
- The RL skeleton itself. Start with slime, Miles, verl, SGLang, and Megatron-LM together. These five repos cover the full generate → score → filter → train → sync path. Three of them have full case studies above as a head start: slime is the clean upstream framework, Miles is its production-hardened fork, verl is the alternative framework abstraction (HybridFlow). SGLang gives you the rollout-and-weight-sync substrate; Megatron-LM gives you the training backbone and, surprisingly, its own native RL path in
megatron/rl/. Read the case studies first, then dive into the source of one of these five. By the end of group 1 you can read any RL framework'strain.pyin fifteen minutes. - Comparison across RL framework designs. Then read AReaL, OpenRLHF, SkyRL, and RLinf. Now you're not learning the skeleton — you're learning the variations. AReaL shows fully-async; OpenRLHF shows the classical synchronous Ray + DeepSpeed + vLLM design; SkyRL specializes in long-horizon agents; RLinf adds the M2Flow scheduler for embodied and agentic workloads. The point of group 2 is to see where the same skeleton can bend.
- Inference and compiler substrate. Next read vLLM, TensorRT-LLM, TensorRT, Triton, TileLang. These are not RL frameworks. They are the layer beneath. The goal is to understand why prefix caches, paged attention, CUDA graphs, kernel DSLs, quantization, and MoE communication kernels determine rollout economics. After group 3, you will read RL papers differently — you'll see that "we used SGLang" or "we used vLLM" is a substantive experimental claim, not a deployment detail.
- Rollout targets beyond chat. Then FLUX, FLUX.2, Wan2.2. These are not RL frameworks either — they are the targets RL frameworks might want to fine-tune. FLUX.2's reference-token KV-cache trick and Wan2.2's video-diffusion MoE both are useful systems case studies in their own right. The point is to understand what changes when the target is not an autoregressive LLM.
- Failure modes and meta-reading. Finish with the Cambrian TPU blog and Chenyang Zhao's Awesome-ML-SYS-Tutorial. Cambrian tells you why almost none of the GPU-RL stack ports to TPUs. The tutorial tells you which papers' results to distrust until you have verified the infrastructure. These two are the reality check at the end of a long reading binge.
Five groups, nineteen repos. If you spend a real week on group 1 and then a few days on each of groups 2 through 5, you finish with a working mental model of the field's plumbing. Without group 1, the others read like trivia. With group 1, they read like commentary on a structure you already understand.
A ten-question template for each repo
I have used the same template for every case study in this survey. It is short enough to apply in an evening, structured enough to make the reading comparable across repos. The questions are deliberately ordered from outside to inside, so you build context before you dive into specifics.
| # | Question | Where to look |
|---|---|---|
| 1 | What is this repo, really? Is it a target model, a framework, a backbone, an inference engine, or a kernel substrate? | README + top-level package layout |
| 2 | Directory map — what lives where? README, docs, examples, scripts, tests, src. | One-shot ls -R with note-taking |
| 3 | Entry scripts — what's the training entry, the inference entry, the benchmark entry, the deployment entry? | scripts/ and the README's "Quick Start" |
| 4 | Core objects — name the trainer, rollout manager, scheduler, engine, worker, cache, model runner. | The __init__.py exports of the top-level package |
| 5 | One sample's path — trace a single prompt/request from input through rollout, reward, loss, update, sync. | The main loop file; trace the variable names |
| 6 | Engineering invariants — memory ownership, weight freshness, cache invalidation, logprob recompute, precision consistency. | The asserts and concurrency contracts |
| 7 | Parallelism and resources — TP/PP/DP/EP/CP, FSDP, DeepSpeed, Ray placement, NCCL/RDMA, offload. | The config files; the placement-group code |
| 8 | Known failure modes — OOM, staleness, routing divergence, NCCL hangs, cache staleness, dynamic shapes, precision drift. | Issues, CHANGELOG, comments tagged "TODO" or "WARN" |
| 9 | Mapping to this survey's primitives — which of the six engineering primitives does it implement, and which does it skip? | Cross-reference with the design patterns section above |
| 10 | The ten files most worth reading — in walking order, outside to inside. | Your own notes from steps 3-6 |
This template is what the Miles case study in this survey looks like under the hood. The deep dive into Miles is just questions 1-10 answered for one repo. Once you have the answers for one repo, the answers for the next repo are easier to find — you know what to look for.
The point of a template is not that it teaches you the answer. The point is that it makes the unknown answers visible. Two pages of "I don't know yet, but I know where to look" beats ten pages of disorganized notes from skimming.
Why this matters more than any single answer
The honest reason to learn this corpus is not that you will read all nineteen repos. The reason is that you will have to read one of them, sooner than you expect, because something in your own work depends on it. When that happens, knowing the five-group structure and the ten-question template is what turns "I have to read a giant unfamiliar codebase" into "I have a routine." A theorist who has practiced this routine is hard to surprise.
The corpus is a vocabulary, not a curriculum. The five groups are the rough order in which the vocabulary becomes useful. The ten questions are how you make any single new repo legible. Together they are the most portable thing in this survey — they outlast any specific framework or paper, because the structure of the field changes more slowly than its surface.
Open questions at the systems–theory boundary
A short list of unresolved questions where a mathematically-minded RL person could plausibly contribute. These are not "research suggestions" so much as the cluster of problems the survey's content gestures at without solving.
TPU-native RL infrastructure
The Cambrian-MLLM blog establishes that almost none of the GPU-RL stack ports to TPUs cleanly. Static shapes, XLA compilation, the absence of NCCL, the awkwardness of Ray on TPU pods — all of these forbid the patterns the survey covers. The question is whether a different set of patterns can deliver the same outcome on TPU hardware: tile-based parallelism, AOT-compiled rollout pipelines, JAX-native weight sync. No framework today fills this gap. Building one is at least a year of senior engineering, but the design space is genuinely open.
Formal verification of weight-sync correctness
TileLang's Z3 theorem-prover integration into the TVM arithmetic analyzer is the first credible attempt to bring SMT-style verification into the GPU DSL. The weight-sync contract — that the inference engine's Python tensors point to the same physical memory the trainer wrote, with no aliasing or stride mismatches — is the kind of contract that would benefit from a machine-checked proof. Current implementations rely on careful code review plus runtime asserts. A formal treatment is missing.
Tight off-policy bounds under partial rollout
TIS and MIS truncate or mask importance weights to control variance, but the bias-variance tradeoff is hand-tuned. The theoretical question: under what conditions on the staleness distribution does partial rollout converge to the same fixed point as strict on-policy training, with how much added variance? Miles' TIS/MIS implementation is a starting point; a tight analysis would let frameworks set the truncation threshold automatically rather than as a hyperparameter.
MoE routing under quantization — beyond R3
R3 (Rollout Routing Replay) is a fix, not a theorem. The deeper question: under what conditions on the routing computation does a precision-induced top-k flip materially affect downstream learning? If the answer is "always," R3 is permanent. If the answer is "only when the expert affinities are within ε of each other," there might be a quantization scheme that preserves routing without explicit replay. This is the kind of question a numerical analyst could answer.
Theoretical basis for DDRL
cosmos-rl's Data-Regularized DRL replaces the KL term in standard PPO with a reward maximization objective plus the standard diffusion training loss. The empirical results are good; the theoretical basis is light. What's the corresponding policy-improvement guarantee? Under what conditions does the diffusion loss act as an implicit KL regularizer? The video-generation RL frontier needs this analysis to mature.
Multi-agent RL as a category of interacting policies
RLinf's multi-agent support (rStar2, WideSeek-R1) and the broader multi-agent RL literature lack a clean systems abstraction. The category-theoretic framing — agents as objects, message-passing as morphisms — is intuitive but doesn't yet correspond to a framework primitive. A mathematically clean abstraction here would have systems consequences.
The diffusion–autoregressive bridge
cosmos-rl supports both diffusion world models (Cosmos-Predict, SANA, SD3) and autoregressive LLMs (Qwen, LLaMA, DeepSeek) — but they go through fundamentally different inference paths (the diffusers backend vs vLLM). The two paradigms increasingly need to interact: video generation conditioned on language, robot policies that output both continuous actions and language explanations. A unified RL framework that treats both as first-class is an open systems problem.
Each of these is a real research direction, not a textbook exercise. The field's progress is increasingly going to depend on people who can think across the systems–theory boundary — which is the audience this survey is written for.
A reading list for the theoretically inclined
Ordered not by sequence but by what you want to understand. Pick a row.
| If you want to understand... | Read |
|---|---|
| The mental model behind every modern RL framework | HybridFlow paper (verl/EuroSys 2025) |
| The connective tissue between the abstract algorithm and the engineering | Chenyang Zhao's tutorial (中文, the canonical reference) |
| What makes inference engines fast | PagedAttention paper (vLLM/SOSP 2023) · SGLang paper · LMSys RadixAttention announcement |
| Why GRPO works (and where group size matters) | DeepSeekMath / GRPO paper |
| What MoE routing replay actually preserves | Miles docs on R3 and the FP8 pipeline |
| How to write a fast GPU kernel without learning CUDA proper | Triton tutorials · Sasha Rush's Triton Puzzles |
| What 5D parallelism looks like in practice | Megatron Core parallelism guide |
| One framework's source code, end to end | verl's ray_trainer.py |
| The mathematician's case for caring about systems | (this survey) |
The 19 source repos surveyed
verl · miles · slime · SkyRL · cosmos-rl · RLinf · SGLang · vLLM · AReaL · OpenRLHF · Megatron-LM · Triton · TileLang · cuda-python · TensorRT · TensorRT-LLM · FLUX · FLUX2 · Wan2.2 · Cambrian TPU blog