On-Policy Distillation

LLMs are capable of expert performance in focused domains, a result of several capabilities stacked together: perception of input, knowledge retrieval, plan selection, and reliable execution. This requires a stack of training approaches, which we can divide into three broad stages:

Pre-training teaches general capacities such as language use, broad reasoning, and world knowledge.
Mid-training imparts domain knowledge, such as code, medical databases, or internal company documents.
Post-training elicits targeted behavior, such as instruction following, reasoning through math problems, or chat.

Smaller models with stronger training often outperform larger, generalist models in their trained domains of expertise. There are many benefits to using smaller models: they can be deployed locally for privacy or security considerations, can continuously train and get updated more easily, and save on inference costs. Taking advantage of these requires picking the right approach for the later stages of training.

Approaches to post-training a "student" model can be divided into two kinds:

On-policy training samples rollouts from the student model itself, and assigns them some reward.
Off-policy training relies on target outputs from some external source that the student learns to imitate.

For example, we may wish to train a compact model to solve math questions. We can do on-policy training via reinforcement learning, by grading each student rollout on whether it solves the question. This grading can be done by a human, or by a "teacher" model that reliably gets the correct answer.

The strength of on-policy training is that by training on samples from itself, the student learns to avoid mistakes in a more direct way. But RL has a major downside: it provides very sparse feedback, teaching a fixed number of bits per training episode regardless of the number of tokens used. The student learns that a wrong answer is wrong and updates away from producing the rollout it tried. But it doesn't learn where exactly the mistake was made, whether it got the order of operations wrong or erred in the arithmetic itself. This sparsity of feedback makes RL inefficient for many applications.

Off-policy training is often done with supervised fine-tuning (SFT): training on a curated set of task-specific labeled examples. The source of these labeled examples can be a teacher model that is proven to perform well on the task at hand.

We can use a mechanism called distillation: training the student to match the output distribution of a teacher model. We train on teacher trajectories: the complete sequence of generated tokens including intermediate thinking steps. We can use the teacher's full next-token distribution at each step (often called "logit distillation") or just sample given sequences. In practice, sampling sequences provides an unbiased estimation of the teacher's distribution and arrives at the same objective. The student updates towards each token in the sequence in proportion to how unlikely it was to generate that token itself, represented by darker color in the example below:

The drawback of off-policy training is that the student learns in contexts frequented by the teachers, not ones the student itself will often find itself in. This can cause compounding error: if the student makes an early mistake that the teacher never makes, it finds itself diverging ever farther from the states it observed in training. This problem becomes particularly acute when we care about the student's performance on long sequences. To avoid this divergence, the student must learn to recover from its own mistakes.

If you're learning to play chess, on-policy RL is analogous to playing games with no coaching. The feedback of winning or losing a match is tied directly to your own play, but is received only once per match and doesn't tell you which moves contributed most to the outcome. Off-policy distillation is analogous to watching a grandmaster playing — you observe extremely strong chess moves, but they are played in board states that a novice player will rarely find themselves in.

We want to combine the on-policy relevance of RL with the dense reward signal of distillation. For learning chess, this would be a teacher that grades each of your own moves on a scale from "blunder" to "brilliant". For LLM post-training, it's on-policy distillation.

On-policy distillation — best of both worlds

The core idea of on-policy distillation is to sample trajectories from the student model and use a high-performing teacher to grade each token of each trajectory. Returning to our math example above, on-policy distillation would score each step of the solution — punishing the mistakes that caused the student to arrive at the wrong answer while reinforcing the ones that were executed correctly.

|Method |Sampling |Reward signal| |----------------------|----------|-------------| |Supervised finetuning |off-policy|dense | |Reinforcement learning|on-policy |sparse | |On-policy distillation|on-policy |dense |

Loss function: reverse KL

On-policy distillation can use a variety of loss functions for grading the student's trajectories. For simplicity, we choose the per-token reverse KL — the divergence between the student's ((\pi_\theta)) and teacher's ((\pi_\text{teacher})) distribution for each token conditioned on the same prior trajectory:

[\text{KL}\Bigl(\pi_\theta \lVert \pi_\text{teacher}\Bigr) = \mathbb{E}{x \sim {\pi\theta}} \Bigl[ \log \pi_\theta(x_{t+1} | x_{1..t}) - \log \pi_\text{teacher}(x_{t+1} | x_{1..t}) \Bigr]]

Our reward function minimizes the reverse KL, which pushes the student to approximate the teacher's behavior in every state the student finds itself in. When the student behaves identically to the teacher, reverse KL is zero. For simplicity, we use a discount factor of zero: at any given timestep, the student only optimizes the immediate next token, with no consideration for future tokens.

Reverse KL has natural synergy with RL, which generally optimizes a form of sequence-level reverse KL induced by the reward model. However, unlike most reward models in practice, the reverse KL is "unhackable" in the sense that low KL always corresponds to a high probability of desirable behavior from the teacher model's point of view. Two other useful properties of reverse KL are that it is "mode seeking" — it learns one specific behavior (the teacher's) instead of spreading its distribution across several suboptimal options — and it reduces exposure bias.

This approach offers significant compute savings. Since it doesn't require a rollout to finish sampling to calculate the reward, we can use shorter or partial rollouts for training. Querying the teacher's log probabilities also requires just a single forward pass from the larger model, while the trajectories are generated by the smaller and cheaper student.

Distillation for reasoning

We use distillation to train mathematical reasoning in the Qwen3-8B-Base model, using Qwen3-32B as a teacher model.

Off-policy distillation

All experiments start with mid-training in the form of off-policy distillation — supervised fine-tuning on a dataset of teacher-generated examples. The dataset used for mathematical reasoning is OpenThoughts-3, a collection of reasoning prompts and responses generated by QwQ-32B (a reasoning model similar to Qwen3-32B).

Training the student (Qwen3-8B-Base) on 400k prompts with full fine-tuning achieves a score of 60% on AIME'24, a benchmark of math problems. We can also train with LoRA, though it lags behind full fine-tuning when training on high-volume datasets. In all cases, we see performance increase log-linearly — the initial performance gains are cheap but the latter are costly.

We can treat the model fine-tuned on 400k prompts as a checkpoint before trying various post-training approaches to increase its performance. We can compare the effort it would take to raise the score on the AIME'24 benchmark from 60% to 70%.

Reinforcement learning

The Qwen3 technical report reaches performance of 67.6% on the benchmark using 17,920 GPU hours of RL on top of a similar SFT initialization. It is hard to compare this directly to the cost of distillation, but given some reasonable assumptions about the SFT training stack, this is similar to the cost of training on 2M off-policy distillation prompts.

|Method |AIME'24|GPQA-Diamond|GPU Hours | |------------------------|-------|------------|----------| |Off-policy distillation |55.0% |55.6% |Unreported| |+ Reinforcement learning|67.6% |61.3% |17,920 | |+ On-policy distillation|74.4% |63.3% |1,800 |

The Qwen team also reports reaching a higher score of 74.4 on AIME'24 at one-tenth the cost of RL with on-policy distillation.

On-policy distillation

Starting from the 400k SFT checkpoint, on-policy distillation achieves an AIME'24 of 70% in about 150 steps.

Comparing compute costs across methods is nontrivial, as the ratio of training vs sampling vs log-prob computation cost varies significantly depending on implementation. Below, we compute the cost in terms of FLOPs which penalizes methods that can be effectively parallelized on GPUs.

|Method |AIME'24 |Teacher FLOPs|Student FLOPs|CE vs SFT-2M| |------------------------|-------------------|-------------|-------------|------------| |Initialization: SFT-400K|60% |8.5 × 10²⁰ |3.8 × 10²⁰ |– | |SFT-2M (extrapolated) |~70% (extrapolated)|3.4 × 10²¹ |1.5 × 10²¹ |1× | |Reinforcement learning |68% |- |- |≈1× | |On-policy distillation |70% |8.4 × 10¹⁹ |8.2 × 10¹⁹ |9-30× |

We find a baseline cost reduction of 9x when the SFT dataset is given, or is amortized across many training runs. Since this computation can be cheaply parallelized across GPUs, the cost reduction in GPU hours is closer to 18x.

However, we often want to train a small model on a new task for which no off-policy distillation dataset is available. If we include the full cost of the teacher model in off-policy distillation, the total cost reduction is about 30x.

Distillation for personalization

In addition to training small models to high performance on common tasks, another use-case for distillation is personalization. Examples include adhering to a particular tone in conversation and format of output, or capabilities like tool use and cost budgeting. We often want to train this behavior in combination with new domain knowledge.

Training an internal assistant

A common objective for a custom model is to act as an assistant: to possess expert knowledge in some field and, in addition, reliable assistant-like behavior. Our example is an internal company assistant, for which we have two desiderata:

The model is knowledgeable about the domain (company documents).
The model exhibits strong post-training behavior, i.e. instruction following.

Training on new knowledge degrades learned behavior

We start with Qwen3-8B, rather than the base model. Qwen3-8B is post-trained on useful skills for an assistant, such as instruction following and reasoning with RL. Prior research has shown that such reinforcement learning only trains small subnetworks of the original model, and thus can be fragile when the network is further trained on a large amount of data.

In order to reduce catastrophic forgetting, a common approach in mid-training is to mix in "background data" from the original model's pretraining distribution. We take Tulu3 prompts — a broad chat and instruction-following dataset — and re-sample them with Qwen3-8B in order to act as chat background data.

This "on-policy" background data sampled by Qwen3-8B acts as a forwards KL regularizer, reinforcing the model's original behavior throughout mid-training. We then fine-tune Qwen3-8B on different mixes of our internal documents and chat data. Increasing the proportion of document data straightforwardly improves the model's knowledge. However, although mixing in at least 30% of chat data helps preserve most instruction-following ability, there is no weighting which maintains the original performance on IF-eval.

On-policy distillation recovers post-training behavior

Next, we seek to restore instruction-following behavior after fine-tuning on internal documents. Instead of using expensive RL, we run on-policy distillation with the earlier version of the model, Qwen3-8B, as the teacher on Tulu3 prompts.

The use of an earlier version of the model as a teacher to "re-invoke" capabilities lost during fine-tuning makes on-policy distillation very promising for continuous learning. We could alternate between phases of fine-tuning on new data and distillation to recover behavior to allow our model to learn and stay up-to-date on knowledge over time.

After fine-tuning on a 70-30 mix of internal document data and chat data, on-policy distillation recovers nearly full performance on IF-eval without losing any knowledge; we also observe some positive transfer between chat capabilities and the model's "knowledge" performance.

|Model |Internal QA Eval (Knowledge)|IF-eval (Chat)| |--------------------------|----------------------------|--------------| |Qwen3-8B |18% |85% | |+ midtrain (100%) |43% |45% | |+ midtrain (70%) |36% |79% | |+ midtrain (70%) + distill|41% |83% |

Dense supervision greatly improves compute efficiency

Reinforcement learning and on-policy distillation both learn via reverse KL, pruning the space of actions present in the base policy. The difference is in the density of reward. Reinforcement learning only teaches (O(1)) bits per episode. In contrast, distillation teaches (O(N)) bits per episode, where (N) is the number of tokens.

We ran an experiment for making a direct comparison between the two:

Start with Qwen3-8B-Base (no additional SFT).
Run RL on DeepMath. We use a LoRA rank of 128. The resultant model is the teacher for distillation.
On-policy distill from the RL-trained model (2) back into the base model (1).

We see that distillation reaches the teacher's level of performance approximately 7-10x faster than RL with matched model architecture (LoRA rank 128). The reverse KL decreases to near-zero and the AIME score is recovered in under 10 gradient steps, while RL took 70 steps to reach that level.

Cumulatively, the reduction in compute required is on the order of 50-100x:

While RL requires training at approximately the evaluation context, distillation learns reasonably at shorter context lengths, as there is no sharp cutoff in reward between a trajectory which has finished sampling and a trajectory that continues.
When the SFT initialization is strong, on-policy distillation works effectively with much smaller batch sizes since it provides significantly more bits per episode, thereby reducing gradient noise.

RL searches in the space of semantic strategies

We have seen that on-policy distillation can replicate the learning provided by RL with much fewer steps of training. One interpretation of this result is that, unlike pre-training, RL doesn't spend a lot of compute on the gradient steps themselves. We should think of RL as spending most of its compute on search — rolling out a policy and assigning credit — rather than on making updates.

Pre-training via stochastic gradient descent is exploring the high-dimensional parameter space. Pre-training requires a vast amount of information and is very difficult to distill, in part because the parameter space is somewhat unique to each network. The gradient steps required for pretraining are extremely computationally expensive and time-consuming.

In contrast, we should think of RL as exploring the space of semantic strategies. At every step, RL tries a small modification of some strategy it has found in the past. Rather than exploring in the parameter space, it "stumbles" onto new strategies by luck — it is randomly sampling from the set of weights it already has.

Once a good strategy is found, distillation serves as a shortcut for learning it: on-policy distillation does not need to model the intermediate strategies during the curriculum of RL, but rather only the final strategy learned. If we are only interested in the final strategy (common in production use-cases), we need not spend the compute to model all the intermediate ones.

Consider an analogy: in science research, we spend a lot of time and resources looking for answers and exploring new ideas. Once a result is discovered, it is much simpler to teach it to others by expressing it in natural language. We can contrast this to intuitive physical skills such as playing a sport. They are much harder to teach to others, since the knowledge exists in an innate language (e.g., muscle memory) that is only readily understood by ourselves. Sports are only learned with repeated practice.

On-policy learning as a tool for continual learning

In the section on distillation for personalization, we explored the ability of on-policy distillation to re-introduce specialized trained behaviors into the model. This generalizes to a broader set of continual learning tasks, which require acquiring new knowledge without degrading prior capabilities.

What happens when we run SFT on a dataset of a model's own samples? We see that any practical learning rate that's greater than zero leads to a degradation of performance on the instruction-following evaluation!

A possible explanation for this is that while KL divergence is 0 in expectation, every finite batch will exhibit a slightly different distribution in practice. Training on these finite batches causes a non-zero gradient update, which then diverges the updated model's policy from that of its original state. This process turns training on one's own samples into off-policy training over time, which causes the same error accumulation and divergence over long sequences that happens with off-policy training.

On-policy distillation always stays on-policy, and since the teacher stays fixed, the student converges on the teacher's desirable behavior, without regressing in the self-distillation setting as SFT does. This makes on-policy distillation a very promising tool for continual learning.

Conclusion

We have explored the application of on-policy distillation for applications such as training a small model for math reasoning or a continuously-learning assistant. We compared on-policy distillation to two other approaches to post-training: off-policy distillation, and on-policy RL. We find that on-policy distillation combines the best of both worlds: the reliable performance of on-policy training, with the cost-efficiency of a dense reward signal.

Post-training is a crucial part of reaching frontier model capabilities. By leveraging on-policy sampling from the student with dense supervision from a teacher, the on-policy distillation recipe reaches those capabilities at a fraction of the cost of frontier high-compute RL runs.

Pre-training teaches general capacities such as language use, broad reasoning, and world knowledge.
Mid-training imparts domain knowledge, such as code, medical databases, or internal company documents.
Post-training elicits targeted behavior, such as instruction following, reasoning through math problems, or chat.