...

How to build custom reasoning agents with a fraction of the compute

Key Highlights:

Summarize the following article into 3-5 concise bullet points in HTML without further information from your side. format:
Training AI reasoning models demands resources that most enterprise teams do not have. Engineering teams are often forced to choose between distilling knowledge from large, expensive models or relying on reinforcement learning techniques that provide sparse feedback.Researchers at JD.com and several academic institutions recently introduced a new training paradigm that sidesteps this dilemma. The technique, called Reinforcement Learning with Verifiable Rewards with Self-Distillation (RLSD), combines the reliable performance tracking of reinforcement learning with the granular feedback of self-distillation. Experiments indicate that models trained with RLSD outperform those built on classic distillation and reinforcement learning algorithms. For enterprise teams, this approach lowers the technical and financial barriers to building custom reasoning models tailored to specific business logic.The problem with training reasoning modelsThe standard method for training reasoning models is Reinforcement Learning with Verifiable Rewards (RLVR). In this paradigm, the model learns through trial and error, guided by a final outcome from its environment. An automated verifier checks if the model’s answer is right or wrong, providing a binary reward, such as a 0 or 1.Reinforcement learning with verifiable rewards (RLVR)RLVR suffers from sparse and uniform feedback. “Standard GRPO has a signal density problem,” Chenxu Yang, co-author of the paper, told VentureBeat. “A multi-thousand-token reasoning trace gets a single binary reward, and every token inside that trace receives identical credit, whether it’s a pivotal logical step or a throwaway phrase.” Consequently, the model never learns which intermediate steps led to its success or failure.On-Policy Distillation (OPD) takes a different approach. Instead of waiting for a final outcome, developers pair a smaller student model with a larger, more capable teacher model. For each training example, the student compares its response to that of the teacher token by token. This provides the student with granular feedback on the entire reasoning chain and response-generation process.Deploying and running a separate, massive teacher model alongside the student throughout the entire training process incurs massive computational overhead. “You have to keep a larger teacher model resident throughout training, which roughly doubles your GPU footprint,” Yang said. Furthermore, the teacher and student models must share the exact same vocabulary structure, which according to Yang, “quietly rules out most cross-architecture, cross-modality, or multilingual setups that enterprises actually run.”On-policy distillation (OPD)The promise and failure of self-distillationOn-Policy Self-Distillation (OPSD) emerged as a solution designed to overcome the shortcomings of the other two approaches. In OPSD, the same model plays the role of both the student and the teacher.During training, the student receives a standard prompt while the teacher receives privileged information, such as a verified, step-by-step answer key. This well-informed teacher version of the model then evaluates the student version, providing token-by-token feedback as the student tries to solve the problem using only the standard prompt.OPSD appears to be the perfect compromise for an enterprise budget. It delivers the granular, step-by-step guidance of OPD. Because it eliminates the need for an external teacher model, it operates with the high computational efficiency and low cost of RLVR, only requiring an extra forward pass for the teacher.However, the researchers found that OPSD suffers from a phenomenon called “privileged information leakage.”“The objective is structurally ill-posed,” Yang said. “There’s an irreducible mutual-information gap that the student can never close… When self-distillation is set up as distribution matching, the student is asked to imitate the teacher’s full output distribution under privileged context.”On-policy self-distillation (OPSD)Because the teacher evaluates the student based on a hidden answer key, the training objective forces the student model to learn the teacher’s exact phrasing or steps instead of the underlying reasoning logic. As a result, the student model starts hallucinating references to an invisible solution that it will not have access to in a real-world deployment.In practice, OPSD models show a rapid spike in performance early in training, but their reasoning capabilities soon plateau and progressively degrade over time.Decoupling direction from magnitude with RLSDThe researchers behind RLSD realized that the signals governing how a model updates its parameters have fundamentally asymmetric requirements. They identified that the signal dictating the direction of the update (i.e., whether to reinforce or penalize a behavior) can be sparse, but must be perfectly reliable, because pointing the model in the wrong direction damages its reasoning policy. On the other hand, the signal dictating the magnitude of the update (i.e., how much relative credit or blame a specific step deserves) benefits from being extremely dense to enable fine-grained, step-by-step corrections.RLSD builds on this principle by decoupling the update direction from the update magnitude. The framework lets the verifiable environmental feedback from the RLVR signal strictly determine the direction of learning. The model only receives overall reinforcement if the final answer is objectively correct.Reinforcement learning with self-distillation (RLSD) (source: arXiv)The self-teacher is stripped of its power to dictate what the model should generate. Instead, the teacher’s token-by-token assessment is repurposed to determine the magnitude of the update. It simply distributes the total credit or blame across the individual steps of the model’s reasoning path.This alters how the model learns compared to the classic OPSD paradigm. In standard OPSD, the training objective acts like behavioral cloning, where the model is forced to directly copy the exact wording and phrasing of the teacher. This causes the student to hallucinate and leak references to data it does not have.Instead of forcing the model to copy a hidden solution, RLSD provides a natural and virtually cost-free source of per-token credit information.“The intuition: we’re not teaching the model to reason like the teacher,” Yang said. “We’re telling the model, on the path it chose, which of its own tokens were actually doing the work. The model’s exploration distribution stays its own. Only the credit allocation gets sharpened.”If a specific deduction strongly supports the correct outcome, it receives a higher score. If it is just a useless filler word, it receives a baseline score. RLSD eliminates the need to train complex auxiliary reward networks, manually annotate step-by-step data, or maintain massive external teacher models.Putting RLSD to the testTo test RLSD, the researchers trained the open-weight Qwen3-VL-8B vision-language model and evaluated it on several visual reasoning benchmarks. These included MMMU for college-level multi-discipline questions, MathVista, MathVision, WeMath, and ZeroBench, a stress-test benchmark explicitly designed to be nearly impossible for current frontier models.They compared the RLSD model against the base model with no post-training, standard RLVR via the GRPO algorithm, standard OPSD, and a hybrid combination of the two.RLSD significantly outperformed every other method, achieving the highest average accuracy of 56.18% across all five benchmarks. It beat the base model by 4.69% and outperformed standard RLVR by 2.32%. The gains were most pronounced in complex mathematical reasoning tasks, where RLSD outperformed standard RLVR by 3.91% on the MathVision benchmark.RLSD outperforms other techniques on key benchmarks (source: arXiv)Beyond accuracy, the framework offers massive efficiency gains. “Concretely, RLSD at 200 training steps already beats GRPO trained for 400 steps, so roughly 2x convergence speedup,” Yang said. “Cost-wise, the only overhead beyond a normal GRPO pipeline is one extra forward pass per response to grab teacher logits. Compared to rollout generation… that’s basically free.”Unlike OPSD, which saw performance spike and then completely collapse due to information leakage, RLSD maintained long-term training stability and converged on a higher performance ceiling than standard methods.The qualitative findings highlight how the model alters its learning behavior. For example, in a complex visual counting task, standard RLVR looks at the final correct answer and gives the entire paragraph of reasoning tokens the same reward. RLSD surgically applied rewards to the specific mathematical subtraction steps that solved the problem, while actively down-weighting generic filler text like “Looking at the image, I see…”.In another example, the model performed an incorrect math derivation based on a bar chart. Instead of labeling the whole response as a failure, RLSD concentrated the heaviest penalty on the exact point where the model misread a relationship from the chart. It remained neutral on the rest of the logical setup, recognizing that the initial framework was valid.This is particularly important for messy, real-world enterprise use cases. If a model makes a mistake analyzing a 50-page quarterly earnings report, developers do not want it to unlearn its entire analytical framework. They just want it to fix the specific assumption it got wrong. RLSD allows the model to learn exactly which logical leaps are valuable and which are flawed, token by token. Because RLSD does this by repurposing the model itself, it provides models with granular reasoning capabilities while keeping the costs of training reasonable.How enterprises can get startedFor data engineers and AI orchestration teams, integrating RLSD is straightforward, but it requires the right setup. The most critical requirement is a verifiable reward signal, such as code compilers, math checkers, SQL execution, or schema validators. “Tasks without verifiable reward (open-ended dialogue, brand-voice writing) belong in preference-based pipelines,” Yang said.However, RLSD is highly flexible regarding the privileged information it requires. While OPSD structurally requires full intermediate reasoning traces, forcing enterprises to either pay annotators or distill from a frontier model, RLSD does not.“If you have full verified reasoning traces, great, RLSD will use them,” Yang said. “If all you have is the ground-truth final answer, that also works… OPSD doesn’t have this flexibility.”Integrating the technique into existing open-source multi-modality RL frameworks like veRL or EasyR1 is incredibly lightweight. According to Yang, it requires no framework rewrite and slots right into the standard stack. The code swap involves simply changing tens of lines to adjust the GRPO objective and sync the teacher with the student.Looking ahead, RLSD offers a powerful way for enterprises to maximize their existing internal assets.“The proprietary data enterprises hold inside their perimeter (compliance manuals, internal documentation, historical tickets, verified code snippets) is essentially free privileged information,” Yang concluded. “RLSD lets enterprises feed this kind of data straight in as privileged context, which sharpens the learning signal on smaller models without needing an external teacher and without sending anything outside the network.”


Here’s a rewritten version of your article with improved clarity, structure, and readability while maintaining the original meaning:

Training AI reasoning models is resource-intensive, making it a challenge for most enterprise teams. Typically, engineering groups face a tough choice: either distill knowledge from large, costly models or rely on reinforcement learning techniques that offer limited feedback. But a new approach called Reinforcement Learning with Verifiable Rewards and Self-Distillation (RLSD) could change that. Developed by researchers at JD.com and academic institutions, RLSD combines the reliability of reinforcement learning with the detailed feedback of self-distillation. Early tests show that models trained with RLSD outperform those using traditional methods—while also reducing technical and financial barriers for businesses.

The Problem with Current Training Methods

The standard approach, Reinforcement Learning with Verifiable Rewards (RLVR), trains models through trial and error, rewarding them only when they produce a correct final answer. The problem? The feedback is too broad—every step in the reasoning process gets the same credit, whether it was crucial or irrelevant. As a result, models struggle to pinpoint what actually led to success or failure.

Another method, On-Policy Distillation (OPD), pairs a smaller “student” model with a larger “teacher” model. The student compares its responses to the teacher’s, receiving detailed feedback at every step. While this improves learning granularity, it comes at a high cost—running both models simultaneously doubles computational demands. Plus, the teacher and student must share the same architecture, limiting flexibility for real-world enterprise needs.

The Promise (and Pitfalls) of Self-Distillation

To address these issues, researchers introduced On-Policy Self-Distillation (OPSD), where the same model acts as both student and teacher. The teacher version gets access to privileged information (like a verified answer key) and evaluates the student’s reasoning step by step. This eliminates the need for an external teacher, reducing costs while maintaining granular feedback.

But OPSD has a critical flaw: privileged information leakage. Because the student is trained to mimic the teacher’s exact phrasing—including references to hidden answers—it starts hallucinating details it shouldn’t know. Early in training, performance spikes, but reasoning ability soon plateaus and deteriorates.

How RLSD Solves the Problem

RLSD takes a smarter approach by separating two key learning signals:

  • Direction (whether a step was right or wrong) is determined by verifiable rewards—ensuring reliability.
  • Magnitude (how much credit or blame a step deserves) comes from self-distribution, allowing fine-grained adjustments.

Instead of forcing the model to copy a hidden solution, RLSD simply tells it which parts of its own reasoning were useful. This prevents hallucinations while sharpening the model’s ability to focus on critical steps.

Proven Results

In tests, RLSD outperformed all other methods, achieving the highest accuracy across multiple reasoning benchmarks—especially in complex math tasks. It also trained twice as fast as traditional reinforcement learning, with minimal added computational cost.

Unlike OPSD, RLSD maintained stable performance over time. For example, when solving a visual counting problem, it correctly rewarded only the essential math steps while ignoring filler text. In another case, it pinpointed exactly where a model misread a bar chart, rather than penalizing the entire response.

Why This Matters for Enterprises

RLSD is particularly valuable for real-world business applications. If a model misinterprets a lengthy financial report, RLSD ensures it only corrects the flawed assumption—not the entire analytical framework. And because it repurposes the model itself for feedback, training costs stay manageable.

Getting Started with RLSD

For enterprises, adopting RLSD requires two key elements:

  1. A verifiable reward signal (e.g., code compilers, SQL validators).
  2. Privileged information—though RLSD is flexible, working with either full reasoning traces or just final answers.

Integration is straightforward, requiring only minor adjustments to existing reinforcement learning frameworks. Most importantly, RLSD lets businesses leverage their proprietary data—internal manuals, compliance docs, or historical records—to train models efficiently, without relying on external teachers or exposing sensitive data.

In short, RLSD offers enterprises a cost-effective way to build AI models with sharper reasoning—while keeping full control over their data and infrastructure.

This version improves flow, eliminates redundancy, and makes technical concepts more accessible while preserving the original meaning. Let me know if you’d like any refinements!

Seraphinite AcceleratorOptimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.