Excessive-Entropy Token Choice in Reinforcement Studying with Verifiable Rewards (RLVR) Improves Accuracy and Reduces Coaching Price for LLMs

June 9, 2025

91

Giant Language Fashions (LLMs) generate step-by-step responses often called Chain-of-Ideas (CoTs), the place every token contributes to a coherent and logical narrative. To enhance the standard of reasoning, varied reinforcement studying strategies have been employed. These strategies enable the mannequin to be taught from suggestions mechanisms by aligning generated outputs with correctness standards. As LLMs develop in complexity and capability, researchers have begun probing the interior construction of token technology to discern patterns that improve or restrict efficiency. One space gaining consideration is the token entropy distribution, a measurement of uncertainty in token prediction, which is now being linked to the mannequin’s skill to make significant logical choices throughout reasoning.

A core challenge in coaching reasoning fashions utilizing reinforcement studying is treating all output tokens equally. When fashions are optimized utilizing reinforcement studying with verifiable rewards (RLVR), the replace course of historically contains each token within the generated sequence, no matter its useful function. This uniform therapy fails to differentiate tokens that result in important reasoning shifts from those who merely lengthen present linguistic constructions. Consequently, a big portion of coaching sources could also be directed at tokens that provide minimal contribution to the mannequin’s reasoning capabilities. With out prioritizing the few tokens that play decisive roles in navigating totally different logic paths, these strategies miss alternatives for centered and efficient optimization.

Most RLVR frameworks, together with Proximal Coverage Optimization (PPO), Group Relative Coverage Optimization (GRPO), and Dynamic sAmpling Coverage Optimization (DAPO), perform by evaluating complete sequences of token outputs towards reward capabilities that assess correctness. PPO depends on stabilizing coverage updates by way of a clipped goal perform. GRPO improves upon this by estimating benefit values utilizing grouped responses, somewhat than a separate worth community. DAPO introduces further enhancements, such because the clip-higher mechanism and overlong reward shaping. These strategies, nevertheless, don’t consider token-level entropy or distinguish the significance of particular person tokens within the reasoning chain, as an alternative making use of uniform gradient updates throughout the board.

In an try to refine how RLVR coaching impacts LLM reasoning, researchers from Alibaba Inc. and Tsinghua College introduced a brand new methodology centered on token entropy patterns. They noticed that within the CoT sequences generated by Qwen3 fashions, a small subset of tokens, roughly 20%, show considerably increased entropy. These tokens, labeled “forking tokens,” usually correspond to moments the place the mannequin should determine between a number of reasoning paths. The remaining 80% of tokens sometimes exhibit low entropy and act as extensions of prior statements. By limiting coverage gradient updates solely to those high-entropy tokens, the analysis workforce was ready not solely to take care of however, in lots of instances, enhance efficiency on difficult reasoning benchmarks.

To quantify token entropy, the researchers used the entropy components based mostly on the chance distribution over doable token selections at every step. They discovered that over half of all generated tokens had entropy values beneath 0.01, indicating near-deterministic conduct. Solely 20% exceeded an entropy of 0.672, marking them because the decision-making hubs inside CoTs. Excessive-entropy tokens usually embrace logical operators and connective phrases comparable to “assume,” “since,” or “thus,” which introduce new situations or transitions in logic. In distinction, low-entropy tokens included predictable symbols, suffixes, or code fragments. Via managed experiments, it grew to become clear that manipulating the entropy of those forking tokens immediately influenced the mannequin’s reasoning efficiency, whereas altering low-entropy tokens had little impact.

The analysis workforce performed in depth experiments throughout three mannequin sizes: Qwen3-8B, Qwen3-14B, and Qwen3-32B. When coaching solely the highest 20% high-entropy tokens, the Qwen3-32B mannequin achieved a rating of 63.5 on AIME’24 and 56.7 on AIME’25, each setting new efficiency benchmarks for fashions underneath 600B parameters. Moreover, rising the utmost response size from 20k to 29k raised the AIME’24 rating to 68.1. Compared, coaching on the underside 80% of low-entropy tokens induced efficiency to drop considerably. The Qwen3-14B mannequin confirmed positive aspects of +4.79 on AIME’25 and +5.21 on AIME’24, whereas the Qwen3-8B maintained aggressive outcomes relative to full-token coaching. An ablation examine additional confirmed the significance of retaining the 20% threshold. Lowering the fraction to 10% omitted important choice factors, and rising it to 50% or 100% diluted the impact by together with too many low-entropy tokens, thereby lowering entropy range and hindering exploration.

In essence, the analysis supplies a brand new course for enhancing the reasoning talents of language fashions by figuring out and selectively coaching on the minority of tokens that disproportionately contribute to reasoning success. It avoids inefficient coaching and as an alternative proposes a scalable method that aligns reinforcement studying aims with precise decision-making moments in token sequences. The success of this technique lies in utilizing entropy as a information to differentiate helpful tokens from filler.

A number of Key takeaways from the analysis embrace:

Round 20% of tokens exhibit excessive entropy and function forking factors that direct reasoning paths.
Coaching solely on these high-entropy tokens delivers efficiency equal to or higher than coaching on the complete token set.
Qwen3-32B achieved scores of 63.5 on AIME’24 and 56.7 on AIME’25, outperforming bigger fashions skilled historically.
Extending response size from 20k to 29k additional pushed the AIME’24 rating to 68.1.
Coaching on the remaining 80% of low-entropy tokens led to sharp efficiency degradation.
Retaining the 20% threshold for high-entropy tokens optimally balances exploration and efficiency.
Bigger fashions acquire extra from this technique because of their capability to learn from enhanced exploration.
The technique scales properly and will information extra environment friendly coaching of next-generation reasoning fashions.

In conclusion, this analysis successfully rethinks the appliance of reinforcement studying to language fashions by introducing a give attention to token-level entropy. By optimizing solely the minority that influences reasoning paths, the tactic enhances efficiency whereas lowering computational overhead. It supplies a sensible roadmap for future efforts to enhance reasoning in LLMs with out pointless complexity.

Take a look at the Paper. All credit score for this analysis goes to the researchers of this challenge. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 98k+ ML SubReddit and Subscribe to our E-newsletter.

Nikhil is an intern guide at Marktechpost. He’s pursuing an built-in twin diploma in Supplies on the Indian Institute of Expertise, Kharagpur. Nikhil is an AI/ML fanatic who’s at all times researching functions in fields like biomaterials and biomedical science. With a robust background in Materials Science, he’s exploring new developments and creating alternatives to contribute.