Incorrect Solutions Enhance Math Reasoning? Reinforcement Studying with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math

May 28, 2025

54

In pure language processing (NLP), RL strategies, similar to reinforcement studying with human suggestions (RLHF), have been utilized to reinforce mannequin outputs by optimizing responses primarily based on suggestions alerts. A particular variant, reinforcement studying with verifiable rewards (RLVR), extends this method by using computerized alerts, similar to mathematical correctness or syntactic options, as suggestions, enabling the large-scale tuning of language fashions. RLVR is particularly fascinating as a result of it guarantees to reinforce fashions’ reasoning talents while not having intensive human supervision. This intersection of automated suggestions and reasoning duties types an thrilling space of analysis, the place builders purpose to uncover how fashions can study to cause mathematically, logically, or structurally utilizing restricted supervision.

A persistent problem in machine studying is constructing fashions that may cause successfully underneath minimal or imperfect supervision. In duties like mathematical problem-solving, the place the proper reply won’t be instantly accessible, researchers grapple with the right way to information a mannequin’s studying. Fashions usually study from ground-truth information, nevertheless it’s impractical to label huge datasets with good accuracy, significantly in reasoning duties that require understanding advanced constructions like proofs or programmatic steps. Consequently, there’s an open query about whether or not fashions can study to cause if they’re uncovered to noisy, deceptive, and even incorrect alerts throughout coaching. This problem is important as a result of fashions that overly depend on good suggestions could not generalize properly when such supervision is unavailable, thereby limiting their utility in real-world situations.

A number of present methods purpose to reinforce fashions’ reasoning talents by means of reinforcement studying (RL), with RLVR being a key focus. Historically, RLVR has used “floor fact” labels, right solutions verified by people or automated instruments, to offer rewards throughout coaching. Some approaches have relaxed this requirement through the use of majority vote labels or easy format-based heuristics, similar to rewarding solutions that comply with a particular output type. Different strategies have experimented with random rewards, providing constructive alerts with out contemplating the correctness of the reply. These strategies purpose to discover whether or not fashions can study even with minimal steering, however they principally focus on particular fashions, similar to Qwen, elevating considerations about generalizability throughout totally different architectures.

Researchers from the College of Washington, the Allen Institute for AI, and UC Berkeley examine this query by testing varied reward alerts on Qwen2.5-Math, a household of enormous language fashions fine-tuned for mathematical reasoning. They examined ground-truth rewards, majority-vote rewards, format rewards primarily based on boxed expressions, random rewards, and incorrect rewards. Remarkably, they noticed that even utterly spurious alerts, like random rewards and rewards for unsuitable solutions, may result in substantial efficiency positive aspects in Qwen fashions. For instance, coaching Qwen2.5-Math-7B on MATH-500 with ground-truth rewards yielded a 28.8% enchancment, whereas utilizing incorrect labels resulted in a 24.6% acquire. Random rewards nonetheless produced a 21.4% increase, and format rewards led to a 16.4% enchancment. Majority-vote rewards supplied a 26.5% accuracy acquire. These enhancements weren’t restricted to a single mannequin; Qwen2.5-Math-1.5B additionally confirmed robust positive aspects: format rewards boosted accuracy by 17.6%, and incorrect labels by 24.4%. Nonetheless, the identical reward methods did not ship comparable advantages on different mannequin households, similar to Llama3 and OLMo2, which confirmed minimal or destructive adjustments when skilled with spurious rewards. As an example, Llama3.1-8B noticed efficiency drops of as much as 8.5% underneath sure spurious alerts, highlighting the model-specific nature of the noticed enhancements.

The analysis group’s method concerned utilizing RLVR coaching to fine-tune fashions with these various reward alerts, changing the necessity for ground-truth supervision with heuristic or randomized suggestions. They discovered that Qwen fashions, even with out entry to right solutions, may nonetheless study to provide high-quality reasoning outputs. A key perception was that Qwen fashions tended to exhibit a definite conduct referred to as “code reasoning”, producing math options structured like code, significantly in Python-like codecs, no matter whether or not the reward sign was significant. This code reasoning tendency grew to become extra frequent over coaching, rising from 66.7% to over 90% in Qwen2.5-Math-7B when skilled with spurious rewards. Solutions that included code reasoning confirmed increased accuracy charges, usually round 64%, in comparison with simply 29% for solutions with out such reasoning patterns. These patterns emerged persistently, suggesting that spurious rewards could unlock latent capabilities realized throughout pretraining quite than introducing new reasoning expertise.

Efficiency information underscored the stunning robustness of Qwen fashions. Features from random rewards (21.4% on MATH-500) and incorrect labels (24.6%) almost matched the ground-truth reward acquire of 28.8%. Comparable developments appeared throughout duties, similar to AMC, the place format, unsuitable, and random rewards produced round an 18% enchancment, solely barely decrease than the 25% enchancment from ground-truth or majority-vote rewards. Even on AIME2024, spurious rewards like format (+13.0%), incorrect (+8.7%), and random (+6.3%) led to significant positive aspects, although the benefit of ground-truth labels (+12.8%) remained evident, significantly for AIME2025 questions created after mannequin pretraining cutoffs.

A number of Key Takeaways from the analysis embody:

Qwen2.5-Math-7B gained 28.8% accuracy on MATH-500 with ground-truth rewards, but additionally 24.6% with incorrect rewards, 21.4% with random rewards, 16.4% with format rewards, and 26.5% with majority-vote rewards.
Code reasoning patterns emerged in Qwen fashions, growing from 66.7% to 90%+ underneath RLVR, which boosted accuracy from 29% to 64%.
Non-Qwen fashions, similar to Llama3 and OLMo2, didn’t present comparable enhancements, with Llama3.1-8B experiencing as much as 8.5% efficiency drops on spurious rewards.
Features from spurious alerts appeared inside 50 coaching steps in lots of circumstances, suggesting speedy elicitation of reasoning talents.
The analysis warns that RLVR research ought to keep away from generalizing outcomes primarily based on Qwen fashions alone, as spurious reward effectiveness is just not common.

In conclusion, these findings counsel that whereas Qwen fashions can leverage spurious alerts to enhance efficiency, the identical is just not true for different mannequin households. Non-Qwen fashions, similar to Llama3 and OLMo2, confirmed flat or destructive efficiency adjustments when skilled with spurious alerts. The analysis emphasizes the significance of validating RLVR strategies on numerous fashions quite than relying solely on Qwen-centric outcomes, as many current papers have achieved.

Try the Paper, Official Launch and GitHub Web page. All credit score for this analysis goes to the researchers of this mission. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 95k+ ML SubReddit and Subscribe to our E-newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.