Introduction: Reinforcement Studying Progress by way of Chain-of-Thought Prompting
LLMs have proven glorious progress in complicated reasoning duties by way of CoT prompting mixed with large-scale reinforcement studying (RL). Fashions like Deepseek-R1-Zero have proven sturdy reasoning capabilities by making use of RL on to base fashions. Equally, strategies akin to SimpleRL and Open-ReasonerZero present enhancements in smaller fashions just like the Qwen collection. Nonetheless, attaining success throughout totally different base mannequin households stays a problem. Furthermore, making use of R1-Zero-style coaching to base fashions such because the Llama collection faces problem, posing a basic query in regards to the underlying components that lead totally different base fashions to behave inconsistently throughout reinforcement studying.
Limitations of RL Scaling on Llama Fashions
Massive-scale RL advances in fashions like OpenAI’s o1, o3, and DeepSeek’s R1 on competition-level arithmetic issues, motivating the exploration of RL on smaller fashions with lower than 100B parameters. Nonetheless, they’re restricted to the Qwen mannequin household, whereas replicating outcomes on households akin to Llama is troublesome. The shortage of transparency in pre-training pipelines has made it obscure how pre-training influences RL scaling. This has prompted unconventional research, which discovered that one-shot prompting improves reasoning in Qwen however gives little profit in Llama. Efforts to curate high-quality mathematical pre-training corpora by way of initiatives like OpenWebMath, MathPile, InfiMM-Internet-Math, and FineMath have made progress however stay restricted in scale below 100B tokens.

Exploring Mid-Coaching with Secure-then-Decay Technique
Researchers from Shanghai Jiao Tong College examine how mid-training methods form RL dynamics, specializing in Qwen and Llama. The examine presents a number of insights: First, high-quality mathematical corpora akin to MegaMath-Internet-Professional increase each base mannequin and RL outcomes. Second, utilizing QA-style knowledge, particularly these with lengthy CoT reasoning, additional enhances RL outcomes. Third, lengthy CoT introduces verbosity and instability in RL coaching. Lastly, making use of scaling throughout mid-training ends in stronger downstream RL efficiency. Researchers introduce a two-stage mid-training technique known as Secure-then-Decay, the place base fashions are first skilled on 200B tokens, adopted by 20B tokens throughout three CoT-focused branches, leading to OctoThinker fashions that present sturdy RL compatibility.
RL Configuration and Benchmark Analysis
Researchers use the MATH8K dataset for RL coaching prompts. The configuration features a world coaching batch measurement of 128, 16 rollout responses per question, and a PPO mini-batch measurement of 64, with experiments carried out on Llama-3.2-3B-Base and Qwen2.5-3B-Base fashions. For analysis, few-shot prompting is used for base language fashions, and zero-shot for RL-tuned fashions throughout indicator duties, together with GSM8K, MATH500, OlympiadBench, and AMC23. Throughout RL coaching, Qwen fashions exhibit rising response lengths that stay cheap all through, whereas Llama shows irregular habits, with common response lengths escalating to 4,096 tokens. Analysis additional reveals that RL-tuned Qwen2.5-3B achieves enhancements throughout benchmarks, whereas Llama-3.2-3B exhibits solely marginal positive aspects.
OctoThinker Outperforms Llama in RL Compatibility
Every OctoThinker department demonstrates 10%-20% enchancment over the unique Llama base mannequin and constant positive aspects over the stable-stage mannequin throughout all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero households reveal various considering behaviors throughout RL scaling, with sturdy efficiency from the OctoThinker-Lengthy variant. When evaluating three 3B-scale base fashions throughout RL coaching, OctoThinker-Lengthy-3B outperforms the unique Llama-3.2-3B mannequin and reaches efficiency parity with Qwen2.5-3B, a mannequin recognized for sturdy reasoning capabilities and intensive pre-training. The hybrid and quick branches present barely decrease efficiency, particularly on difficult benchmarks
Conclusion and Future Work: Towards RL-Prepared Basis Fashions
This paper investigates why base fashions akin to Llama and Qwen exhibit divergent behaviors throughout RL for reasoning, displaying that mid-training performs a serious position in RL scalability. The 2-stage mid-training technique transforms Llama right into a basis mannequin higher suited to RL, leading to OctoThinker fashions. Future analysis instructions embrace:
- Curating higher-quality mathematical corpora to enhance mid-training.
- Creating RL-friendly base fashions utilizing open recipes with out distillation from lengthy CoT reasoning fashions.
- Separating the QA format and content material to know their contributions individually.
- Increasing the OctoThinker household with new branches, akin to tool-integrated reasoning.
Try the Paper, Hugging Face Web page and GitHub Web page. All credit score for this analysis goes to the researchers of this challenge. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication.
Sajjad Ansari is a remaining 12 months undergraduate from IIT Kharagpur. As a Tech fanatic, he delves into the sensible purposes of AI with a concentrate on understanding the affect of AI applied sciences and their real-world implications. He goals to articulate complicated AI ideas in a transparent and accessible method.