Xiaomi launched MiMo-7B: A Compact Language Mannequin that Outperforms Bigger Fashions in Mathematical and Code Reasoning via Rigorous Pre-Coaching and Reinforcement Studying

May 2, 2025

38

With rising demand for AI programs that may deal with duties involving multi-step logic, mathematical proofs, and software program growth, researchers have turned their consideration towards enhancing fashions’ reasoning potential. This functionality, as soon as believed to be unique to human intelligence, is now actively being pursued in smaller-scale fashions to make them extra environment friendly and extensively deployable. As reasoning-based duties proceed to develop in relevance, encompassing tutorial problem-solving, automated theorem proving, algorithm design, and sophisticated software program debugging, language fashions are anticipated to change into extra than simply general-purpose conversational brokers. They’re being inspired to change into domain-specific drawback solvers who can help professionals and researchers alike.

One problem in constructing reasoning-focused fashions is reaching robust, simultaneous efficiency in arithmetic and programming whereas sustaining a comparatively small mannequin measurement. Best ends in these domains are achieved by fashions with roughly 32 billion parameters or extra. These giant fashions are sometimes used as a result of smaller ones wrestle with generalization and reward optimization in reinforcement studying duties, notably with regards to code-based problem-solving. Sparse reward suggestions, restricted high-quality information, and weak base mannequin structure make it tough to develop compact but highly effective fashions. Moreover, the info used to coach these fashions is just not at all times curated with reasoning in thoughts, typically leading to coaching inefficiencies and restricted good points in problem-solving talents.

To deal with reasoning challenges, a number of fashions, together with OpenAI’s o-series, DeepSeek R1, and Claude 3.7, have been launched, leveraging huge parameter counts and sophisticated reinforcement studying methods. These fashions make use of methods comparable to step-by-step planning and backtracking to reinforce reasoning, notably in algorithmic considering and math-related duties. Nonetheless, they closely depend upon post-training levels and underplay the significance of high-quality pre-training information. Many additionally depend on fastened template-based reward programs which might be vulnerable to reward hacking. Code era benchmarks typically reveal that these fashions carry out inconsistently in difficult duties because of shallow pretraining foundations and ineffective reward sign modeling throughout fine-tuning.

A analysis group from Xiaomi launched the MiMo-7B household of language fashions with a centered strategy to overcoming these obstacles. The innovation lies in treating each pre-training and post-training as equally crucial phases for growing reasoning capabilities. The bottom mannequin, MiMo-7B-Base, was educated from scratch utilizing a dataset comprising 25 trillion tokens. This dataset was constructed with a three-stage combination technique that progressively elevated the share of mathematical and programming content material. A further multiple-token prediction (MTP) goal was launched throughout pre-training to enhance each efficiency and inference pace. For post-training, the group developed a curated dataset of 130,000 verifiable math and programming issues, every tagged with problem scores. Reinforcement studying was then utilized utilizing a difficulty-driven reward framework, permitting extra nuanced and efficient suggestions throughout coaching. This resulted in two main variants: MiMo-7B-RL and MiMo-7B-RL-Zero.

The pre-training methodology began by extracting reasoning-heavy content material from internet pages, tutorial papers, and books utilizing a customized HTML extraction instrument designed to protect math equations and code snippets. In contrast to generic pipelines, this extractor retained structural components crucial to problem-solving domains. The group then enhanced the PDF parsing instruments to interpret scientific and programming content material precisely. To forestall information duplication, international deduplication was utilized utilizing URL-based and MinHash methods. The coaching corpus was filtered utilizing small language fashions fine-tuned to tag content material high quality, changing outdated heuristic-based filters that always eliminated precious reasoning examples. Excessive-quality artificial reasoning information was additionally generated from superior fashions and added within the ultimate stage of coaching. This three-stage strategy resulted in a ultimate coaching combine comprising 70% math and code information in stage two and a further 10% of artificial content material in stage three. The utmost context size was prolonged from 8,192 to 32,768 tokens, making certain the mannequin may deal with long-form reasoning issues.

Within the reinforcement studying stage, the analysis group engineered a seamless rollout engine to speed up coaching and validation. This infrastructure included asynchronous reward computation and early termination mechanisms to scale back GPU idle time, leading to 2.29 occasions quicker coaching and 1.96 occasions quicker validation. The mannequin’s coverage was optimized utilizing fine-grained rewards derived from the problem of check instances, addressing the sparse reward concern in programming benchmarks. Information re-sampling methods have been launched to keep up coaching stability and enhance rollout sampling effectivity. These methods collectively enabled the MiMo-7B variants to study successfully, even from cold-start states the place no pre-fine-tuned initialization is out there.

Efficiency analysis revealed that MiMo-7B-Base achieved a rating of 75.2 on the Huge-Bench Exhausting (BBH) job, surpassing different open-source 7B fashions. It additionally carried out properly on SuperGPQA, which incorporates graduate-level reasoning questions. The post-trained MiMo-7B-RL scored 55.4 on the AIME 2025 benchmark, surpassing OpenAI’s o1-mini by 4.7 factors. On code era duties, it outperformed a lot bigger fashions like DeepSeek-R1-Zero-32B and Qwen2.5-32B-RL-Zero on each LiveCodeBench v5 and v6. These outcomes reveal {that a} correctly optimized 7B mannequin can rival and even outperform fashions with greater than 4 occasions the variety of parameters.

The MiMo-7B challenge serves as a concrete demonstration of how pre-training, information high quality, and reinforcement studying infrastructure contribute to the ultimate reasoning functionality of a language mannequin. By rethinking the pipeline from information extraction to reward computation, the Xiaomi analysis group achieved compact but highly effective fashions appropriate for real-world purposes in arithmetic, coding, and logic. Their strategy highlights the untapped potential of small fashions and challenges the belief that measurement alone determines intelligence or versatility.

Key Takeaways from the Analysis on MiMo-7B:

MiMo-7B was educated on a large dataset of 25 trillion tokens, focusing on reasoning duties via using structured information mixtures.
130,000 math and code issues have been utilized in RL coaching, every annotated with problem scores to allow efficient reward shaping.
Three-stage pre-training raised math and coding content material to 70%, adopted by 10% artificial problem-solving information.
A seamless rollout engine elevated RL coaching pace by 2.29 occasions and validation by 1.96 occasions.
MiMo-7B-RL achieved 55.4 on AIME 2025, outperforming OpenAI o1-mini by 4.7 factors.
MiMo-7B fashions are publicly out there and embody all checkpoints: base, SFT, and RL variants.
The mannequin’s success exhibits that small, well-designed fashions can rival or exceed the efficiency of 32B fashions in reasoning duties.

Take a look at the Paper and GitHub Web page. Additionally, don’t overlook to observe us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Overlook to affix our 90k+ ML SubReddit.

🔥 [Register Now] miniCON Digital Convention on AGENTIC AI: FREE REGISTRATION + Certificates of Attendance + 4 Hour Brief Occasion (Could 21, 9 am- 1 pm PST) + Palms on Workshop

Nikhil is an intern guide at Marktechpost. He’s pursuing an built-in twin diploma in Supplies on the Indian Institute of Expertise, Kharagpur. Nikhil is an AI/ML fanatic who’s at all times researching purposes in fields like biomaterials and biomedical science. With a powerful background in Materials Science, he’s exploring new developments and creating alternatives to contribute.