Ai2 stated Molmo 2 improves on its earlier fashions regardless of its compact dimension. | Supply: Ai2
The Allen Institute for AI, also called Ai2, final week launched Molmo 2, its newest multimodel suite able to exact spatial and temporal understanding of video, picture, and multi-image units. Constructing on the primary Molmo platform, Molmo 2 has superior capabilities in video pointing, multi-frame reasoning, and object monitoring.
Molmo 2 is an 8B-parameter mannequin that surpasses final 12 months’s 72B-parameter Molmo in accuracy, temporal understanding, and pixel-level grounding. Ai2 stated it additionally bests proprietary fashions like Gemini 3 on key rising expertise like video monitoring.
In the case of picture and multi-image reasoning, Ai2 claimed the Molmo 2 4B variant outperforms open fashions akin to Qwen 3-VL-8B whereas utilizing fewer parameters. Expertise like these assist the mannequin, and any software or system constructed on high of it, to grasp what is going on, the place it’s occurring, and what it means.
Molmo 2 can also be educated on far much less information than comparable fashions — 9.19 million movies in contrast with 72.5 million for Meta’s PerceptionLM.
“With a fraction of the information, Molmo 2 surpasses many frontier fashions on key video understanding duties,” stated Ali Farhadi, the CEO of Ai2. ‘We’re excited to see the immense impression this mannequin could have on the AI panorama, including one other piece to our totally open mannequin ecosystem.”
Ai2 is a Seattle-based nonprofit AI analysis institute with the mission of constructing AI to resolve the world’s largest issues. Based in 2014 by late Microsoft co-founder Paul G. Allen, Ai2 stated it develops foundational AI analysis and new functions by means of large-scale open fashions, open information, robotics, conservation platforms, and extra.
Molmo 2 affords new capabilities
Deep video understanding is vital to constructing fashions that may perceive and act on sensor streams for robotics. Nonetheless, most fashions at this time both lack video understanding capabilities or are locked behind proprietary programs with out transparency into the information. Ai2 stated it’s giving researchers entry to superior video grounding, monitoring, and multi-frame reasoning, all with open weights and information.
Molmo 2 can establish precisely the place and when occasions happen, monitor a number of objects by means of complicated scenes, and join actions to frame-level timelines. The firm stated these capabilities help safer automation, extra correct real-world programs, and open analysis the worldwide group can examine, reproduce, and construct upon.
Ai2 listed key capabilities:
- Body-level spatial and temporal grounding: Molmo 2 goes past description. It returns exact pixel coordinates, object positions, and timestamps for occasions throughout a video.
- Sturdy multi-object monitoring and counting: The mannequin maintains constant object identities throughout occlusions, scene modifications, and lengthy clips, enabling functions in robotics, inspection, transportation, and business.
- Dense long-form video captioning and anomaly detection: Molmo 2 produces extremely detailed, searchable descriptions and flags uncommon occasions in lengthy sequences.
Molmo 2 delivers on main open-weight benchmarks, says Ai2
Molmo 2 delivers outcomes on main open-weight benchmarks and is on par with main proprietary programs on real-world video duties. The mannequin meets main open-weight efficiency on short-video understanding benchmarks akin to MVBench, MotionQA, and NextQA.
It affords enhancements in video grounding accuracy, usually doubling or tripling the scores of earlier open fashions and surpassing proprietary APIs on a number of pointing and counting duties, Ai2 claimed. The mannequin additionally affords monitoring outcomes throughout multi-domain benchmarks, outperforming robust open baselines and a number of other industrial closed fashions.
As well as, Molmo 2 options picture and multi-image reasoning that rivals or exceeds bigger open-weight programs regardless of utilizing fewer parameters. Ai2 asserted that human desire evaluations confirmed that Molmo 2 is on par with or higher than a number of proprietary programs on real-world video QA and captioning duties.
Ai2 affords open information and recipes
For transparency and reproducibility, all of the coaching sources for Molmo 2 are offered within the technical report. Ai2 can also be releasing a set of 9 new open datasets used to coach Molmo 2, totaling greater than 9 million multimodal examples throughout dense video captions, long-form QA, grounding, monitoring, and multi-image reasoning.
The captioning corpus alone spans greater than 100,000 movies with detailed descriptions that common greater than 900 phrases every. The information combine covers video pointing, multi-object monitoring, artificial grounding, and long-video reasoning. Collectively, they type some of the full open video information collections obtainable at this time, claimed Ai2.
Molmo 2 is available in three principal variants: Molmo 2 (4B), Molmo2 (8B), and Molmo 2-O (7B), which makes use of Ai2’s totally open Olmo spine for the whole end-to-end mannequin circulation. Variations tuned particularly for pointing and monitoring are additionally obtainable.
All fashions, datasets, and analysis instruments are actually publicly obtainable on GitHub, Hugging Face, and the Ai2 Playground for interactive testing. The corporate plans to launch the coaching code quickly.


