IBM has launched Granite-Docling-258M, an open-source (Apache-2.0) vision-language mannequin designed particularly for end-to-end doc conversion. The mannequin targets layout-faithful extraction—tables, code, equations, lists, captions, and studying order—emitting a structured, machine-readable illustration somewhat than lossy Markdown. It’s accessible on Hugging Face with a dwell demo and MLX construct for Apple Silicon.
What’s new in comparison with SmolDocling?
Granite-Docling is the product-ready successor to SmolDocling-256M. IBM changed the sooner spine with a Granite 165M language mannequin and upgraded the imaginative and prescient encoder to SigLIP2 (base, patch16-512) whereas retaining the Idefics3-style connector (pixel-shuffle projector). The ensuing mannequin has 258M parameters and exhibits constant accuracy positive aspects throughout format evaluation, full-page OCR, code, equations, and tables (see metrics under). IBM additionally addressed instability failure modes noticed within the preview mannequin (e.g., repetitive token loops).
Structure and coaching pipeline
- Spine: Idefics3-derived stack with SigLIP2 imaginative and prescient encoder → pixel-shuffle connector → Granite 165M LLM.
- Coaching framework: nanoVLM (light-weight, pure-PyTorch VLM coaching toolkit).
- Illustration: Outputs DocTags, an IBM-authored markup designed for unambiguous doc construction (parts + coordinates + relationships), which downstream instruments convert to Markdown/HTML/JSON.
- Compute: Skilled on IBM’s Blue Vela H100 cluster.
Quantified enhancements (Granite-Docling-258M vs. SmolDocling-256M preview)
Evaluated with docling-eval
, LMMS-Eval, and task-specific datasets:
- Format: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.
- Full-page OCR: F1 0.84 vs. 0.80; decrease edit distance.
- Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.
- Equation recognition: F1 0.968 vs. 0.947.
- Desk recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content material 0.96 vs. 0.76.
- Different benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.
- Stability: “Avoids infinite loops extra successfully” (production-oriented repair).
Multilingual help
Granite-Docling provides experimental help for Japanese, Arabic, and Chinese language. IBM marks this as early-stage; English stays the first goal.
How the DocTags pathway modifications Doc AI
Typical OCR-to-Markdown pipelines lose structural data and complicate downstream retrieval-augmented era (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves desk topology, inline/floating math, code blocks, captions, and studying order with specific coordinates, bettering index high quality and grounding for RAG and analytics.
Inference and integration
- Docling Integration (really useful): The
docling
CLI/SDK routinely pulls Granite-Docling and converts PDFs/workplace docs/photos to a number of codecs. IBM positions the mannequin as a element inside Docling pipelines somewhat than a normal VLM. - Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a devoted MLX construct is optimized for Apple Silicon. A Hugging Face House supplies an interactive demo (ZeroGPU).
- License: Apache-2.0.
Why Granite-Docling?
For enterprise doc AI, small VLMs that protect construction cut back inference price and pipeline complexity. Granite-Docling replaces a number of single-purpose fashions (format, OCR, desk, code, equations) with a single element that emits a richer intermediate illustration, bettering downstream retrieval and conversion constancy. The measured positive aspects—in TEDS for tables, F1 for code/equations, and decreased instability—make it a sensible improve from SmolDocling for manufacturing workflows.
Demo
Abstract
Granite-Docling-258M marks a major development in compact, structure-preserving doc AI. By combining IBM’s Granite spine, SigLIP2 imaginative and prescient encoder, and the nanoVLM coaching framework, it delivers enterprise-ready efficiency throughout tables, equations, code, and multilingual textual content—all whereas remaining light-weight and open-source beneath Apache 2.0. With measurable positive aspects over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling supplies a sensible basis for doc conversion and RAG workflows the place precision and reliability are essential.
Take a look at the Fashions on Hugging Face and Demo right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.