HomeArtificial IntelligenceIBM AI Releases Granite-Docling-258M: An Open-Supply, Enterprise-Prepared Doc AI Mannequin

IBM AI Releases Granite-Docling-258M: An Open-Supply, Enterprise-Prepared Doc AI Mannequin


IBM has launched Granite-Docling-258M, an open-source (Apache-2.0) vision-language mannequin designed particularly for end-to-end doc conversion. The mannequin targets layout-faithful extraction—tables, code, equations, lists, captions, and studying order—emitting a structured, machine-readable illustration somewhat than lossy Markdown. It’s accessible on Hugging Face with a dwell demo and MLX construct for Apple Silicon.

What’s new in comparison with SmolDocling?

Granite-Docling is the product-ready successor to SmolDocling-256M. IBM changed the sooner spine with a Granite 165M language mannequin and upgraded the imaginative and prescient encoder to SigLIP2 (base, patch16-512) whereas retaining the Idefics3-style connector (pixel-shuffle projector). The ensuing mannequin has 258M parameters and exhibits constant accuracy positive aspects throughout format evaluation, full-page OCR, code, equations, and tables (see metrics under). IBM additionally addressed instability failure modes noticed within the preview mannequin (e.g., repetitive token loops).

Structure and coaching pipeline

  • Spine: Idefics3-derived stack with SigLIP2 imaginative and prescient encoder → pixel-shuffle connector → Granite 165M LLM.
  • Coaching framework: nanoVLM (light-weight, pure-PyTorch VLM coaching toolkit).
  • Illustration: Outputs DocTags, an IBM-authored markup designed for unambiguous doc construction (parts + coordinates + relationships), which downstream instruments convert to Markdown/HTML/JSON.
  • Compute: Skilled on IBM’s Blue Vela H100 cluster.

Quantified enhancements (Granite-Docling-258M vs. SmolDocling-256M preview)

Evaluated with docling-eval, LMMS-Eval, and task-specific datasets:

  • Format: MAP 0.27 vs. 0.23; F1 0.86 vs. 0.85.
  • Full-page OCR: F1 0.84 vs. 0.80; decrease edit distance.
  • Code recognition: F1 0.988 vs. 0.915; edit distance 0.013 vs. 0.114.
  • Equation recognition: F1 0.968 vs. 0.947.
  • Desk recognition (FinTabNet @150dpi): TEDS-structure 0.97 vs. 0.82; TEDS with content material 0.96 vs. 0.76.
  • Different benchmarks: MMStar 0.30 vs. 0.17; OCRBench 500 vs. 338.
  • Stability: “Avoids infinite loops extra successfully” (production-oriented repair).

Multilingual help

Granite-Docling provides experimental help for Japanese, Arabic, and Chinese language. IBM marks this as early-stage; English stays the first goal.

How the DocTags pathway modifications Doc AI

Typical OCR-to-Markdown pipelines lose structural data and complicate downstream retrieval-augmented era (RAG). Granite-Docling emits DocTags—a compact, LLM-friendly structural grammar—which Docling converts into Markdown/HTML/JSON. This preserves desk topology, inline/floating math, code blocks, captions, and studying order with specific coordinates, bettering index high quality and grounding for RAG and analytics.

Inference and integration

  • Docling Integration (really useful): The docling CLI/SDK routinely pulls Granite-Docling and converts PDFs/workplace docs/photos to a number of codecs. IBM positions the mannequin as a element inside Docling pipelines somewhat than a normal VLM.
  • Runtimes: Works with Transformers, vLLM, ONNX, and MLX; a devoted MLX construct is optimized for Apple Silicon. A Hugging Face House supplies an interactive demo (ZeroGPU).
  • License: Apache-2.0.

Why Granite-Docling?

For enterprise doc AI, small VLMs that protect construction cut back inference price and pipeline complexity. Granite-Docling replaces a number of single-purpose fashions (format, OCR, desk, code, equations) with a single element that emits a richer intermediate illustration, bettering downstream retrieval and conversion constancy. The measured positive aspects—in TEDS for tables, F1 for code/equations, and decreased instability—make it a sensible improve from SmolDocling for manufacturing workflows.

Demo

Abstract

Granite-Docling-258M marks a major development in compact, structure-preserving doc AI. By combining IBM’s Granite spine, SigLIP2 imaginative and prescient encoder, and the nanoVLM coaching framework, it delivers enterprise-ready efficiency throughout tables, equations, code, and multilingual textual content—all whereas remaining light-weight and open-source beneath Apache 2.0. With measurable positive aspects over its SmolDocling predecessor and seamless integration into Docling pipelines, Granite-Docling supplies a sensible basis for doc conversion and RAG workflows the place precision and reliability are essential.


Take a look at the Fashions on Hugging Face and Demo right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments