HomeCloud ComputingMaia 200: The AI accelerator constructed for inference

Maia 200: The AI accelerator constructed for inference


At the moment, we’re proud to introduce Maia 200, a breakthrough inference accelerator engineered to dramatically enhance the economics of AI token era. Maia 200 is an AI inference powerhouse: an accelerator constructed on TSMC’s 3nm course of with native FP8/FP4 tensor cores, a redesigned reminiscence system with 216GB HBM3e at 7 TB/s and 272MB of on-chip SRAM, plus information motion engines that maintain huge fashions fed, quick and extremely utilized. This makes Maia 200 essentially the most performant, first-party silicon from any hyperscaler, with thrice the FP4 efficiency of the third era Amazon Trainium, and FP8 efficiency above Google’s seventh era TPU. Maia 200 can also be essentially the most environment friendly inference system Microsoft has ever deployed, with 30% higher efficiency per greenback than the newest era {hardware} in our fleet right this moment.

Maia 200 is a part of our heterogenous AI infrastructure and can serve a number of fashions, together with the newest GPT-5.2 fashions from OpenAI, bringing efficiency per greenback benefit to Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence group will use Maia 200 for artificial information era and reinforcement studying to enhance next-generation in-house fashions. For artificial information pipeline use circumstances, Maia 200’s distinctive design helps speed up the speed at which high-quality, domain-specific information might be generated and filtered, feeding downstream coaching with brisker, extra focused alerts.

Maia 200 is deployed in our US Central datacenter area close to Des Moines, Iowa, with the US West 3 datacenter area close to Phoenix, Arizona, coming subsequent and future areas to observe. Maia 200 integrates seamlessly with Azure, and we’re previewing the Maia SDK with an entire set of instruments to construct and optimize fashions for Maia 200. It features a full set of capabilities, together with PyTorch integration, a Triton compiler and optimized kernel library, and entry to Maia’s low-level programming language. This provides builders fine-grained management when wanted whereas enabling straightforward mannequin porting throughout heterogeneous {hardware} accelerators.

YouTube Video

Engineered for AI inference

Fabricated on TSMC’s cutting-edge 3-nanometer course of, every Maia 200 chip incorporates over 140 billion transistors and is tailor-made for large-scale AI workloads whereas additionally delivering environment friendly efficiency per greenback. On each fronts, Maia 200 is constructed to excel. It’s designed for the newest fashions utilizing low-precision compute, with every Maia 200 chip delivering over 10 petaFLOPS in 4-bit precision (FP4) and over 5 petaFLOPS of 8-bit (FP8) efficiency, all inside a 750W SoC TDP envelope. In sensible phrases, Maia 200 can effortlessly run right this moment’s largest fashions, with loads of headroom for even greater fashions sooner or later.

A close-up of the Maia 200 AI accelerator chip.

Crucially, FLOPS aren’t the one ingredient for quicker AI. Feeding information is equally vital. Maia 200 assaults this bottleneck with a redesigned reminiscence subsystem. The Maia 200 reminiscence subsystem is centered on narrow-precision datatypes, a specialised DMA engine, on-die SRAM and a specialised NoC cloth for top‑bandwidth information motion, rising token throughput.

A table with the title “Industry-leading capability” shows peak specifications for Azure Maia 200, AWS Trainium 3 and Google TPU v7.

Optimized AI techniques

On the techniques stage, Maia 200 introduces a novel, two-tier scale-up community design constructed on commonplace Ethernet. A customized transport layer and tightly built-in NIC unlocks efficiency, robust reliability and important price benefits with out counting on proprietary materials.

Every accelerator exposes:

  • 2.8 TB/s of bidirectional, devoted scaleup bandwidth
  • Predictable, high-performance collective operations throughout clusters of as much as 6,144 accelerators

This structure delivers scalable efficiency for dense inference clusters whereas decreasing energy utilization and total TCO throughout Azure’s international fleet.

Inside every tray, 4 Maia accelerators are absolutely linked with direct, non‑switched hyperlinks, preserving excessive‑bandwidth communication native for optimum inference effectivity. The identical communication protocols are used for intra-rack and inter-rack networking utilizing the Maia AI transport protocol, enabling seamless scaling throughout nodes, racks and clusters of accelerators with minimal community hops. This unified cloth simplifies programming, improves workload flexibility and reduces stranded capability whereas sustaining constant efficiency and value effectivity at cloud scale.

A top-down view of the Maia 200 server blade.

A cloud-native improvement method

A core precept of Microsoft’s silicon improvement applications is to validate as a lot of the end-to-end system as doable forward of ultimate silicon availability.

A complicated pre-silicon setting guided the Maia 200 structure from its earliest phases, modeling the computation and communication patterns of LLMs with excessive constancy. This early co-development setting enabled us to optimize silicon, networking and system software program as a unified complete, lengthy earlier than first silicon.

We additionally designed Maia 200 for quick, seamless availability within the datacenter from the start, constructing out early validation of a few of the most advanced system components, together with the backend community and our second-generation, closed loop, liquid cooling Warmth Exchanger Unit. Native integration with the Azure management aircraft delivers safety, telemetry, diagnostics and administration capabilities at each the chip and rack ranges, maximizing reliability and uptime for production-critical AI workloads.

On account of these investments, AI fashions have been operating on Maia 200 silicon inside days of first packaged half arrival. Time from first silicon to first datacenter rack deployment was decreased to lower than half that of comparable AI infrastructure applications. And this end-to-end method, from chip to software program to datacenter, interprets instantly into larger utilization, quicker time to manufacturing and sustained enhancements in efficiency per greenback and per watt at cloud scale.

A view of the Maia 200 rack and the HXU cooling unit.

Join the Maia SDK preview

The period of large-scale AI is simply starting, and infrastructure will outline what’s doable. Our Maia AI accelerator program is designed to be multi-generational. As we deploy Maia 200 throughout our international infrastructure, we’re already designing for future generations and anticipate every era will regularly set new benchmarks for what’s doable and ship ever higher efficiency and effectivity for an important AI workloads.

At the moment, we’re inviting builders, AI startups and lecturers to start exploring early mannequin and workload optimization with the brand new Maia 200 software program improvement equipment (SDK). The SDK features a Triton Compiler, assist for PyTorch, low-level programming in NPL and a Maia simulator and value calculator to optimize for efficiencies earlier within the code lifecycle. Join the preview right here.

Get extra pictures, video and assets on our Maia 200 website and learn extra particulars.

Scott Guthrie is liable for hyperscale cloud computing options and providers together with Azure, Microsoft’s cloud computing platform, generative AI options, information platforms and data and cybersecurity. These platforms and providers assist organizations worldwide resolve pressing challenges and drive long-term transformation.

Tags: , ,



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments