Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to advance innovation.
Within the transition from constructing computing infrastructure for cloud scale to constructing cloud and AI infrastructure for frontier scale, the world of computing has skilled tectonic shifts in innovation. All through this journey, Microsoft has shared its learnings and greatest practices, optimizing our cloud infrastructure stack in cross-industry boards such because the Open Compute Venture (OCP) International Basis.
Right this moment, we see that the subsequent part of cloud infrastructure innovation is poised to be probably the most consequential interval of transformation but. In simply the final 12 months, Microsoft has added greater than 2 gigawatts of latest capability and launched the world’s strongest AI datacenter, which delivers 10x the efficiency of the world’s quickest supercomputer in the present day. But, that is only the start.
Delivering AI infrastructure on the highest efficiency and lowest price requires a techniques strategy, with optimizations throughout the stack to drive high quality, velocity, and resiliency at a degree that may present a constant expertise to our prospects. Within the quest to produce resilient, sustainable, safe, and broadly scalable know-how to deal with the breadth of AI workloads, we’re embarking on an formidable new journey: one not simply of redefining infrastructure innovation at each layer of execution from silicon to techniques, however one in every of tightly built-in {industry} alignment on requirements that provide a mannequin for international interoperability and standardization.
At this 12 months’s OCP International Summit, Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to additional advance innovation within the {industry}.
Redefining energy distribution for the AI period
As AI workloads scale globally, hyperscale datacenters are experiencing unprecedented energy density and distribution challenges.
Final 12 months, on the OCP International Summit, we partnered with Meta and Google within the growth of Mt. Diablo, a disaggregated energy structure. This 12 months, we’re constructing on this innovation with the subsequent step of our full-stack transformation of datacenter energy techniques: solid-state transformers. Stable-state transformers simplify the facility chain with new conversion applied sciences and safety schemes that may accommodate future rack voltage necessities.
Coaching giant fashions throughout hundreds of GPUs additionally introduces variable and intense energy draw patterns that may pressure the grid. The utility, and conventional energy supply techniques. These fluctuations not solely danger {hardware} reliability and operational effectivity but in addition create challenges throughout capability planning and sustainability targets.
Along with key {industry} companions, Microsoft is main an influence stabilization initiative to handle this problem. In a just lately printed paper with OpenAI and NVIDIA—Energy Stabilization for AI Coaching Datacenters—we deal with how full-stack improvements spanning rack-level {hardware}, firmware orchestration, predictive telemetry, and facility integration can clean energy spikes, scale back energy overshoot by 40%, and mitigate operational danger and prices to allow predictable, and scalable energy supply for AI coaching clusters.
This 12 months, on the OCP International Summit, Microsoft is becoming a member of forces with {industry} companions to launch a devoted energy stabilization workgroup. Our aim is to foster open collaboration throughout hyperscalers and {hardware} companions, sharing our learnings from full-stack innovation and alluring the group to co-develop new methodologies that deal with the distinctive energy challenges of AI coaching datacenters. By constructing on the insights from our just lately printed white paper, we goal to speed up industry-wide adoption of resilient, scalable energy supply options for the subsequent era of AI infrastructure. Learn extra about our energy stabilization efforts.
Cooling improvements for resiliency
As the facility profile for AI infrastructure modifications, we’re additionally persevering with to rearchitect our cooling infrastructure to help evolving wants round vitality consumption, area optimization, and total datacenter sustainability. Numerous cooling options have to be carried out to help the size of our enlargement—as we search to construct new AI-scale datacenters, we’re additionally using Warmth Exchanger Unit (HXU)-based liquid cooling to quickly deploy new AI capability inside our current air-cooled datacenter footprint.
Microsoft’s subsequent era HXU is an upcoming OCP contribution that allows liquid cooling for high-performance AI techniques in air-cooled datacenters, supporting international scalability and fast deployment. The modular HXU design delivers 2X the efficiency of present fashions and maintains >99.9% cooling service availability for AI workloads. No datacenter modifications are required, permitting seamless integration and enlargement. Study extra in regards to the subsequent era HXU right here.
In the meantime, we’re persevering with to innovate throughout a number of layers of the stack to handle modifications in energy and warmth dissipation—using facility water cooling at datacenter-scale, circulating liquid in closed-loops from server to chiller; and exploring on-chip cooling improvements like microfluidics to effectively take away warmth straight from the silicon.
Unified networking options for rising infrastructure calls for
Scaling a whole lot of hundreds of GPUs to function as a single, coherent system comes with vital challenges to create rack-scale interconnects that may ship low-latency, excessive bandwidth materials which might be each environment friendly and interoperable. As AI workloads develop exponentially and infrastructure calls for intensify, we’re exploring networking optimizations that may help these wants. To that finish, we have now developed options leveraging scale-up, scale-out, and Huge Space Community (WAN) options to allow large-scale distributed coaching.
We accomplice carefully with requirements our bodies, like UEC (Extremely Ethernet Consortium) and UALink, targeted on innovation in networking applied sciences for this crucial aspect of AI techniques. We’re additionally driving ahead adoption of Ethernet for scale-up networking throughout the ecosystem and are excited to see the Ethernet for Scale-up Networking (ESUN) workstream launch below the OCP Networking Venture. We stay up for selling adoption of cutting-edge networking options and enabling multi-vendor Ecosystem based mostly on open requirements.
Safety, sustainability, and high quality: Elementary pillars for resilient AI operations
Protection in depth: Belief at each layer
Our complete strategy to scaling AI techniques responsibly contains embedding belief and safety into each layer of our platform. This 12 months, we’re introducing new safety contributions that construct on our current physique of labor in {hardware} safety and introduce new protocols which might be uniquely match to help new scientific breakthroughs which have been accelerated with the introduction of AI:
- Constructing on previous years’ contributions and Microsoft’s collaboration with AMD, Google, and NVIDIA, we have now additional enhanced Caliptra, our open-source silicon root of belief The introduction of Caliptra 2.1 extends the {hardware} root-of-trust to a full safety subsystem. Study extra about Caliptra 2.1 right here.
- We now have additionally added Adams Bridge 2.0 to Caliptra to increase help for quantum-resilient cryptographic algorithms to the root-of-trust.
- Lastly, we’re contributing OCP Layered Open-source Cryptographic Key Administration (L.O.C.Okay)—a key administration block for storage units that secures media encryption keys in {hardware}. L.O.C.Okay was developed by way of collaboration between Google, Kioxia, Microsoft, Samsung, and Solidigm.
Advancing datacenter-scale sustainability
Sustainability continues to be a serious space of alternative for {industry} collaboration and standardization by way of communities such because the Open Compute Venture. Working collaboratively as an ecosystem of hyperscalers and {hardware} companions is one catalyst to handle the necessity for sustainable datacenter infrastructure that may successfully scale as compute calls for proceed to evolve. This 12 months, we’re happy to proceed our collaborations as a part of OCP’s Sustainability workgroup throughout areas corresponding to carbon reporting, accounting, and circularity:
- Introduced at this 12 months’s International Summit, we’re partnering with AWS, Google, and Meta to fund the Product Class Rule initiative below the OCP Sustainability workgroup, with the aim of standardizing carbon measurement methodology for units and datacenter tools.
- Along with Google, Meta, OCP, Schneider Electrical, and the iMasons Local weather Accord, we’re establishing the Embodied Carbon Disclosure Base Specification to determine a typical framework for reporting the carbon influence of datacenter tools.
- Microsoft is advancing the adoption of waste warmth reuse (WHR). In partnership with the NetZero Innovation Hub, NREL, and EU and US collaborators, Microsoft has printed warmth reuse reference designs and is growing an financial modeling software which give knowledge middle operators and waste warmth off takers/shoppers the price it takes to develop the waste warmth reuse infrastructure based mostly on the situations like the dimensions and capability of the WHR system, season, location, WHR mandates and subsidies in place. These region-specific options assist operators convert extra warmth into usable vitality—assembly regulatory necessities and unlocking new capability, particularly in areas like Europe the place warmth reuse is turning into necessary.
- We now have developed an open methodology for Life Cycle Evaluation (LCA) at scale throughout large-scale IT {hardware} fleets to drive in direction of a “gold customary” in sustainable cloud infrastructure.
Rethinking node administration: Fleet operational resiliency for the frontier period
As AI infrastructure scales at an unprecedented tempo, Microsoft is investing in standardizing how various compute nodes are deployed, up to date, monitored, and serviced throughout hyperscale datacenters. In collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, we’re driving a collection of Open Compute Venture (OCP) contributions targeted on streamlining fleet operations, unifying firmware administration, manageability interfaces and enhancing diagnostics, debug, and RAS (Reliability, Availability, and Serviceability) capabilities. This standardized strategy to lifecycle administration lays the muse for constant, scalable node operations throughout this era of fast enlargement. Learn extra about our strategy to resilient fleet operations.
Paving the best way for frontier-scale AI computing
As we enter a brand new period of frontier-scale AI growth, Microsoft takes delight in main the development of requirements that may drive the way forward for globally deployable AI supercomputing. Our dedication is mirrored in our lively function in shaping the ecosystem that allows scalable, safe, and dependable AI infrastructure throughout the globe. We invite attendees of this 12 months’s OCP International Summit to attach with Microsoft at sales space #B53 to find our newest cloud {hardware} demonstrations. These demonstrations showcase our ongoing collaborations with companions all through the OCP group, highlighting improvements that help the evolution of AI and cloud applied sciences.
Join with Microsoft on the OCP International Summit 2025 and past

