Meta skilled certainly one of its AI fashions, referred to as Llama 3, in 2024 and printed the leads to a broadly lined paper. Throughout a 54-day interval of pre-training, Llama 3 skilled 466 job interruptions, 419 of which have been surprising. Upon additional investigation, Meta realized 78% of these hiccups have been attributable to {hardware} points equivalent to GPU and host element failures.
{Hardware} points like these don’t simply trigger job interruptions. They will additionally result in silent knowledge corruption (SDC), inflicting undesirable knowledge loss or inaccuracies that always go undetected for prolonged intervals.
Whereas Meta’s pre-training interruptions have been surprising, they shouldn’t be totally stunning. AI fashions like Llama 3 have huge processing calls for that require colossal computing clusters. For coaching alone, AI workloads can require a whole lot of 1000’s of nodes and related GPUs working in unison for weeks or months at a time.
The depth and scale of AI processing and switching create an incredible quantity of warmth, voltage fluctuations and noise, all of which place unprecedented stress on computational {hardware}. The GPUs and underlying silicon can degrade extra quickly than they’d beneath regular (or what was regular) situations. Efficiency and reliability wane accordingly.
That is very true for sub-5 nm course of applied sciences, the place silicon degradation and defective conduct are noticed upon manufacturing and within the subject.
However what will be achieved about it? How can unanticipated interruptions and SDC be mitigated? And the way can chip design groups guarantee optimum efficiency and reliability because the business pushes ahead with newer, greater AI workloads that demand much more processing capability and scale?
Making certain silicon reliability, availability and serviceability (RAS)
Sure AI gamers like Meta have established monitoring and diagnostics capabilities to enhance the provision and reliability of their computing environments. However with processing calls for, {hardware} failures and SDC points on the rise, there’s a distinct want for take a look at and telemetry capabilities at deeper ranges—all the best way all the way down to the silicon and multi-die packages inside every XPU/GPU in addition to the interconnects that carry them collectively.
The secret’s silicon lifecycle administration (SLM) options that assist guarantee end-to-end RAS, from design and manufacturing to bring-up and in-field operation.
With higher visibility, monitoring, and diagnostics on the silicon degree, design groups can:
- Achieve telemetry-based insights into why chips are failing or why SDC is going on.
- Establish voltage or timing degradation, overheating, and mechanical failures in silicon parts, multi-die packages, and high-speed interconnects.
- Conduct extra exact thermal and energy characterization for AI workloads.
- Detect, characterize, and resolve radiation, voltage noise, and mechanism failures that may result in undetected bit flips and SDC.
- Enhance silicon yield, high quality, and in-field RAS.
- Implement reliability-focused methods—like triple modular redundancy and twin core lock step—in the course of the register-transfer degree (RTL) design part to mitigate SDC.
- Set up an correct pre-silicon growing older simulation methodology to detect delicate or susceptible circuits and exchange them with aging-resilient circuits.
- Enhance outlier detection on reliability fashions, which helps reduce in-field SDC.
Silicon lifecycle administration (SLM) options assist guarantee end-to-end reliability, availability, and serviceability. Supply: Synopsys
An SML design instance
SLM IP and analytics options assist enhance silicon well being and supply operational metrics at every part of the system lifecycle. This consists of environmental monitoring for understanding and optimizing silicon efficiency based mostly on the working setting of the system; structural monitoring to establish efficiency variations from design to in-field operation; and practical monitoring to trace the well being and anomalies of crucial system features.
Under are the important thing options and capabilities that SLM IP gives:
- Course of, voltage and temperature screens
- Assist guarantee optimum operation whereas maximizing efficiency, energy, and reliability.
- Extremely correct and distributed monitoring all through the die, enabling thermal administration by way of frequency throttling.
- Path margin screens
- Measure timing margin of 1000+ artificial and practical paths (in-test and in-field).
- Allow silicon efficiency optimization based mostly on precise margins.
- Automated path choice, IP insertion, and scan era.
- Clock and delay screens
- Measure the delay between the perimeters of a number of alerts.
- Verify the standard of the clock responsibility cycle.
- Measure reminiscence learn entry time monitoring with built-in self-test (BIST).
- Characterize digital delay traces.
- UCIe monitor, take a look at and restore
- Monitor sign integrity of die-to-die UCIe lane(s).
- Generate algorithmic BIST patterns to detect interconnect fault sorts, together with lane-to-lane crosstalk.
- Carry out cumulative lane restore with redundancy allocation (upon manufacturing and in-field).
- Excessive-speed entry and take a look at
- Allow testing over practical interfaces (PCIe, USB and SPI).
- For in-field operation in addition to wafer type, closing take a look at, and system-level take a look at.
- Can be utilized at the side of automated take a look at tools.
- Assist conduct in-field distant diagnoses and lower-cost take a look at by way of lowered pin depend.
- HBM exterior take a look at and restore
- Complete, silicon-proven DRAM stack take a look at, restore and diagnostics engine.
- Assist third-party HBM DRAM stack suppliers.
- Present high-performance die to die interconnect take a look at and restore help.
- Function at the side of HBM PHY and help a spread of HBM protocols and configurations.
- SLM hierarchical subsystem
- Automated hierarchical SLM and take a look at manageability resolution for system-on-chips (SoCs).
- Automated integration and entry of all IP/cores with in-system scheduling.
- Pre-validated, prepared ATE patterns with sample porting.
Silicon take a look at and telemetry within the age of AI
With the size and processing calls for of AI units and workloads on the rise, system reliability, silicon well being and SDC points have gotten extra widespread. Whereas there isn’t a single resolution or antidote for avoiding these points, deeper and extra complete take a look at, restore, and telemetry—on the silicon degree—may also help mitigate them. The flexibility to detect or predict in-field chip degradation is especially invaluable, enabling corrective motion earlier than sudden or catastrophic system failures happen.
Delivering end-to-end visibility by RAS, silicon take a look at, restore, and telemetry might be more and more essential as we transfer towards the age of AI.
Shankar Krishnamoorthy is chief product improvement officer at Synopsys.
Krishna Adusumalli is R&D engineer at Synopsys.
Jyotika Athavale is structure engineering director at Synopsys.
Yervant Zorian is chief architect at Synopsys.
Associated Content material
- Uncovering Silent Knowledge Errors with AI
- 11 steps to profitable {hardware} troubleshooting
- Self-testing in embedded programs: {Hardware} failure
- Understanding and combating silent knowledge corruption
- Check options to confront silent knowledge corruption in ICs
The publish Addressing {hardware} failures and silent knowledge corruption in AI chips appeared first on EDN.