HomeArtificial IntelligenceNVIDIA Points Hotfix for GPU Driver's Overheating Challenge

NVIDIA Points Hotfix for GPU Driver’s Overheating Challenge


Yesterday NVIDIA rushed out a important hotfix to comprise the fallout from a previous driver launch that had triggered alarm throughout AI and gaming communities by inflicting methods to falsely report secure GPU temperatures – whilst cooling calls for quietly climbed towards doubtlessly important ranges.

In NVIDIA’s official put up across the hotfix launch, although solely third within the listing of acknowledged fixes, the problem is cited as ‘GPU monitoring utilities might cease reporting the GPU temperature after PC wakes from sleep’.

Shortly after the affected Recreation Prepared driver 576.02 was rolled out, a pinned thread on the Steady Diffusion sub-Reddit, titled Learn to Save Your GPU!, turned a useful resource for anecdotal points and user-reported updates regarding the new driver. From these, and different reviews across the net, some time-line of emergent issues might be established.

The primary Reddit report of the bug appears to have occurred late Friday afternoon UTC, on the ZephyrusG14 subreddit, the place the consumer fricy81 cited a put up at NVIDIA boards (archived):

A user at NVIDIA forums finds issues after the 576.02 update. Source: https://www.nvidia.com/en-us/geforce/forums/game-ready-drivers/13/563010/geforce-grd-57602-feedback-thread-released-41625/3524072/

A consumer at NVIDIA boards finds points after the 576.02 replace. Supply: https://www.nvidia.com/en-us/geforce/boards/game-ready-drivers/13/563010/geforce-grd-57602-feedback-thread-released-41625/3524072/

The consumer at NVIDIA boards reported that after putting in the motive force replace, instruments like MSI Afterburner and in-game displays such because the one in Name of Responsibility (which typically entry native system readings, a lot as Activity Supervisor’s GPU panel does in Home windows) stopped updating GPU temperature readings, freezing at round 35-36°C.

Restarting the monitoring software program had no impact, the consumer acknowledged, and solely a full system reboot would restore correct readings. Instruments like HWInfo and NVIDIA’s personal monitoring app continued to report temperatures accurately. The consumer emphasised that the problem occurred throughout regular use, not simply after waking the system from sleep.

Consumer suggestions throughout numerous boards highlighted a basic disruption of regular fan curve habits and an alteration of core thermal regulation, leading to graphics processing items idling at unexpectedly excessive temperatures, and alarmingly overheating below what would usually be thought-about normal operational masses, as detailed on this remark:

‘I may inform one thing was off. The climate outdoors was most likely round 55°F / 12°C, however I used to be cooking alive in my room. My window was open, and but I couldn’t really feel any distinction. All of the followers had been working at max, and temps regarded effective at first—round 68°C to 72°C after gaming for some time.

‘At first, that appeared regular—till the subsequent morning, once I realized these aren’t idle temps, and the followers had been nonetheless [kicking].

‘I had accomplished some AI overclocking after fixing just a few issues recently, so I wasn’t certain if the values had simply spiked too excessive. It’s occurred as soon as earlier than after putting in ASUS AI Suite 3 – the BIOS settings wouldn’t even work correctly due to it.

‘Anyway, I went forward and rolled again to an older driver for now.’

Sub-Optimum

The official launch PDF for the 576.02 driver replace presents some clues about adjustments that will have contributed to the brand new points. In part 5.5, NVIDIA acknowledges that GPU temperature might be reported incorrectly on NVIDIA Optimus methods, particularly exhibiting zero levels when no purposes are working.

Section 5.5 of the official 576.02 update notes addresses temperature-monitoring issues that seem to have affected a wider number of systems than the Optimus system. Source: https://us.download.nvidia.com/Windows/576.02/576.02-win11-win10-release-notes.pdf

Part 5.5 of the official 576.02 replace notes addresses temperature-monitoring points that appear to have affected a wider variety of methods than the Optimus system. Supply: https://us.obtain.nvidia.com/Home windows/576.02/576.02-win11-win10-release-notes.pdf

The discharge states:

5.5 GPU Temperature Reported Incorrectly on Optimus Methods

5.5.1 Challenge

On Optimus methods, temperature-reporting instruments akin to Speccy or GPU-Z report that the NVIDIA GPU temperature is zero when no purposes are working.

5.5.2 Clarification

On Optimus methods, when the NVIDIA GPU isn’t getting used then it’s put right into a low-power state. This causes temperature-reporting instruments to return incorrect values. Waking up the GPU to question the temperature would lead to meaningless measurements as a result of the GPU temperature change consequently.

These instruments will report correct temperatures solely when the GPU is awake and working.

NVIDIA Optimus is a GPU switching expertise that toggles between built-in and discrete graphics primarily based on software calls for, to be able to routinely stability efficiency and energy consumption, designed to preserve battery life and cut back energy consumption. For duties akin to gaming or HD video playback, Optimus prompts the discrete GPU for higher efficiency; throughout lighter actions akin to net shopping, it reverts to built-in (onboard) graphics.

The replace seems to have prolonged a habits beforehand restricted to Optimus methods, permitting the affected GPU to enter a low-power state whereas idle, even when not hosted on an Optimus system, in flip disrupting temperature reporting in third-party instruments.

Danger Adjustment

In most situations, it’s honest to say that the graphics card’s VBIOS would possible have prevented everlasting GPU injury. VBIOS enforces thermal and energy limits on the firmware stage, independently of the motive force.

Due to this fact even when a driver had been to trigger improper fan habits or misreport temperatures, the VBIOS ought to nonetheless throttle efficiency, ramp up fan exercise, or else shut down the GPU to forestall {hardware} failure.

That doesn’t imply the chance was trivial – sustained excessive temperatures can degrade efficiency over time or stress adjoining parts; moreover, absent a typical understanding that an up to date driver induced an issue (not least in methods the place drivers replace ‘silently’), a problem of this nature may mislead a big proportion of affected customers, who might try treatments for non-existent issues, and even doubtlessly trigger injury to their methods by making use of non-relevant ‘fixes’.

The errant habits attributable to replace 576.02 was notably alarming for these engaged in synthetic intelligence workflows, the place high-performance {hardware} is routinely pushed to its thermal limits for prolonged durations.

The problematic 576.02 driver impressed a broader rash of complaints after its launch in mid-April, regardless of preliminary reviews that it provided some helpful efficiency enhancements. However the supply of the hotfix, and the extent of disruption that 576.02 appears to have induced, on the time of writing it stays obtainable for obtain* at NVIDIA’s web site.

Afterglow

When it comes to the fallout from the defective replace, there are quite a few sorts of injury and or inconvenience reported: consumer Frankie_T9000 reported that his GPU crashed on boot on account of warmth buildup below the fault replace, and solely stabilized after undervolting. He commented ‘appears to be like like its not completely harmed however have to repaste asap (I’ve pads coming wednesday) suspect the previous thermal paste was aged extra by the warmth buildup so im placing new paste pads.

Yesterday one other consumer in the identical thread acknowledged: ‘Im utilizing a customized fan curve wit msi afterburner, and it stored exhibiting that my gpu temps had been consistently at 27°C, so the followers did not activate, which led to overheating points. I believed it was a me situation however after putting in the earlier driver all of it labored out effective once more. Additionally, the temps arent displayed accurately in taskmanager.’

Although NVIDIA (because it states persistently in every hotfix launch) typically gives hotfixes for specific video-games or platforms, the chance of warmth injury to or round a GPU is larger for AI practitioners than for videogamers, since intensive machine studying processes akin to coaching or sustained inference place a GPU below constant long-term load – an occasion prone to be triggered solely periodically in a recreation, which can ‘spike’ into excessive utilization for a boss-battle or a very demanding map part, however which is in any other case designed as a compromise between GPU exploitation and system stability.

 

* Archive: https://archive.ph/ylVR1

First revealed Tuesday, April 22, 2025

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments