It’s summer, heat is real, IT is having a hard time.
I’ve heard many of us lost some hard disks (DO BACKUPS), but it’ll be about graphics cards there.
As you may already know, I’ve got a pretty old desktop setup, but ecology is also about not changing every 6 months all the tools that are still working
Let’s debrief passed weeks lack of understanding about a very weird issue occurring more and more often, but not systematically. See below.
Symptoms were rather simple : Sometimes, the screen just went black and the PC hard rebooted therefore.
Rarely, some graphical glitches may be experienced just before that.
They were weird artifacts, then more and more, until the screen was unreadable, and at this very moment, it went dark.
Quickly, I’ve understood that it happened mostly when the PC was working hard, so typically when a LoL game was loading (enjoy uncapped FPS on splash-screens in 2019), or when loading a bloat-ware Web application.
As you surely guessed, that caused me MANY LP (League Points) losses in ranked lately, so I demoted hard. But that’s a detail.
As one of the hard disks (RAID) was making a little weird noise, I’ve thought about it at first. So I installed and run some quick Smartmontools checks, at least to verify S.M.A.R.T. data.
Everything was looking OK, so I moved to memory.
The unpredictable and unrepairable aspects of the trouble were making me think about memory sticks issues.
For this, I’ve used the incredible Memtest tool, but some false positives made me lose some hours of extensive testing.
As their latest version is compatible with UEFI only, I had to go through their previous “stable” one (Why on Earth someone would remove the regular legacy BIOS support ?? ), pretty buggy.
So, once the 10th check has been manually withdrawn from the testing suite, I ended up understanding that it was not a memory-related issue, but it was not really a good news, as I didn’t have any more leads…
Are you mad ?
Well, it may look like, but I’m not.
Windows (for once) helped me hard on this, as after each crash/reboot, its “Find and repair errors” tool indicated what happened, and it was… BSODs !
Whut ? Didn’t you see this ? It’s pretty explicit…
Actually, blue screens were never shown by the system, maybe ‘cause it was definitely about the graphics card since the beginning ?
The Windows reporting tool indicated two kinds of BSOD reasons :
BCCode : 0x116 [VIDEO_TDR_ERROR]
BCCode : 0x0C5 [DRIVER_CORRUPTED_EXPOOL] (If I remember correctly)
0x116 popped-out, I decided to pursue investigations onto the graphics card.
I landed on this page, and I have to admit it, new Micro$oft documentations are pretty clear and straightforward (when their website is not down, though).
So it looks like the GPU couldn’t do its job on time for whatever reason (Disclaimer : I’m not a hardware expert at all), and the kernel ends up raising those fatal errors.
Not thanks to AMD
Those hardwares are actually driven by an AMD software called
AMD Settings on your disk, that replace the old
ATI CCC, standing for
Catalyst Control Center back in the past.
Actually, what a coincidence, a brand new one (Adrenalin) has been available for some weeks now !
The client is pretty cool, information are relatively clear, and, once you know that advanced hardware tweaks are under
Gaming > Global Settings > Global OverDrive, you’re good to go.
By the way : If you wanna disable the Enhanced V-Sync feature
that may break your in-game experience, it’s under
Gaming > Global Settings > Wait for Vertical Refresh(pretty hidden option, in my honest opinion)
By looking at those graphs, I quickly got it : The GPU was not cooling as it should, and the temperature could easily reach ~70°C in game
At this point, I wondered whether the lack of cooling could be the source of the issues, and… it definitely looks like.
So my final workaround was to lower the GPU and the GPU memory clocks frequencies back to 1000MHz, as long as unlocking the fan speed control, allowing me to drastically higher it (it has been manually turned up around ~70%).
Although, how the hell a graphics card, that is sold for being over-clockable, is over-clocked by default ?
“Power to the users !”
What I discovered lately too : when your AMD driver crashes (yeah, that happens too…), at reboot, you may encounter something like “Default settings have been restored […] due to […] WattMan […] crash”.
This also means that your previous hardware settings have been reseted.
So… I’d advise you to export your settings to JSON (
Preferences > Export Settings...), so as to import them back right when it’s needed, and before running a resources-consuming program.
Addendum (for a little laugh)
Before landing to the conclusion, one question for you : Do you know a software in 2019, once its self-updating process is finished, that opens up your browser to load a (metrics) specific page ?
https://subscriptions.amd.com/driverinstalled/index.html?pt=radeon&VID1=XXXX&DID1=YYYY&PID1=AMD Radeon Graphics Processor&SSVID1=ZZZZ&SSID1=TTTT&os=YOUR_OS&osbit=64&cpu=YOUR_CPU_MODEL_PARTLY_URL_ENCODED
That’s a pretty strange way of performing analytics, isn’t it ?
TL;DR : Don’t trust the AMD’s Radeon Softwares, that are apparently unable to determine the GPU fan speed required for a proper cooling.
I’m looking forward to reading any feedback that you would like to share with the world