Message boards :
Number crunching :
Monitor sometimes becomes black while crunching GPUGRID
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
on the same machine with 2 GTX980ti on which I have been crunching GPUGRID for 2 1/2 years on Windows XP, I have installed Windows 10 recently. For 3 days, I have been crunching LHC tasks, which don't use the GPU. Today I started crunching GPUGRID, and it happens every few hours that all of a sudden the monitor becomes black. From what it looks, GPUGRID crunching stops at that moment (the PC eminates less temperature), but there is some warm air coming out from the side, so I guess that the LHC tasks carry on (BTW, during only LHC crunching in the past days, this problem never happened). All I can do is to push the off-button and make a reboot. The Windows event log under system shows the warning "the graphic driver nvlddmkm does no longer react and was restored. This entry shows up between 50 and 60 times within about 4 minutes, at around the time the monitor got black (probably at some point the driver could no longer be restored, or whatever). Under "details" it shows an eventID 4101, and under event data "nvlddmkm" Does anyone know about this problem, and could give me an advice how to solve it? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Driver crashes are a common problem when a GPU is overheated or overclocked. Check ventilation and dust bunnies: if overclocked, knock it back a couple of notches. |
|
Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
You can install TThrottle and record the GPU temps in order to find out. After a crash, you can reboot and check the temperature graphs of the last 24 hours. In TThrottle you may also set a particular max. temperature to shut down the PC automatically before the GPU gets damaged, I normally set it to 85°C. https://efmer.com/download-tthrottle/ In addition to that, MSI Afterburner keeps my GPU temps constantly below 70°C. Both measures together have saved my GPUs a couple of times. In case the GPU temperature is too high, you could try to renew the thermal grease. If that is not successful there may be a broken heat pipe. But in case the GPU temperatures are OK, that unfortunately does not mean anything as the memory chips or voltage regulators temps don't show up in the graphs. There simply are no sensors there. A thermogram of the board could reveal the error cause then. https://www.guru3d.com/articles-pages/msi-geforce-gtx-970-gaming-review,9.html I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. |
|
Send message Joined: 10 Nov 13 Posts: 101 Credit: 15,773,211,122 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you don't find this problem is heat related it could be that you are not using a good driver. You mentioned that you installed Windows 10. That implies that you did a clean or new install rather than an upgrade. You did not state that you installed the current video drivers directly from Nvidia. If you didn't do so I would suggest downloading those first and then doing a clean install of the video drivers. Use DDU (Display Driver Uninstaller) first. It will also set Windows 10 so it won't install the Win 10 default drivers. Then install the current Nvidia drivers. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Driver crashes are a common problem when a GPU is overheated or overclocked. heat and/or overclocking should not be the problem. As before in Windows XP, the GPU temp is around 61/62 °C, clock around default value. I would rather guess that is has to do with what is described here: If you don't find this problem is heat related it could be that you are not using a good driver. You mentioned that you installed Windows 10. That implies that you did a clean or new install rather than an upgrade. I used the driver that originally came with the new install of Windows 10,it was version 388.. The driver I now downloaded from NVIDIA is 398.36. Installation worked without problems, so I restarted BOINC / GPUGRID and will see what happens (even the tasks which I interruped for the new driver installation were continued normally) |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... The driver I now downloaded from NVIDIA is 398.36. Installation worked without problems, so I restarted BOINC / GPUGRID and will see what happens (even the tasks which I interruped for the new driver installation were continued normally) The problem still exists, despite of the new driver :-((( Again, I looked up the Event log of Windows (System), and it shows the above cited warning ("the graphic driver nvlddmkm does no longer react and was restored") many times from 10:19a.m. on, in exactly 4-seconds-intervals, until 10:52 - the time I pushed the "off"- button. Anyone any idea what could be the reason? What can I do in order to get GPUGRID work properly? |
|
Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
as I wrote, in case the GPU temperatures are OK, that unfortunately does not mean anything as the memory chips or voltage regulators temps don't show up I would take the 980ti GPUs out one after another, to see if the problem is related to a particular card. If the system works properly, having only one GPU installed, then you know. You may also want to move the 980ti's into another PC to see if the error moves along with one or another card. I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The problem still exists, despite of the new driver :-((( In Windows, the only proper way to install a new driver is to first uninstall the old driver. And it has to be a clean uninstall using Display Driver Uninstaller (DDU), to get rid of all the traces of the old one (chose the option to reboot into Safe Mode). https://www.wagnardsoft.com/forums/viewtopic.php?f=5&t=1174&sid=38069867de013db1e7c3bd469b98c82a You might think that Nvidia would do that themselves, but they don't. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
In Windows, the only proper way to install a new driver is to first uninstall the old driver. And it has to be a clean uninstall using Display Driver Uninstaller (DDU), to get rid of all the traces of the old one (chose the option to reboot into Safe Mode). that's exactly what I did, anyway. I am trying now various methods to "delimit" the problem: - right now, I am crunching SETI@home tasks, so I'll see, whether the problem occurs also there. If so, then I might install - Folding@Home, which is working with OpenGL (in contrast to GPUGRID and SETI, both of which work with CUDA). If the problem persists in both of the above cases, then I will revert back to Windows XP (on the same machine, with dual boot) which I have used for the past 2 1/2 years, and see,if the problem also occurs then. In case it does, then I am afraid that JoergF may be right when assuming that there may be some kind of hardware failure :-((( |
|
Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had the same problem, it was a power saving issue with the BIOS and Windows 10 |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had the same problem, it was a power saving issue with the BIOS and Windows 10 How did you resolve it? |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had the same problem, it was a power saving issue with the BIOS and Windows 10 the strange thing is, though, that the problem did NOT occur within the first 2-3 days after the installation of Windows 10; but only after I started GPUGRID crunching (before, I was only crunching LHC tasks via the CPU). Still your reply to Zoltan's question How did you resolve it?would be very interesting. |
|
Send message Joined: 8 May 18 Posts: 190 Credit: 104,426,808 RAC: 0 Level ![]() Scientific publications
|
On my Windows 10 PC SETI@home uses opencl_nvidia_SoG. On GPUGRID the app uses cuda80 and all tasks fail. Tullio On my Linux box SETI@home uses opencl_nvidia_sah. Einstein@home uses FGRPopenclK-nvidia. GPUGRID used cuda80 and it worked on Linux. GPU boards are GTX 1050 Ti on Windows and GTX 750 Ti on Linux. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
the strange thing is, though, that the problem did NOT occur within the first 2-3 days after the installation of Windows 10; but only after I started GPUGRID crunching (before, I was only crunching LHC tasks via the CPU). Windows was always very erratic for me when I was running it with a screensaver, for example. Have you disabled all screen savers, sleep and power-down features? Also, in the BIOS, I would disable the various power control modes. They are known to be problematic. But I like JoergF's idea of removing the cards one at a time. I think you could just unplug the PCIe power cables one at a time to do a simple test. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think you could just unplug the PCIe power cables one at a time to do a simple test.This is a very bad idea. |
|
Send message Joined: 20 Apr 15 Posts: 285 Credit: 1,102,216,607 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I think you could just unplug the PCIe power cables one at a time to do a simple test.This is a very bad idea. I agree. For a test, you should remove the card completely. Because If you leave one in the Mainboard slot without further 6/8pin supply, it will not get supplied properly and (likely) the PC not power up. If you are lucky you'll hear the common BIOS beep codes and no more, but it also could result in further hardware damage, as some components e.g. gates will be operated at undefined states or even oscillate, resulting in local excess current. DONT try that. I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On my Windows 10 PC SETI@home uses opencl_nvidia_SoG. On GPUGRID the app uses cuda80 and all tasks fail. I just noticed that SETI has tasks with opencl_Nvidia_SoG as well as with Cuda42 and Cuda50. (I tried to test also Einstein, but beside GPU tasks, it downloads CPU tasks as well which fill up my total 12 CPU cores, which I don't want to happen. I am sure this can be controlled somehow, but I havn't found out yet). |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For a test, you should remove the card completely. Because If you leave one in the Mainboard slot without further 6/8pin supply, it will not get supplied properly and (likely) the PC not power up. If you are lucky you'll hear the common BIOS beep codes and no more, but it also could result in further hardware damage, as some components e.g. gates will be operated at undefined states or even oscillate, resulting in local excess current. DONT try that. My experience is that the card won't power up at all, and won't draw much power for anything. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My understanding is that NVidia cards are designed to power up using the 75W available from the PCIe slot, detect that the additional power cables are unconnected, and refuse to move out of a protective low-power state. Thus, in a different state from total removal. Possibly safe, but nor very informative. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I don't think that any of the signal inputs are left "floating", if that is the concern. They will all be tied to a supply voltage, and clamped in a known state. Nvidia would not leave that situation unprotected by any means. |
©2025 Universitat Pompeu Fabra