Message boards :
Graphics cards (GPUs) :
System crash when entering sleep while GPU task running
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
About once every week or two, when I put my PC to sleep at end of the day it instead hangs for a few minutes with one (of 2) monitors on and blank, and then shuts off. I believe this is Windows 10 writing a "bug check" (crash dump) due to the kernel having crashed, as seen in the Event Log. After watching it for a while, I'm starting to suspect that this is related to the GPU tasks. These always come back in Boinc as "Error during computation" after the reboot. (And the last 3 or so have been PABLO_v3.) Here is one task that reported failure: http://www.gpugrid.net/result.php?resultid=20531108 This failed with the infamous: The simulation has become unstable. Terminating to avoid lock-up (1) Unfortunately I can't tell if this message was written before or after the reboot. NOTES: no overclocking, last recorded temp seems reasonable at 67'C. I'm not running a bleeding-edge driver (399.07) but I note this was also occurring with an older driver (391.35). Now putting my developer's hat on (although I'm not a systems developer) my hunch here would be that some kernel memory corruption has occurred, and when Windows comes to checkpoint its processes it barfs. This is also causing data corruption outside of BOINC. A number of recently-written files end up "zero-padded", i.e. their content replaced with 0x00's. (Probably something to do with SSD / write caching.) I've had to manually repair my git repo a number of times now, for example. I'm starting to worry if other files might have been zapped that I just haven't noticed yet. I realise it's hard to separate cause & effect from the symptoms. For example, could the crash be due to something else, which then zeros out some of GPUGrid's files and that's why IT fails? From a low sample size of less than a dozen such crashes, I can only say that it never occurred when I didn't run BOINC (i.e. before this Winter). If it would help, I can send the resulting crash-dump to a developer (230MB 7-Zipped). PM me if interested. |
|
Send message Joined: 26 Feb 14 Posts: 211 Credit: 4,496,324,562 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Quick question for you. Did you tell BOINC to stop processing when you exit or did you just exit BOINC and leave the task crunching? It sounds to me as if you are exiting BOINC but leave the task crunching. Then when you go to sleep the machine, it causes the kernal to crash as it's forced quit and end up with the error while computing. Just a thought.. Z
|
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
> Did you tell BOINC to stop processing when you exit or did you just exit BOINC and leave the task crunching? Neither, I leave both BOINC and the tasks running (so that my PCs can wake on a timer and pre-heat my office before I arrive). |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
About once every week or two, when I put my PC to sleep at end of the day it instead hangs for a few minutes with one (of 2) monitors on and blank, and then shuts off. I believe this is Windows 10 writing a "bug check" (crash dump) due to the kernel having crashed, as seen in the Event Log. GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so. I always close BOINC with exiting science apps, then watch MSI afterburner tray monitors (GPU usage and temperature) to go down. After that I restart / turn off my PC. You should do the same to avoid such errors. (until this bug get fixed = forever.) When your PC turned on (by a timer or you) from a complete shutdown, BOINC will continue the tasks in it. (You don't need to suspend them with the OS.) If you have a password protected user account, it won't log on automatically at startup unless you set Windows not to ask for a username and password at startup. You can specify which user account to log on at startup by the following method: Press Windows key + R Type control userpasswords2press [Enter] or click [OK], then uncheck the checkbox, click [OK] and type your username and password (twice) then click [OK]. If your PC is connected to a domain, you should specify your username as domain/username. I also recommend to turn off the "fast system startup" option in power management. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so. This is exactly the procedure that I use and it works. Never shut down the PC until the GPUGrid app stops. Avoid sleep and hibernate. Also turn off write caching on the BOINC drive. If you take these steps your errors should disappear. Listen to Zoltan when he gives GPUGrid advice. :-) |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I 3rd that. Shutting down the OS with BOINC running can result in computation errors, especially GPU tasks. |
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for the tips folks. Note that I'm really interested in sleeping rather than shutting off the computer. Another interesting data-point is that my 2nd Win 7 PC running short tasks (GTX 660) has had no such problems this winter. GPUGrid tasks don't tolerate suspending. Even stopping a task could take a minute or so. Right, that's good to know. I think my next step will be to try a middle-ground approach: I'll use "Snooze GPU" tasks and wait a few minutes for those to idle properly before putting my PC to sleep. I can watch for the GPU ram dropping in Process Explorer to confirm. I'd be willing to bet (at least a few fillér) that this will improve it. I'm guessing the OS being too impatient at suspending the apps that doesn't let GPUGrid checkpoint safely. Worth a shot anyway. Also turn off write caching on the BOINC drive. That would be safer yes, however that would hurt my code compile times, and I suspect that even if I segregated BOINC to a separate SSD I'm not sure whether the main drive would be saved from a hard kernel crash - so I think I'd need to do it on both of my SSDs. Fingers crossed suspending GPUGrid will work! I'll report back. (FYI: you know I even went with a SSD specifically because they claimed its super-capacitors would protect data "in flight" during a sudden power outage. Has never bloody worked. I can say that with confidence, since some power company workmen repeatedly shut the power off to my building without notice a few weeks back. Creating more git-repo repair jobs for me. Fun times.) |
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Clarification: I say 'checkpoint' in the sense of Windows recording the state of the processes rather than the Boinc term for tasks saving their progress. I've just realised that could be ambiguous. (Hmm, now do processes need to be checkpointed for sleep, or is that just for hibernate I wonder...) |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
(FYI: you know I even went with a SSD specifically because they claimed its super-capacitors would protect data "in flight" during a sudden power outage. Has never bloody worked. I can say that with confidence, since some power company workmen repeatedly shut the power off to my building without notice a few weeks back. Creating more git-repo repair jobs for me. Fun times.)That feature can't save the data from the write cache of the OS. That's why disable write caching is recommended, or a good UPS. (is it a Samsung PRO SSD?) These SSDs have a relatively large DDR RAM cache, so you should give a try disabling write caching in the OS, and check how much it hurts code compilation times. Perhaps you loose less time than the time spent repairing things. |
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yeah, had a UPS. It died, so I was hoping this SSD would suffice instead of more lead-acid batteries. That feature can't save the data from the write cache of the OS. Indeed. I guess I'd hoped that the "safer" level of write caching would be good enough with the SSD's capacitors. But I suppose my hopes for this complex pipeline involving multiple vendors was too high. :) so you should give a try disabling write caching in the OS Promising result! After disabling all write caching, a Java full compile goes from 48 -> 52s. But then: I've got System/Data partitions on my Samsung EVO 960 (non-pro I think) and a small Intel SSD as a temp drive. I've already configured it so the bulk of writes during a compile go to the Intel, so I only need to disable write caching on the Samsung. Looks like this will: protect my data better, keep my 48s compile times, and hopefully avoid a new UPS still. Win win win. (I've got a similar amount of C++, which takes ~9m to build, but I think that would turn out similarly.) Perhaps you loose less time than the time spent repairing things. True. I guess my bigger concern is that as compile times increase I become more likely to lose focus, flick over and check the news, and if something catches my interest "Poof!" 15 minutes is gone. :-D |
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
A quick update: I've since had two such hard shutdowns out of 8 attempts at sleeping. I've not observed any zeroed-out files, which is the main thing... so far so good with write caching disabled on the data drive. (I found one file corrupt, but Eclipse regenerated it clean before I could see what was wrong. This may have just been truncated though, which would be expected & fine. I think Eclipse auto-saves this every 5 minutes, so it could have just been bad luck with the timing.) Also one time GPUGrid resumed without computation error, which I think is a first. However today it still had a computation error so I guess the "Snooze GPU" plan must be flawed (which I find odd since the 'acemd-922-80.exe' application did completely exit before I put it to sleep last night). *shrug* This is not so important anyway. |
©2025 Universitat Pompeu Fabra