Message boards :
Number crunching :
Errors resuming after power outage
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My computer recently restarted, unexpectedly. It may have been a brief power outage, though I am not 100% sure. When it restarted, and BOINC tried to load up tasks, problems occurred with the GPUGrid tasks. When each task was loaded, it resulted in a TDR, and then a task failure ... for all 6 of my in-progress tasks. They all resulted in: Server state Over Outcome Computation error Client state Compute error Exit status -52 (0xffffffffffffffcc) Unknown error number Validate state Invalid And they all had the following at the bottom of their stderr.txt: SWAN : FATAL Unable to load module .mshake_kernel.cu. (702) Can anything be done to make this scenario, able to be restarted and resumed, for GPUGrid GPU tasks? e13s16_e1s33f90-NOELIA_1mgx1-2-4-RND0021_0 http://www.gpugrid.net/result.php?resultid=13982550 e26s10_e20s4f232-SDOERR_villinpub2-0-1-RND0381_3 http://www.gpugrid.net/result.php?resultid=13983070 e15s46_e1s400f24-NOELIA_1mgx2-1-4-RND5323_0 http://www.gpugrid.net/result.php?resultid=13983199 2Mgx471-NOELIA_INSP-11-12-RND1315_0 http://www.gpugrid.net/result.php?resultid=13983283 e12s13_e4s36f65-NOELIA_1mgx1-3-4-RND7924_0 http://www.gpugrid.net/result.php?resultid=13983393 e15s46_e1s386f84-NOELIA_1mgx1-1-4-RND3709_0 http://www.gpugrid.net/result.php?resultid=13983801 Note: On this computer, I load my 3 GPUs with 2-tasks-per-GPU. |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Seconded. My neighborhood is a little, uh, neglected. During warm weather, the AC drain becomes too much on the system and the whole block shuts off. I just lost about six hours of crunching yesterday due to periodic power outages. Would hate for this to be a regular issue all summer. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH: Any chance you might look at this problem? |
|
Send message Joined: 25 Mar 12 Posts: 103 Credit: 14,948,929,771 RAC: 13 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have two validation errors after an abrupt power-off of a host. The wus resumed from some check points and completed but ended in validation errors (two GPUs host). It was my fault, just unplugged it unintetionally while tinkering around, what a dumb!. Two of those 255 Kpoints ones!, it hurts! Just reporting. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 51 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had the same problem 2 days ago. The WU's either crash immediately or they continue normally and than you get the validation error, when they upload. The crashing immediately is not a new problem, but the validation error is. |
|
Send message Joined: 21 Feb 10 Posts: 16 Credit: 841,395,284 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just had a WU that gave a Computation error after 23 hours of crunching because of a power failure. It happens to me every now and then, especially during rainy seasons when thunder causes the power in my house to trip. Over the years, I've probably lost 30-40 half completed WUs this way. It is unfortunate is that GPUGrid doesn't resume from the last check point and instead errors and everything is lost. All the other projects I do like P95 or WGC simply resume after power failures from the last saved checkpoint. Is this something that the developers can improve on? |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Duane, somewhere in the PC settings you see "disc caching" - this should be unchecked. |
|
Send message Joined: 21 Feb 10 Posts: 16 Credit: 841,395,284 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Duane, Thanks for the suggestion. I checked in my Device Manager under the drive > policies and find that the Write Caching box is already unchecked.... yet I still lost the WU after the power outage. But I don't think data corruption of the checkpoint is the issue. This is the report I see for the WU: SWAN : FATAL Unable to load module .mshake_kernel.cu. (719) Seems after the power failure and reboot it has some kernel error? It is the exact same problem that the starter of this thread reported. But yet the next WU in the queue starts crunching fine after that. <core_client_version>7.6.33</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc) </message> <stderr_txt> # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 59C # GPU 0 : 60C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 57C Can't acquire lockfile - exiting No heartbeat from core client for 30 sec - exiting # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 58C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 # GPU 0 : 57C # GPU 0 : 58C # GPU 0 : 59C # GPU 0 : 60C # GPU 0 : 61C # GPU [GeForce GTX 960] Platform [Windows] Rev [3212] VERSION [80] # SWAN Device 0 : # Name : GeForce GTX 960 # ECC : Disabled # Global mem : 2048MB # Capability : 5.2 # PCI ID : 0000:28:00.0 # Device clock : 1291MHz # Memory clock : 3600MHz # Memory width : 128bit # Driver version : r381_64 : 38165 SWAN : FATAL Unable to load module .mshake_kernel.cu. (719) </stderr_txt> ]]> |
©2025 Universitat Pompeu Fabra