Message boards :
Number crunching :
Unable to load module .mshake_kernel.cu. (702)
Message board moderation
| Author | Message |
|---|---|
rittermSend message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My normally rock solid GTX570 host kicked out a couple of compute errors on short tasks: glumetx5-NOELIA_SH2-13-50-RND5814 (Others seemed to have problems with this one) prolysx8-NOELIA_SH2-13-50-RND2399_0 Both stderr outputs include: SWAN : FATAL Unable to load module .mshake_kernel.cu. (702) Both occurrences resulted in a driver crash and system reboot. Possibly related question/issue... Are those GPU temps in the stderr output? Could that be part of the problem? I checked other successful tasks and have seen higher values than those in the recently crashed tasks. |
rittermSend message Joined: 31 Jul 09 Posts: 88 Credit: 244,413,897 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My normally rock solid GTX570 host kicked out a couple of compute errors on short tasks... ...Probably because the GPU suffered a partial failure. Since this happened, the host would run fine under little or no load. As soon as the GPU got stressed running any BOINC tasks I threw at it, the machine would eventually crash and reboot. The fan was getting a little noisy and there were signs of some kind of oily liquid on the enclosure. Fortunately, is was still under warranty and EVGA sent me a refurbished GTX 570 under RMA. A virtually painless process -- thanks, EVGA. Maybe I should wait until the replacement GPU runs a few tasks successfully, but everything looks good, so far. |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, I'm getting it on long tasks too. This was from a brand new GTX 970 SSC from EVGA. Name 20mgx1069-NOELIA_20MG2-14-50-RND0261_0 Workunit 10253503 Created 4 Nov 2014 | 3:29:32 UTC Sent 4 Nov 2014 | 4:29:42 UTC Received 4 Nov 2014 | 14:45:46 UTC Server state Over Outcome Computation error Client state Compute error Exit status -52 (0xffffffffffffffcc) Unknown error number Computer ID 140554 Report deadline 9 Nov 2014 | 4:29:42 UTC Run time 21,616.47 CPU time 4,273.91 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65) Stderr output <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc) </message> <stderr_txt> # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r344_32 : 34448 # GPU 0 : 56C # GPU 0 : 57C # GPU 0 : 58C # GPU 0 : 59C # GPU 0 : 60C # BOINC suspending at user request (exit) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r344_32 : 34448 # GPU 0 : 46C # GPU 0 : 47C # GPU 0 : 50C # GPU 0 : 51C # GPU 0 : 54C # GPU 0 : 55C # GPU 0 : 56C # GPU 0 : 57C # BOINC suspending at user request (exit) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r344_32 : 34448 # GPU 0 : 45C # GPU 0 : 46C # GPU 0 : 48C # GPU 0 : 50C # GPU 0 : 52C # GPU 0 : 54C # GPU 0 : 55C # GPU 0 : 56C # GPU 0 : 57C # GPU 0 : 58C # GPU 0 : 59C # GPU 0 : 60C # GPU 0 : 61C # GPU 0 : 62C # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:01:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r344_32 : 34448 SWAN : FATAL Unable to load module .mshake_kernel.cu. (999) </stderr_txt> ]]> |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Now I got five "Unable to load module" crashes in a day. Some crashed a few seconds into their run, some of them crashed after many hours of computation. Last ones caused a blue screen and restart. I replaced my old GTX 460 with a 1200 watt PSU and two 970's to make a big impact with BOINC GPU projects, but the frequent crashes are erasing much of my gains. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Dayle, I was getting roughly similar problems: sometimes "the simulation has become unstable", sometimes "failed to load *.cu". Also system crashes and blue screens. I had bought a new GTX970 and it was running fine for a week. I then added my previous GTX660Ti back into the case, running 2 big GPUs for the 1st time. I've got both cards "eco-tuned", running power-limited. Yet it seems like they increase case temperatures pushed my OC'ed CPU over the stability boundary. Since I lowered the CPU clock speed a notch there have been no more failures. Well, that's only been 1.5 days by now, but it's still a record. Morale: maybe the heat output from those GPUs also stressing some other component of your system too much. MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That was a very interesting idea, so I went ahead and looked at the work unit logs for all five crashes. The ambient temperature in the room fluctuates depending on the time of day, but here is EACH GPU's temp whenever one OR the other failed. All numbers in C 1. 64 & 58 2. 71 & 77 3. 58 & 63 4. 58 & 46 5. 71 & 77 77 degrees is much hotter then I'm hoping they'd run at, and I wonder if you're right. If so, it's time for a new case. I've got both the right and left panels of my tower disconnected, plus a HEPA filter in the room to keep dust from getting in. But maybe my airflow isn't directed enough? But that doesn't seem to be all of the problem, because they're crashing at much lower temperatures too. |
|
Send message Joined: 5 Dec 11 Posts: 147 Credit: 69,970,684 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That was a very interesting idea, so I went ahead and looked at the work unit logs for all five crashes. That's your GPU temps, which are within the range for your GPU, just. IIRC your GPU Thermal throttles at 80c. It may be worth either reducing your clocks or employing a more agressive fan profile. What Apes was referring to was CPU temps. If your GPU is dumping enough hot air into your case, it could be making your cpu unstable. Check those temps and adjust accordingly. |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Okay, done. I've just recovered from another crash. CPU is about 53 Celsius. I'm mystified. I've let it run a little longer, and we're down to 52 C. I did manage to see the blue screen for a split second and but it went away too quickly to take a photo. Something like "IRQL not less or equal". Internet says that's usually a driver issue. As I have the latest GPU drivers, latest motherboard drivers, etc, I am running "WhoCrashed" on my system and waiting for another crash. Hopefully this is related to the Unable to Load Module issue. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hopefully this is related to the Unable to Load Module issue. Very probably. You temperatures seem fine. That blue screen message you got can be a software (driver) problem. Did you already try to do a clean driver install of the current 344.75? I think that message can also mean just a general hardware failure. The PSU could be another candidate, but a new 1.2 kW unit sounds good. Is it, by any chance, a cheap chinese model? MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It ist NOT a Driver Problem. The NVIDIA Driver before crashed also. see also http://www.gpugrid.net/forum_thread.php?id=3932 regards |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi Mr. S. I don't know if it's a cheap Chinese model, it's the Platimax 1200w, which is discontinued. Picked up the last one Fry's had in their system, then special ordered the cables, because some tosser returned theirs and kept all the cords. I've attached my GTX 970s to a new motherboard that I was able to afford during a black Friday sale. I'll post elsewhere about that, because I'm still not getting the speed I'm expecting. Anyway, same drivers, same GPUs, same PSU, but better fans and motherboard. No more errors. If anyone is still getting this error, I hope that helps narrow down your issue. |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Huh. Well after a few years, this error is back, and swallowed 21 hours worth of Maxwell crunching. Two years ago it was happening on an older motherboard, with different drivers, running different tasks, and on a different OS. https://www.gpugrid.net/result.php?resultid=15094951 Various PC temps still look fine. Name 2d9wR8-SDOERR_opm996-0-1-RND7930_0 Workunit 11595346 Created 9 May 2016 | 9:59:05 UTC Sent 10 May 2016 | 7:14:49 UTC Received 11 May 2016 | 7:15:14 UTC Server state Over Outcome Computation error Client state Compute error Exit status -52 (0xffffffffffffffcc) Unknown error number Computer ID 191317 Report deadline 15 May 2016 | 7:14:49 UTC Run time 78,815.07 CPU time 28,671.50 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.48 (cuda65) Stderr output <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> (unknown error) - exit code -52 (0xffffffcc) </message> <stderr_txt> # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:02:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r364_69 : 36510 # GPU 0 : 66C # GPU 1 : 72C # GPU 0 : 67C # GPU 1 : 73C # GPU 0 : 68C # GPU 1 : 74C # GPU 0 : 69C # GPU 0 : 70C # GPU 0 : 71C # GPU 0 : 72C # GPU 1 : 75C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 1 : 76C # GPU 0 : 76C # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 970 # ECC : Disabled # Global mem : 4095MB # Capability : 5.2 # PCI ID : 0000:02:00.0 # Device clock : 1342MHz # Memory clock : 3505MHz # Memory width : 256bit # Driver version : r364_69 : 36510 SWAN : FATAL Unable to load module .mshake_kernel.cu. (719) </stderr_txt> ]]> |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
# GPU 1 : 76C 76C is too hot! Use NVIDIA Inspector to Prioritize Temperature and set it to 69C. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
From my experience, GPUs can run 70-85*C, no problem, so long as the clocks are stable. See if removing any GPU overclocks entirely, fixes the issue or not. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The issue isn't with the GPU core temperature, it's with the heat generated by it; that increases the ambient temperature inside the GPU case and the computer chassis in general. Sometimes it causes failures when the GDDR heats up too much for example, sometimes system memory can become too hot, sometimes other components such as the disk drives. Generally when temps are over 50C they can cause problems. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just had this error on a fairly new system. It is not new by technological standards, but it is new as in it was bought brand new from the store as of less than 6 months ago. I find it interesting that the heat on the GPU core tops out at 58C and had this issue. The card itself has gone to 66C recently with no issue and when it was doing long tasks would flatten out at around 59-61C. Being a GT 730 2GB card, I have it running only short tasks like my laptop is doing now. (I set my fast cards to only do long tasks as well, as I think that is the polite thing to do for weaker cards so they can get short ones in to run.) AFAIK, this PC is not in an area that is hot or cold, but maintains a steady(ish) air temp, although it is next to a door and can get bursts of cooler air as people come in and out the front door during this fall/winter weather. It certainly hasn't been hot recently here. This is the only error task on the PC's history and this has been a very stable system for its total uptime. I'll keep an eye on it and see if this is a pattern. I'll also have to check on the CPU temps to see if they remain steady or go through spikes. I don't think heat is an issue though unless the card is just faulty. It has done 2 tasks successfully since this error. I also see an extra task in the In Progress list that is not on the system, so I know there will be another error task on the list that will read Timed Out after the 12th. https://www.gpugrid.net/result.php?resultid=15586143 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just had this error on a fairly new system. Here's an excerpt from the task's stderr.txt: # GPU 0 : 58C # GPU [GeForce GT 730] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GT 730Note the missing # BOINC suspending at user request (exit)(or similar) message explaining the reason of task shutdown between line 1 and 2. This is the sign of a dirty task shutdown. It's cause is unknown, but it could be a dirty system shutdown, or an unattended (automatic) driver update by Windows Update or NVidia Update. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Perhaps this was a power loss. We have had 2 in the past few weeks. I just think this is the first time I have seen this particular error and when I looked it up, it brought me to this thread. |
|
Send message Joined: 2 Oct 17 Posts: 2 Credit: 22,213,625 RAC: 0 Level ![]() Scientific publications ![]()
|
In case this hasn't been resolved, I've also run into GPUGRID tasks erring and found my solution to be increasing the virtual memory size. See here. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
In case this hasn't been resolved, I've also run into GPUGRID tasks erring and found my solution to be increasing the virtual memory size. See here.I don't think that increasing virtual memory size could fix such problems (perhaps indirectly by accident). Your PC has 32GB RAM. I can't imagine that even if you run 12 CPU tasks simultaneously it will run out of 32GB (+1GB virtual). If it does, then some of the apps you run have a memory leak, and it will run out even if you set +4GB or +8GB virtual memory. These SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1965.and SWAN : FATAL Unable to load module .mshake_kernel.cu. (719)errors are the result of the GPUGrid task gets suspended too frequently and / or too many times (or a failing GPU). EDIT: SLI is not recommended for GPU crunching. You should try to turn it off for a test period (even remove the SLI bridge). |
©2025 Universitat Pompeu Fabra