Message boards :
Number crunching :
WU not completing
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 2 Sep 12 Posts: 16 Credit: 609,890,687 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've been running GPUGRID for almost 3 years with no problems. I have two rigs that I run this project on (the two with the best GPU cards that I have). My computer, 133536, is running tasks but they never end... The card is a GTX660. Boinc is 7.4.42 Driver is 350.12 Is there a log file that I can check for errors? I'm not seeing any errors in the event log. The monitor for the Video Card shows the increased activity on the GPU that I expect to see, but the corresponding rise in temperature isn't there. It's at 33 degress C which is what it normally is idling. I've uninstalled and reinstalled the video driver. It ran one task, then started doing it again. It's running cool enough, and I have an nice Gold power supply on it which is maintaining power just fine. Thanks for any help! |
|
Send message Joined: 2 Sep 12 Posts: 16 Credit: 609,890,687 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ok, an update. I found the log in the task link for the error. Does this mean that there's trouble with the driver or with the card? Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <message> aborted by user </message> <stderr_txt> # GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1058MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 48C # GPU 0 : 52C # GPU 0 : 55C # GPU 0 : 56C # GPU 0 : 57C # GPU 0 : 58C # GPU 0 : 59C # GPU 0 : 60C # GPU 0 : 61C SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965. # SWAN swan_assert 0 # GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1058MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 47C SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965. # SWAN swan_assert 0 # GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1058MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 34C SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965. # SWAN swan_assert 0 # GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1058MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 33C # GPU 0 : 34C # GPU 0 : 35C # GPU 0 : 36C SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965. # SWAN swan_assert 0 # GPU [GeForce GTX 660] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 # ECC : Disabled # Global mem : 2047MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1058MHz # Memory clock : 3004MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 34C </stderr_txt> ]]> Thanks. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
On your XP/GTX660 system you had a problem with this WU, e1s4_2-GERARD_FXCXCL12_LIG_11675311-0-1-RND5477_1 The Std Err says User aborted and SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1965. I also have a problem with a GERARD_FXCXCL12 WU on an XP system. The WU has run for 60h on a GTX670. It should have taken less than a day and should have finished several days ago. My WU has not check-pointed, so there is probably a design fault. While it's at 98.00% complete it is progressing at a very slow rate (0.01% every few minutes) and if it reaches 100% it might continue to run without actually completing... If you experience this again, see if the WU has been checkpointing. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I also have a problem with a GERARD_FXCXCL12 WU on an XP system. The WU has run for 60h on a GTX670. It should have taken less than a day and should have finished several days ago. My WU has not check-pointed, so there is probably a design fault. While it's at 98.00% complete it is progressing at a very slow rate (0.01% every few minutes) and if it reaches 100% it might continue to run without actually completing... I have had several like this since upgrading to the cuda65 driver v347.88 on my Windows XP machines. Another symptom is that CPU usage drops to zero (drops to 0.0000 CPUs on the monitoring tool I use). I find that suspending the task for a few seconds, then allowing it to restart, resumes normal progress and allows the task to complete and validate. |
|
Send message Joined: 2 Sep 12 Posts: 16 Credit: 609,890,687 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'll give that a go, and report back. Should I consider down reving my driver? Now that you mention it...I did start having problems after upgrading. Thanks! |
|
Send message Joined: 2 Sep 12 Posts: 16 Credit: 609,890,687 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ok, here's an update. I uninstalled the driver and related software (version 350.12). I rebooted in safe mode and used DDU v15.0.0.1 to do a more thorough (so I was told) uninstall. I then rebooted back to Windows XP, and installed driver version 344.75 (the previous version that my system was working on). It now has run two short run tasks and one long run task successfully. Previously I was only able to run a single task before problems. It, so far, looks like it's working properly. I'll update if there are any changes. -MichaeMac |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same here. 344.75 seems to be much better - haven't had a task stall since I downgraded (either here or at other projects), and it's new enough to run cuda65 - which was the reason for upgrading in the first place. |
|
Send message Joined: 28 May 12 Posts: 63 Credit: 714,535,121 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
e1s748_2-NOELIA_l690330-0-3-RND3494 This is a SHORT run that ran for 20 hours and was indicated at less than 0.5% complete with almost 5000 hours to complete Most SHORT runs on my GTX760 take 3-4 hours to complete This is the WORST of errors I have seen from GPUGrid. I am MERELY informing that this is on a UNIQUE problem for one user. There is to me, an fault in some of the work units. |
|
Send message Joined: 28 May 12 Posts: 63 Credit: 714,535,121 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OOPS, this is on a GTX550, not a GTX760 |
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi, GPUGrid Folks: Could someone please tell me if the two recent failures are because of a problem with my PC or the server. Thank you, Workunit 10930521 About Science Volunteers Performance Stats Forum Join Us Donate Name e13s4_e11s9f92-GERARD_FXCXCL12_LIG_23157812-0-1-RND0535_0 Workunit 10930521 Created 15 May 2015 | 15:21:14 UTC Sent 16 May 2015 | 8:33:20 UTC Received 17 May 2015 | 4:36:41 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 214484 Report deadline 21 May 2015 | 8:33:20 UTC Run time 52,871.45 CPU time 10,006.15 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65) Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -97 (0xffffff9f) </message> <stderr_txt> # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1110MHz # Memory clock : 3304MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 74C # GPU 1 : 63C # GPU 1 : 64C # GPU 1 : 66C # GPU 1 : 67C # GPU 0 : 75C # GPU 1 : 68C # GPU 0 : 76C # BOINC suspending at user request (exit) # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1110MHz # Memory clock : 3304MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 50C # GPU 1 : 49C # GPU 0 : 56C # GPU 1 : 54C # GPU 0 : 60C # GPU 1 : 57C # GPU 0 : 64C # GPU 1 : 61C # GPU 0 : 66C # GPU 1 : 62C # GPU 0 : 68C # GPU 0 : 69C # GPU 1 : 63C # GPU 0 : 70C # GPU 1 : 64C # GPU 0 : 71C # GPU 1 : 65C # GPU 0 : 72C # GPU 1 : 66C # GPU 0 : 73C # GPU 0 : 74C # GPU 1 : 67C # GPU 0 : 75C # GPU 1 : 68C # GPU 0 : 76C # BOINC suspending at user request (exit) # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1110MHz # Memory clock : 3304MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 56C # GPU 1 : 51C # GPU 0 : 62C # GPU 1 : 57C # GPU 0 : 65C # GPU 1 : 61C # GPU 0 : 67C # GPU 1 : 63C # GPU 0 : 69C # GPU 1 : 64C # GPU 0 : 71C # GPU 0 : 72C # GPU 1 : 66C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 1 : 67C # GPU 1 : 68C # GPU 0 : 76C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 14195000) # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1110MHz # Memory clock : 3304MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # The simulation has become unstable. Terminating to avoid lock-up (1) </stderr_txt> ]]> |
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Second failure Workunit 10932386 Name e17s7_e16s4f41-GERARD_FXCXCL12_LIG_15494362-0-1-RND8820_0 Workunit 10932386 Created 16 May 2015 | 22:06:23 UTC Sent 17 May 2015 | 3:40:23 UTC Received 17 May 2015 | 6:28:16 UTC Server state Over Outcome Computation error Client state Compute error Exit status -97 (0xffffffffffffff9f) Unknown error number Computer ID 214484 Report deadline 22 May 2015 | 3:40:23 UTC Run time 6,888.55 CPU time 1,263.70 Validate state Invalid Credit 0.00 Application version Long runs (8-12 hours on fastest card) v8.47 (cuda65) Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -97 (0xffffff9f) </message> <stderr_txt> # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1110MHz # Memory clock : 3304MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # GPU 0 : 75C # GPU 1 : 63C # GPU 1 : 66C # GPU 1 : 67C # GPU 1 : 68C # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 1770000) # GPU [GeForce GTX 660 Ti] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 660 Ti # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:04:00.0 # Device clock : 1110MHz # Memory clock : 3304MHz # Memory width : 192bit # Driver version : r349_00 : 35012 # The simulation has become unstable. Terminating to avoid lock-up (1) </stderr_txt> ]]> |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It looks like you have your card overclocked, ease down the OC or increase power. The clue is "The simulation has become unstable" |
|
Send message Joined: 28 May 12 Posts: 63 Credit: 714,535,121 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just aborted a short run Noelia that was stuck at 0.88% with indicated runtime completion in over 25,000 hours. I saw NOTHING in the log file updated This SAME card is running a Long Run Gerard normally incrementing percent upwards, estimating run time is dropping every few seconds. This a GTX550Ti running on a Xeon 2620 under Ubuntu Linux 15.04 updated with all updates installed with Nvidia Drivers 246.59 This is not an overclocked card, stock setting. no options on Nvidia Settings (I did have the fan control and overclock options at one time, but even when i did it, I could NOT overclock only REDUCE clock) And the PS is 1050W rated and the CPU is running at 51C |
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi, everyone: ALL three of my recent failures have 'GERARD' in common in the WU name. Two failures have occurred one each on my two GTX 660 Ti devices and one today on one of my 650 Ti devices. None of my cards or CPUs (AMD FX-8350 for the 660) and AMD Phenom 1090T for the 650 Ti is overclocked. Any suggestions? Thanks, John |
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Issue found & fixed: too many CPU tasks running with GPUGrid. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just aborted a short run Noelia that was stuck at 0.88% with indicated runtime completion in over 25,000 hours. Robert: What is the exact make and model, of your GPU devices that are leading to "Simulation has become unstable"? |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Issue found & fixed: too many CPU tasks running with GPUGrid. I don't think that can be an actual cause. I may be wrong, but I'd be interested in knowing how you came to that conclusion. |
|
Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Here my Results, aborts with all same Errorcode. No overclocking etc. BOINC: 7.4.42 NVIDIA Driver: 437.88 9 May 2015 - 21 May e1s468_4-NOELIA_ETQ_bound-1-2-RND6921 e3s206_e1s24f53-NOELIA_l6903301-2-3-RND6031 e1s958_8-NOELIA_ETQ_bound-0-2-RND8280 e3s78_e1s24f24-NOELIA_l6903301-2-3-RND0130 e2s136_e1s99f53-NOELIA_l6903301-2-3-RND5303 e2s499_e1s122f52-NOELIA_l6903301-0-3-RND2621 -97 (0xffffffffffffff9f) Unknown error number The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 1050000) # GPU [GeForce GTX 760] Platform [Windows] Rev [3212] VERSION [65] # SWAN Device 0 : # Name : GeForce GTX 760 # ECC : Disabled # Global mem : 2048MB # Capability : 3.0 # PCI ID : 0000:01:00.0 # Device clock : 1071MHz # Memory clock : 3004MHz # Memory width : 256bit # Driver version : r346_00 : 34788 # The simulation has become unstable. Terminating to avoid lock-up (1) |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Killersocke: What is the exact make and model of the GPU that is giving you problems? The reason I ask, is so we can determine if the GPU is factory-overclocked. If it is, then the next step would be to use a tool like PrecisionX, to apply a negative GPU Offset clock, so that the clock matches the reference clock... and then retest. |
|
Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi Jacob, NVIDIA Systeminformationen-Bericht erstellt am: 05/21/2015 15:34:56 Name des Systems: Killersocke [Anzeige] Betriebssystem: Windows 8.1 Pro, 64-bit DirectX-Version: 11.0 GPU-Prozessor: GeForce GTX 760 GPU GK104 Treiberversion: 347.88 Direct3D-API-Version: 11.2 Direct3D-Funktionsebene: 11_0 CUDA-Kerne: 1152 Kerntakt: 1006 MHz Speicher-Datenrate: 6008 MHz Speicherschnittstelle: 256-Bit Speicherbandbreite: 192.26 GB/s Gesamter verfügbarer Grafikspeicher: 4096 MB Dedizierter Videospeicher: 2048 MB GDDR5 System-Videospeicher: 0 MB Freigegebener Systemspeicher: 2048 MB Video-BIOS-Version: 80.04.BF.00.06 IRQ: 16 Bus: PCI Express x16 Gen3 Geräte-ID: 10DE 1187 84721043 Teilenummer: 2004 0010 [Komponenten] nvui.dll 8.17.13.4788 NVIDIA User Experience Driver Component nvxdsync.exe 8.17.13.4788 NVIDIA User Experience Driver Component nvxdplcy.dll 8.17.13.4788 NVIDIA User Experience Driver Component nvxdbat.dll 8.17.13.4788 NVIDIA User Experience Driver Component nvxdapix.dll 8.17.13.4788 NVIDIA User Experience Driver Component NVCPL.DLL 8.17.13.4788 NVIDIA User Experience Driver Component nvCplUIR.dll 8.1.800.0 NVIDIA Control Panel nvCplUI.exe 8.1.800.0 NVIDIA Control Panel nvWSSR.dll 6.14.13.4788 NVIDIA Workstation Server nvWSS.dll 6.14.13.4788 NVIDIA Workstation Server nvViTvSR.dll 6.14.13.4788 NVIDIA Video Server nvViTvS.dll 6.14.13.4788 NVIDIA Video Server nvDispSR.dll 6.14.13.4788 NVIDIA Display Server NVMCTRAY.DLL 8.17.13.4788 NVIDIA Media Center Library nvDispS.dll 6.14.13.4788 NVIDIA Display Server PhysX 09.14.0702 NVIDIA PhysX NVCUDA.DLL 8.17.13.4788 NVIDIA CUDA 7.0.29 driver nvGameSR.dll 6.14.13.4788 NVIDIA 3D Settings Server nvGameS.dll 6.14.13.4788 NVIDIA 3D Settings Server |
©2026 Universitat Pompeu Fabra