Message boards :
Graphics cards (GPUs) :
Errors
Message board moderation
| Author | Message |
|---|---|
PhilTheNetSend message Joined: 24 Sep 14 Posts: 1 Credit: 57,101,016 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Hello, all the UT on error this morning (the project running smoothly on this computer for 6 months) : Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4) </message> ]]> Someone has the same problem ??? Phil |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 41 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Here too.. <core_client_version>7.4.36</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4) </message> I thought this was because we had a poweroutage and one machine recovered with wrong systemtime (+ 8days O.o). now i corrected that and ran into this errors. On both Graphiccards in only one longrunning24/7crunching machine. Voltage, Clocks are the same like before. Indeed i changed them to a more secure setting (i already ran them in a save way cloc and voltagewise) but it doesnt helped. I still think it is a fault on my side on this machine but dont know where. Einstein running fine, but this doesnt say anything. On the other side, the crunching partner errored out too, so perhaps its an errorous batch on its way too. So i will try again in one week. DSKAG Austria Research Team: http://www.research.dskag.at
|
|
Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
Can you report which WU you had the problems with? yesterday I released a whole new batch of simulations and, althought they should be fine, a possibility of corruption exists. Thanks a lot guys! |
|
Send message Joined: 14 Oct 11 Posts: 31 Credit: 81,420,504 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All my WU since yesterday have failed (both long and short queues): http://www.gpugrid.net/results.php?userid=81842 |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 41 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
https://www.gpugrid.net/workunit.php?wuid=11087902 as example DSKAG Austria Research Team: http://www.research.dskag.at
|
|
Send message Joined: 28 Jun 10 Posts: 1 Credit: 31,454,680 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
|
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
They seem to fall into two groups:
The cards are giving an "unstable" error, probably just overclocking/overheating.
|
|
Send message Joined: 26 Aug 11 Posts: 100 Credit: 2,863,609,686 RAC: 356 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
After a load of "error units" on my main system I updated drivers and started crunching again ok. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've received a couple of reissued tasks which were previously failed on other hosts with the "(unknown error) - exit code -44 (0xffffffd4)". All of them completed successfully on my host. For example: http://www.gpugrid.net/workunit.php?wuid=11081985 http://www.gpugrid.net/workunit.php?wuid=11087547 http://www.gpugrid.net/workunit.php?wuid=11088391 http://www.gpugrid.net/workunit.php?wuid=11088528 |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 41 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
They seem to fall into two groups: It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h by some minutes including dl/ul over HSDPA whats not a very stable line sometimes (OutdoorModem crashing, Router crashing, Bad USB Wireconnection etc.). The extrem long units needed >2days, and sometimes it hang up something because this long duration, whats bad on unattended machines. "Extra Long Queue" +1 I think i come back end of November with a single 970 or 980 at home over the winter here and attacking the 1B Mark again ^^ Byebye Worldwide GPUGrid Place 43 ^^ In Austria i still have double points as the nearly inactive Place 2. So that should be a secure enough Place #1 until end Nov. :) DSKAG Austria Research Team: http://www.research.dskag.at
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
They seem to fall into two groups: These drivers a bit old, as your hosts are using the CUDA6.0 client, however they should work fine. From my experience the latest driver (353.30) is stable, however I don't have GTX-5xx cards. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too... I've looked into the stderr output of your tasks, and I came to the conclusion that your tasks on Host 150780 are failing because its GPU can't take that high clock frequencies. (probably you've reduced the memory clock already, but this or the GPU clock still has to be reduced) GPU: GeForce GTX 570 Device clock : 1500MHz (default: 1464MHz) Memory clock : 1700MHz (default: 1900MHz) Task 14375512: # Simulation unstable. Flag 9 value 129 # Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1875000) ... Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1885000) Task 14373443 ERROR: file force.cpp line 513: TCL evaluation of [calcforces] 17:24:33 (3980): called boinc_finish In your other host (117426), the GTX 580 and the GTX 560Ti is definitely overheating (sometimes reaching 90°C), so it is a miracle that the tasks on this card don't have "simulation became unstable" messages. Task 14367948: <core_client_version>7.4.36</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 580] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 580 # ECC : Disabled # Global mem : 3071MB # Capability : 2.0 # PCI ID : 0000:01:00.0 # Device clock : 1520MHz # Memory clock : 1700MHz # Memory width : 384bit # Driver version : r334_00 : 33528 # GPU 0 : 69C # GPU 1 : 89C # GPU 0 : 71C # GPU 1 : 90C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # GPU 0 : 79C # GPU 0 : 80C # GPU 0 : 81C # GPU 0 : 82C # Time per step (avg over 3125000 steps): 8.338 ms # Approximate elapsed time for entire WU: 26054.703 s 12:38:20 (4052): called boinc_finish </stderr_txt> ]]> ... and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h... You are right about the GTX 5xx series getting old, as there are two newer GPU generations developed in the meantime. However they should still work here also, and as Einstein@home is working on them, it suggests that the power outage corrupted some files of the GPUGrid project or the driver on your host. You can eliminate these factors by resetting (or removing and re-attaching) the GPUGrid project on your host, and reinstalling / upgrading your drivers. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My host finished a previously failed task again. It has been failed three times on other hosts before: Host 204947: Task 14385448 "# The simulation has become unstable. Terminating to avoid lock-up (1)" Tasklist (all failed) Host 194523: Task 14399453 "unknown error) - exit code -44 (0xffffffd4)" Tasklist (all failed) Host 163989: Task 14399778 "process exited with code 212 (0xd4, -44)" Tasklist (all failed) |
|
Send message Joined: 16 May 13 Posts: 41 Credit: 145,731,947 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
|
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have had 176 tasks fail between 22nd and 24th. Have now removed these cards for now. The failures are on 4 identical cards I have successfully used on GPUgrid for almost a year. I have 1 remaining card (that is different) still able to run tasks. It appears that something has changed in the last 'batch' the four cards are: https://www.gpugrid.net/results.php?hostid=181299 https://www.gpugrid.net/results.php?hostid=181300 https://www.gpugrid.net/results.php?hostid=180572 https://www.gpugrid.net/results.php?hostid=180015 |
|
Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Got some wu with many attemps. My host can´t take these at all, they failed in early stage. https://www.gpugrid.net/workunit.php?wuid=11092476 created 20 Jul 2015 | 18:29:55 UTC -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 40000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11092127 created 20 Jul 2015 | 18:17:40 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 760000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11091745 created 20 Jul 2015 | 18:04:57 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 600000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11098488 created 23 Jul 2015 | 18:13:11 UTC Exit status -98 (0xffffffffffffff9e) Unknown error number ERROR: file force.cpp line 513: TCL evaluation of [calcforces] 16:05:42 (7636): called boinc_finish 11091450 created 20 Jul 2015 | 17:55:40 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 560000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] Last completed and valid, but some host got problem with these 11091018 created 20 Jul 2015 | 17:42:09 UTC 11090161 created 20 Jul 2015 | 17:15:09 UTC 11090182 created 20 Jul 2015 | 17:15:53 UTC 11089720 created 20 Jul 2015 | 17:00:31 UTC 11089778 created 20 Jul 2015 | 17:02:31 UTC |
|
Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,499,534,331 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Update: https://www.gpugrid.net/workunit.php?wuid=11092831 created 20 Jul 2015 | 18:43:49 UTC Manage to complete in first task, same settings and drivers as before. Templimit at 73°C This should be in same batch i think but no error even a bit higher clock and suspended few times. |
|
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Manage to complete in first task, same settings and drivers as before. Templimit at 73°C (User hardware error) unstable simulation (-97 message) GERALD's an issue for me in a hot - soaking humid environment during the last fortnight. I have 7 more days of forecasted 95F heat and +75F dewpoint (humidity) to contend with. The GPU at ~50C. Yesterday a WU flipped 30k/sec in at 1503MHz. The WU before failed two hours in. -1MHz offset core GPU on a following WU completed without error then after 7hr - a WU failed just now. Offset -1 again. Will try one more long - will switch to short NOELIA's if another GERALD fails. Are there any other GM204 owners at 1.5GHz in hot/humid conditions? DMM reading = 1.212V. Dewpoint is currently at 78F - tropical rainforest humidity levels. Even if a sea breeze happens the air is so saturated it makes no difference. This should be in same batch i think but no error even a bit higher clock and suspended few times. GERALD_FXCXCL12_LIG tolerates a bin or two (13/26MHz) less than NOELIA and GIANNI on my 970. GPU's are independent of the next. A 100 straight valid WU streak in <80F ambient can become a failed WU every 5 with the same overclock in +90F ambient. NOELIA_467x short or ETQ yet to fail in similar conditions on my GPU(s). Expect unstable sim (-97 error) or CUDA error() with overclocking in hot ambient and/or (dewpoint = +70F) very humid conditions (<50C core temperature readings). ACEMD app is extremely demanding even if 70% core usage (WDDM bottleneck). Crunching with out of box clocks or the GPU's reference boost will offer lesser chance of CUDA errors and unstable sims that error to a -97 message. When the ACEMD app is doing it's job: Overclocked WC/air systems without Summer air conditioning are more prone to errors. Hot and/or humid environments a nemesis to ACEMD stability when the GPU is overclocked mildly. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My host saved a workunit again: e1s17_8-GERARD_FXCXCL12_LIG_6121521-0-1-RND4507 It was the last (7th) attempt to crunch it. To avoid errors, please update your NVidia drivers to the latest one (v353.62). http://www.geforce.com http://www.nvidia.com Could the staff please check if there's a consistency between failing tasks, and the assigned application version, as I think the CUDA6.0 application is more prone to errors lately. |
©2025 Universitat Pompeu Fabra