Errors

Message boards : Graphics cards (GPUs) : Errors

Author	Message
PhilTheNet Send message Joined: 24 Sep 14 Posts: 1 Credit: 55,601,016 RAC: 0 Level Scientific publications	Message 41538 - Posted: 21 Jul 2015 \| 5:50:00 UTC
	Hello, all the UT on error this morning (the project running smoothly on this computer for 6 months) : Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4) </message> ]]> Someone has the same problem ??? Phil
	ID: 41538 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 41540 - Posted: 21 Jul 2015 \| 11:18:10 UTC Last modified: 21 Jul 2015 \| 11:25:39 UTC
	Here too.. <core_client_version>7.4.36</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4) </message> I thought this was because we had a poweroutage and one machine recovered with wrong systemtime (+ 8days O.o). now i corrected that and ran into this errors. On both Graphiccards in only one longrunning24/7crunching machine. Voltage, Clocks are the same like before. Indeed i changed them to a more secure setting (i already ran them in a save way cloc and voltagewise) but it doesnt helped. I still think it is a fault on my side on this machine but dont know where. Einstein running fine, but this doesnt say anything. On the other side, the crunching partner errored out too, so perhaps its an errorous batch on its way too. So i will try again in one week. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 41540 \| Rating: 0 \| rate: / Reply Quote

Gerard Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level Scientific publications	Message 41541 - Posted: 21 Jul 2015 \| 12:51:23 UTC Last modified: 21 Jul 2015 \| 12:51:45 UTC
	Can you report which WU you had the problems with? yesterday I released a whole new batch of simulations and, althought they should be fine, a possibility of corruption exists. Thanks a lot guys!
	ID: 41541 \| Rating: 0 \| rate: / Reply Quote

lukeu Send message Joined: 14 Oct 11 Posts: 31 Credit: 75,720,504 RAC: 0 Level Scientific publications	Message 41545 - Posted: 22 Jul 2015 \| 3:58:07 UTC - in response to Message 41541.
	All my WU since yesterday have failed (both long and short queues): http://www.gpugrid.net/results.php?userid=81842
	ID: 41545 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 41547 - Posted: 22 Jul 2015 \| 10:06:10 UTC
	https://www.gpugrid.net/workunit.php?wuid=11087902 as example ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 41547 \| Rating: 0 \| rate: / Reply Quote

hawker Send message Joined: 28 Jun 10 Posts: 1 Credit: 31,454,680 RAC: 0 Level Scientific publications	Message 41548 - Posted: 22 Jul 2015 \| 15:07:55 UTC
	Errors. All. https://www.gpugrid.net/results.php?userid=62470&offset=0&show_names=0&state=5&appid=
	ID: 41548 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 41549 - Posted: 22 Jul 2015 \| 15:39:46 UTC Last modified: 22 Jul 2015 \| 15:58:49 UTC
	They seem to fall into two groups: The drivers are too old (e.g., 335.28) which gives the exit code -44 error The cards are giving an "unstable" error, probably just overclocking/overheating. Those problems should be easy to fix, though I don't know whether XP has recent enough drivers to work. https://www.gpugrid.net/results.php?hostid=223541&offset=0&show_names=1&state=0&appid= https://www.gpugrid.net/results.php?hostid=194224&offset=0&show_names=1&state=0&appid=
	ID: 41549 \| Rating: 0 \| rate: / Reply Quote

TheFiend Send message Joined: 26 Aug 11 Posts: 99 Credit: 2,500,112,138 RAC: 0 Level Scientific publications	Message 41550 - Posted: 22 Jul 2015 \| 18:17:21 UTC
	After a load of "error units" on my main system I updated drivers and started crunching again ok.
	ID: 41550 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41552 - Posted: 22 Jul 2015 \| 22:46:24 UTC
	I've received a couple of reissued tasks which were previously failed on other hosts with the "(unknown error) - exit code -44 (0xffffffd4)". All of them completed successfully on my host. For example: http://www.gpugrid.net/workunit.php?wuid=11081985 http://www.gpugrid.net/workunit.php?wuid=11087547 http://www.gpugrid.net/workunit.php?wuid=11088391 http://www.gpugrid.net/workunit.php?wuid=11088528
	ID: 41552 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 41553 - Posted: 23 Jul 2015 \| 9:20:29 UTC - in response to Message 41549. Last modified: 23 Jul 2015 \| 9:37:20 UTC
	They seem to fall into two groups: [list]The drivers are too old (e.g., 335.28) which gives the exit code -44 error It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h by some minutes including dl/ul over HSDPA whats not a very stable line sometimes (OutdoorModem crashing, Router crashing, Bad USB Wireconnection etc.). The extrem long units needed >2days, and sometimes it hang up something because this long duration, whats bad on unattended machines. "Extra Long Queue" +1 I think i come back end of November with a single 970 or 980 at home over the winter here and attacking the 1B Mark again ^^ Byebye Worldwide GPUGrid Place 43 ^^ In Austria i still have double points as the nearly inactive Place 2. So that should be a secure enough Place #1 until end Nov. :) ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 41553 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41554 - Posted: 23 Jul 2015 \| 12:23:28 UTC - in response to Message 41553. Last modified: 23 Jul 2015 \| 12:27:23 UTC
	They seem to fall into two groups: [list]The drivers are too old (e.g., 335.28) which gives the exit code -44 error It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion. These drivers a bit old, as your hosts are using the CUDA6.0 client, however they should work fine. From my experience the latest driver (353.30) is stable, however I don't have GTX-5xx cards. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too... I've looked into the stderr output of your tasks, and I came to the conclusion that your tasks on Host 150780 are failing because its GPU can't take that high clock frequencies. (probably you've reduced the memory clock already, but this or the GPU clock still has to be reduced) GPU: GeForce GTX 570 Device clock : 1500MHz (default: 1464MHz) Memory clock : 1700MHz (default: 1900MHz) Task 14375512: # Simulation unstable. Flag 9 value 129 # Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1875000) ... Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1885000) Task 14373443 ERROR: file force.cpp line 513: TCL evaluation of [calcforces] 17:24:33 (3980): called boinc_finish In your other host (117426), the GTX 580 and the GTX 560Ti is definitely overheating (sometimes reaching 90°C), so it is a miracle that the tasks on this card don't have "simulation became unstable" messages. Task 14367948: <core_client_version>7.4.36</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 580] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 580 # ECC : Disabled # Global mem : 3071MB # Capability : 2.0 # PCI ID : 0000:01:00.0 # Device clock : 1520MHz # Memory clock : 1700MHz # Memory width : 384bit # Driver version : r334_00 : 33528 # GPU 0 : 69C # GPU 1 : 89C # GPU 0 : 71C # GPU 1 : 90C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # GPU 0 : 79C # GPU 0 : 80C # GPU 0 : 81C # GPU 0 : 82C # Time per step (avg over 3125000 steps): 8.338 ms # Approximate elapsed time for entire WU: 26054.703 s 12:38:20 (4052): called boinc_finish </stderr_txt> ]]> ... and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h... You are right about the GTX 5xx series getting old, as there are two newer GPU generations developed in the meantime. However they should still work here also, and as Einstein@home is working on them, it suggests that the power outage corrupted some files of the GPUGrid project or the driver on your host. You can eliminate these factors by resetting (or removing and re-attaching) the GPUGrid project on your host, and reinstalling / upgrading your drivers.
	ID: 41554 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41555 - Posted: 23 Jul 2015 \| 12:43:28 UTC
	My host finished a previously failed task again. It has been failed three times on other hosts before: Host 204947: Task 14385448 "# The simulation has become unstable. Terminating to avoid lock-up (1)" Tasklist (all failed) Host 194523: Task 14399453 "unknown error) - exit code -44 (0xffffffd4)" Tasklist (all failed) Host 163989: Task 14399778 "process exited with code 212 (0xd4, -44)" Tasklist (all failed)
	ID: 41555 \| Rating: 0 \| rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 41 Credit: 79,726,864 RAC: 668 Level Scientific publications	Message 41558 - Posted: 25 Jul 2015 \| 21:49:44 UTC Last modified: 25 Jul 2015 \| 21:50:09 UTC
	All Work Units failed with computation error... https://www.gpugrid.net/results.php?hostid=182555
	ID: 41558 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 41559 - Posted: 26 Jul 2015 \| 2:07:06 UTC
	I have had 176 tasks fail between 22nd and 24th. Have now removed these cards for now. The failures are on 4 identical cards I have successfully used on GPUgrid for almost a year. I have 1 remaining card (that is different) still able to run tasks. It appears that something has changed in the last 'batch' the four cards are: https://www.gpugrid.net/results.php?hostid=181299 https://www.gpugrid.net/results.php?hostid=181300 https://www.gpugrid.net/results.php?hostid=180572 https://www.gpugrid.net/results.php?hostid=180015
	ID: 41559 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 74 Credit: 14,802,941,499 RAC: 21,480,122 Level Scientific publications	Message 41560 - Posted: 26 Jul 2015 \| 19:18:39 UTC Last modified: 26 Jul 2015 \| 19:19:21 UTC
	Got some wu with many attemps. My host can´t take these at all, they failed in early stage. https://www.gpugrid.net/workunit.php?wuid=11092476 created 20 Jul 2015 \| 18:29:55 UTC -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 40000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11092127 created 20 Jul 2015 \| 18:17:40 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 760000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11091745 created 20 Jul 2015 \| 18:04:57 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 600000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11098488 created 23 Jul 2015 \| 18:13:11 UTC Exit status -98 (0xffffffffffffff9e) Unknown error number ERROR: file force.cpp line 513: TCL evaluation of [calcforces] 16:05:42 (7636): called boinc_finish 11091450 created 20 Jul 2015 \| 17:55:40 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 560000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] Last completed and valid, but some host got problem with these 11091018 created 20 Jul 2015 \| 17:42:09 UTC 11090161 created 20 Jul 2015 \| 17:15:09 UTC 11090182 created 20 Jul 2015 \| 17:15:53 UTC 11089720 created 20 Jul 2015 \| 17:00:31 UTC 11089778 created 20 Jul 2015 \| 17:02:31 UTC
	ID: 41560 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 74 Credit: 14,802,941,499 RAC: 21,480,122 Level Scientific publications	Message 41561 - Posted: 27 Jul 2015 \| 2:00:02 UTC Last modified: 27 Jul 2015 \| 2:04:20 UTC
	Update: https://www.gpugrid.net/workunit.php?wuid=11092831 created 20 Jul 2015 \| 18:43:49 UTC Manage to complete in first task, same settings and drivers as before. Templimit at 73°C This should be in same batch i think but no error even a bit higher clock and suspended few times.
	ID: 41561 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 41581 - Posted: 28 Jul 2015 \| 15:56:50 UTC
	Manage to complete in first task, same settings and drivers as before. Templimit at 73°C (User hardware error) unstable simulation (-97 message) GERALD's an issue for me in a hot - soaking humid environment during the last fortnight. I have 7 more days of forecasted 95F heat and +75F dewpoint (humidity) to contend with. The GPU at ~50C. Yesterday a WU flipped 30k/sec in at 1503MHz. The WU before failed two hours in. -1MHz offset core GPU on a following WU completed without error then after 7hr - a WU failed just now. Offset -1 again. Will try one more long - will switch to short NOELIA's if another GERALD fails. Are there any other GM204 owners at 1.5GHz in hot/humid conditions? DMM reading = 1.212V. Dewpoint is currently at 78F - tropical rainforest humidity levels. Even if a sea breeze happens the air is so saturated it makes no difference. This should be in same batch i think but no error even a bit higher clock and suspended few times. GERALD_FXCXCL12_LIG tolerates a bin or two (13/26MHz) less than NOELIA and GIANNI on my 970. GPU's are independent of the next. A 100 straight valid WU streak in <80F ambient can become a failed WU every 5 with the same overclock in +90F ambient. NOELIA_467x short or ETQ yet to fail in similar conditions on my GPU(s). Expect unstable sim (-97 error) or CUDA error() with overclocking in hot ambient and/or (dewpoint = +70F) very humid conditions (<50C core temperature readings). ACEMD app is extremely demanding even if 70% core usage (WDDM bottleneck). Crunching with out of box clocks or the GPU's reference boost will offer lesser chance of CUDA errors and unstable sims that error to a -97 message. When the ACEMD app is doing it's job: Overclocked WC/air systems without Summer air conditioning are more prone to errors. Hot and/or humid environments a nemesis to ACEMD stability when the GPU is overclocked mildly.
	ID: 41581 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41614 - Posted: 3 Aug 2015 \| 9:13:31 UTC
	My host saved a workunit again: e1s17_8-GERARD_FXCXCL12_LIG_6121521-0-1-RND4507 It was the last (7th) attempt to crunch it. To avoid errors, please update your NVidia drivers to the latest one (v353.62). http://www.geforce.com http://www.nvidia.com Could the staff please check if there's a consistency between failing tasks, and the assigned application version, as I think the CUDA6.0 application is more prone to errors lately.
	ID: 41614 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : Errors

	About	Science	Volunteers	Performance	Forum	Join us	Donate

Author	Message
PhilTheNet Send message Joined: 24 Sep 14 Posts: 1 Credit: 55,601,016 RAC: 0 Level Scientific publications	Message 41538 - Posted: 21 Jul 2015 \| 5:50:00 UTC
	Hello, all the UT on error this morning (the project running smoothly on this computer for 6 months) : Stderr output <core_client_version>7.4.42</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4) </message> ]]> Someone has the same problem ??? Phil
	ID: 41538 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 41540 - Posted: 21 Jul 2015 \| 11:18:10 UTC Last modified: 21 Jul 2015 \| 11:25:39 UTC
	Here too.. <core_client_version>7.4.36</core_client_version> <![CDATA[ <message> (unknown error) - exit code -44 (0xffffffd4) </message> I thought this was because we had a poweroutage and one machine recovered with wrong systemtime (+ 8days O.o). now i corrected that and ran into this errors. On both Graphiccards in only one longrunning24/7crunching machine. Voltage, Clocks are the same like before. Indeed i changed them to a more secure setting (i already ran them in a save way cloc and voltagewise) but it doesnt helped. I still think it is a fault on my side on this machine but dont know where. Einstein running fine, but this doesnt say anything. On the other side, the crunching partner errored out too, so perhaps its an errorous batch on its way too. So i will try again in one week. ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 41540 \| Rating: 0 \| rate: / Reply Quote

Gerard Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level Scientific publications	Message 41541 - Posted: 21 Jul 2015 \| 12:51:23 UTC Last modified: 21 Jul 2015 \| 12:51:45 UTC
	Can you report which WU you had the problems with? yesterday I released a whole new batch of simulations and, althought they should be fine, a possibility of corruption exists. Thanks a lot guys!
	ID: 41541 \| Rating: 0 \| rate: / Reply Quote

lukeu Send message Joined: 14 Oct 11 Posts: 31 Credit: 75,720,504 RAC: 0 Level Scientific publications	Message 41545 - Posted: 22 Jul 2015 \| 3:58:07 UTC - in response to Message 41541.
	All my WU since yesterday have failed (both long and short queues): http://www.gpugrid.net/results.php?userid=81842
	ID: 41545 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 41547 - Posted: 22 Jul 2015 \| 10:06:10 UTC
	https://www.gpugrid.net/workunit.php?wuid=11087902 as example ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 41547 \| Rating: 0 \| rate: / Reply Quote

hawker Send message Joined: 28 Jun 10 Posts: 1 Credit: 31,454,680 RAC: 0 Level Scientific publications	Message 41548 - Posted: 22 Jul 2015 \| 15:07:55 UTC
	Errors. All. https://www.gpugrid.net/results.php?userid=62470&offset=0&show_names=0&state=5&appid=
	ID: 41548 \| Rating: 0 \| rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 41549 - Posted: 22 Jul 2015 \| 15:39:46 UTC Last modified: 22 Jul 2015 \| 15:58:49 UTC
	They seem to fall into two groups: The drivers are too old (e.g., 335.28) which gives the exit code -44 error The cards are giving an "unstable" error, probably just overclocking/overheating. Those problems should be easy to fix, though I don't know whether XP has recent enough drivers to work. https://www.gpugrid.net/results.php?hostid=223541&offset=0&show_names=1&state=0&appid= https://www.gpugrid.net/results.php?hostid=194224&offset=0&show_names=1&state=0&appid=
	ID: 41549 \| Rating: 0 \| rate: / Reply Quote

TheFiend Send message Joined: 26 Aug 11 Posts: 99 Credit: 2,500,112,138 RAC: 0 Level Scientific publications	Message 41550 - Posted: 22 Jul 2015 \| 18:17:21 UTC
	After a load of "error units" on my main system I updated drivers and started crunching again ok.
	ID: 41550 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41552 - Posted: 22 Jul 2015 \| 22:46:24 UTC
	I've received a couple of reissued tasks which were previously failed on other hosts with the "(unknown error) - exit code -44 (0xffffffd4)". All of them completed successfully on my host. For example: http://www.gpugrid.net/workunit.php?wuid=11081985 http://www.gpugrid.net/workunit.php?wuid=11087547 http://www.gpugrid.net/workunit.php?wuid=11088391 http://www.gpugrid.net/workunit.php?wuid=11088528
	ID: 41552 \| Rating: 0 \| rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 456 Credit: 817,865,789 RAC: 0 Level Scientific publications	Message 41553 - Posted: 23 Jul 2015 \| 9:20:29 UTC - in response to Message 41549. Last modified: 23 Jul 2015 \| 9:37:20 UTC
	They seem to fall into two groups: [list]The drivers are too old (e.g., 335.28) which gives the exit code -44 error It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h by some minutes including dl/ul over HSDPA whats not a very stable line sometimes (OutdoorModem crashing, Router crashing, Bad USB Wireconnection etc.). The extrem long units needed >2days, and sometimes it hang up something because this long duration, whats bad on unattended machines. "Extra Long Queue" +1 I think i come back end of November with a single 970 or 980 at home over the winter here and attacking the 1B Mark again ^^ Byebye Worldwide GPUGrid Place 43 ^^ In Austria i still have double points as the nearly inactive Place 2. So that should be a secure enough Place #1 until end Nov. :) ____________ DSKAG Austria Research Team: http://www.research.dskag.at
	ID: 41553 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41554 - Posted: 23 Jul 2015 \| 12:23:28 UTC - in response to Message 41553. Last modified: 23 Jul 2015 \| 12:27:23 UTC
	They seem to fall into two groups: [list]The drivers are too old (e.g., 335.28) which gives the exit code -44 error It is still recommend to use driver 334.xx or newer, or is this info outdated? 335.28 was a very stable driverversion. These drivers a bit old, as your hosts are using the CUDA6.0 client, however they should work fine. From my experience the latest driver (353.30) is stable, however I don't have GTX-5xx cards. But ok the info that im not the only one was important for me too, it is not because our poweroutage (only unlucky it started on the same day), the other machine failing on shorts too... I've looked into the stderr output of your tasks, and I came to the conclusion that your tasks on Host 150780 are failing because its GPU can't take that high clock frequencies. (probably you've reduced the memory clock already, but this or the GPU clock still has to be reduced) GPU: GeForce GTX 570 Device clock : 1500MHz (default: 1464MHz) Memory clock : 1700MHz (default: 1900MHz) Task 14375512: # Simulation unstable. Flag 9 value 129 # Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1875000) ... Simulation unstable. Flag 10 value 129 # The simulation has become unstable. Terminating to avoid lock-up # The simulation has become unstable. Terminating to avoid lock-up (2) # Attempting restart (step 1885000) Task 14373443 ERROR: file force.cpp line 513: TCL evaluation of [calcforces] 17:24:33 (3980): called boinc_finish In your other host (117426), the GTX 580 and the GTX 560Ti is definitely overheating (sometimes reaching 90°C), so it is a miracle that the tasks on this card don't have "simulation became unstable" messages. Task 14367948: <core_client_version>7.4.36</core_client_version> <![CDATA[ <stderr_txt> # GPU [GeForce GTX 580] Platform [Windows] Rev [3301M] VERSION [60] # SWAN Device 0 : # Name : GeForce GTX 580 # ECC : Disabled # Global mem : 3071MB # Capability : 2.0 # PCI ID : 0000:01:00.0 # Device clock : 1520MHz # Memory clock : 1700MHz # Memory width : 384bit # Driver version : r334_00 : 33528 # GPU 0 : 69C # GPU 1 : 89C # GPU 0 : 71C # GPU 1 : 90C # GPU 0 : 73C # GPU 0 : 74C # GPU 0 : 75C # GPU 0 : 76C # GPU 0 : 77C # GPU 0 : 78C # GPU 0 : 79C # GPU 0 : 80C # GPU 0 : 81C # GPU 0 : 82C # Time per step (avg over 3125000 steps): 8.338 ms # Approximate elapsed time for entire WU: 26054.703 s 12:38:20 (4052): called boinc_finish </stderr_txt> ]]> ... and i retire GPUGrid for the Moment and switching fully to Einstein bacause it seems 570/580 getting to old for all long units within 24h... You are right about the GTX 5xx series getting old, as there are two newer GPU generations developed in the meantime. However they should still work here also, and as Einstein@home is working on them, it suggests that the power outage corrupted some files of the GPUGrid project or the driver on your host. You can eliminate these factors by resetting (or removing and re-attaching) the GPUGrid project on your host, and reinstalling / upgrading your drivers.
	ID: 41554 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41555 - Posted: 23 Jul 2015 \| 12:43:28 UTC
	My host finished a previously failed task again. It has been failed three times on other hosts before: Host 204947: Task 14385448 "# The simulation has become unstable. Terminating to avoid lock-up (1)" Tasklist (all failed) Host 194523: Task 14399453 "unknown error) - exit code -44 (0xffffffd4)" Tasklist (all failed) Host 163989: Task 14399778 "process exited with code 212 (0xd4, -44)" Tasklist (all failed)
	ID: 41555 \| Rating: 0 \| rate: / Reply Quote

bormolino Send message Joined: 16 May 13 Posts: 41 Credit: 79,726,864 RAC: 668 Level Scientific publications	Message 41558 - Posted: 25 Jul 2015 \| 21:49:44 UTC Last modified: 25 Jul 2015 \| 21:50:09 UTC
	All Work Units failed with computation error... https://www.gpugrid.net/results.php?hostid=182555
	ID: 41558 \| Rating: 0 \| rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 41559 - Posted: 26 Jul 2015 \| 2:07:06 UTC
	I have had 176 tasks fail between 22nd and 24th. Have now removed these cards for now. The failures are on 4 identical cards I have successfully used on GPUgrid for almost a year. I have 1 remaining card (that is different) still able to run tasks. It appears that something has changed in the last 'batch' the four cards are: https://www.gpugrid.net/results.php?hostid=181299 https://www.gpugrid.net/results.php?hostid=181300 https://www.gpugrid.net/results.php?hostid=180572 https://www.gpugrid.net/results.php?hostid=180015
	ID: 41559 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 74 Credit: 14,802,941,499 RAC: 21,480,122 Level Scientific publications	Message 41560 - Posted: 26 Jul 2015 \| 19:18:39 UTC Last modified: 26 Jul 2015 \| 19:19:21 UTC
	Got some wu with many attemps. My host can´t take these at all, they failed in early stage. https://www.gpugrid.net/workunit.php?wuid=11092476 created 20 Jul 2015 \| 18:29:55 UTC -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 40000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11092127 created 20 Jul 2015 \| 18:17:40 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 760000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11091745 created 20 Jul 2015 \| 18:04:57 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 600000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] 11098488 created 23 Jul 2015 \| 18:13:11 UTC Exit status -98 (0xffffffffffffff9e) Unknown error number ERROR: file force.cpp line 513: TCL evaluation of [calcforces] 16:05:42 (7636): called boinc_finish 11091450 created 20 Jul 2015 \| 17:55:40 UTC Exit status -97 (0xffffffffffffff9f) Unknown error number # The simulation has become unstable. Terminating to avoid lock-up (1) # Attempting restart (step 560000) # GPU [GeForce GTX 970] Platform [Windows] Rev [3212] VERSION [65] Last completed and valid, but some host got problem with these 11091018 created 20 Jul 2015 \| 17:42:09 UTC 11090161 created 20 Jul 2015 \| 17:15:09 UTC 11090182 created 20 Jul 2015 \| 17:15:53 UTC 11089720 created 20 Jul 2015 \| 17:00:31 UTC 11089778 created 20 Jul 2015 \| 17:02:31 UTC
	ID: 41560 \| Rating: 0 \| rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 74 Credit: 14,802,941,499 RAC: 21,480,122 Level Scientific publications	Message 41561 - Posted: 27 Jul 2015 \| 2:00:02 UTC Last modified: 27 Jul 2015 \| 2:04:20 UTC
	Update: https://www.gpugrid.net/workunit.php?wuid=11092831 created 20 Jul 2015 \| 18:43:49 UTC Manage to complete in first task, same settings and drivers as before. Templimit at 73°C This should be in same batch i think but no error even a bit higher clock and suspended few times.
	ID: 41561 \| Rating: 0 \| rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 41581 - Posted: 28 Jul 2015 \| 15:56:50 UTC
	Manage to complete in first task, same settings and drivers as before. Templimit at 73°C (User hardware error) unstable simulation (-97 message) GERALD's an issue for me in a hot - soaking humid environment during the last fortnight. I have 7 more days of forecasted 95F heat and +75F dewpoint (humidity) to contend with. The GPU at ~50C. Yesterday a WU flipped 30k/sec in at 1503MHz. The WU before failed two hours in. -1MHz offset core GPU on a following WU completed without error then after 7hr - a WU failed just now. Offset -1 again. Will try one more long - will switch to short NOELIA's if another GERALD fails. Are there any other GM204 owners at 1.5GHz in hot/humid conditions? DMM reading = 1.212V. Dewpoint is currently at 78F - tropical rainforest humidity levels. Even if a sea breeze happens the air is so saturated it makes no difference. This should be in same batch i think but no error even a bit higher clock and suspended few times. GERALD_FXCXCL12_LIG tolerates a bin or two (13/26MHz) less than NOELIA and GIANNI on my 970. GPU's are independent of the next. A 100 straight valid WU streak in <80F ambient can become a failed WU every 5 with the same overclock in +90F ambient. NOELIA_467x short or ETQ yet to fail in similar conditions on my GPU(s). Expect unstable sim (-97 error) or CUDA error() with overclocking in hot ambient and/or (dewpoint = +70F) very humid conditions (<50C core temperature readings). ACEMD app is extremely demanding even if 70% core usage (WDDM bottleneck). Crunching with out of box clocks or the GPU's reference boost will offer lesser chance of CUDA errors and unstable sims that error to a -97 message. When the ACEMD app is doing it's job: Overclocked WC/air systems without Summer air conditioning are more prone to errors. Hot and/or humid environments a nemesis to ACEMD stability when the GPU is overclocked mildly.
	ID: 41581 \| Rating: 0 \| rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2343 Credit: 16,201,255,749 RAC: 851 Level Scientific publications	Message 41614 - Posted: 3 Aug 2015 \| 9:13:31 UTC
	My host saved a workunit again: e1s17_8-GERARD_FXCXCL12_LIG_6121521-0-1-RND4507 It was the last (7th) attempt to crunch it. To avoid errors, please update your NVidia drivers to the latest one (v353.62). http://www.geforce.com http://www.nvidia.com Could the staff please check if there's a consistency between failing tasks, and the assigned application version, as I think the CUDA6.0 application is more prone to errors lately.
	ID: 41614 \| Rating: 0 \| rate: / Reply Quote