All Tasks Failed

Author	Message
Paul Raney Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level Scientific publications	Message 24577 - Posted: 24 Apr 2012, 14:03:10 UTC on one of my computers, every task started to fail. I just restarted the system - is there any way to get a new task now? Any ideas on what happened? This system has been running fine for weeks. http://www.gpugrid.net/show_host_detail.php?hostid=117970 thank you ID: 24577 · Rating: 0 · rate: / Reply Quote

nenym Send message Joined: 31 Mar 09 Posts: 137 Credit: 1,574,337,071 RAC: 3,893 Level Scientific publications	Message 24578 - Posted: 24 Apr 2012, 16:39:49 UTC - in response to Message 24577. is there any way to get a new task now? I know a bit strange way, that affects statistics of your host: - detach from GPUGRID - chanage the hostname - reboot - attach to GPUGRID Other connected projects changes your hostname only (as I can remember). You can look for problems, if the host is connected to LAN. ID: 24578 · Rating: 0 · rate: / Reply Quote

Paul Raney Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level Scientific publications	Message 24581 - Posted: 25 Apr 2012, 0:54:51 UTC - in response to Message 24578. Thanks for the hack. I want to keep my stats so I will just let the machine idle for a day or so. Did anyone else have this issue? ID: 24581 · Rating: 0 · rate: / Reply Quote

Paul Raney Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level Scientific publications	Message 24583 - Posted: 25 Apr 2012, 13:55:08 UTC - in response to Message 24581. It looks like tasks continue to fail. Does anyone have any ideas of what might be wrong with this host? Thx ID: 24583 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 24585 - Posted: 25 Apr 2012, 16:32:09 UTC The original clock rate was 1.88 GHz. Now it's 1.46 GHz & still failing. Is this the same card? Try under clocking the memory by 20%. Does it run other projects OK? ID: 24585 · Rating: 0 · rate: / Reply Quote

Paul Raney Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level Scientific publications	Message 24592 - Posted: 26 Apr 2012, 12:17:04 UTC - in response to Message 24585. It is the same card in the same computer. I lowered the clock rate to see if that would correct the condition. That computer is down now. It should be running again this weekend. We will see if power was an issue. thx ID: 24592 · Rating: 0 · rate: / Reply Quote

Paul Raney Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level Scientific publications	Message 24595 - Posted: 26 Apr 2012, 13:55:43 UTC - in response to Message 24592. Same problem on a different host. Could the 275.33 drivers be the issue? I have a different host with 285 drivers and it appears to be working fine. any help is appreciated http://www.gpugrid.net/show_host_detail.php?hostid=119703 ID: 24595 · Rating: 0 · rate: / Reply Quote

Paul Raney Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level Scientific publications	Message 24606 - Posted: 28 Apr 2012, 3:20:40 UTC - in response to Message 24595. The hosts appear to be working correctly again. Were the work units bad? ID: 24606 · Rating: 0 · rate: / Reply Quote

RichF Send message Joined: 6 Jan 09 Posts: 7 Credit: 5,741,255 RAC: 0 Level Scientific publications	Message 24743 - Posted: 5 May 2012, 15:04:40 UTC - in response to Message 24606. All my WUs have been failing for the past couple of days, too. Is this a widespread problem, and how can we fix it? Thanks. ID: 24743 · Rating: 0 · rate: / Reply Quote

Old man Send message Joined: 24 Jan 09 Posts: 42 Credit: 16,676,387 RAC: 0 Level Scientific publications	Message 24746 - Posted: 5 May 2012, 16:05:16 UTC Last modified: 5 May 2012, 16:05:52 UTC Here also tasks failed. Nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616_5 Työpaketti 3395291 Luotu 5 May 2012 \| 11:53:47 UTC Lähetetty 5 May 2012 \| 15:23:02 UTC Vastaanotettu 5 May 2012 \| 15:26:05 UTC Tila palvelimella Valmis Tulos Laskentavirhe Tila ohjelmassa Laskentavirhe Exit status 98 (0x62) Tietokoneen tunniste 123486 Raportoinnin takaraja 10 May 2012 \| 15:23:02 UTC Laskenta-aika 2.70 Suoritinaika 0.80 Vahvistuksen tila Vahvistamattomat Pisteet 0.00 Sovellusversio ACEMD2: GPU molecular dynamics v6.16 (cuda31) Stderr output <core_client_version>6.12.34</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 470" # Clock rate: 1.21 GHz # Total amount of global memory: 1275658240 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Device 1: "GeForce GTX 260" # Clock rate: 1.30 GHz # Total amount of global memory: 891748352 bytes # Number of multiprocessors: 27 # Number of cores: 216 MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47792) expected ERROR: Unable to read bincoordfile called boinc_finish </stderr_txt> ]]> nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616 sovellus ACEMD2: GPU molecular dynamics luotu 4 May 2012 \| 14:27:28 UTC oikeita tuloksia vähintään 1 alustavia toisintoja 1 suurin lkm virheitä/kokonaismääriä/onnistuneita tehtäviä 7, 10, 6 Tehtävä napsauta tietoihin Tietokone Lähetetty Raportointiaika tai takaraja selite Tila Laskenta-aika (sekuntia) Suoritinaika (sekuntia) Pisteet Sovellus 5326942 124335 4 May 2012 \| 17:49:33 UTC 4 May 2012 \| 17:54:12 UTC Virhe latauksessa 0.00 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5327658 112695 4 May 2012 \| 20:20:30 UTC 4 May 2012 \| 21:24:15 UTC Virhe laskennassa 2.07 0.41 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5328368 124628 5 May 2012 \| 2:08:11 UTC 5 May 2012 \| 2:14:55 UTC Virhe laskennassa 7.75 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5329342 105945 5 May 2012 \| 5:41:17 UTC 5 May 2012 \| 5:48:23 UTC Virhe laskennassa 3.67 0.81 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5329857 102639 5 May 2012 \| 11:26:08 UTC 5 May 2012 \| 11:53:44 UTC Virhe laskennassa 2.15 0.53 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5330904 123486 5 May 2012 \| 15:23:02 UTC 5 May 2012 \| 15:26:05 UTC Virhe laskennassa 2.70 0.80 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5331534 --- --- --- Lähettämättä --- --- --- As you can see, also all others have failed to run task :-( ID: 24746 · Rating: 0 · rate: / Reply Quote

RichF Send message Joined: 6 Jan 09 Posts: 7 Credit: 5,741,255 RAC: 0 Level Scientific publications	Message 24747 - Posted: 5 May 2012, 17:19:49 UTC - in response to Message 24743. Here is the error message I've been getting. Any help would be appreciated. Stderr output <core_client_version>6.12.34</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> # Using device 1 SWAN: FATAL : Unable to enumerate devices Assertion failed: 0, file swanlib_nv.c, line 390 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> ID: 24747 · Rating: 0 · rate: / Reply Quote

Mika_at_home Send message Joined: 16 Apr 12 Posts: 2 Credit: 297,794 RAC: 0 Level Scientific publications	Message 24750 - Posted: 5 May 2012, 22:12:28 UTC I also have had failed workunits on this week. Of the last five three have failed. The first failed on wednesday, the next failed on friday and the latest failed tonight. Of all those there are messages like these in BOINC log: 5.5.2012 23:56:56 GPUGRID Computation for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 finished 5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_1 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent 5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_2 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent 5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_3 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent The following ACEMD2 workunit failed on friday: 1x21-MJHARVEY_MJHXA1-8-30-RND8065_0 Stderr output <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> - exit code -99 (0xffffff9d) </message> <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 560 Ti" # Clock rate: 1.46 GHz # Total amount of global memory: 1341849600 bytes # Number of multiprocessors: 14 # Number of cores: 112 MDIO: cannot open file "restart.coor" # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 560 Ti" # Clock rate: 1.46 GHz # Total amount of global memory: 1341849600 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Using device 0 I also run Einstein@home with about 2 million credit points and their WU:s have never failed. My graphics card is a Gigabyte GTX 560Ti 448 which runs at reference clock speed of 723 MHz and temps are between 70 - 75 C. I have lowered fan speed with MSI Afterburner. It has been running GPUGRID workunits for about a week now. So should I suspect my computer of these failures? Thank you ID: 24750 · Rating: 0 · rate: / Reply Quote

NeoMetal* Send message Joined: 30 Mar 11 Posts: 1 Credit: 1,005,009 RAC: 0 Level Scientific publications	Message 24769 - Posted: 7 May 2012, 0:42:13 UTC I got a failed WU because of: MDIO: cannot open file "output.restart.coor" First time I've ever seen that. WU completed fine but errored when it tried to upload. No anti virus or backup running. Just a basic Win 7 install for crunching. This sucks, 21 hours wasted on a most likely valid WU because of a locked or disappearing file. I see Mika_at_home has a similar error in his post above: MDIO: cannot open file "restart.coor". Is this happening to anyone else? Seems like a rash of errors recently. Could this be something needing fixing? Stderr output <core_client_version>7.0.25</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 560 Ti" # Clock rate: 1.90 GHz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 8 # Number of cores: 64 MDIO: cannot open file "output.restart.coor" ERROR: get_Dvec() element 0 (b) called boinc_finish </stderr_txt> ]]> NM* ID: 24769 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 24770 - Posted: 7 May 2012, 1:04:59 UTC The MDIO: cannot open file "output.restart.coor" message is not a real error, it appears in every task, even in the successful ones. Your real error message is ERROR: get_Dvec() element 0 (b), and I think that such an error cannot be caused by the upload, nor "a locked or disappearing file". This error is happened during processing the wu, probably near its completion, that is why it seems like to be caused by the upload. ID: 24770 · Rating: 0 · rate: / Reply Quote

Paul Raney Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level Scientific publications	Message 24771 - Posted: 7 May 2012, 2:49:32 UTC - in response to Message 24770. Several of my work units failed very near the end of the calculation process. Any ideas on why? The clock rate has been reduced to see if that will correct the issue. thank you ID: 24771 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 24781 - Posted: 7 May 2012, 14:13:47 UTC - in response to Message 24750. Mika_at_home, it seems that your tasks are getting suspended and resumed many times. I think there is more chance of failures running this way. I suggest you configure Boinc Manager to allow GPU tasks to run when the system is in use. All of the MJHARVEY tasks that failed on your system failed on at least one more system, and some repeatedly failed on many systems, suggesting an issue with the tasks; errors Too many errors (may have bug) Sometimes these issues are very difficult to track down, as they only rarely appear on some combinations of operating system/driver/GPU, but in the above 'Too many errors' case the problem seems independent of GPU, driver and operating system, and my guess is that it was a badly built task, MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47525) expected ERROR: Unable to read bincoordfile I would be more concerned by the tasks that fail after 10K sec than 2sec. Paul Raney, as different task types are failing on your system it's more likely that the issue is a setup one (GPU clock, overuse of CPU, interference from another program...). 'Energies have become nan' is often symptomatic of a GPU issue with the clock, voltage or temps (but may also be linked to overuse of the CPU). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 24781 · Rating: 0 · rate: / Reply Quote

wiyosaya Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level Scientific publications	Message 24784 - Posted: 7 May 2012, 16:51:42 UTC Running on Windows, one thing that I have done is to turn off the BOINC screen saver. After doing so, I have rarely had any GPUGrid WUs report computation error. On most PCs these days with LCD monitors, screen savers are only eye candy as LCD monitors do not suffer from burn in as tube based monitors did. To elaborate a bit further, I set my screen saver to NONE on the two machines where I currently run GPUGrid. I am bringing a third machine on line in the next week or so, and I will also turn off the screen saver on that one, too. ID: 24784 · Rating: 0 · rate: / Reply Quote

Mika_at_home Send message Joined: 16 Apr 12 Posts: 2 Credit: 297,794 RAC: 0 Level Scientific publications	Message 24798 - Posted: 8 May 2012, 11:50:51 UTC - in response to Message 24781. skgiven, thanks for your analysis and advice. I have now completed one ACEMD2 workunit with the GPU task running always. It didn't cause any problems at least with web browsing and e-mail use. I also changed my screensaver to a more simple windows standard screensaver. Now I will get my Einstein GPU-WU:s completed quicker, too. :) -Mika ID: 24798 · Rating: 0 · rate: / Reply Quote

lohphat Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level Scientific publications	Message 24859 - Posted: 10 May 2012, 4:57:56 UTC All my GPUGRID WUs are failing. I even replaced my 9800 GTX with a GTX 680 and the WUs fail in less than 5 seconds. I suspect the Nvidia driver. 301.10 is the only driver for the GTX 680. ID: 24859 · Rating: 0 · rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 24860 - Posted: 10 May 2012, 5:25:24 UTC gtx 680 for Windows has not been released yet. They are currently working on it. Linux version was just released for beta, when Windows is released it will be on beta as well. ID: 24860 · Rating: 0 · rate: / Reply Quote