Message boards :
Number crunching :
All Tasks Failed
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
on one of my computers, every task started to fail. I just restarted the system - is there any way to get a new task now? Any ideas on what happened? This system has been running fine for weeks. http://www.gpugrid.net/show_host_detail.php?hostid=117970 thank you |
nenymSend message Joined: 31 Mar 09 Posts: 137 Credit: 1,429,587,071 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
is there any way to get a new task now?I know a bit strange way, that affects statistics of your host: - detach from GPUGRID - chanage the hostname - reboot - attach to GPUGRID Other connected projects changes your hostname only (as I can remember). You can look for problems, if the host is connected to LAN. |
|
Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for the hack. I want to keep my stats so I will just let the machine idle for a day or so. Did anyone else have this issue? |
|
Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It looks like tasks continue to fail. Does anyone have any ideas of what might be wrong with this host? Thx |
StoneagemanSend message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The original clock rate was 1.88 GHz. Now it's 1.46 GHz & still failing. Is this the same card? Try under clocking the memory by 20%. Does it run other projects OK? |
|
Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It is the same card in the same computer. I lowered the clock rate to see if that would correct the condition. That computer is down now. It should be running again this weekend. We will see if power was an issue. thx |
|
Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same problem on a different host. Could the 275.33 drivers be the issue? I have a different host with 285 drivers and it appears to be working fine. any help is appreciated http://www.gpugrid.net/show_host_detail.php?hostid=119703 |
|
Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The hosts appear to be working correctly again. Were the work units bad? |
|
Send message Joined: 6 Jan 09 Posts: 7 Credit: 5,741,255 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All my WUs have been failing for the past couple of days, too. Is this a widespread problem, and how can we fix it? Thanks. |
|
Send message Joined: 24 Jan 09 Posts: 42 Credit: 16,676,387 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Here also tasks failed. Nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616_5 Työpaketti 3395291 Luotu 5 May 2012 | 11:53:47 UTC Lähetetty 5 May 2012 | 15:23:02 UTC Vastaanotettu 5 May 2012 | 15:26:05 UTC Tila palvelimella Valmis Tulos Laskentavirhe Tila ohjelmassa Laskentavirhe Exit status 98 (0x62) Tietokoneen tunniste 123486 Raportoinnin takaraja 10 May 2012 | 15:23:02 UTC Laskenta-aika 2.70 Suoritinaika 0.80 Vahvistuksen tila Vahvistamattomat Pisteet 0.00 Sovellusversio ACEMD2: GPU molecular dynamics v6.16 (cuda31) Stderr output <core_client_version>6.12.34</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 470" # Clock rate: 1.21 GHz # Total amount of global memory: 1275658240 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Device 1: "GeForce GTX 260" # Clock rate: 1.30 GHz # Total amount of global memory: 891748352 bytes # Number of multiprocessors: 27 # Number of cores: 216 MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47792) expected ERROR: Unable to read bincoordfile called boinc_finish </stderr_txt> ]]> nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616 sovellus ACEMD2: GPU molecular dynamics luotu 4 May 2012 | 14:27:28 UTC oikeita tuloksia vähintään 1 alustavia toisintoja 1 suurin lkm virheitä/kokonaismääriä/onnistuneita tehtäviä 7, 10, 6 Tehtävä napsauta tietoihin Tietokone Lähetetty Raportointiaika tai takaraja selite Tila Laskenta-aika (sekuntia) Suoritinaika (sekuntia) Pisteet Sovellus 5326942 124335 4 May 2012 | 17:49:33 UTC 4 May 2012 | 17:54:12 UTC Virhe latauksessa 0.00 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5327658 112695 4 May 2012 | 20:20:30 UTC 4 May 2012 | 21:24:15 UTC Virhe laskennassa 2.07 0.41 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5328368 124628 5 May 2012 | 2:08:11 UTC 5 May 2012 | 2:14:55 UTC Virhe laskennassa 7.75 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5329342 105945 5 May 2012 | 5:41:17 UTC 5 May 2012 | 5:48:23 UTC Virhe laskennassa 3.67 0.81 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5329857 102639 5 May 2012 | 11:26:08 UTC 5 May 2012 | 11:53:44 UTC Virhe laskennassa 2.15 0.53 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5330904 123486 5 May 2012 | 15:23:02 UTC 5 May 2012 | 15:26:05 UTC Virhe laskennassa 2.70 0.80 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) 5331534 --- --- --- Lähettämättä --- --- --- As you can see, also all others have failed to run task :-( |
|
Send message Joined: 6 Jan 09 Posts: 7 Credit: 5,741,255 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Here is the error message I've been getting. Any help would be appreciated. Stderr output <core_client_version>6.12.34</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> # Using device 1 SWAN: FATAL : Unable to enumerate devices Assertion failed: 0, file swanlib_nv.c, line 390 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> |
|
Send message Joined: 16 Apr 12 Posts: 2 Credit: 297,794 RAC: 0 Level ![]() Scientific publications
|
I also have had failed workunits on this week. Of the last five three have failed. The first failed on wednesday, the next failed on friday and the latest failed tonight. Of all those there are messages like these in BOINC log: 5.5.2012 23:56:56 GPUGRID Computation for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 finished 5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_1 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent 5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_2 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent 5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_3 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent The following ACEMD2 workunit failed on friday: 1x21-MJHARVEY_MJHXA1-8-30-RND8065_0 Stderr output <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> - exit code -99 (0xffffff9d) </message> <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 560 Ti" # Clock rate: 1.46 GHz # Total amount of global memory: 1341849600 bytes # Number of multiprocessors: 14 # Number of cores: 112 MDIO: cannot open file "restart.coor" # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 560 Ti" # Clock rate: 1.46 GHz # Total amount of global memory: 1341849600 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Using device 0 I also run Einstein@home with about 2 million credit points and their WU:s have never failed. My graphics card is a Gigabyte GTX 560Ti 448 which runs at reference clock speed of 723 MHz and temps are between 70 - 75 C. I have lowered fan speed with MSI Afterburner. It has been running GPUGRID workunits for about a week now. So should I suspect my computer of these failures? Thank you |
|
Send message Joined: 30 Mar 11 Posts: 1 Credit: 1,005,009 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
I got a failed WU because of: MDIO: cannot open file "output.restart.coor" First time I've ever seen that. WU completed fine but errored when it tried to upload. No anti virus or backup running. Just a basic Win 7 install for crunching. This sucks, 21 hours wasted on a most likely valid WU because of a locked or disappearing file. I see Mika_at_home has a similar error in his post above: MDIO: cannot open file "restart.coor". Is this happening to anyone else? Seems like a rash of errors recently. Could this be something needing fixing? Stderr output <core_client_version>7.0.25</core_client_version> <![CDATA[ <message> - exit code 98 (0x62) </message> <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 560 Ti" # Clock rate: 1.90 GHz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 8 # Number of cores: 64 MDIO: cannot open file "output.restart.coor" ERROR: get_Dvec() element 0 (b) called boinc_finish </stderr_txt> ]]> NM* |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The MDIO: cannot open file "output.restart.coor" message is not a real error, it appears in every task, even in the successful ones. Your real error message is ERROR: get_Dvec() element 0 (b), and I think that such an error cannot be caused by the upload, nor "a locked or disappearing file". This error is happened during processing the wu, probably near its completion, that is why it seems like to be caused by the upload. |
|
Send message Joined: 26 Dec 10 Posts: 115 Credit: 416,576,946 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Several of my work units failed very near the end of the calculation process. Any ideas on why? The clock rate has been reduced to see if that will correct the issue. thank you |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Mika_at_home, it seems that your tasks are getting suspended and resumed many times. I think there is more chance of failures running this way. I suggest you configure Boinc Manager to allow GPU tasks to run when the system is in use. All of the MJHARVEY tasks that failed on your system failed on at least one more system, and some repeatedly failed on many systems, suggesting an issue with the tasks; errors Too many errors (may have bug) Sometimes these issues are very difficult to track down, as they only rarely appear on some combinations of operating system/driver/GPU, but in the above 'Too many errors' case the problem seems independent of GPU, driver and operating system, and my guess is that it was a badly built task, MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47525) expected ERROR: Unable to read bincoordfile I would be more concerned by the tasks that fail after 10K sec than 2sec. Paul Raney, as different task types are failing on your system it's more likely that the issue is a setup one (GPU clock, overuse of CPU, interference from another program...). 'Energies have become nan' is often symptomatic of a GPU issue with the clock, voltage or temps (but may also be linked to overuse of the CPU). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 22 Nov 09 Posts: 114 Credit: 589,114,683 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Running on Windows, one thing that I have done is to turn off the BOINC screen saver. After doing so, I have rarely had any GPUGrid WUs report computation error. On most PCs these days with LCD monitors, screen savers are only eye candy as LCD monitors do not suffer from burn in as tube based monitors did. To elaborate a bit further, I set my screen saver to NONE on the two machines where I currently run GPUGrid. I am bringing a third machine on line in the next week or so, and I will also turn off the screen saver on that one, too. |
|
Send message Joined: 16 Apr 12 Posts: 2 Credit: 297,794 RAC: 0 Level ![]() Scientific publications
|
skgiven, thanks for your analysis and advice. I have now completed one ACEMD2 workunit with the GPU task running always. It didn't cause any problems at least with web browsing and e-mail use. I also changed my screensaver to a more simple windows standard screensaver. Now I will get my Einstein GPU-WU:s completed quicker, too. :) -Mika |
|
Send message Joined: 21 Jan 10 Posts: 46 Credit: 1,388,234,528 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All my GPUGRID WUs are failing. I even replaced my 9800 GTX with a GTX 680 and the WUs fail in less than 5 seconds. I suspect the Nvidia driver. 301.10 is the only driver for the GTX 680. |
|
Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
gtx 680 for Windows has not been released yet. They are currently working on it. Linux version was just released for beta, when Windows is released it will be on beta as well. |
©2025 Universitat Pompeu Fabra