All Tasks Failed

Message boards : Number crunching : All Tasks Failed
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Paul Raney

Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24577 - Posted: 24 Apr 2012, 14:03:10 UTC

on one of my computers, every task started to fail. I just restarted the system - is there any way to get a new task now? Any ideas on what happened? This system has been running fine for weeks.

http://www.gpugrid.net/show_host_detail.php?hostid=117970

thank you

ID: 24577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile nenym

Send message
Joined: 31 Mar 09
Posts: 137
Credit: 1,429,587,071
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24578 - Posted: 24 Apr 2012, 16:39:49 UTC - in response to Message 24577.  

is there any way to get a new task now?
I know a bit strange way, that affects statistics of your host:
- detach from GPUGRID
- chanage the hostname
- reboot
- attach to GPUGRID
Other connected projects changes your hostname only (as I can remember).
You can look for problems, if the host is connected to LAN.
ID: 24578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Raney

Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24581 - Posted: 25 Apr 2012, 0:54:51 UTC - in response to Message 24578.  

Thanks for the hack. I want to keep my stats so I will just let the machine idle for a day or so.

Did anyone else have this issue?
ID: 24581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Raney

Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24583 - Posted: 25 Apr 2012, 13:55:08 UTC - in response to Message 24581.  

It looks like tasks continue to fail. Does anyone have any ideas of what might be wrong with this host?

Thx

ID: 24583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24585 - Posted: 25 Apr 2012, 16:32:09 UTC

The original clock rate was 1.88 GHz. Now it's 1.46 GHz & still failing. Is this the same card? Try under clocking the memory by 20%. Does it run other projects OK?
ID: 24585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Raney

Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24592 - Posted: 26 Apr 2012, 12:17:04 UTC - in response to Message 24585.  

It is the same card in the same computer. I lowered the clock rate to see if that would correct the condition.

That computer is down now. It should be running again this weekend. We will see if power was an issue.

thx
ID: 24592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Raney

Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24595 - Posted: 26 Apr 2012, 13:55:43 UTC - in response to Message 24592.  

Same problem on a different host. Could the 275.33 drivers be the issue? I have a different host with 285 drivers and it appears to be working fine.

any help is appreciated

http://www.gpugrid.net/show_host_detail.php?hostid=119703
ID: 24595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Raney

Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24606 - Posted: 28 Apr 2012, 3:20:40 UTC - in response to Message 24595.  

The hosts appear to be working correctly again. Were the work units bad?
ID: 24606 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
RichF

Send message
Joined: 6 Jan 09
Posts: 7
Credit: 5,741,255
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 24743 - Posted: 5 May 2012, 15:04:40 UTC - in response to Message 24606.  

All my WUs have been failing for the past couple of days, too. Is this a widespread problem, and how can we fix it? Thanks.
ID: 24743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Old man

Send message
Joined: 24 Jan 09
Posts: 42
Credit: 16,676,387
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24746 - Posted: 5 May 2012, 16:05:16 UTC
Last modified: 5 May 2012, 16:05:52 UTC

Here also tasks failed.

Nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616_5
Työpaketti 3395291
Luotu 5 May 2012 | 11:53:47 UTC
Lähetetty 5 May 2012 | 15:23:02 UTC
Vastaanotettu 5 May 2012 | 15:26:05 UTC
Tila palvelimella Valmis
Tulos Laskentavirhe
Tila ohjelmassa Laskentavirhe
Exit status 98 (0x62)
Tietokoneen tunniste 123486
Raportoinnin takaraja 10 May 2012 | 15:23:02 UTC
Laskenta-aika 2.70
Suoritinaika 0.80
Vahvistuksen tila Vahvistamattomat
Pisteet 0.00
Sovellusversio ACEMD2: GPU molecular dynamics v6.16 (cuda31)
Stderr output

<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.21 GHz
# Total amount of global memory: 1275658240 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 1: "GeForce GTX 260"
# Clock rate: 1.30 GHz
# Total amount of global memory: 891748352 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47792) expected
ERROR: Unable to read bincoordfile

called boinc_finish

</stderr_txt>
]]>

nimi 9px10-MJHARVEY_MJHXA1-8-30-RND0616
sovellus ACEMD2: GPU molecular dynamics
luotu 4 May 2012 | 14:27:28 UTC
oikeita tuloksia vähintään 1
alustavia toisintoja 1
suurin lkm virheitä/kokonaismääriä/onnistuneita tehtäviä 7, 10, 6
Tehtävä
napsauta tietoihin Tietokone Lähetetty Raportointiaika
tai takaraja
selite Tila Laskenta-aika
(sekuntia) Suoritinaika
(sekuntia) Pisteet Sovellus
5326942 124335 4 May 2012 | 17:49:33 UTC 4 May 2012 | 17:54:12 UTC Virhe latauksessa 0.00 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5327658 112695 4 May 2012 | 20:20:30 UTC 4 May 2012 | 21:24:15 UTC Virhe laskennassa 2.07 0.41 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5328368 124628 5 May 2012 | 2:08:11 UTC 5 May 2012 | 2:14:55 UTC Virhe laskennassa 7.75 0.00 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5329342 105945 5 May 2012 | 5:41:17 UTC 5 May 2012 | 5:48:23 UTC Virhe laskennassa 3.67 0.81 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5329857 102639 5 May 2012 | 11:26:08 UTC 5 May 2012 | 11:53:44 UTC Virhe laskennassa 2.15 0.53 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5330904 123486 5 May 2012 | 15:23:02 UTC 5 May 2012 | 15:26:05 UTC Virhe laskennassa 2.70 0.80 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31)
5331534 --- --- --- Lähettämättä --- --- ---

As you can see, also all others have failed to run task :-(
ID: 24746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
RichF

Send message
Joined: 6 Jan 09
Posts: 7
Credit: 5,741,255
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 24747 - Posted: 5 May 2012, 17:19:49 UTC - in response to Message 24743.  

Here is the error message I've been getting. Any help would be appreciated.


Stderr output
<core_client_version>6.12.34</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# Using device 1
SWAN: FATAL : Unable to enumerate devices
Assertion failed: 0, file swanlib_nv.c, line 390

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>
ID: 24747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mika_at_home

Send message
Joined: 16 Apr 12
Posts: 2
Credit: 297,794
RAC: 0
Level

Scientific publications
wat
Message 24750 - Posted: 5 May 2012, 22:12:28 UTC

I also have had failed workunits on this week. Of the last five three have failed. The first failed on wednesday, the next failed on friday and the latest failed tonight. Of all those there are messages like these in BOINC log:

5.5.2012 23:56:56 GPUGRID Computation for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 finished
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_1 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_2 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent
5.5.2012 23:56:56 GPUGRID Output file 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1_3 for task 12px17-MJHARVEY_MJHXA1-4-30-RND6604_1 absent

The following ACEMD2 workunit failed on friday:

1x21-MJHARVEY_MJHXA1-8-30-RND8065_0

Stderr output

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code -99 (0xffffff9d)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.46 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO: cannot open file "restart.coor"
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.46 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Using device 0

I also run Einstein@home with about 2 million credit points and their WU:s have never failed. My graphics card is a Gigabyte GTX 560Ti 448 which runs at reference clock speed of 723 MHz and temps are between 70 - 75 C. I have lowered fan speed with MSI Afterburner. It has been running GPUGRID workunits for about a week now. So should I suspect my computer of these failures?

Thank you
ID: 24750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
NeoMetal*

Send message
Joined: 30 Mar 11
Posts: 1
Credit: 1,005,009
RAC: 0
Level
Ala
Scientific publications
watwatwat
Message 24769 - Posted: 7 May 2012, 0:42:13 UTC

I got a failed WU because of: MDIO: cannot open file "output.restart.coor"
First time I've ever seen that. WU completed fine but errored when it tried to upload. No anti virus or backup running. Just a basic Win 7 install for crunching. This sucks, 21 hours wasted on a most likely valid WU because of a locked or disappearing file.

I see Mika_at_home has a similar error in his post above: MDIO: cannot open file "restart.coor". Is this happening to anyone else? Seems like a rash of errors recently. Could this be something needing fixing?

Stderr output

<core_client_version>7.0.25</core_client_version>
<![CDATA[
<message>
 - exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 560 Ti"
# Clock rate: 1.90 GHz
# Total amount of global memory:                 1073741824 bytes
# Number of multiprocessors:                     8
# Number of cores:                               64
MDIO: cannot open file "output.restart.coor"
ERROR:  get_Dvec() element 0 (b) 
called boinc_finish

</stderr_txt>
]]>



NM*
ID: 24769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24770 - Posted: 7 May 2012, 1:04:59 UTC

The MDIO: cannot open file "output.restart.coor" message is not a real error, it appears in every task, even in the successful ones.
Your real error message is ERROR: get_Dvec() element 0 (b), and I think that such an error cannot be caused by the upload, nor "a locked or disappearing file". This error is happened during processing the wu, probably near its completion, that is why it seems like to be caused by the upload.
ID: 24770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul Raney

Send message
Joined: 26 Dec 10
Posts: 115
Credit: 416,576,946
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 24771 - Posted: 7 May 2012, 2:49:32 UTC - in response to Message 24770.  

Several of my work units failed very near the end of the calculation process. Any ideas on why? The clock rate has been reduced to see if that will correct the issue.

thank you
ID: 24771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24781 - Posted: 7 May 2012, 14:13:47 UTC - in response to Message 24750.  

Mika_at_home, it seems that your tasks are getting suspended and resumed many times. I think there is more chance of failures running this way. I suggest you configure Boinc Manager to allow GPU tasks to run when the system is in use.

All of the MJHARVEY tasks that failed on your system failed on at least one more system, and some repeatedly failed on many systems, suggesting an issue with the tasks; errors Too many errors (may have bug)
Sometimes these issues are very difficult to track down, as they only rarely appear on some combinations of operating system/driver/GPU, but in the above 'Too many errors' case the problem seems independent of GPU, driver and operating system, and my guess is that it was a badly built task,
MDIO: read error for file "input.coor", byte number 4: number of atoms (-45219840) != (47525) expected
ERROR: Unable to read bincoordfile


I would be more concerned by the tasks that fail after 10K sec than 2sec.

Paul Raney, as different task types are failing on your system it's more likely that the issue is a setup one (GPU clock, overuse of CPU, interference from another program...). 'Energies have become nan' is often symptomatic of a GPU issue with the clock, voltage or temps (but may also be linked to overuse of the CPU).
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 24781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
wiyosaya

Send message
Joined: 22 Nov 09
Posts: 114
Credit: 589,114,683
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24784 - Posted: 7 May 2012, 16:51:42 UTC

Running on Windows, one thing that I have done is to turn off the BOINC screen saver. After doing so, I have rarely had any GPUGrid WUs report computation error.

On most PCs these days with LCD monitors, screen savers are only eye candy as LCD monitors do not suffer from burn in as tube based monitors did.

To elaborate a bit further, I set my screen saver to NONE on the two machines where I currently run GPUGrid. I am bringing a third machine on line in the next week or so, and I will also turn off the screen saver on that one, too.
ID: 24784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mika_at_home

Send message
Joined: 16 Apr 12
Posts: 2
Credit: 297,794
RAC: 0
Level

Scientific publications
wat
Message 24798 - Posted: 8 May 2012, 11:50:51 UTC - in response to Message 24781.  

skgiven, thanks for your analysis and advice. I have now completed one ACEMD2 workunit with the GPU task running always. It didn't cause any problems at least with web browsing and e-mail use. I also changed my screensaver to a more simple windows standard screensaver. Now I will get my Einstein GPU-WU:s completed quicker, too. :)

-Mika
ID: 24798 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
lohphat

Send message
Joined: 21 Jan 10
Posts: 46
Credit: 1,388,234,528
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24859 - Posted: 10 May 2012, 4:57:56 UTC

All my GPUGRID WUs are failing. I even replaced my 9800 GTX with a GTX 680 and the WUs fail in less than 5 seconds.

I suspect the Nvidia driver. 301.10 is the only driver for the GTX 680.
ID: 24859 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 24860 - Posted: 10 May 2012, 5:25:24 UTC

gtx 680 for Windows has not been released yet. They are currently working on it. Linux version was just released for beta, when Windows is released it will be on beta as well.
ID: 24860 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : All Tasks Failed

©2025 Universitat Pompeu Fabra