failing tasks lately

Message boards : Number crunching : failing tasks lately
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52174 - Posted: 3 Jul 2019, 16:56:32 UTC
Last modified: 3 Jul 2019, 16:59:27 UTC

This afternoon, I had 4 tasks in a row which failed after few seconds; see here: http://www.gpugrid.net/results.php?userid=125700&offset=0&show_names=1&state=0&appid=

-97 (0xffffffffffffff9f) Unknown error number

The simulation has become unstable. Terminating to avoid lock-up


I've never had that before; and I didn't change anything in my settings or so.
Does anyone else experience the same problem?
I now stopped the download.
ID: 52174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52176 - Posted: 3 Jul 2019, 18:48:47 UTC

I've had three failed tasks over the last two days, but all the others have run normally. All the failed tasks had PABLO_V3_p27_sj403_IDP in their name.

But I'm currently uploading e10s21_e4s18p1f211-PABLO_V3_p27_sj403_IDP-0-2-RND5679_0 - which fits that name pattern, but has run normally. By the time you read this, it will probably have reported and you can read the outcome for yourselves. If it's valid, I think you can assume that Pablo has found the problem and corrected it.
ID: 52176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52177 - Posted: 3 Jul 2019, 19:33:05 UTC

Yes, part of the PABLO_V3_p27_sj403_ID series seems to be erronious.
Within the past few days, some of them worked well here. But others don't, as can be seen.
The server status page shows an error rate of 56.37% for them. Which is high, isn't it?

I'll switch off my aircond over night and will try to download the next task tomorrow morning.
ID: 52177 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52179 - Posted: 4 Jul 2019, 4:46:06 UTC - in response to Message 52177.  

The server status page shows an error rate of 56.37% for them. Which is high, isn't it?

over night, failure rate has raised to 57.98%.

The remaining tasks from this series should be cancelled from the queue.
ID: 52179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52182 - Posted: 4 Jul 2019, 15:41:34 UTC - in response to Message 52179.  

The server status page shows an error rate of 56.37% for them. Which is high, isn't it?

over night, failure rate has raised to 57.98%.

The remaining tasks from this series should be cancelled from the queue.

meanwhile, the failure rate has passed the 60% mark. It's 60,12%, to be exact.

And these faulty tasks are still in the download queue, WHY ???
ID: 52182 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52189 - Posted: 5 Jul 2019, 16:33:40 UTC

I thought we'd got rid of these, but I've just sent back e15s24_e1s258p1f302-PABLO_V3_p27_sj403_IDP-0-2-RND4645_0 - note the _0 replication. I was the first victim since the job was created at 11:25:23 UTC today, seven more to go.
ID: 52189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52194 - Posted: 5 Jul 2019, 19:55:48 UTC

The failure rate now is close to 64%, so it's still climbing up.
From what it looks, none of the tasks from this series are successful.

Can anyone from the GPUGRID people explain the rationale behind leaving these faulty tasks in the download queue?
ID: 52194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 213
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52195 - Posted: 5 Jul 2019, 21:55:24 UTC - in response to Message 52194.  

The failure rate now is close to 64%, so it's still climbing up.
From what it looks, none of the tasks from this series are successful.

Can anyone from the GPUGRID people explain the rationale behind leaving these faulty tasks in the download queue?


A holiday. Some admins won't even cancel tasks like that even if they are active. Some will just let them error out the max # of times.
ID: 52195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52197 - Posted: 6 Jul 2019, 4:44:48 UTC - in response to Message 52195.  

Some will just let them error out the max # of times.

The bad thing is that once a host has more than 2 or 3 such faulty tasks in a row, the host is considered as unreliable and will no longer receive tasks for the next 24 hours.
So the host is penalized for something which is not in the responsibility of the host.

Even more I am wondering that the GPUGRID people don't care :-(
ID: 52197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52204 - Posted: 7 Jul 2019, 5:04:44 UTC

the failure rate has passed the 70% mark now. Great !
ID: 52204 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52208 - Posted: 8 Jul 2019, 18:50:22 UTC

meanwhile, the failure rate has passed the 75% mark. It now is 75,18%, to be exact.
And still, these faulty tasks are in the download queue.
Does anybody understand this?
ID: 52208 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52210 - Posted: 9 Jul 2019, 4:34:10 UTC

If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now.

I don't have any issues with the project and I haven't had any normal work since February when the Linux app was decommissioned.

I trust Toni will eventually figure out the new wrapper apps and we will get work again. Don't PANIC!
ID: 52210 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52211 - Posted: 9 Jul 2019, 5:04:46 UTC - in response to Message 52210.  

If you are so unhappy running the available Windows tasks, just stop getting any work. Problem solved. You are happy now.

I don't have any issues with the project and I haven't had any normal work since February when the Linux app was decommissioned.

I trust Toni will eventually figure out the new wrapper apps and we will get work again. Don't PANIC!

The question isn't whether or not I am unhappy. The question rather is what makes sense and what doesn't.
Don't you think the only real solution to the problem would logically be to simply withdraw the remaining tasks of this faulty series from the download queue?
Or can you explain the rationale for leaving them in the download queue?
In a few more weeks, when all these tasks will be used up, the error rate will be 100%. How does this serve the project?

As I explained before: once a host happens to download such a faulty task 2 or 3 times in a row, this host is blocked for 24 hours. So, what sense does this then make?
ID: 52211 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52215 - Posted: 9 Jul 2019, 13:17:11 UTC

So far as I can tell from my account pages, my machines are processing GPUGrid tasks just fine and at the normal rate.

It's just one sub-type which is failing, and it's only wasting a few seconds when it does so. For some people on metered internet connections, there might be an additional cost, but I think it's unlikely that many people are running a high-bandwidth project that way.

The rationale for letting them time out naturally? It saves staff time, better spent doing the analysis and debugging behind the scenes. Let them get on with that, and I'm sure the research will be re-run when they find and solve the problem.

BTW, "No, it doesn't work" is a valid research outcome.
ID: 52215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Redirect Left

Send message
Joined: 8 Dec 12
Posts: 23
Credit: 182,017,044
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwat
Message 52216 - Posted: 9 Jul 2019, 15:12:21 UTC

My machine has also failed numerous GPUGrid tasks lately, running on 2 GTX 1070 cards (individual, not SLI'd).

The failed ones are usually PABLO or NOELIA in their names.

Here are four examples of recent fails on my machine, hopefully you can determine from output any issues to resolve.

http://www.gpugrid.net/result.php?resultid=7412820
http://www.gpugrid.net/result.php?resultid=21094782
http://www.gpugrid.net/result.php?resultid=7412829
http://www.gpugrid.net/result.php?resultid=21075338

I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine. I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks.
ID: 52216 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52217 - Posted: 9 Jul 2019, 22:28:06 UTC - in response to Message 52216.  
Last modified: 9 Jul 2019, 22:31:17 UTC

http://www.gpugrid.net/result.php?resultid=7412820 This WU is from 2013.
http://www.gpugrid.net/result.php?resultid=21094782 This WU is from the present bad batch. It took 6 seconds to error out.
http://www.gpugrid.net/result.php?resultid=7412829 This WU is from 2013.
http://www.gpugrid.net/result.php?resultid=21075338 This WU is from the present bad batch. It took 5 seconds to error out.

http://www.gpugrid.net/result.php?resultid=21094816 This WU is from the present bad batch. It took 6 seconds to error out.

I'll be skipping GPUGrid tasks from now on until it is resolved, as it is wasting CPU/GPU time that i can use for other projects on the machine.
The 3 recent errors wasted 17 seconds on your host in the past 4 days, so there's no reason for panicking. (even though your host didn't received work for 3 days.)

I'll refer back to these forums to check on updates though so i know when to restart GPUGRID tasks.
The project is running fine beside this one bad batch, so you can do it right away.

The number of resends may increase as this bad batch runs out, that may cause a host to be "blacklisted" for 24 hours, but it needs many failing workunits in a row (so it is unlikely to happen, as the maximal number of daily workunits get reduced by 1 after an error).
The max number of "Long runs (8-12 hours on fastest card) 9.22 windows_intelx86 (cuda80)" app for your host is currently 28, so this host should be extremely unlucky to receive 28 bad workunits in a row to get "banned" for 24 hours.
ID: 52217 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Redirect Left

Send message
Joined: 8 Dec 12
Posts: 23
Credit: 182,017,044
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwat
Message 52218 - Posted: 9 Jul 2019, 23:09:11 UTC - in response to Message 52217.  

Oops my bad, i sorted the tasks by 'errored' and mixed up the ones to paste.

The results in their entirity are below, with 10 errored ones, only 4 recently with non have errored (or are not showing there) since one in 2015, and the other 5 in 2013.
http://www.gpugrid.net/results.php?userid=93721&offset=0&show_names=0&state=5&appid=

On your advice i'll restart the GPUGrid task seeking, and hopefully the toin cosses go in my way and it'll fetch a wide spread of tasks to not get itself blacklisted. Interesting it is set to store up to 28, given it only ever stores 4, and that is if 2 are running active on the GPUs with 2 spare. But I guess that is down to the limits on the future work storage settings for BOINC.
ID: 52218 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52342 - Posted: 24 Jul 2019, 8:42:59 UTC
Last modified: 24 Jul 2019, 8:46:04 UTC

There are two more 'bad' batches at the moment in the 'long' queue:
PABLO_V4_UCB_p27_isolated_005_salt_ID
PABLO_V4_UCB_p27_sj403_short_005_salt_ID

Don't be surprised if the tasks from these two batches fail on your host after a couple of seconds - there's nothing wrong with your host.
The safety check of these batches is too sensitive, so it thinks that "the simulation became unstable" while it's probably not.
ID: 52342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 52385 - Posted: 7 Aug 2019, 12:10:27 UTC

any idea why all tasks downloaded within the last few hours fail immediately?
ID: 52385 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
gemini8
Avatar

Send message
Joined: 3 Jul 16
Posts: 31
Credit: 2,248,809,169
RAC: 0
Level
Phe
Scientific publications
watwat
Message 52386 - Posted: 7 Aug 2019, 12:51:31 UTC - in response to Message 52385.  

any idea why all tasks downloaded within the last few hours fail immediately?

No idea, but it's the same for others.

I'm using Win7pro, work-units crash at once:
Stderr Ausgabe
<core_client_version>7.10.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -44 (0xffffffd4)</message>
]]>

07.08.2019 14:17:11 | GPUGRID | Sending scheduler request: To fetch work.
07.08.2019 14:17:11 | GPUGRID | Requesting new tasks for NVIDIA GPU
07.08.2019 14:17:13 | GPUGRID | Scheduler request completed: got 1 new tasks
07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE
07.08.2019 14:17:15 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT
07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-LICENSE
07.08.2019 14:17:17 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-COPYRIGHT
07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file
07.08.2019 14:17:17 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file
07.08.2019 14:17:18 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-vel_file
07.08.2019 14:17:18 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file
07.08.2019 14:17:19 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-idx_file
07.08.2019 14:17:19 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file
07.08.2019 14:17:21 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-coor_file
07.08.2019 14:17:21 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file
07.08.2019 14:17:30 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-pdb_file
07.08.2019 14:17:30 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file
07.08.2019 14:17:33 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-par_file
07.08.2019 14:17:33 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc
07.08.2019 14:17:34 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-conf_file_enc
07.08.2019 14:17:34 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file
07.08.2019 14:17:35 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-metainp_file
07.08.2019 14:17:35 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file
07.08.2019 14:17:36 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-hills_file
07.08.2019 14:17:36 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file
07.08.2019 14:17:37 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-xsc_file
07.08.2019 14:17:37 | GPUGRID | Started download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file
07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-psf_file
07.08.2019 14:17:38 | GPUGRID | Finished download of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-prmtop_file
07.08.2019 14:19:22 | GPUGRID | Starting task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4
07.08.2019 14:19:29 | GPUGRID | Computation for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 finished
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_0 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_1 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_2 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:29 | GPUGRID | Output file e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_3 for task e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4 absent
07.08.2019 14:19:37 | GPUGRID | Started upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7
07.08.2019 14:19:39 | GPUGRID | Finished upload of e14s18_e8s70p1f46-PABLO_V4_UCB_p27_sj403_005_salt_IDP-0-2-RND1985_4_7


Another member of our team has the same problem on Win10.
I'd really like to compare this with Linux, but I didn't get any work-unit on my Debian machine for weeks.
- - - - - - - - - -
Greetings, Jens
ID: 52386 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : failing tasks lately

©2025 Universitat Pompeu Fabra