Bad batch of WU's

Message boards : Number crunching : Bad batch of WU's
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Betting Slip

Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49293 - Posted: 17 Apr 2018, 8:16:50 UTC - in response to Message 49291.  
Last modified: 17 Apr 2018, 8:53:36 UTC

Why are these work units still in the que? Is anyone running this program? Where are the moderators at? This place is like an airplane on autopilot, it seems like some of these projects have no more enthusiasm.


A similar post of mine about this projects participation with its contributors.


http://www.gpugrid.net/forum_thread.php?id=4585#47369

another one,

http://www.gpugrid.net/forum_thread.php?id=4368&nowrap=true#48039
ID: 49293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49294 - Posted: 17 Apr 2018, 8:32:49 UTC - in response to Message 49293.  

I'm under the impression they think were all getting paid for crunching, I will never do data mining, ever!! I'll bet a lot of the dedicated crunchers that have been here for a while are now miners for hire.
ID: 49294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STFC9F22

Send message
Joined: 10 Nov 17
Posts: 7
Credit: 154,876,594
RAC: 0
Level
Ile
Scientific publications
watwat
Message 49295 - Posted: 17 Apr 2018, 9:25:23 UTC - in response to Message 49278.  


If you follow through your account (name link at top of page) / computer / workunit / task, you should be able to see something like workunit 13443713 - that one was well worth aborting.


Richard, thank you for your post. I have received a single further failing task this morning and following the link as you advise it seems to have originated in yesterday's bad batch at 11:17 (UTC).

I notice that by following the links for all ten of my failed tasks, seven arising from the 13 April bad batch (around 17:50 UTC) and three from the 16 April bad Batch (around 11:20 UTC), they now all show exactly eight 'Error while Computing' failures so perhaps there is some automatic mechanism whereby they are automatically pulled after eight failures on different computers?

I also notice on the Server Status Page that PABLO_p27_wild_0_sj403_ID currently shows 740 successes and a 92.33% error rate which, if my maths is correct, suggests over 9600 failures, potentially resulting in a large number of computers having been temporarily locked out. Frustrating though that is for we donors it doesn't appear to have created a (GPU) processing backlog and perhaps is an indication that the processing resource offered by donors currently far exceeds the requirements of the available work.[/url]
ID: 49295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp

Send message
Joined: 28 Mar 15
Posts: 46
Credit: 1,547,496,701
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 49299 - Posted: 17 Apr 2018, 15:36:32 UTC - in response to Message 49288.  
Last modified: 17 Apr 2018, 15:36:43 UTC

So, one my one, my seven hosts will be jailed, wasting 12x 1080Ti and 2x TitanX... Something is very non-ideal with that picture. Given that this sort of bad batches are happening with some relevant non-ignorable frequency, there should be a way to unblock machines in bulk that were blocked/limited due to bad batch issues, methinks.


Case in point: When my hosts are fully utilized, I would see "State: In progress (28)" under my account's Tasks page (I have a custom config file that tells BOINC that GPUGRID tasks are each 1 CPU + 0.5 GPU, and that works well for these cards, I found, so 14 cards = 28 tasks). Last night I saw it at 26, then 22, went to sleep, then this morning at 14, now it is 12.

For instance, when one of my single TitanX machines (http://www.gpugrid.net/results.php?hostid=205349) that has NOTHING ELSE going on in BOINC contacts GPUGRID, it gets:

4/17/2018 8:23:54 AM | GPUGRID | Sending scheduler request: To fetch work.
4/17/2018 8:23:54 AM | GPUGRID | Requesting new tasks for CPU and NVIDIA GPU
4/17/2018 8:23:56 AM | GPUGRID | Scheduler request completed: got 0 new tasks
4/17/2018 8:23:56 AM | GPUGRID | No tasks sent

Quite ironic when the Server Status says "Tasks ready to send 34,375"...

:(
ID: 49299 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49300 - Posted: 17 Apr 2018, 15:52:56 UTC - in response to Message 49295.  

I notice that by following the links for all ten of my failed tasks, seven arising from the 13 April bad batch (around 17:50 UTC) and three from the 16 April bad Batch (around 11:20 UTC), they now all show exactly eight 'Error while Computing' failures so perhaps there is some automatic mechanism whereby they are automatically pulled after eight failures on different computers?

Yes, on the workunit page, you should see a red banner above the task list saying

Too many errors (may have bug)

Once that appears (at this project, after 8 failures), no more are sent out.
ID: 49300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49301 - Posted: 17 Apr 2018, 16:47:50 UTC - in response to Message 49299.  

Quite ironic when the Server Status says "Tasks ready to send 34,375"...

:(


Those are Quantum Chemistery WU's for your CPU.

ID: 49301 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STFC9F22

Send message
Joined: 10 Nov 17
Posts: 7
Credit: 154,876,594
RAC: 0
Level
Ile
Scientific publications
watwat
Message 49302 - Posted: 17 Apr 2018, 16:53:12 UTC - in response to Message 49299.  



For instance, when one of my single TitanX machines (http://www.gpugrid.net/results.php?hostid=205349) that has NOTHING ELSE going on in BOINC contacts GPUGRID, it gets:

4/17/2018 8:23:54 AM | GPUGRID | Sending scheduler request: To fetch work.
4/17/2018 8:23:54 AM | GPUGRID | Requesting new tasks for CPU and NVIDIA GPU
4/17/2018 8:23:56 AM | GPUGRID | Scheduler request completed: got 0 new tasks
4/17/2018 8:23:56 AM | GPUGRID | No tasks sent

Quite ironic when the Server Status says "Tasks ready to send 34,375"...

:(


Hi Tuna,

The 'Tasks ready to send' figure on the Server Status page is the total of all task types ready to send. There is a table beneath this showing a breakdown of 'Tasks by application' (although for some reason the totals always differ by one). You should see that almost all, if not all, of the unsent tasks are currently Quantum Chemistry tasks which do not run on GPUs or Windows; they run on Linux CPUs only. The Unsent Short runs and Unsent Long Runs figures show the work available for GPUs.

I cannot remember the exact wording, but when my own PC was temporarily jailed, requests for new work in the event log then reported the reason that the daily quota (3) had been exceeded - as the log you have posted above does not say this I suspect you might not be locked out and the reason you are not receiving tasks is simply that currently, most of the time, there are no GPU tasks available to send. As I understand it, tasks are only sent in response to requests from the client so it is down a matter of luck as to whether tasks are available when your PC makes its requests.
ID: 49302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 193
Level
Tyr
Scientific publications
watwatwatwatwat
Message 49303 - Posted: 17 Apr 2018, 17:02:29 UTC

I had 8 fail on the 13th at 18:38:35 UTC and received 4 more on the 15th at 6:17:47 UTC. So if they were jailed it was less than 2 days.
ID: 49303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp

Send message
Joined: 28 Mar 15
Posts: 46
Credit: 1,547,496,701
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 49304 - Posted: 17 Apr 2018, 17:03:25 UTC - in response to Message 49302.  

Ooops. Yup. I didn't scroll down far enough, I guess... :)
ID: 49304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STFC9F22

Send message
Joined: 10 Nov 17
Posts: 7
Credit: 154,876,594
RAC: 0
Level
Ile
Scientific publications
watwat
Message 49307 - Posted: 17 Apr 2018, 22:44:28 UTC - in response to Message 49303.  

I had 8 fail on the 13th at 18:38:35 UTC and received 4 more on the 15th at 6:17:47 UTC. So if they were jailed it was less than 2 days.


Yes, as in these circumstances the event log refers to a daily quota being exceeded, I guess the lockout only lasts for a day or the remaining part of a day.
ID: 49307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 49320 - Posted: 19 Apr 2018, 5:40:30 UTC - in response to Message 49284.  

Richard wrote on April 16th:
All the failed / failing tasks over the last four days have had the exact string

PABLO_p27_wild

in their name.

These faulty task are still in the queue; I got such one this morning:

http://gpugrid.net/result.php?resultid=17472190

again, as before: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
ID: 49320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 49321 - Posted: 19 Apr 2018, 6:43:45 UTC

just now, the next faulty PABLO_927_wild task :-(((
The fourth one within 6 hours! Which is really annoying.

What I don't understand: don't the people at GPUGRID monitor what's happening?
This specific problem has been known for 6 days now, and still there are these faulty tasks in the queue.
Not nice at all :-(((
ID: 49321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STFC9F22

Send message
Joined: 10 Nov 17
Posts: 7
Credit: 154,876,594
RAC: 0
Level
Ile
Scientific publications
watwat
Message 49323 - Posted: 19 Apr 2018, 8:38:14 UTC - in response to Message 49320.  


These faulty task are still in the queue; I got such one this morning:

http://gpugrid.net/result.php?resultid=17472190



Hi Erich,

As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881).

If you look at the history of the Work Unit that you picked up http://gpugrid.net/workunit.php?wuid=13443681 you can see that it originated in the bad batch released on 13 April. After it failed for you, it was reissued to flashawk (who aborted it) and is currently reissued to an anonymous user but still requires three further failures to trigger automatic removal.

I guess that because there is currenlty so little GPU work available these remaining bad units are still a significant proportion of available work and agree that it is disappointing, and seems a little disrespectful to donors, that no action has been taken to remove them proactively.
ID: 49323 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 49325 - Posted: 19 Apr 2018, 10:26:25 UTC

@STFC9F22, thank you for your explanations and insights.

You are right, the way this problem is being handled by GPUGRID is a little dissapointing for us donors, particularly since a donor is being punished by not getting any more tasks for a certain timespan as the host is being considered "unreliable".
Exactly this happened to me last Friday with one of my hosts, and I think it's not okay at all.
The mechanism of "host punishment" should definitely be suspended in such cases where the cause for the problem is a faulty task.
ID: 49325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49327 - Posted: 19 Apr 2018, 18:29:53 UTC - in response to Message 49323.  
Last modified: 19 Apr 2018, 18:30:43 UTC

Hi Erich,

As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881).


Most of us are and have been aware of this for sometime. The problem is when a computer gets to many errors it's put on a black list for a time, I know right after I get an error I can't download any WU's for a time.

We should'nt be taking these hits. It's as though he loaded up these last WU's, packed his bags and left.
ID: 49327 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PappaLitto

Send message
Joined: 21 Mar 16
Posts: 513
Credit: 4,673,458,277
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 49328 - Posted: 19 Apr 2018, 18:36:07 UTC - in response to Message 49327.  

It's as though he loaded up these last WU's, packed his bags and left.

You know he might be on vacation. Scientists have real lives too.
ID: 49328 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49329 - Posted: 19 Apr 2018, 18:45:26 UTC - in response to Message 49327.  
Last modified: 19 Apr 2018, 18:47:41 UTC

Hi Erich,

As Richard confirmed earlier in the thread there is a mechanism whereby tasks are automatically withdrawn after eight failures and it seems this is being relied on to remove the faulty batches (for example http://www.gpugrid.net/workunit.php?wuid=13443881).


Most of us are and have been aware of this for sometime. The problem is when a computer gets to many errors it's put on a black list for a time, I know right after I get an error I can't download any WU's for a time.

We should'nt be taking these hits. It's as though he loaded up these last WU's, packed his bags and left.

When the tasks are withdrawn after eight failures, they should also no longer be counted as failures for the eight computers that ran them.

While this is fixed, compute errors in 2015 and earlier years should be removed from the lists of failures for computers, even for workunits that never had a successful task.
ID: 49329 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49331 - Posted: 19 Apr 2018, 20:31:20 UTC - in response to Message 49329.  

I have 8 failed WU's from 2013 through 2014 that won't go away, how do I get those removed? I've asked twice and got no response, are their mods here anymore?
ID: 49331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 49345 - Posted: 20 Apr 2018, 15:55:57 UTC

I just notice that many more of these faulty PABLO_P27_wild tasks are distributed, although they have an error rate of 88% (which means that nearly 9 out of 10 are bad).
Can anyone explain what sense this makes?
ID: 49345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49346 - Posted: 20 Apr 2018, 19:59:35 UTC - in response to Message 49345.  

I just notice that many more of these faulty PABLO_P27_wild tasks are distributed, although they have an error rate of 88% (which means that nearly 9 out of 10 are bad).
Can anyone explain what sense this makes?


I have 4 over 50% right now and they seem fine, they may have been reworked.
ID: 49346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Bad batch of WU's

©2025 Universitat Pompeu Fabra