Bad batch of WU's

Message boards : Number crunching : Bad batch of WU's
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49260 - Posted: 13 Apr 2018, 21:13:40 UTC
Last modified: 13 Apr 2018, 21:15:01 UTC

I just shot up to 21 errors, I was watching a group of WU's startup and all of them pushed my 3 1080's to 2100MHz and they failed. They are failing on everyone's computers, they are Adria's WU's.
ID: 49260 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 65
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49261 - Posted: 13 Apr 2018, 21:25:54 UTC - in response to Message 49260.  
Last modified: 13 Apr 2018, 21:28:24 UTC

I just had two Pablo's crash as soon as they started.

e175s2_e44s10p0f2-PABLO_p27_wild_0_sj403-0
e174s112_e63s8p1f198-PABLO_p27_wild_0_sj403_IDP-0

Update - I just checked my tasks list. I've had seven bad Pablos today.

Win
ID: 49261 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49262 - Posted: 13 Apr 2018, 21:31:02 UTC - in response to Message 49261.  

Strange, both their WU's are crashing. You have 7 from today, I was poking around and everyone is getting errors on long WU's.
ID: 49262 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49263 - Posted: 13 Apr 2018, 22:45:41 UTC - in response to Message 49262.  
Last modified: 13 Apr 2018, 22:46:06 UTC

All tasks error out immediately with:
ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
This should be fixed by the staff ASAP.
ID: 49263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49265 - Posted: 14 Apr 2018, 14:57:47 UTC

Any word yet? Has anyone gotten any WU's yet and did they fail? Where are the moderators at, I haven't seen any mods sense I've been back. Also, has anyone heard anything about CPDN that crunches for them? Their forum and everything else has been down for almost 2 weeks now, no WU's, not a word. their main page is up but no mention of what happened, I know there's some people here that crunch for them. just wondering if they might have heard something.
ID: 49265 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49266 - Posted: 14 Apr 2018, 15:33:20 UTC - in response to Message 49265.  

There was a batch of bad PABLO tasks tasks created between about 17:30 - 18:00 UTC yesterday afternoon. I've watched some crash, and I've aborted some others (after checking that they had failed on other machines first). But there are good tasks created before and after.
ID: 49266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
3de64piB5uZAS6SUNt1GFDU9dRhY
Avatar

Send message
Joined: 20 Apr 15
Posts: 285
Credit: 1,102,216,607
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwat
Message 49267 - Posted: 14 Apr 2018, 20:12:00 UTC

I had quite a few today as well... scared me to death. Frankly I suspected my 4 month young ASUS ROG gtx1070 of being defective and was (figuratively) about to throw it out of the window... when I stumbled across the same error

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

Saved by the bell :)
I would love to see HCF1 protein folding and interaction simulations to help my little boy... someday.
ID: 49267 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STFC9F22

Send message
Joined: 10 Nov 17
Posts: 7
Credit: 154,876,594
RAC: 0
Level
Ile
Scientific publications
watwat
Message 49277 - Posted: 16 Apr 2018, 12:02:37 UTC

It seems another bad batch of PABLO_p27_wild_0_sj403_ID has just been released.

On 13 April around 18:10 I received seven of these which all failed after about 6 seconds (the event log reporting three absent files), and was then locked out, according to the event log, due to exceeding my daily quota.

I have just (16 April at 11:34) received two more tasks failing in the same manner, but have temporarily set the Project to 'No New Tasks' to avoid being locked out again.

The files shown as absent in the event log are:
16/04/2018 12:33:53 | GPUGRID | Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_1 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent
16/04/2018 12:33:53 | GPUGRID | Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_2 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent
16/04/2018 12:33:53 | GPUGRID | Output file e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1_3 for task e174s447_e62s29p1f212-PABLO_p27_wild_0_sj403_IDP-0-2-RND7636_1 absent

- although as these are output files I guess it might be that this is a symptom of the failure rather than the cause.
ID: 49277 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49278 - Posted: 16 Apr 2018, 13:03:00 UTC - in response to Message 49277.  
Last modified: 16 Apr 2018, 13:07:17 UTC

- although as these are output files I guess it might be that this is a symptom of the failure rather than the cause.

Yes, those are symptoms, not causes.

If you follow through your account (name link at top of page) / computer / workunit / task, you should be able to see something like workunit 13443713 - that one was well worth aborting.

And if you look at one of the errored tasks, the real cause:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

That was the earlier batch. Today's are possibly similar, but we need to see one to be sure. Your computers are hidden, and we don't have the 'find task by name' tool here, so we'll have to ask you to look it up for use.

Edit - thanks for the heads up, I've got one of those too. WU 13451812 is indeed the same as before, created 16 Apr 2018 | 11:22:43 UTC. That can go in the bit-bucket with the others.
ID: 49278 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49279 - Posted: 16 Apr 2018, 13:26:21 UTC

I've just been sent another from today's bad batch e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-2-RND8766. The files downloaded were

16/04/2018 14:16:44 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-LICENSE
16/04/2018 14:16:44 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-COPYRIGHT
16/04/2018 14:16:46 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-coor_file
16/04/2018 14:16:46 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-vel_file
16/04/2018 14:16:47 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-idx_file
16/04/2018 14:16:47 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-pdb_file
16/04/2018 14:16:49 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-psf_file
16/04/2018 14:16:52 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-par_file
16/04/2018 14:16:54 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-conf_file_enc
16/04/2018 14:16:55 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-metainp_file
16/04/2018 14:16:55 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-hills_file
16/04/2018 14:16:56 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-xsc_file
16/04/2018 14:16:56 | GPUGRID | Started download of e174s49_e48s25p1f219-PABLO_p27_wild_0_sj403_IDP-0-prmtop_file

For comparison, I'm working on an older one, resent this morning but created on 11 April - e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-2-RND9196. Those files were called

16/04/2018 09:06:22 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-LICENSE
16/04/2018 09:06:22 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-COPYRIGHT
16/04/2018 09:06:23 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_1
16/04/2018 09:06:23 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_2
16/04/2018 09:06:24 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_3
16/04/2018 09:06:24 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-pdb_file
16/04/2018 09:06:25 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-psf_file
16/04/2018 09:06:25 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-par_file
16/04/2018 09:06:26 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-conf_file_enc
16/04/2018 09:06:27 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-metainp_file
16/04/2018 09:06:27 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_7
16/04/2018 09:06:28 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-0-2-RND9196_10
16/04/2018 09:06:28 | GPUGRID | Started download of e82s5_e80s15p1f298-PABLO_p27_W60A_W76A_0_IDP-1-prmtop_file

Quite a difference.
ID: 49279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49280 - Posted: 16 Apr 2018, 14:58:28 UTC

Two of my recent PABLO tasks gave Error while computing, with this error message in the stderr file:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile

Could you check if this is due to a missing file that should have been sent with the task?
ID: 49280 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49281 - Posted: 16 Apr 2018, 15:04:02 UTC

I just got 10 more bad Pablo's, that brings me to 34 total. The server is going to give me the boot for to many errors in such a short amount of time, I hope they figure this out soon.

Richard, do you have any idea what's going on over at CPDN? Everything has been down for 2 weeks or so and I was curious when they might be back up, thanks.
ID: 49281 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49283 - Posted: 16 Apr 2018, 15:50:55 UTC - in response to Message 49281.  

Richard, do you have any idea what's going on over at CPDN? Everything has been down for 2 weeks or so and I was curious when they might be back up, thanks.

I received the same emails as have been quoted on the BOINC message board at CPDN project going offline this afternoon, but I've had no more specific news that that. Better to consolidate all the news that we do get in that thread, I think.
ID: 49283 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49284 - Posted: 16 Apr 2018, 16:01:21 UTC - in response to Message 49280.  
Last modified: 16 Apr 2018, 16:08:58 UTC

Could you check if this is due to a missing file that should have been sent with the task?

All the failed / failing tasks over the last four days have had the exact string

PABLO_p27_wild

in their name. I can see that I've completed at least one successfully with that string: also, at least one other with PABLO_p27_O43806_wild

I'll go and search the message logs to see what I can find, but I think any completely missing files would show up as a problem at the download stage, and never get as far as attempting to run. I think it's more likely that the contents are badly formatted in some way, and it won't be possible to compare good and bad after the event.

Edit - well, e173s16_e149s4p1f23-PABLO_p27_wild_0_sj403_IDP-1-2-RND2043 had file names with the workunit name embedded, like the second example in my comparison example earlier. I think that Pablo, or whoever is submitting the work on Pablo's behalf, might be using the wrong script/template when preparing the workunits.
ID: 49284 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 259
Level
Tyr
Scientific publications
watwatwatwatwat
Message 49286 - Posted: 16 Apr 2018, 17:25:07 UTC

2nd batch of bad tasks today. 8 on the 13th and 4 more today.
ID: 49286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp

Send message
Joined: 28 Mar 15
Posts: 46
Credit: 1,547,496,701
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 49288 - Posted: 16 Apr 2018, 17:43:52 UTC

So, one my one, my seven hosts will be jailed, wasting 12x 1080Ti and 2x TitanX... Something is very non-ideal with that picture. Given that this sort of bad batches are happening with some relevant non-ignorable frequency, there should be a way to unblock machines in bulk that were blocked/limited due to bad batch issues, methinks.
ID: 49288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49289 - Posted: 16 Apr 2018, 18:31:20 UTC

Why don't they know what's going on? Don't they monitor their project?

Thanks Richard, sorry to bother you, I didn't think to check the BOINC forums. I haven't heard nothing even before they went down on the CPDN forum.
ID: 49289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile robertmiles

Send message
Joined: 16 Apr 09
Posts: 503
Credit: 769,991,668
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 49290 - Posted: 16 Apr 2018, 19:10:17 UTC - in response to Message 49284.  

Could you check if this is due to a missing file that should have been sent with the task?

All the failed / failing tasks over the last four days have had the exact string

PABLO_p27_wild

in their name. I can see that I've completed at least one successfully with that string: also, at least one other with PABLO_p27_O43806_wild

I'll go and search the message logs to see what I can find, but I think any completely missing files would show up as a problem at the download stage, and never get as far as attempting to run. I think it's more likely that the contents are badly formatted in some way, and it won't be possible to compare good and bad after the event.

Edit - well, e173s16_e149s4p1f23-PABLO_p27_wild_0_sj403_IDP-1-2-RND2043 had file names with the workunit name embedded, like the second example in my comparison example earlier. I think that Pablo, or whoever is submitting the work on Pablo's behalf, might be using the wrong script/template when preparing the workunits.

I'd expect the download stage to fail if the file was missing on the server, but only if the name of the file was included on the list of files sent with the task to tell the client what files to download before starting the task.

If the name of the file was missing from that list, I'd expect download stage to download all the files on the list, report success for that stage, and the problem to become visible only when the application tries to open the file.
ID: 49290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49291 - Posted: 17 Apr 2018, 7:16:01 UTC

Why are these work units still in the que? Is anyone running this program? Where are the moderators at? This place is like an airplane on autopilot, it seems like some of these projects have no more enthusiasm.
ID: 49291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 49292 - Posted: 17 Apr 2018, 7:16:05 UTC
Last modified: 17 Apr 2018, 7:17:05 UTC

Double post
ID: 49292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Bad batch of WU's

©2025 Universitat Pompeu Fabra