BAD PABLO_p53 WUs

Message boards : Number crunching : BAD PABLO_p53 WUs
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46695 - Posted: 20 Mar 2017, 20:42:47 UTC

So far I've had 23 of these bad PABLO_p53 WUs today. Think maybe they should be cancelled?
ID: 46695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46696 - Posted: 20 Mar 2017, 21:32:08 UTC - in response to Message 46695.  

ID: 46696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 42
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46697 - Posted: 20 Mar 2017, 21:43:59 UTC

I also had 2 bad units from this bunch.

The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear.



ID: 46697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46698 - Posted: 20 Mar 2017, 21:47:27 UTC - in response to Message 46697.  
Last modified: 20 Mar 2017, 21:47:40 UTC

The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear.

That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly.
ID: 46698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile koschi
Avatar

Send message
Joined: 14 Aug 08
Posts: 127
Credit: 913,858,161
RAC: 11
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 46699 - Posted: 20 Mar 2017, 22:43:27 UTC

Same here under Linux:

ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
ID: 46699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46701 - Posted: 21 Mar 2017, 1:42:13 UTC

My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch).
Luckily there are short runs (and one other long run), so my main host is not completely shut off of this project.
This is really annoying.
This batch was working fine previously.
ID: 46701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 40
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46702 - Posted: 21 Mar 2017, 4:20:17 UTC - in response to Message 46698.  

That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly.


They do error quickly, but it kicked me into my daily quota limit right at the beginning of a new day. $#%@^!
ID: 46702 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46703 - Posted: 21 Mar 2017, 4:23:57 UTC

also bad here:
for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-(((

This means that no new task can be downloaded before March 22, right?
ID: 46703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46704 - Posted: 21 Mar 2017, 7:27:43 UTC - in response to Message 46703.  

also bad here:
for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-(((

This means that no new tasks can be downloaded before March 22, right?


The incident early this morning shows that the policy of the daily quota should be revisited quickly.
In the specific case, it results in total nonsense.

No idea how many people (I think: many) now are not able to download any new tasks for whole March 21. Rather bad thing.
ID: 46704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46705 - Posted: 21 Mar 2017, 9:49:54 UTC
Last modified: 21 Mar 2017, 9:50:23 UTC

This is a generic error, all long workunits failed on all of my hosts too overnight, so all of my hosts are processing short runs now, but the short queue is ran dry already.

The incident early this morning shows that the policy of the daily quota should be revisited quickly.
In the specific case, it results in total nonsense.
The daily quota would decrease to 1 in every case if the project supplies only failing workunits, there's no problem with that. The problem is the waaay too high ratio of bad workunits in the queue.
ID: 46705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Lluis

Send message
Joined: 22 Feb 14
Posts: 26
Credit: 672,639,304
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 46706 - Posted: 21 Mar 2017, 10:52:52 UTC - in response to Message 46705.  

Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process.
Anyone has an idea of what to do? Any advice (other than process short units)?
ID: 46706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46707 - Posted: 21 Mar 2017, 11:32:47 UTC - in response to Message 46705.  

... so all of my hosts are processing short runs now, but the short queue is ran dry already.

I switched to short runs in the early morning when, according to the Project Status Page, some were still available.
However, the download of those was again refused referring to the "daily quota of 3 tasks" :-(
ID: 46707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Matt
Avatar

Send message
Joined: 11 Jan 13
Posts: 216
Credit: 846,538,252
RAC: 0
Level
Glu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46708 - Posted: 21 Mar 2017, 11:54:32 UTC

I've had 9 Pablos fail on me. I'm only receiving short runs now. Returning to Einstein until this is sorted.
ID: 46708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Betting Slip

Send message
Joined: 5 Jan 09
Posts: 670
Credit: 2,498,095,550
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46709 - Posted: 21 Mar 2017, 11:54:40 UTC - in response to Message 46706.  

Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process.
Anyone has an idea of what to do? Any advice (other than process short units)?


There is nothing you can do just wait 24hrs. There is not a problem with the Daily Quota" there is a massive problem with the dumping into the queue of faulty WU's.

The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured.
ID: 46709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46711 - Posted: 21 Mar 2017, 12:09:04 UTC - in response to Message 46709.  

The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured.
It seems that the workunits gone wrong at a certain point, but it wasn't clear that it would affect every batch. It took a couple of hours while the error spread out wide, now the situation is clear. It's very easy to be wise retrospectively.
BTW I've sent a notification email to a member of the staff.
ID: 46711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46714 - Posted: 21 Mar 2017, 12:42:22 UTC - in response to Message 46709.  

There is nothing you can do just wait 24hrs.

the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock.
I guess quite a number of crunchers would be pleased if this was possible.
ID: 46714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46716 - Posted: 21 Mar 2017, 12:50:43 UTC - in response to Message 46714.  

There is nothing you can do just wait 24hrs.

the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock.
I guess quite a number of crunchers would be pleased if this was possible.

They have to cure the underlying problem first - that's the priority when something like this happens.
ID: 46716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 46718 - Posted: 21 Mar 2017, 13:15:49 UTC

Oh crappity crap. I think I messed up the adaptive. Will check now. Thanks for pointing it out!
ID: 46718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46719 - Posted: 21 Mar 2017, 14:13:04 UTC - in response to Message 46716.  

They have to cure the underlying problem first - that's the priority when something like this happens.

that's clear to me - first the problem must be fixed, and then, if possible, some kind of "reset" should be made so that all crunchers for which the downloads were stopped could download new tasks again.

Although I am afraid that this will not be possible.
ID: 46719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 46720 - Posted: 21 Mar 2017, 14:21:55 UTC

Well these broken tasks will have to run their course. But they crash on start so they should be gone very quickly now. I fixed the bugs and we will restart the adaptives in a moment
ID: 46720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : BAD PABLO_p53 WUs

©2025 Universitat Pompeu Fabra