Message boards :
Number crunching :
BAD PABLO_p53 WUs
Message board moderation
| Author | Message |
|---|---|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So far I've had 23 of these bad PABLO_p53 WUs today. Think maybe they should be cancelled? |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 42 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I also had 2 bad units from this bunch. The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear. That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly. |
koschiSend message Joined: 14 Aug 08 Posts: 127 Credit: 913,858,161 RAC: 11 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same here under Linux: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch). Luckily there are short runs (and one other long run), so my main host is not completely shut off of this project. This is really annoying. This batch was working fine previously. |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 40 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly. They do error quickly, but it kicked me into my daily quota limit right at the beginning of a new day. $#%@^! |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
also bad here: for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-((( This means that no new task can be downloaded before March 22, right? |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
also bad here: This means that no new tasks can be downloaded before March 22, right? The incident early this morning shows that the policy of the daily quota should be revisited quickly. In the specific case, it results in total nonsense. No idea how many people (I think: many) now are not able to download any new tasks for whole March 21. Rather bad thing. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is a generic error, all long workunits failed on all of my hosts too overnight, so all of my hosts are processing short runs now, but the short queue is ran dry already. The incident early this morning shows that the policy of the daily quota should be revisited quickly.The daily quota would decrease to 1 in every case if the project supplies only failing workunits, there's no problem with that. The problem is the waaay too high ratio of bad workunits in the queue. |
|
Send message Joined: 22 Feb 14 Posts: 26 Credit: 672,639,304 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. Anyone has an idea of what to do? Any advice (other than process short units)? |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... so all of my hosts are processing short runs now, but the short queue is ran dry already. I switched to short runs in the early morning when, according to the Project Status Page, some were still available. However, the download of those was again refused referring to the "daily quota of 3 tasks" :-( |
|
Send message Joined: 11 Jan 13 Posts: 216 Credit: 846,538,252 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've had 9 Pablos fail on me. I'm only receiving short runs now. Returning to Einstein until this is sorted. |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. There is nothing you can do just wait 24hrs. There is not a problem with the Daily Quota" there is a massive problem with the dumping into the queue of faulty WU's. The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured.It seems that the workunits gone wrong at a certain point, but it wasn't clear that it would affect every batch. It took a couple of hours while the error spread out wide, now the situation is clear. It's very easy to be wise retrospectively. BTW I've sent a notification email to a member of the staff. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There is nothing you can do just wait 24hrs. the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock. I guess quite a number of crunchers would be pleased if this was possible. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There is nothing you can do just wait 24hrs. They have to cure the underlying problem first - that's the priority when something like this happens. |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Oh crappity crap. I think I messed up the adaptive. Will check now. Thanks for pointing it out! |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
They have to cure the underlying problem first - that's the priority when something like this happens. that's clear to me - first the problem must be fixed, and then, if possible, some kind of "reset" should be made so that all crunchers for which the downloads were stopped could download new tasks again. Although I am afraid that this will not be possible. |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Well these broken tasks will have to run their course. But they crash on start so they should be gone very quickly now. I fixed the bugs and we will restart the adaptives in a moment |
©2025 Universitat Pompeu Fabra