BAD PABLO_p53 WUs

Author	Message
Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 46695 - Posted: 20 Mar 2017, 20:42:47 UTC So far I've had 23 of these bad PABLO_p53 WUs today. Think maybe they should be cancelled? ID: 46695 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 46696 - Posted: 20 Mar 2017, 21:32:08 UTC - in response to Message 46695. Same for me. http://www.gpugrid.net/forum_thread.php?id=4513&nowrap=true#46692 ID: 46696 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 1,066 Level Scientific publications	Message 46697 - Posted: 20 Mar 2017, 21:43:59 UTC I also had 2 bad units from this bunch. The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear. ID: 46697 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 46698 - Posted: 20 Mar 2017, 21:47:27 UTC - in response to Message 46697. Last modified: 20 Mar 2017, 21:47:40 UTC The problem with canceling these units is that the error will stay with you forever, but if you let it run its course until it get 8 errors and becomes a "too many errors (may have a bug)" unit, it should in time disappear. That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly. ID: 46698 · Rating: 0 · rate: / Reply Quote

koschi Send message Joined: 14 Aug 08 Posts: 127 Credit: 973,858,161 RAC: 601 Level Scientific publications	Message 46699 - Posted: 20 Mar 2017, 22:43:27 UTC Same here under Linux: ERROR: file mdioload.cpp line 81: Unable to read bincoordfile ID: 46699 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46701 - Posted: 21 Mar 2017, 1:42:13 UTC My main host got its daily quota of long workunits reduced to 1 because it had too many failures (caused by this bad batch). Luckily there are short runs (and one other long run), so my main host is not completely shut off of this project. This is really annoying. This batch was working fine previously. ID: 46701 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 109 Credit: 3,977,737,860 RAC: 2,035 Level Scientific publications	Message 46702 - Posted: 21 Mar 2017, 4:20:17 UTC - in response to Message 46698. That will eliminate the work unit, but I think Beyond is saying that they should cancel the entire series until they fix it. That will save a lot of people some time, though fortunately they error out quickly. They do error quickly, but it kicked me into my daily quota limit right at the beginning of a new day. $#%@^! ID: 46702 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 46703 - Posted: 21 Mar 2017, 4:23:57 UTC also bad here: for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-((( This means that no new task can be downloaded before March 22, right? ID: 46703 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 46704 - Posted: 21 Mar 2017, 7:27:43 UTC - in response to Message 46703. also bad here: for a few hours now, BOINC has not downloaded any new tasks, telling me that "the computer has finished a daily quota of 3 tasks" :-((( This means that no new tasks can be downloaded before March 22, right? The incident early this morning shows that the policy of the daily quota should be revisited quickly. In the specific case, it results in total nonsense. No idea how many people (I think: many) now are not able to download any new tasks for whole March 21. Rather bad thing. ID: 46704 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46705 - Posted: 21 Mar 2017, 9:49:54 UTC Last modified: 21 Mar 2017, 9:50:23 UTC This is a generic error, all long workunits failed on all of my hosts too overnight, so all of my hosts are processing short runs now, but the short queue is ran dry already. The incident early this morning shows that the policy of the daily quota should be revisited quickly. In the specific case, it results in total nonsense. The daily quota would decrease to 1 in every case if the project supplies only failing workunits, there's no problem with that. The problem is the waaay too high ratio of bad workunits in the queue. ID: 46705 · Rating: 0 · rate: / Reply Quote

Lluis Send message Joined: 22 Feb 14 Posts: 26 Credit: 672,639,304 RAC: 0 Level Scientific publications	Message 46706 - Posted: 21 Mar 2017, 10:52:52 UTC - in response to Message 46705. Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. Anyone has an idea of what to do? Any advice (other than process short units)? ID: 46706 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 46707 - Posted: 21 Mar 2017, 11:32:47 UTC - in response to Message 46705. ... so all of my hosts are processing short runs now, but the short queue is ran dry already. I switched to short runs in the early morning when, according to the Project Status Page, some were still available. However, the download of those was again refused referring to the "daily quota of 3 tasks" :-( ID: 46707 · Rating: 0 · rate: / Reply Quote

Matt Send message Joined: 11 Jan 13 Posts: 216 Credit: 846,538,252 RAC: 0 Level Scientific publications	Message 46708 - Posted: 21 Mar 2017, 11:54:32 UTC I've had 9 Pablos fail on me. I'm only receiving short runs now. Returning to Einstein until this is sorted. ID: 46708 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 46709 - Posted: 21 Mar 2017, 11:54:40 UTC - in response to Message 46706. Since yesterday I have 10 Pablo long work units with an "unknown" error, and now I don't have any work unit to process. Anyone has an idea of what to do? Any advice (other than process short units)? There is nothing you can do just wait 24hrs. There is not a problem with the Daily Quota" there is a massive problem with the dumping into the queue of faulty WU's. The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured. ID: 46709 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46711 - Posted: 21 Mar 2017, 12:09:04 UTC - in response to Message 46709. The system does not appear to be monitored to any great effect, if it was somebody would notice and cancel the WU's before this problem occured. It seems that the workunits gone wrong at a certain point, but it wasn't clear that it would affect every batch. It took a couple of hours while the error spread out wide, now the situation is clear. It's very easy to be wise retrospectively. BTW I've sent a notification email to a member of the staff. ID: 46711 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 46714 - Posted: 21 Mar 2017, 12:42:22 UTC - in response to Message 46709. There is nothing you can do just wait 24hrs. the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock. I guess quite a number of crunchers would be pleased if this was possible. ID: 46714 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 46716 - Posted: 21 Mar 2017, 12:50:43 UTC - in response to Message 46714. There is nothing you can do just wait 24hrs. the bad thing on this is that the GPUGRID people most probably cannot "reset" this 24 hours-lock. I guess quite a number of crunchers would be pleased if this was possible. They have to cure the underlying problem first - that's the priority when something like this happens. ID: 46716 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 46718 - Posted: 21 Mar 2017, 13:15:49 UTC Oh crappity crap. I think I messed up the adaptive. Will check now. Thanks for pointing it out! ID: 46718 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 46719 - Posted: 21 Mar 2017, 14:13:04 UTC - in response to Message 46716. They have to cure the underlying problem first - that's the priority when something like this happens. that's clear to me - first the problem must be fixed, and then, if possible, some kind of "reset" should be made so that all crunchers for which the downloads were stopped could download new tasks again. Although I am afraid that this will not be possible. ID: 46719 · Rating: 0 · rate: / Reply Quote

Stefan Project administrator Project developer Project tester Project scientist Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level Scientific publications	Message 46720 - Posted: 21 Mar 2017, 14:21:55 UTC Well these broken tasks will have to run their course. But they crash on start so they should be gone very quickly now. I fixed the bugs and we will restart the adaptives in a moment ID: 46720 · Rating: 0 · rate: / Reply Quote