BAD PABLO_p53 WUs

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46745 - Posted: 22 Mar 2017, 12:22:34 UTC To my surprise, the faulty / working ratio is much better than I've expected. I did a test with my dummy host again, and only 18 of 48 workunits were faulty. I've received some of the new (working) workunits on my alive hosts too, so the daily quota will be recovered in a couple of days. ID: 46745 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 46746 - Posted: 22 Mar 2017, 12:24:56 UTC - in response to Message 46745. ... so the daily quota will be recovered in a couple of days. still it's a shame that there is no other mechanism in place for cases like the present one :-( ID: 46746 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46747 - Posted: 22 Mar 2017, 12:26:17 UTC - in response to Message 46744. You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated that's what BOINC is showing me: 22/03/2017 13:12:42 \| GPUGRID \| No tasks are available for Short runs (2-3 hours on fastest card) 22/03/2017 13:12:42 \| GPUGRID \| No tasks are available for Long runs (8-12 hours on fastest card) 22/03/2017 13:12:42 \| GPUGRID \| This computer has finished a daily quota of 1 tasks So I doubt that could get short runs. (your assumption is correct: I should be suffering on a long runs quota only, since no short runs were selected when the "accident" happened). The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours. ID: 46747 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46748 - Posted: 22 Mar 2017, 12:50:53 UTC - in response to Message 46746. ... so the daily quota will be recovered in a couple of days. still it's a shame that there is no other mechanism in place for cases like the present one :-( You can't prepare a system to every abnormal situation. BTW you'll receive workunits while your daily quota is lower than its maximum. The only important factor is that a host should not receive many faulty workunits in a row, because it will "blacklist" that host for a day. This is a pretty good automatism to minimize the effects of a faulty host, as such a host would exhaust the queues in a very short time if there's nothing to limit the work assigned to a faulty host. Too bad that this generic error combined with this self-defense made all of our hosts blacklisted, but there's no defense of this self-defense. I've realized that we are this "device", which could make this project running in such regrettable situations. ID: 46748 · Rating: 0 · rate: / Reply Quote

WPrion Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,814,987,860 RAC: 78 Level Scientific publications	Message 46749 - Posted: 22 Mar 2017, 12:56:02 UTC When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-) ID: 46749 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46750 - Posted: 22 Mar 2017, 13:12:20 UTC - in response to Message 46749. When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-) Indeed. This should be a special one, with special design. I think of a crashed bug. :) ID: 46750 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 46751 - Posted: 22 Mar 2017, 13:35:28 UTC - in response to Message 46747. The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours. Sometimes you get a working long task, sometimes you get a faulty long task, sometimes you get a short task - it's very much the luck of the draw at the moment. I've had all three outcomes within the last hour. ID: 46751 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 46752 - Posted: 22 Mar 2017, 13:53:18 UTC - in response to Message 46751. [quote]... sometimes you get a faulty long task this leads me to repeating my question: why were/are the faulty ones not eliminated from the queue? ID: 46752 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 46753 - Posted: 22 Mar 2017, 14:13:09 UTC - in response to Message 46752. why were/are the faulty ones not eliminated from the queue? My guess - and it is only a guess - is that the currently-available staff are all biochemical researchers, rather than specialist database repairers. BOINC server code provides tools for researchers to submit jobs directly, but identifying faulty (and only faulty) workunits for cancellation is a tricky business. We've had cases in the past when batches of tasks have been cancelled en bloc, including tasks in the middle of an apparently viable run. That caused even more vociferous complaints (of wasted electricity) than the current forced diversion of BOINC resources to other backup projects. Amateur meddling in technical matters (anything outside your personal professional skill) can cause more problems than it's worth. Stefan has owned up to making a mistake in preparing the workunit parameters: he has corrected that error, but he seems to have decided - wisely, in my opinion - not to risk dabbling in areas where he doesn't feel comfortable about his own level of expertise. ID: 46753 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1168 Credit: 12,317,898,501 RAC: 25,299 Level Scientific publications	Message 46754 - Posted: 22 Mar 2017, 14:29:14 UTC @Richard: what you are saying sounds logical ID: 46754 · Rating: 0 · rate: / Reply Quote

Ken Florian Send message Joined: 4 May 12 Posts: 56 Credit: 1,832,989,878 RAC: 0 Level Scientific publications	Message 46755 - Posted: 22 Mar 2017, 20:24:35 UTC Though I once posted some good numbers to the project, I've been away for awhile and lost track of how BOINC ought to work. I still do not have new tasks after my own set of failed tasks. Is there anything I need to do "clear my name" so that I get tasks? ID: 46755 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 46756 - Posted: 22 Mar 2017, 22:50:18 UTC I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit. As to what to do about it - just allow/encourage your computer to request work once each day. Perhaps you will be lucky and get a good one at the next attempt, or you may end up with several more days' wait. It'll work out in the end. ID: 46756 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46757 - Posted: 23 Mar 2017, 8:12:20 UTC - in response to Message 46756. I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit. If there's the ERROR: file mdioload.cpp line 81: Unable to read bincoordfile message in many of the previous task's stderr.txt output file, then it's a faulty task. The one you've received is failed 4 times, from 3 different reasons (but none of them is the one above): 1st & 3rd: <message> process exited with code 201 (0xc9, -55) </message> <stderr_txt> # Unable to initialise. Check permissions on /dev/nvidia* (err=100) </stderr_txt> 2nd (that's the most mysterious:) <message> process exited with code 212 (0xd4, -44) </message> <stderr_txt> </stderr_txt> 4th: <message> (unknown error) - exit code -80 (0xffffffb0) </message> <stderr_txt> ... # Access violation : progress made, try to restart called boinc_finish </stderr_txt> BTW things are now back to normal (almost), some faulty workunits are still floating around. ID: 46757 · Rating: 0 · rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level Scientific publications	Message 46760 - Posted: 23 Mar 2017, 18:07:10 UTC Has the problem been fixed? ID: 46760 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 46761 - Posted: 23 Mar 2017, 18:54:14 UTC - in response to Message 46760. Has the problem been fixed? Yes. There still could be some faulty workunits in the long queue, but those are not threatening the daily quota. ID: 46761 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,739,145,728 RAC: 826 Level Scientific publications	Message 46799 - Posted: 31 Mar 2017, 10:54:38 UTC These error units are starting to disappear from the tasks pages. Soon, they will be all gone, nothing more than a memory. Good bye!!!! ID: 46799 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 46803 - Posted: 31 Mar 2017, 17:32:53 UTC Trouble is, I'm starting to ses a new bad batch, like e1s2_ubiquitin_50ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_50_ubiquitin_1-0-1-RND7532 I've seen failures for each of contacts_20, contacts_50, and contacts_100 ID: 46803 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 46804 - Posted: 31 Mar 2017, 17:54:16 UTC - in response to Message 46803. I just got one an hour ago, that failed after two seconds. e1s2_ubiquitin_20ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_20_ubiquitin_6-0-1-RND9359 ID: 46804 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 46805 - Posted: 31 Mar 2017, 21:30:23 UTC e1s9_ubiquitin_100ns_8-ADRIA_FOLDGREED10_crystal_ss_contacts_100_ubiquitin_4-0-2-RND2702 is running OK, so they're not all bad. ID: 46805 · Rating: 0 · rate: / Reply Quote

Loohi Send message Joined: 27 Aug 16 Posts: 16 Credit: 43,745,875 RAC: 0 Level Scientific publications	Message 46806 - Posted: 1 Apr 2017, 3:58:45 UTC Same here, 6 broken Adria WU out of 8, in 12 hours so far. Failing immediately. ID: 46806 · Rating: 0 · rate: / Reply Quote