Message boards :
Number crunching :
BAD PABLO_p53 WUs
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
To my surprise, the faulty / working ratio is much better than I've expected. I did a test with my dummy host again, and only 18 of 48 workunits were faulty. I've received some of the new (working) workunits on my alive hosts too, so the daily quota will be recovered in a couple of days. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... so the daily quota will be recovered in a couple of days. still it's a shame that there is no other mechanism in place for cases like the present one :-( |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours.You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You can't prepare a system to every abnormal situation. BTW you'll receive workunits while your daily quota is lower than its maximum. The only important factor is that a host should not receive many faulty workunits in a row, because it will "blacklist" that host for a day. This is a pretty good automatism to minimize the effects of a faulty host, as such a host would exhaust the queues in a very short time if there's nothing to limit the work assigned to a faulty host. Too bad that this generic error combined with this self-defense made all of our hosts blacklisted, but there's no defense of this self-defense. I've realized that we are this "device", which could make this project running in such regrettable situations.... so the daily quota will be recovered in a couple of days. |
|
Send message Joined: 30 Apr 13 Posts: 106 Credit: 3,805,237,860 RAC: 40 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-) |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-)Indeed. This should be a special one, with special design. I think of a crashed bug. :) |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours. Sometimes you get a working long task, sometimes you get a faulty long task, sometimes you get a short task - it's very much the luck of the draw at the moment. I've had all three outcomes within the last hour. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
[quote]... sometimes you get a faulty long task this leads me to repeating my question: why were/are the faulty ones not eliminated from the queue? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
why were/are the faulty ones not eliminated from the queue? My guess - and it is only a guess - is that the currently-available staff are all biochemical researchers, rather than specialist database repairers. BOINC server code provides tools for researchers to submit jobs directly, but identifying faulty (and only faulty) workunits for cancellation is a tricky business. We've had cases in the past when batches of tasks have been cancelled en bloc, including tasks in the middle of an apparently viable run. That caused even more vociferous complaints (of wasted electricity) than the current forced diversion of BOINC resources to other backup projects. Amateur meddling in technical matters (anything outside your personal professional skill) can cause more problems than it's worth. Stefan has owned up to making a mistake in preparing the workunit parameters: he has corrected that error, but he seems to have decided - wisely, in my opinion - not to risk dabbling in areas where he doesn't feel comfortable about his own level of expertise. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
@Richard: what you are saying sounds logical |
|
Send message Joined: 4 May 12 Posts: 56 Credit: 1,832,989,878 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Though I once posted some good numbers to the project, I've been away for awhile and lost track of how BOINC ought to work. I still do not have new tasks after my own set of failed tasks. Is there anything I need to do "clear my name" so that I get tasks? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit. As to what to do about it - just allow/encourage your computer to request work once each day. Perhaps you will be lucky and get a good one at the next attempt, or you may end up with several more days' wait. It'll work out in the end. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit.If there's the ERROR: file mdioload.cpp line 81: Unable to read bincoordfilemessage in many of the previous task's stderr.txt output file, then it's a faulty task. The one you've received is failed 4 times, from 3 different reasons (but none of them is the one above): 1st & 3rd: <message> process exited with code 201 (0xc9, -55) </message> <stderr_txt> # Unable to initialise. Check permissions on /dev/nvidia* (err=100) </stderr_txt> 2nd (that's the most mysterious:) <message> process exited with code 212 (0xd4, -44) </message> <stderr_txt> </stderr_txt> 4th: <message> (unknown error) - exit code -80 (0xffffffb0) </message> <stderr_txt> ... # Access violation : progress made, try to restart called boinc_finish </stderr_txt> BTW things are now back to normal (almost), some faulty workunits are still floating around. |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Has the problem been fixed? |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Has the problem been fixed?Yes. There still could be some faulty workunits in the long queue, but those are not threatening the daily quota. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 42 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
These error units are starting to disappear from the tasks pages. Soon, they will be all gone, nothing more than a memory. Good bye!!!! |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Trouble is, I'm starting to ses a new bad batch, like e1s2_ubiquitin_50ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_50_ubiquitin_1-0-1-RND7532 I've seen failures for each of contacts_20, contacts_50, and contacts_100 |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just got one an hour ago, that failed after two seconds. e1s2_ubiquitin_20ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_20_ubiquitin_6-0-1-RND9359 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
e1s9_ubiquitin_100ns_8-ADRIA_FOLDGREED10_crystal_ss_contacts_100_ubiquitin_4-0-2-RND2702 is running OK, so they're not all bad. |
|
Send message Joined: 27 Aug 16 Posts: 16 Credit: 43,745,875 RAC: 0 Level ![]() Scientific publications
|
Same here, 6 broken Adria WU out of 8, in 12 hours so far. Failing immediately. |
©2025 Universitat Pompeu Fabra