BAD PABLO_p53 WUs

Message boards : Number crunching : BAD PABLO_p53 WUs
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46745 - Posted: 22 Mar 2017, 12:22:34 UTC

To my surprise, the faulty / working ratio is much better than I've expected.
I did a test with my dummy host again, and only 18 of 48 workunits were faulty.
I've received some of the new (working) workunits on my alive hosts too, so the daily quota will be recovered in a couple of days.
ID: 46745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46746 - Posted: 22 Mar 2017, 12:24:56 UTC - in response to Message 46745.  

... so the daily quota will be recovered in a couple of days.

still it's a shame that there is no other mechanism in place for cases like the present one :-(
ID: 46746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46747 - Posted: 22 Mar 2017, 12:26:17 UTC - in response to Message 46744.  

You are likely to be suffering from a quota of one long task per day: if you allow short tasks in your preferences, it is possible (but rare) to get short tasks allocated

that's what BOINC is showing me:

22/03/2017 13:12:42 | GPUGRID | No tasks are available for Short runs (2-3 hours on fastest card)
22/03/2017 13:12:42 | GPUGRID | No tasks are available for Long runs (8-12 hours on fastest card)
22/03/2017 13:12:42 | GPUGRID | This computer has finished a daily quota of 1 tasks

So I doubt that could get short runs.
(your assumption is correct: I should be suffering on a long runs quota only, since no short runs were selected when the "accident" happened).
The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours.
ID: 46747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46748 - Posted: 22 Mar 2017, 12:50:53 UTC - in response to Message 46746.  

... so the daily quota will be recovered in a couple of days.

still it's a shame that there is no other mechanism in place for cases like the present one :-(
You can't prepare a system to every abnormal situation. BTW you'll receive workunits while your daily quota is lower than its maximum. The only important factor is that a host should not receive many faulty workunits in a row, because it will "blacklist" that host for a day. This is a pretty good automatism to minimize the effects of a faulty host, as such a host would exhaust the queues in a very short time if there's nothing to limit the work assigned to a faulty host. Too bad that this generic error combined with this self-defense made all of our hosts blacklisted, but there's no defense of this self-defense. I've realized that we are this "device", which could make this project running in such regrettable situations.
ID: 46748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
WPrion

Send message
Joined: 30 Apr 13
Posts: 106
Credit: 3,805,237,860
RAC: 40
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46749 - Posted: 22 Mar 2017, 12:56:02 UTC

When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-)
ID: 46749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46750 - Posted: 22 Mar 2017, 13:12:20 UTC - in response to Message 46749.  

When this is all over there should be a publication badge for participation in faulty Pablo WUs ;-)
Indeed. This should be a special one, with special design. I think of a crashed bug. :)
ID: 46750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46751 - Posted: 22 Mar 2017, 13:35:28 UTC - in response to Message 46747.  

The short queue is empty, and the scheduler won't send you from the long queue, because of the host's decreased daily quota. You should wait for a couple of hours.

Sometimes you get a working long task, sometimes you get a faulty long task, sometimes you get a short task - it's very much the luck of the draw at the moment. I've had all three outcomes within the last hour.
ID: 46751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46752 - Posted: 22 Mar 2017, 13:53:18 UTC - in response to Message 46751.  

[quote]... sometimes you get a faulty long task

this leads me to repeating my question: why were/are the faulty ones not eliminated from the queue?
ID: 46752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46753 - Posted: 22 Mar 2017, 14:13:09 UTC - in response to Message 46752.  

why were/are the faulty ones not eliminated from the queue?

My guess - and it is only a guess - is that the currently-available staff are all biochemical researchers, rather than specialist database repairers. BOINC server code provides tools for researchers to submit jobs directly, but identifying faulty (and only faulty) workunits for cancellation is a tricky business. We've had cases in the past when batches of tasks have been cancelled en bloc, including tasks in the middle of an apparently viable run. That caused even more vociferous complaints (of wasted electricity) than the current forced diversion of BOINC resources to other backup projects.

Amateur meddling in technical matters (anything outside your personal professional skill) can cause more problems than it's worth. Stefan has owned up to making a mistake in preparing the workunit parameters: he has corrected that error, but he seems to have decided - wisely, in my opinion - not to risk dabbling in areas where he doesn't feel comfortable about his own level of expertise.
ID: 46753 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 46754 - Posted: 22 Mar 2017, 14:29:14 UTC

@Richard: what you are saying sounds logical
ID: 46754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ken Florian

Send message
Joined: 4 May 12
Posts: 56
Credit: 1,832,989,878
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46755 - Posted: 22 Mar 2017, 20:24:35 UTC

Though I once posted some good numbers to the project, I've been away for awhile and lost track of how BOINC ought to work.

I still do not have new tasks after my own set of failed tasks.

Is there anything I need to do "clear my name" so that I get tasks?
ID: 46755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46756 - Posted: 22 Mar 2017, 22:50:18 UTC

I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit.

As to what to do about it - just allow/encourage your computer to request work once each day. Perhaps you will be lucky and get a good one at the next attempt, or you may end up with several more days' wait. It'll work out in the end.
ID: 46756 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46757 - Posted: 23 Mar 2017, 8:12:20 UTC - in response to Message 46756.  

I've just picked up a 4th replication from workunit e34s5_e17s62p0f449-PABLO_p53_mut_7_DIS-0-1-RND8386. From the PABLO_p53 and the _4 at the end of the task name, I assumed the worst - but it's running just fine. Don't assume that every failure - even multiple failures - comes from a faulty workunit.
If there's the
ERROR: file mdioload.cpp line 81: Unable to read bincoordfile
message in many of the previous task's stderr.txt output file, then it's a faulty task.
The one you've received is failed 4 times, from 3 different reasons (but none of them is the one above):

1st & 3rd:
<message>
process exited with code 201 (0xc9, -55)
</message>
<stderr_txt>
# Unable to initialise. Check permissions on /dev/nvidia* (err=100)

</stderr_txt>

2nd (that's the most mysterious:)
<message>
process exited with code 212 (0xd4, -44)
</message>
<stderr_txt>

</stderr_txt>

4th:
<message>
(unknown error) - exit code -80 (0xffffffb0)
</message>
<stderr_txt>
...
# Access violation : progress made, try to restart
called boinc_finish

</stderr_txt>

BTW things are now back to normal (almost), some faulty workunits are still floating around.
ID: 46757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
PappaLitto

Send message
Joined: 21 Mar 16
Posts: 513
Credit: 4,673,458,277
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwat
Message 46760 - Posted: 23 Mar 2017, 18:07:10 UTC

Has the problem been fixed?
ID: 46760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46761 - Posted: 23 Mar 2017, 18:54:14 UTC - in response to Message 46760.  

Has the problem been fixed?
Yes.
There still could be some faulty workunits in the long queue, but those are not threatening the daily quota.
ID: 46761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 42
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46799 - Posted: 31 Mar 2017, 10:54:38 UTC

These error units are starting to disappear from the tasks pages. Soon, they will be all gone, nothing more than a memory.


Good bye!!!!


ID: 46799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46803 - Posted: 31 Mar 2017, 17:32:53 UTC

Trouble is, I'm starting to ses a new bad batch, like

e1s2_ubiquitin_50ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_50_ubiquitin_1-0-1-RND7532

I've seen failures for each of contacts_20, contacts_50, and contacts_100
ID: 46803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46804 - Posted: 31 Mar 2017, 17:54:16 UTC - in response to Message 46803.  

I just got one an hour ago, that failed after two seconds.
e1s2_ubiquitin_20ns_1-ADRIA_FOLDGREED10_crystal_ss_contacts_20_ubiquitin_6-0-1-RND9359

ID: 46804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 46805 - Posted: 31 Mar 2017, 21:30:23 UTC

ID: 46805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Loohi

Send message
Joined: 27 Aug 16
Posts: 16
Credit: 43,745,875
RAC: 0
Level
Val
Scientific publications
wat
Message 46806 - Posted: 1 Apr 2017, 3:58:45 UTC

Same here, 6 broken Adria WU out of 8, in 12 hours so far. Failing immediately.
ID: 46806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : BAD PABLO_p53 WUs

©2025 Universitat Pompeu Fabra