Remaining (Estimated) time is unusually high; duration correction factor unusually large

Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32611 - Posted: 2 Sep 2013, 0:22:34 UTC - in response to Message 32609.  
Last modified: 2 Sep 2013, 0:22:52 UTC

The opposite happens too. I have now a NOELIA_KLEBEbeta-2-3... that has done 9% in 1 hour, but in 5 minutes it will be finished....?
And an HARVEY_TEST that will take about 15 minutes, is expected to run 6h53m52s.

If I have watched correct it all started with the HARVEY_TEST that took around 1 minute, but estimation of 12 hours.
Greetings from TJ
ID: 32611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32618 - Posted: 2 Sep 2013, 8:04:06 UTC - in response to Message 32610.  

Each task is sent with an estimate. You can even view that estimate in the task properties.

There are a number of components which go towards calculating that estimate. After playing around with those Beta tasks yesterday, I've now been given a re-sent NATHAN_KIDKIXc22 from the long queue (WU 4743036), so we can see what will happen when all this is over.

From <workunit> in client_state:
    <name>I6R6-NATHAN_KIDKIXc22_6-12-50-RND1527</name>
    <app_name>acemdlong</app_name>
    <version_num>803</version_num>
    <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
    <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>

From <app_version> in client_state:
    <app_name>acemdlong</app_name>
    <version_num>803</version_num>
    <platform>windows_intelx86</platform>
    <avg_ncpus>0.666596</avg_ncpus>
    <max_ncpus>0.666596</max_ncpus>
    <flops>142541780304.165830</flops>
    <plan_class>cuda55</plan_class>

From <project> in client_state:
    <duration_correction_factor>19.676844</duration_correction_factor>

It's the local BOINC client on your machine that puts all those figures into the calculator.

Size: 5,000,000,000,000,000 (5 PetaFpops, 5 quadrillion calculations)
Speed: 142,541,780,304 (142.5 GigaFlops)
DCF: 19.67

Put those together, and my calculator gets 690,213 seconds - 192 hours or 8 days.

28% of the way through the task (in 2.5 hours), BOINC is still estimating 174 hours - over a week - to go: BOINC is very slow to switch from 'estimate' to 'experience' as a task is running.

We're going to get a lot of panic (and possibly even aborted tasks) from inexperienced users before all this unwinds.
ID: 32618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32626 - Posted: 2 Sep 2013, 10:06:17 UTC

The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time).
This is really annoying.
ID: 32626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32627 - Posted: 2 Sep 2013, 10:21:36 UTC - in response to Message 32626.  

The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time).
This is really annoying.

It depends which version of the BOINC client you run.

I'm testing new BOINC versions as they come out, too - that rig is currently one step behind, on v7.2.10

The behaviour of 'stopping the current task, and starting a later one' when in High Priority was acknowledged to have been a bug, and has been corrected now. BOINC is hoping to promote v7.2.xx to 'recommended' status soon - that should cure your annoyance.
ID: 32627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32630 - Posted: 2 Sep 2013, 12:05:34 UTC - in response to Message 32626.  

This is really annoying.


I agree, it is annoying.
I reported it as soon as I spotted it, over a week ago.
Hopefully the admins take it a bit more seriously.
ID: 32630 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32634 - Posted: 2 Sep 2013, 13:37:08 UTC - in response to Message 32630.  

This is really annoying.

I agree, it is annoying.
I reported it as soon as I spotted it, over a week ago.
Hopefully the admins take it a bit more seriously.

The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously).

Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14:

EDF policy says we should run the ones with earliest deadlines.

Note: this is how it used to be (as designed by John McLeod). I attempted to improve it, and got it wrong.
[DA]
ID: 32634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32636 - Posted: 2 Sep 2013, 15:55:53 UTC - in response to Message 32634.  

The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously).

That's why I've posted about my annoyance here.
This overestimation misleads the BOINC manager in another way: it won't ask for new work, since it thinks that there is enough work in its queue.

Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14:

There is another annoying bug and an annoying GUI change, which makes me not to upgrade v6.10.60:
The bug is in the calculation of the required CPU percentage for GPU tasks. It can change from below 0.5 to over 0.5. On a dual GPU system that change results in 1 CPU thread fluctuation. The v6.10.60 underestimates the required CPU percentage for Kepler based cards (0.04%), so the number of available CPUs won't fluctuate. This bug comes in handy.
The annoying GUI change is the omitted "messages" tab (actually it's relocated to a submenu).
ID: 32636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32639 - Posted: 2 Sep 2013, 20:07:10 UTC

I've just picked up a new Beta task - 7238202

New (to me) task type MJHARVEY_TESTSIN, and new cuda55 application version 8.06

Task has the same 5 PetaFpops estimate as we're used to seeing, and - what with it being a new app and all - speed estimate was 295 GigaFlops, lower than previously on this host. DCF is still high, so BOINC calculated the runtime as 87 hours. Looks like it's turning into a 100 minute job... (but still leapt into action in High Priority as soon as I let my AV complete the download).
ID: 32639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32640 - Posted: 2 Sep 2013, 20:16:12 UTC - in response to Message 32636.  

You could configure the CPU percentage your self with an app_config (which the later BOINCs support). I could send you mine, if you're interested. And the message.. well, it's annoying. But only really needed when there are problems. Which, fortunately, isn't all that often for me.

Anyway.. let's wait for MJH to work through these posts!

MrS
Scanning for our furry friends since Jan 2002
ID: 32640 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32644 - Posted: 2 Sep 2013, 22:24:25 UTC - in response to Message 32639.  

I've just picked up a new Beta task - 7238202

Completed in under 2 hours, and awarded 150,000 credits. This is getting silly.
ID: 32644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32658 - Posted: 3 Sep 2013, 14:07:59 UTC

Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670.

If you are a BOINC user experienced and competent enough to edit client_state.xml - and taking all the standard safety warnings as read - you can avoid this by increasing <rsc_fpops_bound> for all GPUGrid Beta tasks. A couple of orders of magnitude should do it, maybe three for luck.
ID: 32658 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32659 - Posted: 3 Sep 2013, 14:42:25 UTC - in response to Message 32658.  

Yup, I just had some tasks fail because of that poor server estimation.
This is getting very ridiculous.

Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670.

If you are a BOINC user experienced and competent enough to edit client_state.xml - and taking all the standard safety warnings as read - you can avoid this by increasing <rsc_fpops_bound> for all GPUGrid Beta tasks. A couple of orders of magnitude should do it, maybe three for luck.

ID: 32659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32670 - Posted: 4 Sep 2013, 9:15:44 UTC
Last modified: 4 Sep 2013, 9:37:48 UTC

This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones.

Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises.

MJH
ID: 32670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
juan BFP

Send message
Joined: 11 Dec 11
Posts: 21
Credit: 145,887,858
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 32671 - Posted: 4 Sep 2013, 9:26:40 UTC
Last modified: 4 Sep 2013, 9:29:52 UTC

All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs)

http://www.gpugrid.net/results.php?hostid=157835&offset=0&show_names=0&state=1&appid=

So actualy most of them are crunching in high priority mode.
ID: 32671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32672 - Posted: 4 Sep 2013, 9:37:29 UTC - in response to Message 32670.  
Last modified: 4 Sep 2013, 9:39:50 UTC

Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises.

MJH

I have tried 3 betas in the long queue, and two have failed at almost exactly the same running times. One was a Noelia, and the other was a Harvey. (On a GTX 650 Ti under Win7 64-bit and BOINC 7.2.11 x64)


    8.10 ACEMD beta version (cuda55) 109nx46-NOELIA_KLEBEbeta-0-3-RND7143_2 01:58:09 Reported: Computation error (197,)

    8.10 ACEMD beta version (cuda55) 66-MJHARVEY_TEST10-42-50-RND0504_0 01:58:02 Reported: Computation error (197,)



The Noelia was also reported as failed by four other people thus far, whereas the Harvey was completed successfully by someone running ACEMD beta version v8.05 (cuda42)

ID: 32672 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32673 - Posted: 4 Sep 2013, 9:40:43 UTC - in response to Message 32672.  


I have tried 3 betas in the long queue


I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved.

MJH
ID: 32673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32674 - Posted: 4 Sep 2013, 10:02:56 UTC - in response to Message 32673.  

Sorry. I thought you were asking about the betas. No problems with the longs thus far. All seven that I have received under CUDA 5.5 have been completed successfully.

    8.03 Long runs (cuda55) 063ppx239-NOELIA_FRAG063pp-2-4-RND4537_0 20:57:08
    8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-3-4-RND5262_0 17:41:48
    8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-2-4-RND5262_0 17:39:49
    8.03 Long runs (cuda55) 063ppx290-NOELIA_FRAG063pp-1-4-RND3152_0 20:59:32
    8.03 Long runs (cuda55) I35R7-NATHAN_KIDKIXc22_6-8-50-RND8566_0
    8.02 Long runs (cuda55) I50R6-NATHAN_KIDKIXc22_6-3-50-RND0333_0 17:48:16
    8.00 Long runs (cuda55) I81R8-NATHAN_KIDKIXc22_6-4-50-RND0944_0 17:44:35

ID: 32674 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32676 - Posted: 4 Sep 2013, 10:12:44 UTC - in response to Message 32671.  

All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs)

http://www.gpugrid.net/results.php?hostid=157835&offset=0&show_names=0&state=1&appid=

So actualy most of them are crunching in high priority mode.

That's DCF in action. It will work itself down eventually, but may take 20 - 30 tasks with proper <rsc_fpops_est> to get there.

Unless DCF has already reached over 90 - the normalisation process is slower in those extreme cases.
ID: 32676 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32677 - Posted: 4 Sep 2013, 10:14:54 UTC - in response to Message 32670.  
Last modified: 4 Sep 2013, 10:19:04 UTC

MJH,

Thanks for finding the cause of the problem.
Has the problem been fixed, such that new tasks (including beta!) are sent with appropriate fpops estimates?

Once the problem has been fixed at the server, if a user wants to immediately reset Duration Correction Factor (instead of waiting it to adjust down over the course of a few days/weeks), they could do these steps:

- Exit BOINC, including stopping running tasks
- Open client_state.xml within the data directory
- Search for the <project_name>GPUGRID</project_name> element
- Find the <duration_correction_factor> element within that project element
- Change the line to read: <duration_correction_factor>1</duration_correction_factor>
- Restart BOINC
- Monitor the value in the UI by viewing the GPUGrid project properties, to test, to make sure it hopefully stays between 0.6 and 1.6.

.... I just need to know when the problem has been fixed by the server, so I can begin testing the solution on my client (which has luckily not been full of too many surprises).

Has it been fixed?





This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones.

Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises.

MJH
ID: 32677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32679 - Posted: 4 Sep 2013, 10:31:54 UTC - in response to Message 32673.  


I have tried 3 betas in the long queue


I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved.

MJH

I think he means long tasks, like those NOELIA_KLEBE jobs, being processed through the Beta queue currently alongside your quick test pieces.

The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast - it thinks my GTX 670 can complete ACEMD beta version 8.11 tasks (bottom of linked list) at 79.2 TeraFlops. When BOINC attempts any reasonably-long job (a bit over an hour, in my case), the client thinks something has gone wrong, and aborts it for taking too long.

There's nothing the user can do to overcome that problem, except 'innocculate' each individual task as received, with a big (100x or 1000x) increase to <rsc_fpops_bound>
ID: 32679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large

©2025 Universitat Pompeu Fabra