Remaining (Estimated) time is unusually high; duration correction factor unusually large

Author	Message
TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32611 - Posted: 2 Sep 2013, 0:22:34 UTC - in response to Message 32609. Last modified: 2 Sep 2013, 0:22:52 UTC The opposite happens too. I have now a NOELIA_KLEBEbeta-2-3... that has done 9% in 1 hour, but in 5 minutes it will be finished....? And an HARVEY_TEST that will take about 15 minutes, is expected to run 6h53m52s. If I have watched correct it all started with the HARVEY_TEST that took around 1 minute, but estimation of 12 hours. Greetings from TJ ID: 32611 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32618 - Posted: 2 Sep 2013, 8:04:06 UTC - in response to Message 32610. Each task is sent with an estimate. You can even view that estimate in the task properties. There are a number of components which go towards calculating that estimate. After playing around with those Beta tasks yesterday, I've now been given a re-sent NATHAN_KIDKIXc22 from the long queue (WU 4743036), so we can see what will happen when all this is over. From <workunit> in client_state: <name>I6R6-NATHAN_KIDKIXc22_6-12-50-RND1527</name> <app_name>acemdlong</app_name> <version_num>803</version_num> <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> From <app_version> in client_state: <app_name>acemdlong</app_name> <version_num>803</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.666596</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>142541780304.165830</flops> <plan_class>cuda55</plan_class> From <project> in client_state: <duration_correction_factor>19.676844</duration_correction_factor> It's the local BOINC client on your machine that puts all those figures into the calculator. Size: 5,000,000,000,000,000 (5 PetaFpops, 5 quadrillion calculations) Speed: 142,541,780,304 (142.5 GigaFlops) DCF: 19.67 Put those together, and my calculator gets 690,213 seconds - 192 hours or 8 days. 28% of the way through the task (in 2.5 hours), BOINC is still estimating 174 hours - over a week - to go: BOINC is very slow to switch from 'estimate' to 'experience' as a task is running. We're going to get a lot of panic (and possibly even aborted tasks) from inexperienced users before all this unwinds. ID: 32618 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 32626 - Posted: 2 Sep 2013, 10:06:17 UTC The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). This is really annoying. ID: 32626 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32627 - Posted: 2 Sep 2013, 10:21:36 UTC - in response to Message 32626. The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). This is really annoying. It depends which version of the BOINC client you run. I'm testing new BOINC versions as they come out, too - that rig is currently one step behind, on v7.2.10 The behaviour of 'stopping the current task, and starting a later one' when in High Priority was acknowledged to have been a bug, and has been corrected now. BOINC is hoping to promote v7.2.xx to 'recommended' status soon - that should cure your annoyance. ID: 32627 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32630 - Posted: 2 Sep 2013, 12:05:34 UTC - in response to Message 32626. This is really annoying. I agree, it is annoying. I reported it as soon as I spotted it, over a week ago. Hopefully the admins take it a bit more seriously. ID: 32630 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32634 - Posted: 2 Sep 2013, 13:37:08 UTC - in response to Message 32630. This is really annoying. I agree, it is annoying. I reported it as soon as I spotted it, over a week ago. Hopefully the admins take it a bit more seriously. The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: EDF policy says we should run the ones with earliest deadlines. Note: this is how it used to be (as designed by John McLeod). I attempted to improve it, and got it wrong. [DA] ID: 32634 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 32636 - Posted: 2 Sep 2013, 15:55:53 UTC - in response to Message 32634. The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). That's why I've posted about my annoyance here. This overestimation misleads the BOINC manager in another way: it won't ask for new work, since it thinks that there is enough work in its queue. Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: There is another annoying bug and an annoying GUI change, which makes me not to upgrade v6.10.60: The bug is in the calculation of the required CPU percentage for GPU tasks. It can change from below 0.5 to over 0.5. On a dual GPU system that change results in 1 CPU thread fluctuation. The v6.10.60 underestimates the required CPU percentage for Kepler based cards (0.04%), so the number of available CPUs won't fluctuate. This bug comes in handy. The annoying GUI change is the omitted "messages" tab (actually it's relocated to a submenu). ID: 32636 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32639 - Posted: 2 Sep 2013, 20:07:10 UTC I've just picked up a new Beta task - 7238202 New (to me) task type MJHARVEY_TESTSIN, and new cuda55 application version 8.06 Task has the same 5 PetaFpops estimate as we're used to seeing, and - what with it being a new app and all - speed estimate was 295 GigaFlops, lower than previously on this host. DCF is still high, so BOINC calculated the runtime as 87 hours. Looks like it's turning into a 100 minute job... (but still leapt into action in High Priority as soon as I let my AV complete the download). ID: 32639 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32640 - Posted: 2 Sep 2013, 20:16:12 UTC - in response to Message 32636. You could configure the CPU percentage your self with an app_config (which the later BOINCs support). I could send you mine, if you're interested. And the message.. well, it's annoying. But only really needed when there are problems. Which, fortunately, isn't all that often for me. Anyway.. let's wait for MJH to work through these posts! MrS Scanning for our furry friends since Jan 2002 ID: 32640 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32644 - Posted: 2 Sep 2013, 22:24:25 UTC - in response to Message 32639. I've just picked up a new Beta task - 7238202 Completed in under 2 hours, and awarded 150,000 credits. This is getting silly. ID: 32644 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32658 - Posted: 3 Sep 2013, 14:07:59 UTC Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. If you are a BOINC user experienced and competent enough to edit client_state.xml - and taking all the standard safety warnings as read - you can avoid this by increasing <rsc_fpops_bound> for all GPUGrid Beta tasks. A couple of orders of magnitude should do it, maybe three for luck. ID: 32658 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32659 - Posted: 3 Sep 2013, 14:42:25 UTC - in response to Message 32658. Yup, I just had some tasks fail because of that poor server estimation. This is getting very ridiculous. Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. If you are a BOINC user experienced and competent enough to edit client_state.xml - and taking all the standard safety warnings as read - you can avoid this by increasing <rsc_fpops_bound> for all GPUGrid Beta tasks. A couple of orders of magnitude should do it, maybe three for luck. ID: 32659 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32670 - Posted: 4 Sep 2013, 9:15:44 UTC Last modified: 4 Sep 2013, 9:37:48 UTC This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. MJH ID: 32670 · Rating: 0 · rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32671 - Posted: 4 Sep 2013, 9:26:40 UTC Last modified: 4 Sep 2013, 9:29:52 UTC All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) http://www.gpugrid.net/results.php?hostid=157835&offset=0&show_names=0&state=1&appid= So actualy most of them are crunching in high priority mode. ID: 32671 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 32672 - Posted: 4 Sep 2013, 9:37:29 UTC - in response to Message 32670. Last modified: 4 Sep 2013, 9:39:50 UTC Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. MJH I have tried 3 betas in the long queue, and two have failed at almost exactly the same running times. One was a Noelia, and the other was a Harvey. (On a GTX 650 Ti under Win7 64-bit and BOINC 7.2.11 x64) 8.10 ACEMD beta version (cuda55) 109nx46-NOELIA_KLEBEbeta-0-3-RND7143_2 01:58:09 Reported: Computation error (197,) 8.10 ACEMD beta version (cuda55) 66-MJHARVEY_TEST10-42-50-RND0504_0 01:58:02 Reported: Computation error (197,) The Noelia was also reported as failed by four other people thus far, whereas the Harvey was completed successfully by someone running ACEMD beta version v8.05 (cuda42) ID: 32672 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32673 - Posted: 4 Sep 2013, 9:40:43 UTC - in response to Message 32672. I have tried 3 betas in the long queue I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved. MJH ID: 32673 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 32674 - Posted: 4 Sep 2013, 10:02:56 UTC - in response to Message 32673. Sorry. I thought you were asking about the betas. No problems with the longs thus far. All seven that I have received under CUDA 5.5 have been completed successfully. 8.03 Long runs (cuda55) 063ppx239-NOELIA_FRAG063pp-2-4-RND4537_0 20:57:08 8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-3-4-RND5262_0 17:41:48 8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-2-4-RND5262_0 17:39:49 8.03 Long runs (cuda55) 063ppx290-NOELIA_FRAG063pp-1-4-RND3152_0 20:59:32 8.03 Long runs (cuda55) I35R7-NATHAN_KIDKIXc22_6-8-50-RND8566_0 8.02 Long runs (cuda55) I50R6-NATHAN_KIDKIXc22_6-3-50-RND0333_0 17:48:16 8.00 Long runs (cuda55) I81R8-NATHAN_KIDKIXc22_6-4-50-RND0944_0 17:44:35 ID: 32674 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32676 - Posted: 4 Sep 2013, 10:12:44 UTC - in response to Message 32671. All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) http://www.gpugrid.net/results.php?hostid=157835&offset=0&show_names=0&state=1&appid= So actualy most of them are crunching in high priority mode. That's DCF in action. It will work itself down eventually, but may take 20 - 30 tasks with proper <rsc_fpops_est> to get there. Unless DCF has already reached over 90 - the normalisation process is slower in those extreme cases. ID: 32676 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32677 - Posted: 4 Sep 2013, 10:14:54 UTC - in response to Message 32670. Last modified: 4 Sep 2013, 10:19:04 UTC MJH, Thanks for finding the cause of the problem. Has the problem been fixed, such that new tasks (including beta!) are sent with appropriate fpops estimates? Once the problem has been fixed at the server, if a user wants to immediately reset Duration Correction Factor (instead of waiting it to adjust down over the course of a few days/weeks), they could do these steps: - Exit BOINC, including stopping running tasks - Open client_state.xml within the data directory - Search for the <project_name>GPUGRID</project_name> element - Find the <duration_correction_factor> element within that project element - Change the line to read: <duration_correction_factor>1</duration_correction_factor> - Restart BOINC - Monitor the value in the UI by viewing the GPUGrid project properties, to test, to make sure it hopefully stays between 0.6 and 1.6. .... I just need to know when the problem has been fixed by the server, so I can begin testing the solution on my client (which has luckily not been full of too many surprises). Has it been fixed? This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. MJH ID: 32677 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32679 - Posted: 4 Sep 2013, 10:31:54 UTC - in response to Message 32673. I have tried 3 betas in the long queue I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved. MJH I think he means long tasks, like those NOELIA_KLEBE jobs, being processed through the Beta queue currently alongside your quick test pieces. The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast - it thinks my GTX 670 can complete ACEMD beta version 8.11 tasks (bottom of linked list) at 79.2 TeraFlops. When BOINC attempts any reasonably-long job (a bit over an hour, in my case), the client thinks something has gone wrong, and aborts it for taking too long. There's nothing the user can do to overcome that problem, except 'innocculate' each individual task as received, with a big (100x or 1000x) increase to <rsc_fpops_bound> ID: 32679 · Rating: 0 · rate: / Reply Quote