Message boards :
Number crunching :
Remaining (Estimated) time is unusually high; duration correction factor unusually large
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The opposite happens too. I have now a NOELIA_KLEBEbeta-2-3... that has done 9% in 1 hour, but in 5 minutes it will be finished....? And an HARVEY_TEST that will take about 15 minutes, is expected to run 6h53m52s. If I have watched correct it all started with the HARVEY_TEST that took around 1 minute, but estimation of 12 hours. Greetings from TJ |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Each task is sent with an estimate. You can even view that estimate in the task properties. There are a number of components which go towards calculating that estimate. After playing around with those Beta tasks yesterday, I've now been given a re-sent NATHAN_KIDKIXc22 from the long queue (WU 4743036), so we can see what will happen when all this is over. From <workunit> in client_state: <name>I6R6-NATHAN_KIDKIXc22_6-12-50-RND1527</name>
<app_name>acemdlong</app_name>
<version_num>803</version_num>
<rsc_fpops_est>5000000000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound>From <app_version> in client_state: <app_name>acemdlong</app_name>
<version_num>803</version_num>
<platform>windows_intelx86</platform>
<avg_ncpus>0.666596</avg_ncpus>
<max_ncpus>0.666596</max_ncpus>
<flops>142541780304.165830</flops>
<plan_class>cuda55</plan_class>From <project> in client_state: <duration_correction_factor>19.676844</duration_correction_factor> It's the local BOINC client on your machine that puts all those figures into the calculator. Size: 5,000,000,000,000,000 (5 PetaFpops, 5 quadrillion calculations) Speed: 142,541,780,304 (142.5 GigaFlops) DCF: 19.67 Put those together, and my calculator gets 690,213 seconds - 192 hours or 8 days. 28% of the way through the task (in 2.5 hours), BOINC is still estimating 174 hours - over a week - to go: BOINC is very slow to switch from 'estimate' to 'experience' as a task is running. We're going to get a lot of panic (and possibly even aborted tasks) from inexperienced users before all this unwinds. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). This is really annoying. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The BOINC manager suspends a WU which has normally estimated run time when it receives a fresh WU which has overestimated run time (my personal high score is 2878(!) hours), which makes my batch programs think that they are stuck (actually they are, it's intentional to give priority to the task with the overestimated run time). It depends which version of the BOINC client you run. I'm testing new BOINC versions as they come out, too - that rig is currently one step behind, on v7.2.10 The behaviour of 'stopping the current task, and starting a later one' when in High Priority was acknowledged to have been a bug, and has been corrected now. BOINC is hoping to promote v7.2.xx to 'recommended' status soon - that should cure your annoyance. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is really annoying. I agree, it is annoying. I reported it as soon as I spotted it, over a week ago. Hopefully the admins take it a bit more seriously. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is really annoying. The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: EDF policy says we should run the ones with earliest deadlines. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The problem Retvari Zoltan is annoyed about - BOINC suspending one task and running a different one when high priority is needed - isn't something the project admins can solve (except by getting the estimates right so EDF isn't needed, obviously). That's why I've posted about my annoyance here. This overestimation misleads the BOINC manager in another way: it won't ask for new work, since it thinks that there is enough work in its queue. Decisions about which task from the cache to run next are taken locally by the BOINC core client. v6.10.60 is getting quite old now - and yes, this bug has been around that long. It was fixed in v7.0.14: There is another annoying bug and an annoying GUI change, which makes me not to upgrade v6.10.60: The bug is in the calculation of the required CPU percentage for GPU tasks. It can change from below 0.5 to over 0.5. On a dual GPU system that change results in 1 CPU thread fluctuation. The v6.10.60 underestimates the required CPU percentage for Kepler based cards (0.04%), so the number of available CPUs won't fluctuate. This bug comes in handy. The annoying GUI change is the omitted "messages" tab (actually it's relocated to a submenu). |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've just picked up a new Beta task - 7238202 New (to me) task type MJHARVEY_TESTSIN, and new cuda55 application version 8.06 Task has the same 5 PetaFpops estimate as we're used to seeing, and - what with it being a new app and all - speed estimate was 295 GigaFlops, lower than previously on this host. DCF is still high, so BOINC calculated the runtime as 87 hours. Looks like it's turning into a 100 minute job... (but still leapt into action in High Priority as soon as I let my AV complete the download). |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You could configure the CPU percentage your self with an app_config (which the later BOINCs support). I could send you mine, if you're interested. And the message.. well, it's annoying. But only really needed when there are problems. Which, fortunately, isn't all that often for me. Anyway.. let's wait for MJH to work through these posts! MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've just picked up a new Beta task - 7238202 Completed in under 2 hours, and awarded 150,000 credits. This is getting silly. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. If you are a BOINC user experienced and competent enough to edit client_state.xml - and taking all the standard safety warnings as read - you can avoid this by increasing <rsc_fpops_bound> for all GPUGrid Beta tasks. A couple of orders of magnitude should do it, maybe three for luck. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yup, I just had some tasks fail because of that poor server estimation. This is getting very ridiculous. Just a heads-up: estimated times for the current v8.10 (cuda55) Beta are unusually low - it is likely that many runs (especially of the full-length production NOELIA_KLEBEbeta tasks being processed through the Beta queue) will fail with EXIT_TIME_LIMIT_EXCEEDED - after about an hour, on my GTX 670. |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. MJH |
|
Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) http://www.gpugrid.net/results.php?hostid=157835&offset=0&show_names=0&state=1&appid= So actualy most of them are crunching in high priority mode.
|
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Please let me know if you start to see this problem on the important acemdshort and acemdlong queues. There's no reason why it should be happening there (any more than usual), but the client is full of surprises. I have tried 3 betas in the long queue, and two have failed at almost exactly the same running times. One was a Noelia, and the other was a Harvey. (On a GTX 650 Ti under Win7 64-bit and BOINC 7.2.11 x64)
8.10 ACEMD beta version (cuda55) 66-MJHARVEY_TEST10-42-50-RND0504_0 01:58:02 Reported: Computation error (197,)
|
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
I'm not sure what you mean my that. Those are WUs from the acemdbeta queue, run by the beta application. The acemdlong queue isn't involved. MJH |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Sorry. I thought you were asking about the betas. No problems with the longs thus far. All seven that I have received under CUDA 5.5 have been completed successfully.
8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-3-4-RND5262_0 17:41:48 8.03 Long runs (cuda55) 041px89-NOELIA_FRAG041p-2-4-RND5262_0 17:39:49 8.03 Long runs (cuda55) 063ppx290-NOELIA_FRAG063pp-1-4-RND3152_0 20:59:32 8.03 Long runs (cuda55) I35R7-NATHAN_KIDKIXc22_6-8-50-RND8566_0 8.02 Long runs (cuda55) I50R6-NATHAN_KIDKIXc22_6-3-50-RND0333_0 17:48:16 8.00 Long runs (cuda55) I81R8-NATHAN_KIDKIXc22_6-4-50-RND0944_0 17:44:35 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All the WU actualy running in this host has a estimated of +/- 130hrs! (normal time to crunching a WU = 8-9 hrs) That's DCF in action. It will work itself down eventually, but may take 20 - 30 tasks with proper <rsc_fpops_est> to get there. Unless DCF has already reached over 90 - the normalisation process is slower in those extreme cases. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
MJH, Thanks for finding the cause of the problem. Has the problem been fixed, such that new tasks (including beta!) are sent with appropriate fpops estimates? Once the problem has been fixed at the server, if a user wants to immediately reset Duration Correction Factor (instead of waiting it to adjust down over the course of a few days/weeks), they could do these steps: - Exit BOINC, including stopping running tasks - Open client_state.xml within the data directory - Search for the <project_name>GPUGRID</project_name> element - Find the <duration_correction_factor> element within that project element - Change the line to read: <duration_correction_factor>1</duration_correction_factor> - Restart BOINC - Monitor the value in the UI by viewing the GPUGrid project properties, to test, to make sure it hopefully stays between 0.6 and 1.6. .... I just need to know when the problem has been fixed by the server, so I can begin testing the solution on my client (which has luckily not been full of too many surprises). Has it been fixed? This problem should confined to the beta queue and is a side-effect of having issued a series of short running WUs with the same fpops estimate as normal longer-running ones. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think he means long tasks, like those NOELIA_KLEBE jobs, being processed through the Beta queue currently alongside your quick test pieces. The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast - it thinks my GTX 670 can complete ACEMD beta version 8.11 tasks (bottom of linked list) at 79.2 TeraFlops. When BOINC attempts any reasonably-long job (a bit over an hour, in my case), the client thinks something has gone wrong, and aborts it for taking too long. There's nothing the user can do to overcome that problem, except 'innocculate' each individual task as received, with a big (100x or 1000x) increase to <rsc_fpops_bound> |
©2025 Universitat Pompeu Fabra