Remaining (Estimated) time is unusually high; duration correction factor unusually large

Author	Message
Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32263 - Posted: 25 Aug 2013, 13:25:59 UTC Last modified: 25 Aug 2013, 13:27:50 UTC Problem: Remaining (Estimated) time is unusually high; duration correction factor unusually large Basically, recently I've noticed that my GPUGrid tasks are saying they're going to take a long long time (140 hours), when they normally complete in about 1-6 hours. I've tracked it down to the duration correction factor. This is usually between 1 and 2 on my system, but something is making it unusually large, right now it is above 9. Is the problem in: - the new beta acemd application? - the plan classes for that application? - the MJHARVEY TEST 10/12/13 tasks I've recently processed? - something else? Is there any way I can further diagnose this issue? We need to identify the problem, and correct it if possible. Thanks, Jacob ID: 32263 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32405 - Posted: 28 Aug 2013, 19:09:06 UTC - in response to Message 32263. It could easily be caused by a wrong estimate in some WU. But sicne noone else replied it doesn't seem to be persistent issue. MrS Scanning for our furry friends since Jan 2002 ID: 32405 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32409 - Posted: 28 Aug 2013, 19:25:16 UTC There was at least 1 other user that reported it also, in the beta news thread. All I can do is report the problem, with as much info as I can provide, like a good tester should. I wish it was properly investigated and fixed by the administrators. ID: 32409 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32421 - Posted: 28 Aug 2013, 21:18:10 UTC - in response to Message 32409. I think the run time estimates are app based, so when a new batch of different WU's run's it takes time for their estimated run time to auto-correct (on everyone's systems). Not aware of what's been done server side, but there have been lots of WU changes and app changes recently. Some very short WU's turned up in the Long queue and in the Beta app test and different WU's appeared in the short queue. On my W7 system the remaining runtime of a Short NATHAN_baxbimx WU's is still being significantly over-estimated - 40% done after 80min but remaining is >8h (though it's falling fast). A NATHAN_baxbimy WU on my 650TiBoost (Linux) system is closer to the mark - 25% done after 50min, estimated remaining is 3h 50min (but falling rapidly). I think the estimates were worse yesterday and probably for the long runs too. So it's auto-correcting, but just slowly. IIRC an alternative to the existing correction factor equation was proposed some time ago by a very able party. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 32421 · Rating: 0 · rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 32432 - Posted: 29 Aug 2013, 0:25:28 UTC - in response to Message 32409. There was at least 1 other user that reported it also, in the beta news thread. All I can do is report the problem, with as much info as I can provide, like a good tester should. I wish it was properly investigated and fixed by the administrators. You could go into the client state xml file and edit the dcf for GPUGrid tasks to whatever suits you. Problem solved. ID: 32432 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32433 - Posted: 29 Aug 2013, 0:42:18 UTC - in response to Message 32432. Until you get more tasks that throw it out of balance. Problem remains to be solved at the server. ID: 32433 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32436 - Posted: 29 Aug 2013, 1:05:51 UTC I see this too that the remaining time is not correct. I saw it first after we got a lot of MJH beta tests to optimize the app and then "normal" WU's again. I thought it was not a big deal as the WU crunch at good speed. However looking a bit to the system, it could be that as BOINC says it need 4 hours more to finish, it will not request new work. In reality these 4 hours mean 20 minutes, so it could ask for new work. This could explain why I did not get now work last afternoon, but when I hit the "update" button I got a WU. If this is true than it would be nice if it can be solved at server side. Greetings from TJ ID: 32436 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32491 - Posted: 29 Aug 2013, 18:49:25 UTC - in response to Message 32433. Until you get more tasks that throw it out of balance. Problem remains to be solved at the server. Agreed - it's not an option to let everyone manually edit config files. Since yesterday you convinced me it's not an isolated issue. MrS Scanning for our furry friends since Jan 2002 ID: 32491 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32586 - Posted: 1 Sep 2013, 12:11:50 UTC - in response to Message 32491. Last modified: 1 Sep 2013, 12:16:19 UTC ... And now my Duration Correction Factor is back up to 12! I think the MJHARVEY beta tasks are throwing it off (by possibly having bad estimates). Is there a way to conclusively prove this? Admins: Is there a way to fix this server-side so it stops happening? It may even be related to the recent EXIT_TIME_LIMIT_EXCEEDED errors people are getting! ID: 32586 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32591 - Posted: 1 Sep 2013, 12:53:22 UTC - in response to Message 32586. Set aside the WUs which are failing with 'max time exceeded' errors, which are a symptom of another as yet unfixed problem, The application reports its progress to the BOINC client regularly, giving the client sufficient information able to linearly extrapolate the completion time accurately. If it's not getting it right, then that's a client bug. I've not heard of "duration correction factor" before, bit it looks like the result of over-thinking the problem. MJH ID: 32591 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32592 - Posted: 1 Sep 2013, 12:58:16 UTC - in response to Message 32591. Last modified: 1 Sep 2013, 13:12:54 UTC I'm fairly certain DCF is very relevant. The BOINC client "remembers" how much time a task was estimated to have taken, when it received said task. Then, when the task is done, it makes comparison to the actual time versus the estimated time. It stores that off as a client-side "factor" for the project (viewable in the project properties, called Duration Correction Factor DCF), so that in the future, when it gets a task from that project, it can "dynamically adjust" the Remaining time to better reflect the estimate. The problem that we're seeing is that DCF is way larger than expected, all of the sudden. And I believe the root cause is that recent tasks might have been sent with very incorrect estimated times (or maybe very incorrect estimated sizes of FLOPS or FPOPS). Can you please check into the estimates of recent tasks? Are the beta ones or the "_TEST" ones possibly very incorrect? :) Right now my DCF (which is usually between 1.0 and 1.6) is at 29.88, and a long-task (which usually completes in 8-16 hours) is currently estimated at 281 hours (~12 days!) Note: Projects can also choose to "not use" DCF, especially when they have various apps where estimates could be way off, per app. World Community Grid, for instance, has disabled DCF usage. I personally believe GPUGrid should keep using it. ID: 32592 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32593 - Posted: 1 Sep 2013, 13:13:14 UTC - in response to Message 32592. Well if this DCF thing is keeping a running average of WU runtimes then I expect it is simply losing its marbles when it encounters one of the WUs that make no progress. Am I correct in thinking that you like DCF because it gives you an estimate of WU runtime before the WU has begun executing? Certainly, once the WU is in progress the DCF is no longer relevant because the client has real data to work with. If it is still reporting the DCF-estimate completion time rather than a live extrapolation, that should be reported as a client bug. MJH ID: 32593 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32594 - Posted: 1 Sep 2013, 13:24:09 UTC - in response to Message 32593. Last modified: 1 Sep 2013, 14:17:34 UTC I'm not certain why it is "losing its marbles" lol, but I don't think it's related to a "no progress" task. Instead, I think it gets "adjusted" when a task is completed, and it compares the "estimated time before starting the task" versus "actual time after completing a task". The estimated task size is very important in that calculation, and it feels like some tasks went through with incorrect estimated sizes. I like DCF for several reasons. It accounts for "how BOINC normally runs, possibly fully loaded with CPU tasks", it accounts for computer-specific hardware scenarios that the project doesn't know about, maybe the user plays games while GPU crunching some of the time... all sorts of reasons to use it. I just wish it was app-based and not project-based. But, for GPUGrid, I think it's useful to still use, since you have been able to make accurate estimates for all your apps, in the past, and you seem to have always used the same method of making those estimates regardless of app. Regarding how it works... Once the WU is in progress, I believe DCF is still used even until the task finishes. (Not all tasks report linear progress like GPUGrid). So, for instance, here where the DCF is very skewed, the "remaining" time may start at 281 hours, but each clock-second that happens, the remaining time decreases 30 seconds. Once tasks complete, then the DCF gets adjusted for future tasks. Again, I believe the real issue is that something is wrong in the estimates of recent tasks. Have you investigated whether <rsc_fpops_est> ... was and is being set correctly for all recent tasks? If it's incorrect, it'll be the cause of DCF being improperly adjusted, resulting in bad estimated times people are seeing. Also, have you instigated whether <rsc_fpops_bound> ... was and is being set correctly for all recent tasks? If it's incorrect, it'll be the cause of the "exceeded maximum time" EXIT_TIME_LIMIT_EXCEEDED errors people are seeing. ID: 32594 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 32599 - Posted: 1 Sep 2013, 18:04:59 UTC I agree with what Jacob says here. MrS Scanning for our furry friends since Jan 2002 ID: 32599 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32603 - Posted: 1 Sep 2013, 19:33:44 UTC My GTX 670 host 132158 has been running through a variety of Beta tasks I recently finished one of the full-length NOELIA_KLEBEbeta being put through the Beta queue: I'm now working through some more MJHARVEY_TEST. Runtimes vary, as you would expect: in very round figures, the NOELIA ran for 10 hours, and the test pieces for 100 minutes, 10 minutes, or 1 minute - a ratio of 600::1 between the longest and shortest. I've checked a few <rsc_fpops_est> and <rsc_fpops_bound> pairs, and every workunit had been identical: The values have been <rsc_fpops_est>5000000000000000.000000</rsc_fpops_est> <rsc_fpops_bound>250000000000000000.000000</rsc_fpops_bound> Giving identical estimates to such a wide range of tasks is going, I'm afraid, to end in tears. BOINC Manager has translated those fpops_est into hours and minutes for me. For the cuda55 plan class, the estimate is roughly right for the NOELIA full task I've just completed - 8 hours 41 minutes. But to get that figure, BOINC has had to apply a DCF of 18.4 - it should be somewhere round 1.0 One the other hand, one of the test units was allocated under cuda42, and was showing an estimate of 24 minutes. Same size task, different estimates, can only happen one way, and here it is: <app_name>acemdbeta</app_name> <version_num>804</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.571351</max_ncpus> <flops>52770755214365.750000</flops> <plan_class>cuda42</plan_class> <app_name>acemdbeta</app_name> <version_num>804</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>2945205633626.370100</flops> <plan_class>cuda55</plan_class> Pulling those out into a more legible format, my speeds are supposed to be: cuda42: 52,770,755,214,365 cuda55: 2,945,205,633,626 That's 3 TeraFlops for cuda55, and over 50 TeraFlops for cuda42 - no, I don't think so. The marketing people at NVidia, as interpreted by BOINC, do give these cards a rating of 2915 GFLOPS peak, but I think we all know that not every flop is usable in the real world - pesky things like PCIe bus transfers, and memory read/writes, get in the way. The APR (Average Processing Rate) figures for that host are shown on the application details page. For mainstream processing under cuda42, I'm showing 147 (units: GigaFlops) for both short and long runs. About one-twentieth of theoretical peak feels about right, and matches the new DCF. For the new Beta runs I'm showing crazy speeds here as well - up to (and above) 80 TeraFlops. Guys, I'm sorry, but that's what happens when you send out 1-minute tasks without changing <rsc_fpops_est> from the value you used for 10 hour tasks. ID: 32603 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32605 - Posted: 1 Sep 2013, 20:08:15 UTC - in response to Message 32603. Last modified: 1 Sep 2013, 20:10:50 UTC Yes, a million times over! Admins: Regardless of whether we're still sending these tasks out, could we PLEASE make sure the code is correctly setting the values for them? They're causing estimation chaos! ID: 32605 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32607 - Posted: 1 Sep 2013, 20:53:50 UTC Disclaimer first: I speak as a long-term (interested) volunteer on number of BOINC projects, but I have no first-hand experience of administering a BOINC server. What follows is my personal opinion only, but an opinion informed by seeing similar crises at a number of projects. My opinion is that the current issues largely arise from the code behind http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation. It appears that this code is equisitely sensitive to outlier values in wu.rsc_fpops_est, and the hypersensitive code has a nasty habit of creeping out of the woodwork and biting sysadmins' ankles at times of stress. I think there are two reasons for that: 1) it's exactly at times like this - working weekend overtime to chase out a stubborn application bug, with a project-full of volunteers clamouring for new (error-free) work - that harrassed sysadmins are likely to take obvious shortcuts: put the new test app on the server with the same settings as last time, make the tasks shorter so the results come back sooner. I think we can all recognise, and empathise with, that syndrome. 2) BOINC's operations manual, and systems training courses (if they run any), don't emphasise enough: no shortcut through here - here be dragons, and they will demand a sacrifice if you get it wrong. Unless you run a completely sandboxed test server, these beta apps - it's always beta apps - will screw your averages and estimates. This runtime estimation code has been active in the master code base for BOINC servers for over three years now, and as I said, I've seen problems like this several times now - the most spectacular was at AQUA, where it spilled over into credit awards too. But to the best of my knowledge, no post-mortem has ever been carried out to find out why things go so badly wrong. In - again - my personal opinion only, this area of code needs to be re-visited, and re-written according to sound engineering - fault tolerant and error tolerant - principles (the way the original BOINC server code was written). However, I see no sign that the BOINC core programmers have any plans to do this. Some project administrators have looked into the code, and expressed frustration at having to work with it in its current form - Eric Korpela of SETI and Bernd Machenschalk of Einstein come to mind - but I've not heard of any improvements being migrated back into the master codebase. There is a workshop "for researchers, scientists and developers with significant experience or interest in BOINC" at Inria near Grenoble, at the end of this month. If anybody reading this has any sympathy with my point of view, that workshop might be a suitable opportunity to raise any concerns you might have. ID: 32607 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32608 - Posted: 1 Sep 2013, 21:05:57 UTC - in response to Message 32607. :) I hope we don't have to travel to France for a solution. Admins: Can you please solve this issue? We've given tons of information that should point you in the right direction I hope! ID: 32608 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32609 - Posted: 1 Sep 2013, 23:22:50 UTC - in response to Message 32608. Many projects have separate apps (and queues) for different task types, as it makes things more manageable. However at GPUGrid many different WU types are run in the same queue. For example, I've just run a NOELIA_KLEBbeta WU that took 13.43h and has an estimate of 5000000GFLOPs, and a MJHARVEY_TEST11 that took 2.08h but also had an estimated WU requirement of 5000000GFLOPs. I thought the estimated GFLOPs is set on an application basis. Is that correct? If that's the case then you cannot run multiple WU types in the same app queue without messing up the estimated runtimes. PS. In Project Preferences there are 4 apps (queues), ACEMD short runs (2-3 hours on fastest card) for CUDA 4.2: ACEMD short runs (2-3 hours on fastest card) for CUDA 3.1: - deprecated? ACEMD beta: ACEMD long runs (8-12 hours on fastest GPU) for CUDA 4.2: In server status there are 3 apps, Short runs (2-3 hours on fastest card) ACEMD beta version Long runs (8-12 hours on fastest card) FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 32609 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32610 - Posted: 1 Sep 2013, 23:47:00 UTC Each task is sent with an estimate. You can even view that estimate in the task properties. It can be different amongst tasks within the same app. We need the admins to perform an investigation, and to fix the problem. ID: 32610 · Rating: 0 · rate: / Reply Quote