Remaining (Estimated) time is unusually high; duration correction factor unusually large

Author	Message
MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32682 - Posted: 4 Sep 2013, 10:43:36 UTC - in response to Message 32679. The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast The server doesn't have any part in it - it's the client making that decision. anyway, the last batch of short WUs have been submitted, with no more to follow. Hopefully the client will be as quick to correct itself back as before. MJH ID: 32682 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32683 - Posted: 4 Sep 2013, 10:57:28 UTC - in response to Message 32682. The problem is, that if you run a succession of short test units with full-size <rsc_fpops_est> values, then the BOINC server thinks your host is insanely fast The server doesn't have any part in it - it's the client making that decision. anyway, the last batch of short WUs have been submitted, with no more to follow. Hopefully the client will be as quick to correct itself back as before. MJH I beg to differ. Please have a look at http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation - the problem is the host_app_version table described under 'The New System'. You have that here: the sever calculates the effective speed of the host, based on an average of previously completed tasks. You can see - but not alter - those 'effective' speeds as 'Average Processing Rate' on the application details page for each host. That's what I linked for mine: the units are gigaflops. The server passes that effective flops rating to the client with each work allocation, and yes: it's the client which makes the final decision to abort work with EXIT_TIME_LIMIT_EXCEEDED - but it does so on the basis of data maintained and supplied by the server. ID: 32683 · Rating: 0 · rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 32684 - Posted: 4 Sep 2013, 12:36:36 UTC - in response to Message 32682. Last modified: 4 Sep 2013, 12:41:31 UTC In the last month, my two machines seem to have stabilized for errors. Last errors were the server canceled runs Aug 24 and before that Jul 30 - Aug 11 with the NOELIA runs. Just added the Beta runs to be allowed a couple days ago and last night they started running on one machine. The HARVEY's went through no problem. The NOELIA_KLEBEbeta did error with 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED after running ~30 of the HARVEY's. http://www.gpugrid.net/result.php?resultid=7246089 When reviewing all the comments about rsc_fpops_bound, duration_correction_factor, etc., it seems that with the current setup, one minute runs and 8 hour runs should not be mixed under the same app. Since it is our side which is calculating how long it should take, and our side is limited by the app definitions. <app_version> <app_name>acemdshort</app_name> <version_num>800</version_num> <flops>741352314880.428590</flops> (0.74 TFlops) <file_name>acemd.800-42.exe</file_name> <app_version> <app_name>acemdshort</app_name> <version_num>802</version_num> <flops>243678296053.232120</flops> (0.24 TFlops) <file_name>acemd.802-55.exe</file_name> <app_version> <app_name>acemdlong</app_name> <version_num>803</version_num> <flops>133475073594.177120</flops> (0.13 TFlops) <file_name>acemd.800-55.exe</file_name> <app_version> <app_name>acemdlong</app_name> <version_num>803</version_num> <flops>185603220803.599580</flops> (0.19 TFlops) <file_name>acemd.800-55.exe</file_name> <app_version> <app_name>acemdbeta</app_name> <version_num>811</version_num> <flops>69462341903928.672000</flops> (69.46 TFlops) <file_name>acemd.811-55.exe</file_name> My two machines also make this a bit more complicated in that they both have a GTX680 and a GTX460 so it seems that the estimated time remaining is driven by the GTX460 speed. That seems to work out ok though. There seems to be no tracking ability local or server for the different cards so this is a moot point. So if the acemdbeta app could be run with at least different version with identical binary (e.g. 5min_811 and 8hr_811 where 5 minutes and 8 hours expected time), this could at least separate the runs such as HARVEY and KLEBE which have orders of magnitude in different run times. This may possibly reduce the number of errors due to time outs. ID: 32684 · Rating: 0 · rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32685 - Posted: 4 Sep 2013, 12:42:39 UTC Last modified: 4 Sep 2013, 12:43:11 UTC Thanks all for te explanations, the path explained by Jacob aparently fix the times. Now the question remains, it´s fixed at the server side? can i allow to receive new Beta WU or is beter wait a little more? ID: 32685 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32687 - Posted: 4 Sep 2013, 13:14:54 UTC - in response to Message 32685. Thanks all for te explanations, the path explained by Jacob aparently fix the times. Now the question remains, it´s fixed at the server side? can i allow to receive new Beta WU or is beter wait a little more? Having consulted the usual oracle on that, we think it would be wise to wait a little longer. Although changing DCF will change the displayed estimates for runtime, we don't think it affects the underlying calculations for EXIT_TIME_LIMIT_EXCEEDED. And whatever you change, DCF is re-calculated every time a task exits. If you happen to draw another NOELIA_KLEBEbeta, DCF will go right back up through the roof in one jump when it finishes. The only solution is a new Beta app installation, with a new set of APRs - and that has to be done on the server. ID: 32687 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32688 - Posted: 4 Sep 2013, 13:18:43 UTC - in response to Message 32687. Richard is right, as usual. I'm going to take my computer off of the beta queue, until a new app and new APRs are in place; too much wasted computing power is occurring. MJH: Please let us know when a new beta app (with new APRs) is released on the server. Thanks all for te explanations, the path explained by Jacob aparently fix the times. Now the question remains, it´s fixed at the server side? can i allow to receive new Beta WU or is beter wait a little more? Having consulted the usual oracle on that, we think it would be wise to wait a little longer. Although changing DCF will change the displayed estimates for runtime, we don't think it affects the underlying calculations for EXIT_TIME_LIMIT_EXCEEDED. And whatever you change, DCF is re-calculated every time a task exits. If you happen to draw another NOELIA_KLEBEbeta, DCF will go right back up through the roof in one jump when it finishes. The only solution is a new Beta app installation, with a new set of APRs - and that has to be done on the server. ID: 32688 · Rating: 0 · rate: / Reply Quote

juan BFP Send message Joined: 11 Dec 11 Posts: 21 Credit: 145,887,858 RAC: 0 Level Scientific publications	Message 32689 - Posted: 4 Sep 2013, 13:38:42 UTC - in response to Message 32687. Having consulted the usual oracle on that, we think it would be wise to wait a little longer. <Waiting>1<Waiting> hope not for a long time and thanks for the help. Richard is right, as usual. I'm going to take my computer off of the beta queue, until a new app and new APRs are in place; too much wasted computing power is occurring. Please let us know when a new beta app (with new APRs) is released on the server. +1 ID: 32689 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32692 - Posted: 4 Sep 2013, 13:51:18 UTC - in response to Message 32688. Richard is right, as usual. I'm going to take my computer off of the beta queue, until a new app and new APRs are in place; too much wasted computing power is occurring. MJH: Please let us know when a new beta app (with new APRs) is released on the server. Jacob - the beta testing is over now. 8.11 is the final revision and is out now on beta and short. Now, I know these wrong estimates have been a cause of frustration for you, but in fact the WUs haven't been going to waste - they've been doing enough work to help me fix the bugs I was looking at. MJH ID: 32692 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32693 - Posted: 4 Sep 2013, 13:57:46 UTC - in response to Message 32692. MJH: If people get stuck in a "Maximum Time Exceeded" loop, where they cannot complete their tasks, then surely their time will be wasted with the work, no? Richard knows what needs to be done in order to get people out of that scenario. It deals with APRs on the Application Details page - Average processing rate is way too high for some applications. As long as the "busted APR" applications aren't used anymore, then I think the problem will go away from a user's perspective, as DCF will eventually work it's way back to 0.6-1.6. Since you say the beta is over, I take it we're done issuing tasks on these "busted APR" applications, right? (ie: no future beta will ever use these applications) But I'm also hopeful you can do something to prevent the issue from happening in the future - maybe some safeguard that ensures proper fpops estimates. Just a thought. For your reference, I care more about the "Maximum time exceeded" errors than the "bad estimates" problems. The former cause lost work/time. ID: 32693 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32694 - Posted: 4 Sep 2013, 14:11:43 UTC - in response to Message 32693. If people get stuck in a "Maximum Time Exceeded" loop, where they cannot complete their tasks, then surely their time will be wasted with the work, no? From a BOINC credit perspective, yes. But note that the completed WUs were receiving a generous award, which ought to have been some compensation. Importantly though, from a development perspective they were't wasted. The failures I was interested in happened very quickly after start-up. If the WU ran long enough for MTE, it had run long enough to accomplish its purpose. But I'm also hopeful you can do something to prevent the issue from happening in the future - maybe some safeguard that ensures proper fpops estimates. Yes, of course! We've not tried this method of live debugging using short WUs before and weren't expecting this unfortunate side-effect. Next time the fpops estimate will be dialled down appropriately. MJH ID: 32694 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32695 - Posted: 4 Sep 2013, 14:16:14 UTC - in response to Message 32694. Last modified: 4 Sep 2013, 14:19:53 UTC But I'm also hopeful you can do something to prevent the issue from happening in the future - maybe some safeguard that ensures proper fpops estimates. Yes, of course! We've not tried this method of live debugging using short WUs before and weren't expecting this unfortunate side-effect. Next time the fpops estimate will be dialled down appropriately. MJH Thank you. That is what I/we needed to hear. We understand it's a bit of a learning experience, since you were trying a new way to weed out errors and move forward. I'm glad you know more about this issue - APRs and app versions - How they affect fpops estimated - How fpops bound ends up affecting Maximum Time Exceeded - How the client keeps track of estimation using a project-wide-variable [Duraction Correction Factor (DCF)] to show estimated times in the UI Next time, I'm sure it'll go much more smoothly :) Thanks for your responses. ID: 32695 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32698 - Posted: 4 Sep 2013, 14:42:37 UTC I was just about to suggest that we wait until this urgent fine-tuning of the app was complete, but I see we've reached that point already if you're happy with v8.11 So I guess we're left with some MJHARVEY_TEST tasks with, ahem, 'creative' runtime estimates working their way through the system on the Beta queue, and in addition some of the full-length NOELIA_KLEBEbeta tasks. Are you able to monitor the numbers of tasks of each type still remaining in the queue? The best course of action might be to wait until all MJHARVEY_TEST tasks are complete and out of the way - or even help them along, with a 'cancel if unstarted' purge. Then, once we are sure that no MJHARVEY_TEST tasks are left or in any danger of being re-issued with current estimates, re-deploy the Beta app as version 812 (doesn't have to be any change in the app itself). That could then be left to clean up the remaining NOELIA_KLEBEbeta tasks with a new app_version detail record, which would automatically be populated with more realistic APRs. There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. One other small point while we're here - I noticed that the CUDA 5.5 app_version referenced the v5.5 cudart and cufft DLLs - correct. But it also referenced the equivalent CUDA 4.2 DLLs. Subject to checking at your end, that sounds like duplication, which might add an unnecessary bandwidth overhead as both sets of files are downloaded by new testers. Might be scope for a bit of a saving there. ID: 32698 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32702 - Posted: 4 Sep 2013, 17:41:03 UTC - in response to Message 32698. There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. Will disconnecting and re-attaching to the project force a reset? But it also referenced the equivalent CUDA 4.2 DLLs. Yes - 42 and 55 DLLs are both delivered irrespective of the app version. It's a side-effect of our deployment mechanism. Will probably fix it later. MJH ID: 32702 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32703 - Posted: 4 Sep 2013, 17:44:05 UTC - in response to Message 32702. Last modified: 4 Sep 2013, 17:44:28 UTC There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. Will disconnecting and re-attaching to the project force a reset? Yes, I believe it will, but the user loses all the local stats for the project, plus any files that had been downloaded. For resetting DCF, I prefer to close BOINC, carefully edit the client_state.xml file, then reopen BOINC. ID: 32703 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32704 - Posted: 4 Sep 2013, 17:54:50 UTC - in response to Message 32698. Last modified: 4 Sep 2013, 18:12:12 UTC I just got the following task: http://www.gpugrid.net/result.php?resultid=7248506 Name 063px55-NOELIA_KLEBEbeta-2-3-RND9896_0 Created 4 Sep 2013 \| 17:38:22 UTC Sent 4 Sep 2013 \| 17:41:12 UTC Application version ACEMD beta version v8.11 (cuda55) I believe it is a new task on the beta queue, though I'm not sure if this app version has "busted APR" or not. Can we make sure that any new tasks are not using the "busted APR" app versions? ID: 32704 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32705 - Posted: 4 Sep 2013, 17:59:50 UTC - in response to Message 32702. There's nothing you can do at the server end to hasten the correction of DCF - that's entirely managed by the client, at our end. Those who feel confident will adjust their own, others will have to wait, but it'll come right in the end. Will disconnecting and re-attaching to the project force a reset? For DCF - detach/reattach will fix it, as will a simple 'Reset project' from BOINC Manager. Both routes will kill any tasks in progress, and will force a re-download of applications, DLLs and new tasks. People would probably wish to wait for a pause between jobs to do this: set 'No new tasks'; complete, upload and report all current work; and only then reset the project. For APR - 'Reset project' will do nothing, except kill tasks in progress and force the download of new ones. Detach/re-attach might help, but the BOINC server code in general tries to re-assign the previous HostID to a re-attaching host (if it recognises the IP address, Domain Name, and hardware configuration). If you get the same HostID, you get the APR values and other application details back, too. There are ways of forcing a new HostID, but they involve deliberately invoking BOINC's anti-cheating mechanism by falsifying the RPC sequence number. ID: 32705 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 32706 - Posted: 4 Sep 2013, 18:02:40 UTC - in response to Message 32705. Richard, For APR: Does that means that, to effectively solve the APR problem, the solution is to create a new app version for any app version that had tasks sent with bad estimates. Is that correct? ID: 32706 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32708 - Posted: 4 Sep 2013, 18:04:49 UTC - in response to Message 32704. The last KLEBE beta WUs are being deleted now. ID: 32708 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32710 - Posted: 4 Sep 2013, 18:38:04 UTC - in response to Message 32706. Last modified: 4 Sep 2013, 18:38:53 UTC Richard, For APR: Does that means that, to effectively solve the APR problem, the solution is to create a new app version for any app version that had tasks sent with bad estimates. Is that correct? Yes, that's the only way I know from a user perspective. There is supposed to be an Application Reset tool on the server operations web-admin page, but I don't know if it can be applied selectively: the code is here http://boinc.berkeley.edu/trac/browser/boinc-v2/html/ops/app_reset.php but there's no mention of it on the associated Wiki page http://boinc.berkeley.edu/trac/wiki/HtmlOps I'd advise consulting another BOINC server admistrator before touching it: Oliver Bock (shown as the most recent contributer to that code page) can be contacted via Einstein@home or the BOINC email lists, and is normally very helpful. ID: 32710 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 32711 - Posted: 4 Sep 2013, 18:40:49 UTC - in response to Message 32708. The last KLEBE beta WUs are being deleted now. Will that affect tasks in progress? :P I've just started one, with suitably modified <rsc_fpops_bound> (of course). ID: 32711 · Rating: 0 · rate: / Reply Quote