Message boards :
Number crunching :
Remaining (Estimated) time is unusually high; duration correction factor unusually large
Message board moderation
Previous · 1 · 2 · 3 · 4
| Author | Message |
|---|---|
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
While I expect you jest, for the rest of those who might be reading this thread, recently some work was aborted by mistake; tasks in progress as well as non started work. However, it was mentioned that a mechanism to avoid this will be used in the future. http://www.gpugrid.net/forum_thread.php?id=3448 http://www.gpugrid.net/forum_thread.php?id=3446&nowrap=true#32213 - just after starting (and finishing) the following beta, 1-MJHARVEY_TEST18-9-10-RND8069_0 4753024 4 Sep 2013 | 20:21:07 UTC 4 Sep 2013 | 20:24:14 UTC Completed and validated 148.76 9.55 150.00 ACEMD beta version v8.00 (cuda55) So, it's only the NOELIA betas that are being aborted! |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Huh, I thought all the TEST18s were done already. MJH |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Created 4 Sep 2013 | 20:18:12 UTC Sent 4 Sep 2013 | 20:21:07 UTC Received 4 Sep 2013 | 20:24:14 UTC FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I did a reset project, but the estimated time for a new WU afterwards is still wrong. One from 1.5 minutes was estimated 1h45m and a SANTI SR 7h40m52s. This will be faster already done 3% in 5m. Greetings from TJ |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I thought the problem had been solved, but now I don't know. Overnight I had completed the following 5 tasks, and now my Duration Correction Factor is the highest it's ever been, 96. Some new tasks are saying they'll take 848 hours to complete :) Do any of the following apps have "busted APR" from having tasks with inappropriate <rsc_fpops_est> values? - ACEMD beta version v8.11 (cuda55) - ACEMD beta version v8.13 (cuda42) - ACEMD beta version v8.13 (cuda55) 063px55-NOELIA_KLEBEbeta-2-3-RND9896_0 4752598 4 Sep 2013 | 17:41:12 UTC 6 Sep 2013 | 4:58:59 UTC Completed and validated 106,238.51 11,702.89 119,000.00 ACEMD beta version v8.11 (cuda55) 124-MJHARVEY_CRASH1-0-25-RND3516_1 4754388 5 Sep 2013 | 16:39:18 UTC 6 Sep 2013 | 4:58:21 UTC Completed and validated 17,615.38 7,935.19 18,750.00 ACEMD beta version v8.13 (cuda55) 139-MJHARVEY_CRASH2-1-25-RND6442_0 4756103 6 Sep 2013 | 4:49:11 UTC 6 Sep 2013 | 7:51:44 UTC Completed and validated 10,781.12 10,669.78 18,750.00 ACEMD beta version v8.13 (cuda42) 196-MJHARVEY_CRASH2-1-25-RND1142_1 4756328 6 Sep 2013 | 7:51:00 UTC 6 Sep 2013 | 10:49:48 UTC Completed and validated 10,559.78 10,479.89 18,750.00 ACEMD beta version v8.13 (cuda55) 149-MJHARVEY_CRASH2-1-25-RND2885_0 4756187 6 Sep 2013 | 4:53:45 UTC 6 Sep 2013 | 11:00:47 UTC Completed and validated 21,826.75 5,600.89 18,750.00 ACEMD beta version v8.13 (cuda55) |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
All of the CRASH tasks are exact copies of SANTI-MAR4222s MJH |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Can you answer my question about the apps? |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
rsc_fpops_est is the same as when they went out on acemdshort. I neither understand what APR is not know how to measure it. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Oh. I only know a little, but I'll share what I know. Richard knows a ton about it. From what I gather... each app-version that a host runs, gets its own "set of statistics". So... If you click my computers, and view the details of RacerX, and then click "Application details", you'll see all the app-versions I've run, including statistics on them. For the app-versions that were sent out with bad <rsc_fpops_est> values, you'll see that my "Average processing rate" (APR) was very jacked up (normal values for me are about ~350 for short-run queue, ~100 for long-run queue; but if you look at the beta apps on my list there, you'll see values like 4000, 7000, 16000) If the APR is very high, then the result is that the server thinks the card is like 40-160 times faster than it actually is! So, in a sense, we can use this APR to see if apps had been sent with inappropriate <rsc_fpops_est> values. At the client, when a task with incorrect <rsc_fpops_est> is processed, the clear indication is that Duration Correction Factor gets jacked up well. And then the client time estimates get jacked. I was hoping that we had fixed all this "bad <rsc_fpops_est> values" business (which I've termed as "busted-APR apps"), but last night it seems one of the tasks messed it up. I noticed it because DCF was wrong, and running tasks said 400+ hours. The main question I have: On the beta queue, which app version is the earliest where you believe the <rsc_fpops_est> issue was fully resolved? I'm pretty sure the answer is not "8.11"; I think the 8.11 task may have been the culprit out of the 5 tasks from a couple posts up. |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All of the CRASH tasks are exact copies of SANTI-MAR4222s That sounds promising. My 660 had big problems with Santi´s SR and LR, but so far all CRASH´s beta´s I got (4) finished with good result. Greetings from TJ |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Thanks Jacob, In fact our value for rsc_fpops_est has never changed - it is set the same for every WU we have ever put out on every queue. Evidently the mechanisms that depend on it being accurate had adjusted to accept as correct our typical WU lengths, but got very confused when I put out a whole batch of very short WUs with the same pops estimate. MJH |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 261 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Oh. I only know a little, but I'll share what I know. Richard knows a ton about it. That's pretty much it. Some comments: Host names, as in your "view the details of RacerX", are only visible to the machine owner when logged in to their account. All other users - including the project staff - can only see the 'HostID' number, so it's better to quote (or even link) that. <rsc_fpops_est> is a property of the workunit, and hence of all tasks (including resends) generated from it. Workunits exist as entities in their own right - there's no such thing as a 'v8.11 workunit' - although the copy that got sent to your machine might well have appeared as a 'v8.11 task'. But another user might have got it as v8.06 or v8.13 - depends which was active at the time the task was allocated to the host in question. If the test tasks (the current 'CRASH' series) are copies of SANTI-MAR4222, they will be long enough - as I think I've already said somewhere - not to cause any timing problems. A bit of distortion, sure - DCF should rise to maybe 4, but still in single figures, which will clear by itself. The problem with hugely-distorted runtime estimates arose from the doctored 'TEST' workunits, some of which only ran for one minute while still carrying a <rsc_fpops_est> more appropriate for 10 hours. So long as any of those remain in the system, we could get recurrences - whichever version of the Beta app is deployed at the time a task for the WU is issued to a volunteer. On my Beta host, it looks as if estimates for v8.11 are thoroughly borked: I suspect they will be for all active participants. If anyone still has any tasks issued with that version, they may have problems running them - and if they get aborted, and re-generated by BOINC (i.e., if there are any problems with the WU or task cancellation on the server), then the later Beta versions may get 'poisoned' too. But for the time being, v8.12 and v8.13 look clean for me. @ Matt - don't feel bad about not understanding APR. *Nobody* understands APR and everything that lies behind it. Except possibly David Anderson (who wrote it), and we're not even sure about him. Grown men (and women) have wept when they tried to walk the code... Best to wait and watch, I think, and see if the issues clear themselves up as the Beta queue tasks pass through the system and into oblivion. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks Jacob, Right, I get that. But, when that happens, it "ruins" the APR for the given app-version. So, I guess what I was getting at is: Which beta app-version is the first one that couldn't possibly have been ruined? It's okay if you don't have an answer. Edit: It sounds as if Richard is saying that the task could get reissued into an app-version and poison it, so I guess my question is a bit invalid. Anyway, after processing that 8.11 and seeing DCF/estimates jacked, I (again) closed BOINC, edited my client_state.xml file to reset the DCF, and restarted BOINC. Hopefully I don't get any more tasks that ruin app-versions. Thanks, Jacob |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
If I understand what you are saying, it must be 8.12, since that was the first to do only CRASH WUs MJH |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thank you Matt and Richard. |
DamaralandSend message Joined: 7 Nov 09 Posts: 152 Credit: 16,181,924 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Not very sure if you still want this info. Maybe you could be more precise: CUDA: NVIDIA GPU 0: GeForce GTX 260 (driver version 331.65, CUDA version 6.0, compute capability 1.3, 896MB, 818MB available, 912 GFLOPS peak) ACEMD beta version v8.15 (cuda55) 77-KLAUDE_6429-0-2-RND1641_1 Expected to finish in 22h. 83% processed right so far. But on the other GPU I have one that is messing quite much with time estimation: CUDA: NVIDIA GPU 1: GeForce GTX 560 Ti (driver version 331.65, CUDA version 6.0, compute capability 2.1, 1024MB, 847MB available, 1352 GFLOPS peak) Long runs (8-12 hours on fastest card) v8.14 (cuda55) Progress 43% 7h so far, remaining 58h!! I don't really care about estimated times. Just in case this info is of some help. |
©2025 Universitat Pompeu Fabra