Remaining (Estimated) time is unusually high; duration correction factor unusually large

Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32712 - Posted: 4 Sep 2013, 20:40:12 UTC - in response to Message 32711.  
Last modified: 4 Sep 2013, 20:41:39 UTC

While I expect you jest, for the rest of those who might be reading this thread, recently some work was aborted by mistake; tasks in progress as well as non started work. However, it was mentioned that a mechanism to avoid this will be used in the future.

http://www.gpugrid.net/forum_thread.php?id=3448

http://www.gpugrid.net/forum_thread.php?id=3446&nowrap=true#32213

- just after starting (and finishing) the following beta,
1-MJHARVEY_TEST18-9-10-RND8069_0 4753024 4 Sep 2013 | 20:21:07 UTC 4 Sep 2013 | 20:24:14 UTC Completed and validated 148.76 9.55 150.00 ACEMD beta version v8.00 (cuda55)

So, it's only the NOELIA betas that are being aborted!
ID: 32712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32713 - Posted: 4 Sep 2013, 20:53:00 UTC - in response to Message 32712.  

Huh, I thought all the TEST18s were done already.

MJH
ID: 32713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32714 - Posted: 4 Sep 2013, 20:58:52 UTC - in response to Message 32713.  

Created 4 Sep 2013 | 20:18:12 UTC
Sent 4 Sep 2013 | 20:21:07 UTC
Received 4 Sep 2013 | 20:24:14 UTC

FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32715 - Posted: 4 Sep 2013, 21:04:29 UTC

I did a reset project, but the estimated time for a new WU afterwards is still wrong. One from 1.5 minutes was estimated 1h45m and a SANTI SR 7h40m52s. This will be faster already done 3% in 5m.
Greetings from TJ
ID: 32715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32779 - Posted: 6 Sep 2013, 11:58:15 UTC

I thought the problem had been solved, but now I don't know.
Overnight I had completed the following 5 tasks, and now my Duration Correction Factor is the highest it's ever been, 96. Some new tasks are saying they'll take 848 hours to complete :)

Do any of the following apps have "busted APR" from having tasks with inappropriate <rsc_fpops_est> values?
- ACEMD beta version v8.11 (cuda55)
- ACEMD beta version v8.13 (cuda42)
- ACEMD beta version v8.13 (cuda55)

063px55-NOELIA_KLEBEbeta-2-3-RND9896_0 	4752598 	4 Sep 2013 | 17:41:12 UTC 	6 Sep 2013 | 4:58:59 UTC 	Completed and validated 	106,238.51 	11,702.89 	119,000.00 	ACEMD beta version v8.11 (cuda55)
124-MJHARVEY_CRASH1-0-25-RND3516_1 	4754388 	5 Sep 2013 | 16:39:18 UTC 	6 Sep 2013 | 4:58:21 UTC 	Completed and validated 	17,615.38 	7,935.19 	18,750.00 	ACEMD beta version v8.13 (cuda55)
139-MJHARVEY_CRASH2-1-25-RND6442_0 	4756103 	6 Sep 2013 | 4:49:11 UTC 	6 Sep 2013 | 7:51:44 UTC 	Completed and validated 	10,781.12 	10,669.78 	18,750.00 	ACEMD beta version v8.13 (cuda42)
196-MJHARVEY_CRASH2-1-25-RND1142_1 	4756328 	6 Sep 2013 | 7:51:00 UTC 	6 Sep 2013 | 10:49:48 UTC 	Completed and validated 	10,559.78 	10,479.89 	18,750.00 	ACEMD beta version v8.13 (cuda55)
149-MJHARVEY_CRASH2-1-25-RND2885_0 	4756187 	6 Sep 2013 | 4:53:45 UTC 	6 Sep 2013 | 11:00:47 UTC 	Completed and validated 	21,826.75 	5,600.89 	18,750.00 	ACEMD beta version v8.13 (cuda55)
ID: 32779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32782 - Posted: 6 Sep 2013, 12:14:16 UTC - in response to Message 32779.  

All of the CRASH tasks are exact copies of SANTI-MAR4222s

MJH
ID: 32782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32783 - Posted: 6 Sep 2013, 12:15:42 UTC - in response to Message 32782.  

Can you answer my question about the apps?
ID: 32783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32784 - Posted: 6 Sep 2013, 12:23:05 UTC - in response to Message 32783.  

rsc_fpops_est is the same as when they went out on acemdshort.

I neither understand what APR is not know how to measure it.
ID: 32784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32785 - Posted: 6 Sep 2013, 12:33:36 UTC - in response to Message 32784.  
Last modified: 6 Sep 2013, 12:38:42 UTC

Oh. I only know a little, but I'll share what I know. Richard knows a ton about it.

From what I gather... each app-version that a host runs, gets its own "set of statistics". So... If you click my computers, and view the details of RacerX, and then click "Application details", you'll see all the app-versions I've run, including statistics on them.

For the app-versions that were sent out with bad <rsc_fpops_est> values, you'll see that my "Average processing rate" (APR) was very jacked up (normal values for me are about ~350 for short-run queue, ~100 for long-run queue; but if you look at the beta apps on my list there, you'll see values like 4000, 7000, 16000)

If the APR is very high, then the result is that the server thinks the card is like 40-160 times faster than it actually is! So, in a sense, we can use this APR to see if apps had been sent with inappropriate <rsc_fpops_est> values. At the client, when a task with incorrect <rsc_fpops_est> is processed, the clear indication is that Duration Correction Factor gets jacked up well. And then the client time estimates get jacked.

I was hoping that we had fixed all this "bad <rsc_fpops_est> values" business (which I've termed as "busted-APR apps"), but last night it seems one of the tasks messed it up. I noticed it because DCF was wrong, and running tasks said 400+ hours.

The main question I have:
On the beta queue, which app version is the earliest where you believe the <rsc_fpops_est> issue was fully resolved?

I'm pretty sure the answer is not "8.11"; I think the 8.11 task may have been the culprit out of the 5 tasks from a couple posts up.
ID: 32785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32789 - Posted: 6 Sep 2013, 12:52:40 UTC - in response to Message 32782.  

All of the CRASH tasks are exact copies of SANTI-MAR4222s

MJH

That sounds promising. My 660 had big problems with Santi´s SR and LR, but so far all CRASH´s beta´s I got (4) finished with good result.
Greetings from TJ
ID: 32789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32792 - Posted: 6 Sep 2013, 13:06:34 UTC - in response to Message 32785.  

Thanks Jacob,

In fact our value for rsc_fpops_est has never changed - it is set the same for every WU we have ever put out on every queue.

Evidently the mechanisms that depend on it being accurate had adjusted to accept as correct our typical WU lengths, but got very confused when I put out a whole batch of very short WUs with the same pops estimate.

MJH
ID: 32792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 261
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32794 - Posted: 6 Sep 2013, 13:19:27 UTC - in response to Message 32785.  

Oh. I only know a little, but I'll share what I know. Richard knows a ton about it.

From what I gather... each app-version that a host runs, gets its own "set of statistics". So... If you click my computers, and view the details of RacerX, and then click "Application details", you'll see all the app-versions I've run, including statistics on them.

For the app-versions that were sent out with bad <rsc_fpops_est> values, you'll see that my "Average processing rate" (APR) was very jacked up (normal values for me are about ~350 for short-run queue, ~100 for long-run queue; but if you look at the beta apps on my list there, you'll see values like 4000, 7000, 16000)

If the APR is very high, then the result is that the server thinks the card is like 40-160 times faster than it actually is! So, in a sense, we can use this APR to see if apps had been sent with inappropriate <rsc_fpops_est> values. At the client, when a task with incorrect <rsc_fpops_est> is processed, the clear indication is that Duration Correction Factor gets jacked up well. And then the client time estimates get jacked.

I was hoping that we had fixed all this "bad <rsc_fpops_est> values" business (which I've termed as "busted-APR apps"), but last night it seems one of the tasks messed it up. I noticed it because DCF was wrong, and running tasks said 400+ hours.

The main question I have:
On the beta queue, which app version is the earliest where you believe the <rsc_fpops_est> issue was fully resolved?

I'm pretty sure the answer is not "8.11"; I think the 8.11 task may have been the culprit out of the 5 tasks from a couple posts up.

That's pretty much it. Some comments:

Host names, as in your "view the details of RacerX", are only visible to the machine owner when logged in to their account. All other users - including the project staff - can only see the 'HostID' number, so it's better to quote (or even link) that.

<rsc_fpops_est> is a property of the workunit, and hence of all tasks (including resends) generated from it. Workunits exist as entities in their own right - there's no such thing as a 'v8.11 workunit' - although the copy that got sent to your machine might well have appeared as a 'v8.11 task'. But another user might have got it as v8.06 or v8.13 - depends which was active at the time the task was allocated to the host in question.

If the test tasks (the current 'CRASH' series) are copies of SANTI-MAR4222, they will be long enough - as I think I've already said somewhere - not to cause any timing problems. A bit of distortion, sure - DCF should rise to maybe 4, but still in single figures, which will clear by itself.

The problem with hugely-distorted runtime estimates arose from the doctored 'TEST' workunits, some of which only ran for one minute while still carrying a <rsc_fpops_est> more appropriate for 10 hours. So long as any of those remain in the system, we could get recurrences - whichever version of the Beta app is deployed at the time a task for the WU is issued to a volunteer.

On my Beta host, it looks as if estimates for v8.11 are thoroughly borked: I suspect they will be for all active participants. If anyone still has any tasks issued with that version, they may have problems running them - and if they get aborted, and re-generated by BOINC (i.e., if there are any problems with the WU or task cancellation on the server), then the later Beta versions may get 'poisoned' too. But for the time being, v8.12 and v8.13 look clean for me.

@ Matt - don't feel bad about not understanding APR. *Nobody* understands APR and everything that lies behind it. Except possibly David Anderson (who wrote it), and we're not even sure about him. Grown men (and women) have wept when they tried to walk the code...

Best to wait and watch, I think, and see if the issues clear themselves up as the Beta queue tasks pass through the system and into oblivion.
ID: 32794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32795 - Posted: 6 Sep 2013, 13:19:30 UTC - in response to Message 32792.  
Last modified: 6 Sep 2013, 13:23:15 UTC

Thanks Jacob,

In fact our value for rsc_fpops_est has never changed - it is set the same for every WU we have ever put out on every queue.

Evidently the mechanisms that depend on it being accurate had adjusted to accept as correct our typical WU lengths, but got very confused when I put out a whole batch of very short WUs with the same pops estimate.

MJH


Right, I get that.
But, when that happens, it "ruins" the APR for the given app-version.

So, I guess what I was getting at is: Which beta app-version is the first one that couldn't possibly have been ruined? It's okay if you don't have an answer. Edit: It sounds as if Richard is saying that the task could get reissued into an app-version and poison it, so I guess my question is a bit invalid.

Anyway, after processing that 8.11 and seeing DCF/estimates jacked, I (again) closed BOINC, edited my client_state.xml file to reset the DCF, and restarted BOINC. Hopefully I don't get any more tasks that ruin app-versions.

Thanks,
Jacob
ID: 32795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32796 - Posted: 6 Sep 2013, 13:24:24 UTC - in response to Message 32795.  


So, I guess what I was getting at is: Which beta app-version is the first one that couldn't possibly have been ruined? It's okay if you don't have an answer.


If I understand what you are saying, it must be 8.12, since that was the first to do only CRASH WUs

MJH
ID: 32796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32797 - Posted: 6 Sep 2013, 13:25:38 UTC - in response to Message 32796.  

Thank you Matt and Richard.
ID: 32797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Damaraland

Send message
Joined: 7 Nov 09
Posts: 152
Credit: 16,181,924
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33944 - Posted: 20 Nov 2013, 19:48:41 UTC

Not very sure if you still want this info. Maybe you could be more precise:

CUDA: NVIDIA GPU 0: GeForce GTX 260 (driver version 331.65, CUDA version 6.0, compute capability 1.3, 896MB, 818MB available, 912 GFLOPS peak)

ACEMD beta version v8.15 (cuda55)
77-KLAUDE_6429-0-2-RND1641_1 Expected to finish in 22h. 83% processed right so far.


But on the other GPU I have one that is messing quite much with time estimation:
CUDA: NVIDIA GPU 1: GeForce GTX 560 Ti (driver version 331.65, CUDA version 6.0, compute capability 2.1, 1024MB, 847MB available, 1352 GFLOPS peak)

Long runs (8-12 hours on fastest card) v8.14 (cuda55)
Progress 43% 7h so far, remaining 58h!!

I don't really care about estimated times. Just in case this info is of some help.
ID: 33944 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Number crunching : Remaining (Estimated) time is unusually high; duration correction factor unusually large

©2025 Universitat Pompeu Fabra