Message boards :
Number crunching :
Strange really big wrong ETAs on workunits
Message board moderation
| Author | Message |
|---|---|
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 41 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Recently i see that my hosts show a ETA on the WUs as example 46d so the host does not download another workunits until it finishing out 100% (and the systems are finishing the WUs depending of its compelxity in 1-12h. So it triggers always the backupproject in the upload/downloadphase after 100% timeslot. What can be done to correct this wrong benchmarking from the GPUs in GPUGrid only? DSKAG Austria Research Team: http://www.research.dskag.at
|
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,408 Level ![]() Scientific publications
|
They set it the way they did to fix something. There was an earlier thread about this. Eventually after say 100 tasks it normalizes. A couple things I do in app_config is <skip_cpu_benchmarks>1</skip_cpu_benchmarks> <fraction_done_exact/> Shipping the benchmark will let it normalize. I was told that running the benchmark resets it back to 10's of days. Fraction exact gets the current task's remaining time accurate asap so that it allows you to get a second task. Solution is weak because it only works with one app name. Maybe there's a way to provide criteria for all the apps you're running but I have not been able to figure that out. Others have mentioned shutting boinc down and editing client_state's <duration_correction_factor>0.721557</duration_correction_factor> but I have not tried that. |
dskagcommunitySend message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 41 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Definitely use the fraction exact in an app_info. But the best and fastest solution is to edit the client_state.xml in the GPUGrid section of the file with a text editor and change the dcf to 0.01 and save the file. Depending on the mix of work being done, it will eventually start climbing again and you will have to re-edit the file. But I find I only have to do that every few months at the most. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Definitely use the fraction exact in an app_info. some time ago, I applied the above mentioned change in the client_state.xml. However, after a couple of days the value fell back to what it was before :-( So I gave up. |
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,408 Level ![]() Scientific publications
|
It reset with benchmarking turned off? KeithM send me the link for app_config which states that you can't even manually benchmark when off. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 6,423 Level ![]() Scientific publications
|
benchmarking does not change the DCF value. 0.01 is the minimum value acceptable for DCF in BOINC. if Erich tried setting it lower than that, that's why it didnt stick.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
no, I did NOT set it lower |
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,408 Level ![]() Scientific publications
|
I just did CPU benchmark even though I have it turned off and it ran anyway. My ETA for a future task went from 36d 8h to 89d 19h after the benchmark. Prior to the benchmark, my dcf was 0.686320 and now its 1.696240. I wasn't expecting it to run but thankfully didn't trash my 3.5h running task. CPU benchmarking does appear to impact ETA's for GPU tasks. Turning benchmark off doesn't prevent you from running it. |
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,408 Level ![]() Scientific publications
|
Quick follow-up: the above experiment was on a Win11 core laptop. That one task did finish thankfully. New dcf after completion is 1.679309 or 99% of the previous value. Something else is going on: even though my dcf jumped up ~247% then down 1%, my ETA on the next task is 36d 3h or ~99.5% of the previous eta before benchmarking. dcf does not completely control the GPU task duration. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Quick follow-up: the above experiment was on a Win11 core laptop. That one task did finish thankfully. Why would you expect it to? Only the task itself, the application and the host compute performance and loading determines how long a task takes to finish computation. All the DCF does is affect the way that BOINC attempts to estimate each tasks estimated computation time. BUT, DCF applies across the ENTIRE project, meaning ONE DCF value applies to ALL tasks sub-types. So, every time you change tasks subspecies, the client/scheduler combination has to recompute the DCF value. Run a long running task type and the next time you run a short running task type the estimates times are skewed wildy. Follow a run of short running tasks by the next long running species and the DCF is again wildly skewed in the other direction. If there was a DCF value applied to EACH sub-species of tasks, the estimated times would stabilize and be pretty much on the spot. But BOINC server code on GPUGrid does not allow that. So we have to just accept that on projects that use the DCF mechanism in their server code, and run many different sub-species of tasks, you will get DCF values that ping-pong back and forth and estimated completion times will never be correct. The most we a user can to is set the DCF values the lowest the BOINC code limits allow. Which is 0.01. Or get the project admins to run a different server code base. The current BOINC code removed the DCF mechanism and changed the DCF to a static value of 1.0. So projects that run that code do not see gyrations in the estimated times. But it is up to each project what Boinc server code they decide to run and how much they have modified it to suit their needs. Benchmarking itself does not change the DCF value. It's the variation in task running times among all the varied sub-species that changes the DCF value. |
|
Send message Joined: 29 Aug 24 Posts: 71 Credit: 3,321,790,989 RAC: 1,408 Level ![]() Scientific publications
|
That's very helpful. Makes sense. I know that both you and Ian say that DCF isn't impacted by a CPU benchmark but I saw it change mid-task before and after a benchmark. I'll chalk it up as a mystery. I'm fine with everything as it is. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
There probably is some interaction between running the benchmarks and computing the dcf. But a cursory examination of the boinc codebase didn't find an intersection between the two. I'd have to really dig into the code and try and find some commonality. Or it could just be that running benchmarks simply changes the flops estimate for the host. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My observations on the estimated runtimes are the following: - even setting the DCF value to its minimum (0.01) is too high for the present ACEMD3 tasks on a RTX 2080Ti - the estimated runtimes are off in the beginning (for ACEMD3 tasks), then they skyrocket to the 30 days range from one task to the next. In reality these tasks last for 30 minutes or less. This results in "Tasks won't finish in time" messages. - The ATMML tasks immediately break the DCF value, and put the estimated runtime for the ACEMD3 tasks to unacceptable levels. I think that the rsc_fpops_estimated value of different tasks is set to the same value, regardless their actual amount of floating point operations. So it's no wonder that the estimated values lost their function, and task queueing is not working properly (= not working at all) with GPUGrid tasks in the queue. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Strange that I have never had a single instance of "tasks won't finish in time" message on my two hosts with a 2080 Ti running every type of task that GPUGrid offers in all the time that I've run this project. While you have. Too large a cache in your instance, perhaps? |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The maximum queue length you can set in BOINC manager is 10+10 days. When a task's runtime estimate says that it will run for 30+ days, then it could not be the queue (cache) size causing the "Tasks won't finish in time" message. BTW my queue lenght is set to 0.01 or 0.1 days. I set larger values only when I'm debugging. I crunch mainly FAH tasks recently, there are some FAH work servers which are very slow sometimes from my home (it takes 30-50 minutes to download a new workunit), so my BOINC projects run only at these "outages", hence the short queue. At one such outage I gave a try for GPUGrid, but it's still a mess (= it's unreliable to give work for my computers, which heat our apartment, so I need to use a different, more reliable source). |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'd like to add, that there are two distict batches in ACEMD3 (ADRIA and ANTONIOM), and when an ADRIA task gets between ANTONIOM tasks, it breaks the duration correction factor of the latter, so I have to adjust it in the client_state.xml file manually on a daily basis if I want to fill up my queue (4 tasks). |
©2025 Universitat Pompeu Fabra