ATMML

Message boards : Number crunching : ATMML
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61712 - Posted: 24 Aug 2024, 21:30:21 UTC - in response to Message 61711.  

Probably the decision is because this project depends on fast turnaround and turnover for tasks.

Science can't proceed till the earlier result is returned, validated and then iterated into the next task.

Better to fail fast and send out the next wingman task until the task gets retired at 8 fails.

Deadline is always 5 days to get through 8 tries and why they reward 50% credit bonus for returning results within 24 hours.
ID: 61712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
wujj123456

Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 0
Level
Phe
Scientific publications
watwatwatwat
Message 61713 - Posted: 24 Aug 2024, 22:25:16 UTC - in response to Message 61712.  
Last modified: 24 Aug 2024, 22:34:54 UTC

I've seen people misuse this "fail fast" philosophy very often. "Fail fast" makes sense only when it's going to be a failure anyway. Turning a successful result into a failure proactively is the opposite of making progress.

Look at the errors on this host. It took ~20K seconds for my host to finish, but all the prematurely killed results ended up wasting way more compute time and on average suffered another half a day delay before getting a successful return. That's not speeding up but slowing down.

That's why I'm curious what this limit is trying to protect against. Only if we know there is a chance that a task can stuck computing indefinitely, would such a limit make sense. That would generally indicate some bug needs to be fixed. Even then, given how long turnaround would be after killing an otherwise successful task, the project should have set a floor of the limit to a few hours.

In addition, "science" does not equal to this project alone. Wasting hours of compute that could have been used by other projects isn't advancing "science". It's advancing this project at the cost of other science. That would be a bit disrespectful to fellow scientists if it's done intentionally. However, I'd rather assume good intention here that this is just a misguided optimization. I know software isn't easy, so project owner should take a look at the resulting data and try to make more efficient use of available compute by reducing waste, which would also speed up the progress for this project.
ID: 61713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61714 - Posted: 25 Aug 2024, 7:40:25 UTC - in response to Message 61713.  

I'm pretty sure the "exceeded elapsed time limit" is not because the project scientists just decided on a whim to utilize it.

It's part of the Boinc code and nothing they have control over. It's present for all projects that use the Boinc code unmodified.

Only the Boinc developers have the knowledge of how that function is implemented.

The project scientist already stated he was surprised by the errors when the exact same task template was used for the Linux tasks and they have not had any issues with elapsed time limit errors.

Something specific to Windows. And they do not develop for Windows firstly being all the tools and software they use is primarily first Linux based and where their expertise is greatest.

Some of the toolchains they use have never had Windows versions which is why it has taken so long for some Windows versions of the native Linux apps.
ID: 61714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61716 - Posted: 25 Aug 2024, 11:49:19 UTC - in response to Message 61714.  

I agree - the runtime errors are an issue mainly of the BOINC software, but they are appearing because the GPUGrid teams - admin and research - have over the years failed to fully come to terms with the changes introduced by BOINC around 2010. We are running a very old copy of the BOINC server code here, which include the beginnings of the 2010 changes, but which makes it very difficult for us to dig our way out of the hole we're in.

But I don't agree that only the BOINC developers understand the code. It's all open-source, and other projects mange to control it reasonably well. The finer points are indeed cloaked in obscure language, but the resulting data is visible on all our machines.

Let's play with a current worked excample.

I've just downloaded a new ATMML task on host 508381. That's a Linux machine, with 2 GPUs. They are in fact identical, so for once the 'Coprocessors' line is true. It has completed 52 ATMML tasks so far, so it has had plenty of time to reach a steady state. [BOINC loves steady states - it's the edge cases, like deploying a new app_version, which cause the problems]

My key objective is to see how the runtime estimate was derived, and to see what was done well, and what was done badly. BOINC works out the runtime from the size and the speed of the task. In dimensional terms, that's

{size} / {size per second}

The sizes cancel out, and duration is the inverse of speed.

In the case of my new task, I have:

<rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> (size)
<flops>698258637765.176392</flops> (speed)

My calculator makes that

Duration 1,432,134 seconds, or about 16.5 days.

But our BOINC clients have a trick up their sleeves for coping with that - it's called the DCF, or duration correction factor. For this machine, it's settled to 0.045052. Popping that into the calculator, that comes down to:

Runtime estimate 64,520 seconds, or 17.92 hours. BOINC Manager displays 17:55:20, and that's about right for these tasks (although they do vary).

CONCLUSION
The task sizes set by the project for this app are unrealistically high, and the runtime estimates only approach sanity through the heavy application of DCF - which should normally hover around 1.

DCF is self-adjusting, but very slowly for these extreme limits. And you have to do the work first, which may not be possible.

Volunteers with knowledge and skill can adjust their own DCFs, but I wouldn't advise it for novices.

@ Steve
That's even more indigestible than the essay I wrote you yesterday. Please don't jump into changing things until they've both sunk in fully: meddling (and in particular, meddling with one task type at a project with multiple applications) can cause even more problems that it cures.

Mull it over, discuss it with your administrators and fellow researchers, and above all - ask as many questions as you need.
ID: 61716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61721 - Posted: 25 Aug 2024, 19:56:36 UTC

"transient upload error: server out of disk space"

the old problem which has been occurring over the years :-(
Unbelievable that this is still happening.
ID: 61721 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
wujj123456

Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 0
Level
Phe
Scientific publications
watwatwatwat
Message 61724 - Posted: 26 Aug 2024, 0:56:52 UTC - in response to Message 61716.  

So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power?
ID: 61724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61730 - Posted: 26 Aug 2024, 7:52:36 UTC - in response to Message 61724.  

So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power?

Not so fast, please. The rsc_fpops_est figure is obviously wrong, but that's the result of many years of twiddling knobs without really understanding what they do.

Two flies in that pot of ointment:
If they reduce rsc_fpops_est by itself, the time limit will reduce, and more tasks will fail.
There's a second value - rsc_fpops_bound - which actually triggers the failure. In my worked example, that was set to 1,000x the estimate, or several years. That was one of the knobs they twiddled some years ago: the default is 10x. So something else is seriously wrong as well.

Soon after the Windows app was launched, I saw tasks with very high replication numbers, which had failed on multiple machines - up to 7, the limit here. But very few of them were 'time limit exceeded'. The tasks I'm running now have low replication numbers, so we may be over the worst of it.

I repeat my plea to Steve - please take your time to think, discuss, understand what's going on. Don't change anything until you've worked out what it'll do.
ID: 61730 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 21 Dec 23
Posts: 51
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61732 - Posted: 26 Aug 2024, 8:32:33 UTC - in response to Message 61730.  

Thank you for the explanation.

The time limit exceeded error therefore happened because:
- we had a bug in some circulating WUs where certain errors would not trigger a proper error code. The result would then be validated with short runtimes.
- these fast runtime results then skewed the correction factors for the newly released windows app version.

To fix the problem I 10x'ed the rsc_fpops_bound value while leaving the rsc_fpops_est unchanged.

This appears to have worked and hosts that previously had the time limit exceeded errors now do not.
ID: 61732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61733 - Posted: 26 Aug 2024, 9:05:27 UTC - in response to Message 61732.  

Yes, I found host 591089. That succeeded on its first task, but then failed five in a row on the time limit.

It's had the current one for two days now, so hopefully it'll work. One to watch.
ID: 61733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
EA6LE

Send message
Joined: 28 Dec 20
Posts: 7
Credit: 26,500,257,436
RAC: 2
Level
Trp
Scientific publications
wat
Message 61735 - Posted: 26 Aug 2024, 13:16:48 UTC - in response to Message 61733.  

I found a way to get the windows WUs finish without errors.
after you get a WU, go to Projects tab and select no new tasks. be sure you don't have other projects running at the same time. once is finished and uploaded you can allow for another WU and repeat.
ID: 61735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
EA6LE

Send message
Joined: 28 Dec 20
Posts: 7
Credit: 26,500,257,436
RAC: 2
Level
Trp
Scientific publications
wat
Message 61739 - Posted: 27 Aug 2024, 22:05:37 UTC - in response to Message 61735.  

WUs starting with "MCL1" are all erroring out in windows or linux.
ID: 61739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 338
Credit: 7,987,341,558
RAC: 259
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61740 - Posted: 27 Aug 2024, 22:21:52 UTC

Quite a few of my tasks starting with PTP1B are failing in both OSs
ID: 61740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
EA6LE

Send message
Joined: 28 Dec 20
Posts: 7
Credit: 26,500,257,436
RAC: 2
Level
Trp
Scientific publications
wat
Message 61741 - Posted: 27 Aug 2024, 22:31:58 UTC - in response to Message 61740.  
Last modified: 27 Aug 2024, 22:32:11 UTC

Quite a few of my tasks starting with PTP1B are failing in both OSs

Those worked fine for me under linux. took shorter time to finish them.
ID: 61741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 69
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61742 - Posted: 28 Aug 2024, 0:37:26 UTC

Same issue here:

https://www.gpugrid.net/result.php?resultid=35798577



Tue 27 Aug 2024 08:34:58 PM EDT | GPUGRID | Computation for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 finished
Tue 27 Aug 2024 08:34:58 PM EDT | GPUGRID | Output file MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_0 for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 absent
Tue 27 Aug 2024 08:34:59 PM EDT | GPUGRID | Started upload of MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_1


ID: 61742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,491,875
RAC: 3,176
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61743 - Posted: 28 Aug 2024, 3:47:48 UTC - in response to Message 61742.  

That is a bad batch. All units error out after several weeks of trouble-free calculation under Linux. Example:
https://www.gpugrid.net/result.php?resultid=35800268

Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
[W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function )
[W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function )
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/28/bin/rbfe_explicit_sync.py", line 11, in <module>
rx.scheduleJobs()
File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/sync/atm.py", line 142, in scheduleJobs
if isample % int(self.config['CHECKPOINT_FREQUENCY']) == 0 or isample == num_samples:
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/configobj/__init__.py", line 554, in __getitem__
val = dict.__getitem__(self, key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'CHECKPOINT_FREQUENCY'
ID: 61743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61747 - Posted: 28 Aug 2024, 9:31:22 UTC

Yes, I had 90 failed ATMML tasks overnight. The earliest was issued just after 18:00 UTC yesterday, but was created at 27 Aug 2024 | 13:28:15 UTC.

I've switched to helping with the quantum chemistry backlog for the time being.
ID: 61747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,752,050,315
RAC: 57
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61757 - Posted: 3 Sep 2024, 18:30:20 UTC - in response to Message 61694.  
Last modified: 3 Sep 2024, 18:37:38 UTC

I experienced the same problems over several days when I was suspending GPU processing because of very hot temps in Texas; the result was loss of many hours of processing until I discovered the LTIWS(leave tasks in memory while suspended) was apparently not working. I am suspending ATMML tasks until cooler weather arrives in the fall.BET
ID: 61757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61758 - Posted: 3 Sep 2024, 19:46:19 UTC - in response to Message 61757.  

I can't remember all of the task types that allow suspending or exiting Boinc without erroring out.

The acemd tasks properly checkpoint, but you also can't allow a restarted WU to start again on a different gpu or it will also error out.

Best practice for GPUGrid generally has always been to let all tasks run to completion before exiting Boinc. No guarantees that any task will resume without loss of prior work done or just error out.
ID: 61758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61760 - Posted: 3 Sep 2024, 21:07:48 UTC - in response to Message 61757.  

LTI[M]WS only applies to CPU tasks. GPUs don't have that much spare memory.
ID: 61760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Billy Ewell 1931

Send message
Joined: 22 Oct 10
Posts: 42
Credit: 1,752,050,315
RAC: 57
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61761 - Posted: 4 Sep 2024, 17:12:09 UTC - in response to Message 61760.  

I appreciate the informative responses of BOTH Keith and Richard immediately below!!
ID: 61761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next

Message boards : Number crunching : ATMML

©2025 Universitat Pompeu Fabra