Message boards :
Number crunching :
ATMML
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 8 · Next
| Author | Message |
|---|---|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Probably the decision is because this project depends on fast turnaround and turnover for tasks. Science can't proceed till the earlier result is returned, validated and then iterated into the next task. Better to fail fast and send out the next wingman task until the task gets retired at 8 fails. Deadline is always 5 days to get through 8 tries and why they reward 50% credit bonus for returning results within 24 hours. |
|
Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I've seen people misuse this "fail fast" philosophy very often. "Fail fast" makes sense only when it's going to be a failure anyway. Turning a successful result into a failure proactively is the opposite of making progress. Look at the errors on this host. It took ~20K seconds for my host to finish, but all the prematurely killed results ended up wasting way more compute time and on average suffered another half a day delay before getting a successful return. That's not speeding up but slowing down. That's why I'm curious what this limit is trying to protect against. Only if we know there is a chance that a task can stuck computing indefinitely, would such a limit make sense. That would generally indicate some bug needs to be fixed. Even then, given how long turnaround would be after killing an otherwise successful task, the project should have set a floor of the limit to a few hours. In addition, "science" does not equal to this project alone. Wasting hours of compute that could have been used by other projects isn't advancing "science". It's advancing this project at the cost of other science. That would be a bit disrespectful to fellow scientists if it's done intentionally. However, I'd rather assume good intention here that this is just a misguided optimization. I know software isn't easy, so project owner should take a look at the resulting data and try to make more efficient use of available compute by reducing waste, which would also speed up the progress for this project. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I'm pretty sure the "exceeded elapsed time limit" is not because the project scientists just decided on a whim to utilize it. It's part of the Boinc code and nothing they have control over. It's present for all projects that use the Boinc code unmodified. Only the Boinc developers have the knowledge of how that function is implemented. The project scientist already stated he was surprised by the errors when the exact same task template was used for the Linux tasks and they have not had any issues with elapsed time limit errors. Something specific to Windows. And they do not develop for Windows firstly being all the tools and software they use is primarily first Linux based and where their expertise is greatest. Some of the toolchains they use have never had Windows versions which is why it has taken so long for some Windows versions of the native Linux apps. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I agree - the runtime errors are an issue mainly of the BOINC software, but they are appearing because the GPUGrid teams - admin and research - have over the years failed to fully come to terms with the changes introduced by BOINC around 2010. We are running a very old copy of the BOINC server code here, which include the beginnings of the 2010 changes, but which makes it very difficult for us to dig our way out of the hole we're in. But I don't agree that only the BOINC developers understand the code. It's all open-source, and other projects mange to control it reasonably well. The finer points are indeed cloaked in obscure language, but the resulting data is visible on all our machines. Let's play with a current worked excample. I've just downloaded a new ATMML task on host 508381. That's a Linux machine, with 2 GPUs. They are in fact identical, so for once the 'Coprocessors' line is true. It has completed 52 ATMML tasks so far, so it has had plenty of time to reach a steady state. [BOINC loves steady states - it's the edge cases, like deploying a new app_version, which cause the problems] My key objective is to see how the runtime estimate was derived, and to see what was done well, and what was done badly. BOINC works out the runtime from the size and the speed of the task. In dimensional terms, that's {size} / {size per second}The sizes cancel out, and duration is the inverse of speed. In the case of my new task, I have: <rsc_fpops_est>1000000000000000000.000000</rsc_fpops_est> (size) <flops>698258637765.176392</flops> (speed) My calculator makes that Duration 1,432,134 seconds, or about 16.5 days. But our BOINC clients have a trick up their sleeves for coping with that - it's called the DCF, or duration correction factor. For this machine, it's settled to 0.045052. Popping that into the calculator, that comes down to: Runtime estimate 64,520 seconds, or 17.92 hours. BOINC Manager displays 17:55:20, and that's about right for these tasks (although they do vary). CONCLUSION The task sizes set by the project for this app are unrealistically high, and the runtime estimates only approach sanity through the heavy application of DCF - which should normally hover around 1. DCF is self-adjusting, but very slowly for these extreme limits. And you have to do the work first, which may not be possible. Volunteers with knowledge and skill can adjust their own DCFs, but I wouldn't advise it for novices. @ Steve That's even more indigestible than the essay I wrote you yesterday. Please don't jump into changing things until they've both sunk in fully: meddling (and in particular, meddling with one task type at a project with multiple applications) can cause even more problems that it cures. Mull it over, discuss it with your administrators and fellow researchers, and above all - ask as many questions as you need. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
"transient upload error: server out of disk space" the old problem which has been occurring over the years :-( Unbelievable that this is still happening. |
|
Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power? |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So this runtime exceeded failure is actually related to the absurd rsc_fpops_est setting. This number is obviously not accurate. Could the project fix the number to be more reasonable, instead of relying on client's trial and error for adjustment that wastes lots of compute power? Not so fast, please. The rsc_fpops_est figure is obviously wrong, but that's the result of many years of twiddling knobs without really understanding what they do. Two flies in that pot of ointment: If they reduce rsc_fpops_est by itself, the time limit will reduce, and more tasks will fail. There's a second value - rsc_fpops_bound - which actually triggers the failure. In my worked example, that was set to 1,000x the estimate, or several years. That was one of the knobs they twiddled some years ago: the default is 10x. So something else is seriously wrong as well. Soon after the Windows app was launched, I saw tasks with very high replication numbers, which had failed on multiple machines - up to 7, the limit here. But very few of them were 'time limit exceeded'. The tasks I'm running now have low replication numbers, so we may be over the worst of it. I repeat my plea to Steve - please take your time to think, discuss, understand what's going on. Don't change anything until you've worked out what it'll do. |
|
Send message Joined: 21 Dec 23 Posts: 51 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Thank you for the explanation. The time limit exceeded error therefore happened because: - we had a bug in some circulating WUs where certain errors would not trigger a proper error code. The result would then be validated with short runtimes. - these fast runtime results then skewed the correction factors for the newly released windows app version. To fix the problem I 10x'ed the rsc_fpops_bound value while leaving the rsc_fpops_est unchanged. This appears to have worked and hosts that previously had the time limit exceeded errors now do not. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes, I found host 591089. That succeeded on its first task, but then failed five in a row on the time limit. It's had the current one for two days now, so hopefully it'll work. One to watch. |
|
Send message Joined: 28 Dec 20 Posts: 7 Credit: 26,500,257,436 RAC: 2 Level ![]() Scientific publications
|
I found a way to get the windows WUs finish without errors. after you get a WU, go to Projects tab and select no new tasks. be sure you don't have other projects running at the same time. once is finished and uploaded you can allow for another WU and repeat. |
|
Send message Joined: 28 Dec 20 Posts: 7 Credit: 26,500,257,436 RAC: 2 Level ![]() Scientific publications
|
WUs starting with "MCL1" are all erroring out in windows or linux. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 259 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Quite a few of my tasks starting with PTP1B are failing in both OSs |
|
Send message Joined: 28 Dec 20 Posts: 7 Credit: 26,500,257,436 RAC: 2 Level ![]() Scientific publications
|
Quite a few of my tasks starting with PTP1B are failing in both OSs Those worked fine for me under linux. took shorter time to finish them. |
|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 69 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same issue here: https://www.gpugrid.net/result.php?resultid=35798577 Tue 27 Aug 2024 08:34:58 PM EDT | GPUGRID | Computation for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 finished Tue 27 Aug 2024 08:34:58 PM EDT | GPUGRID | Output file MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_0 for task MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1 absent Tue 27 Aug 2024 08:34:59 PM EDT | GPUGRID | Started upload of MCL1_A23_A35_r0_4-QUICO_ATM_AF_04_Benchmark_MCL1-0-20-RND1911_1_1 |
|
Send message Joined: 11 May 10 Posts: 68 Credit: 12,293,491,875 RAC: 3,176 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That is a bad batch. All units error out after several weeks of trouble-free calculation under Linux. Example: https://www.gpugrid.net/result.php?resultid=35800268 Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. [W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function ) [W output_modules.py:45] Warning: CUDA graph capture will lock the batch to the current number of samples (2). Changing this will result in a crash (function ) Traceback (most recent call last): File "/var/lib/boinc-client/slots/28/bin/rbfe_explicit_sync.py", line 11, in <module> rx.scheduleJobs() File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/sync/atm.py", line 142, in scheduleJobs if isample % int(self.config['CHECKPOINT_FREQUENCY']) == 0 or isample == num_samples: ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^ File "/var/lib/boinc-client/slots/28/lib/python3.11/site-packages/configobj/__init__.py", line 554, in __getitem__ val = dict.__getitem__(self, key) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ KeyError: 'CHECKPOINT_FREQUENCY' |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes, I had 90 failed ATMML tasks overnight. The earliest was issued just after 18:00 UTC yesterday, but was created at 27 Aug 2024 | 13:28:15 UTC. I've switched to helping with the quantum chemistry backlog for the time being. |
|
Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,752,050,315 RAC: 57 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I experienced the same problems over several days when I was suspending GPU processing because of very hot temps in Texas; the result was loss of many hours of processing until I discovered the LTIWS(leave tasks in memory while suspended) was apparently not working. I am suspending ATMML tasks until cooler weather arrives in the fall.BET |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 891 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I can't remember all of the task types that allow suspending or exiting Boinc without erroring out. The acemd tasks properly checkpoint, but you also can't allow a restarted WU to start again on a different gpu or it will also error out. Best practice for GPUGrid generally has always been to let all tasks run to completion before exiting Boinc. No guarantees that any task will resume without loss of prior work done or just error out. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
LTI[M]WS only applies to CPU tasks. GPUs don't have that much spare memory. |
|
Send message Joined: 22 Oct 10 Posts: 42 Credit: 1,752,050,315 RAC: 57 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I appreciate the informative responses of BOTH Keith and Richard immediately below!! |
©2025 Universitat Pompeu Fabra