ATM

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60673 - Posted: 22 Aug 2023, 18:04:39 UTC There appears to be a flaw or weakness in the application right at the end of the finishing up stage where the app needs to tar the output file. Seems it can lose track of the filename often or looks for the incorrect filename. Might have something to do with the way the host is handling access permissions or the slowness of the filesystem in accessing the output file. I wonder if putting a wait loop into the code right before this process would let the filesystem settle down long enough to get access to the file for tar'ing ID: 60673 · Rating: 0 · rate: / Reply Quote

Art_Brown Send message Joined: 3 Jun 09 Posts: 4 Credit: 1,086,774,155 RAC: 0 Level Scientific publications	Message 60674 - Posted: 23 Aug 2023, 19:12:52 UTC ATMbeta tasks show 100% complete but do not finish, and Task Manager shows 'normal' working CPU loads (5 to 6% in my case) (12 core, 16 logical). I abort the runs and stop accepting new work for a week or so hoping the next batch finishes and uploads, but it might be an issue with my PC. Windows 10, 64GB ram, two GTX-980ti cards Can anyone offer suggestions? Thanks. Two examples: Computer: Pecan Project GPUGRID Name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083_0 Application ATMbeta 1.09 (cuda1121) Workunit name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083 State Running Received 8/23/2023 6:41:59 AM Report deadline 8/28/2023 6:41:59 AM Estimated app speed 580.20 GFLOPs/sec Estimated task size 1,000,000,000 GFLOPs Resources 0.983 CPUs + 1 NVIDIA GPU (device 1) CPU time at last checkpoint 00:00:00 CPU time 04:25:24 Elapsed time 05:11:46 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 6,867.10 MB Working set size 761.31 MB Directory slots/1 Process ID 5364 Debug State: 2 - Scheduler: 2 Computer: Pecan Project GPUGRID Name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546_0 Application ATMbeta 1.09 (cuda1121) Workunit name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546 State Running Received 8/23/2023 6:42:37 AM Report deadline 8/28/2023 6:42:36 AM Estimated app speed 580.20 GFLOPs/sec Estimated task size 1,000,000,000 GFLOPs Resources 0.983 CPUs + 1 NVIDIA GPU (device 0) CPU time at last checkpoint 00:00:00 CPU time 04:40:09 Elapsed time 05:17:34 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 7,414.65 MB Working set size 1,184.61 MB Directory slots/0 Process ID 20016 Debug State: 2 - Scheduler: 2 ID: 60674 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60675 - Posted: 23 Aug 2023, 21:07:29 UTC - in response to Message 60674. Yes. That is - sadly - normal for the current tasks in this series. Ultra-long tasks were split into bite-sized chunks - you can tell which chunk you're running from the task name. They're split into 0-5 to 4-5 (towards the end of the task name: there's no 5-5). If you get a 0-5, it'll report progress normally, from 0% to 100%. All the others will jump quickly to 100%, and stay there. Irritating, and it messes up work fetch, but it doesn't ultimately matter. Just let them run: they'll finish eventually, and life will move on. ID: 60675 · Rating: 0 · rate: / Reply Quote

Art_Brown Send message Joined: 3 Jun 09 Posts: 4 Credit: 1,086,774,155 RAC: 0 Level Scientific publications	Message 60676 - Posted: 24 Aug 2023, 15:39:38 UTC - in response to Message 60675. Hi Richard and thanks for the info. I stop all processing at 4pm daily, and when the ATMbeta tasks are still running, they don't survive the suspension and restart process. They spontaneously abort. So I'll still have to try new batches from time to time to avoid wasting CPU time for other projects. Regards, Art ID: 60676 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60677 - Posted: 24 Aug 2023, 18:40:52 UTC - in response to Message 60676. The ATMbeta tasks cannot be stopped during processing or they will error out. ID: 60677 · Rating: 0 · rate: / Reply Quote

wujj123456 Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 0 Level Scientific publications	Message 60683 - Posted: 26 Aug 2023, 21:45:17 UTC - in response to Message 60677. I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk? ID: 60683 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60684 - Posted: 27 Aug 2023, 2:04:39 UTC - in response to Message 60638. If it hurts when you <do that> then the most obvious solution is to not <do that>. This applies to most things in life. It’s well known that these tasks do not like to be interrupted. If your power grid is that unstable then it’s probably best for you to crunch something else, or invest in a battery backup to keep the computer running during power outages. These are still classified as Beta after all and that comes with the implication that things will not always work, and you need to accept whatever compromises that comes with it. If you don’t then your other solution could be to just disable Beta processing from your profile and wait for ACEMD3 work. The sad part is that you are extra smart. ID: 60684 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60685 - Posted: 27 Aug 2023, 2:10:56 UTC - in response to Message 60683. I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk? You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted. ID: 60685 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 60686 - Posted: 27 Aug 2023, 10:45:52 UTC this morning, a ATM on a GTX980ti errored out after more than 10 hours :-((( The reason, as often enough, is: ValueError: Energy is NaN What a waste of valuable energy! What I don't understand is that by now the developer does not have enough experience with this type of task so that errors like this could be eliminated. At any rate, if this happens again, I will quit crunching ATM. Electricity rates here have trippled last year, and I can no longer afford to waste money. ID: 60686 · Rating: 0 · rate: / Reply Quote

roundup Send message Joined: 11 May 10 Posts: 68 Credit: 13,357,003,875 RAC: 9,425 Level Scientific publications	Message 60687 - Posted: 27 Aug 2023, 11:42:37 UTC - in response to Message 60686. this morning, a ATM on a GTX980ti errored out after more than 10 hours :-((( The reason, as often enough, is: ValueError: Energy is NaN Similar here after 11,363.13 seconds on a RTX 2070S for Unit 33589476. I understand that it can happen in a Beta project. However, I would expect that the developer irons out the most common errors such as 'Energy is NaN', progress indication jumping to 100%, wrong remaining runtime indication, RTX4xxx errors on Windows - to name a few. ID: 60687 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 60688 - Posted: 28 Aug 2023, 2:20:07 UTC - in response to Message 60685. I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk? You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted. I’m sure doing the same thing over and over and expecting a different result is the solution :) ID: 60688 · Rating: 0 · rate: / Reply Quote

bluestang Send message Joined: 13 Apr 15 Posts: 11 Credit: 3,003,712,606 RAC: 0 Level Scientific publications	Message 60689 - Posted: 28 Aug 2023, 3:14:03 UTC - in response to Message 60688. I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk? You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted. I’m sure doing the same thing over and over and expecting a different result is the solution :) You mean like releasing batch after batch after batch of WUs with the same issues that people have been complaining about for how long now? :) ID: 60689 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 60690 - Posted: 28 Aug 2023, 5:13:43 UTC - in response to Message 60689. It’s beta. Accept it or move on. ID: 60690 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 60691 - Posted: 28 Aug 2023, 9:06:32 UTC - in response to Message 60690. It’s beta. Accept it or move on. the question though is how much longer it will be beta. Isn't the reason for beta that the developper of a tool is working on it in order to eliminate problems? Here, not much seems to have been done so far. Always the same errors and problems, none of them have been solved after so long time :-( ID: 60691 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 60692 - Posted: 28 Aug 2023, 11:06:30 UTC here the next one: https://www.gpugrid.net/result.php?resultid=33593354 it failed after 11.267 seconds, and doesn't even tell why it failed (unless I am unable to catch it). So I will stop crunching ATMs at least on this machine. Waste of expensive electricity :-( ID: 60692 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60693 - Posted: 28 Aug 2023, 21:38:36 UTC - in response to Message 60691. It’s beta. Accept it or move on. the question though is how much longer it will be beta. Isn't the reason for beta that the developper of a tool is working on it in order to eliminate problems? Here, not much seems to have been done so far. Always the same errors and problems, none of them have been solved after so long time :-( Could be forever . . . does not matter as the science still gets done. Either accept failures with a beta app or move on. ID: 60693 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60694 - Posted: 29 Aug 2023, 8:50:42 UTC - in response to Message 60693. Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode. It would be appreciated if the beta wrinkles could be ironed out from the administrative aspects of the app, too. ID: 60694 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 60695 - Posted: 29 Aug 2023, 16:05:30 UTC - in response to Message 60694. Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode. It would be appreciated if the beta wrinkles could be ironed out from the administrative aspects of the app, too. + 1 ID: 60695 · Rating: 0 · rate: / Reply Quote

bluestang Send message Joined: 13 Apr 15 Posts: 11 Credit: 3,003,712,606 RAC: 0 Level Scientific publications	Message 60696 - Posted: 29 Aug 2023, 17:30:47 UTC - in response to Message 60690. It’s beta. Accept it or move on. This excuse/reason has been used for too long now. It's getting old and I'm sick of people letting devs/admins of projects slide by with crap apps instead of fixing them like they know they should. ID: 60696 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 60697 - Posted: 29 Aug 2023, 20:47:15 UTC - in response to Message 60696. then don't support the project? and move on? Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it. ID: 60697 · Rating: 0 · rate: / Reply Quote