Message boards :
News :
ATM
Message board moderation
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 35 · Next
| Author | Message |
|---|---|
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
There appears to be a flaw or weakness in the application right at the end of the finishing up stage where the app needs to tar the output file. Seems it can lose track of the filename often or looks for the incorrect filename. Might have something to do with the way the host is handling access permissions or the slowness of the filesystem in accessing the output file. I wonder if putting a wait loop into the code right before this process would let the filesystem settle down long enough to get access to the file for tar'ing |
|
Send message Joined: 3 Jun 09 Posts: 4 Credit: 1,086,774,155 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
ATMbeta tasks show 100% complete but do not finish, and Task Manager shows 'normal' working CPU loads (5 to 6% in my case) (12 core, 16 logical). I abort the runs and stop accepting new work for a week or so hoping the next batch finishes and uploads, but it might be an issue with my PC. Windows 10, 64GB ram, two GTX-980ti cards Can anyone offer suggestions? Thanks. Two examples: Computer: Pecan Project GPUGRID Name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083_0 Application ATMbeta 1.09 (cuda1121) Workunit name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083 State Running Received 8/23/2023 6:41:59 AM Report deadline 8/28/2023 6:41:59 AM Estimated app speed 580.20 GFLOPs/sec Estimated task size 1,000,000,000 GFLOPs Resources 0.983 CPUs + 1 NVIDIA GPU (device 1) CPU time at last checkpoint 00:00:00 CPU time 04:25:24 Elapsed time 05:11:46 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 6,867.10 MB Working set size 761.31 MB Directory slots/1 Process ID 5364 Debug State: 2 - Scheduler: 2 Computer: Pecan Project GPUGRID Name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546_0 Application ATMbeta 1.09 (cuda1121) Workunit name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546 State Running Received 8/23/2023 6:42:37 AM Report deadline 8/28/2023 6:42:36 AM Estimated app speed 580.20 GFLOPs/sec Estimated task size 1,000,000,000 GFLOPs Resources 0.983 CPUs + 1 NVIDIA GPU (device 0) CPU time at last checkpoint 00:00:00 CPU time 04:40:09 Elapsed time 05:17:34 Estimated time remaining 00:00:00 Fraction done 100.000% Virtual memory size 7,414.65 MB Working set size 1,184.61 MB Directory slots/0 Process ID 20016 Debug State: 2 - Scheduler: 2 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes. That is - sadly - normal for the current tasks in this series. Ultra-long tasks were split into bite-sized chunks - you can tell which chunk you're running from the task name. They're split into 0-5 to 4-5 (towards the end of the task name: there's no 5-5). If you get a 0-5, it'll report progress normally, from 0% to 100%. All the others will jump quickly to 100%, and stay there. Irritating, and it messes up work fetch, but it doesn't ultimately matter. Just let them run: they'll finish eventually, and life will move on. |
|
Send message Joined: 3 Jun 09 Posts: 4 Credit: 1,086,774,155 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi Richard and thanks for the info. I stop all processing at 4pm daily, and when the ATMbeta tasks are still running, they don't survive the suspension and restart process. They spontaneously abort. So I'll still have to try new batches from time to time to avoid wasting CPU time for other projects. Regards, Art |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
The ATMbeta tasks cannot be stopped during processing or they will error out. |
|
Send message Joined: 9 Jun 10 Posts: 19 Credit: 2,233,932,323 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]()
|
I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk? |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
If it hurts when you <do that> then the most obvious solution is to not <do that>. This applies to most things in life. The sad part is that you are extra smart. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
this morning, a ATM on a GTX980ti errored out after more than 10 hours :-((( The reason, as often enough, is: ValueError: Energy is NaN What a waste of valuable energy! What I don't understand is that by now the developer does not have enough experience with this type of task so that errors like this could be eliminated. At any rate, if this happens again, I will quit crunching ATM. Electricity rates here have trippled last year, and I can no longer afford to waste money. |
|
Send message Joined: 11 May 10 Posts: 68 Credit: 12,293,503,875 RAC: 3,253 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
this morning, a ATM on a GTX980ti errored out after more than 10 hours :-((( Similar here after 11,363.13 seconds on a RTX 2070S for Unit 33589476. I understand that it can happen in a Beta project. However, I would expect that the developer irons out the most common errors such as 'Energy is NaN', progress indication jumping to 100%, wrong remaining runtime indication, RTX4xxx errors on Windows - to name a few. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. I’m sure doing the same thing over and over and expecting a different result is the solution :)
|
|
Send message Joined: 13 Apr 15 Posts: 11 Credit: 3,003,712,606 RAC: 2,331 Level ![]() Scientific publications
|
I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin. You mean like releasing batch after batch after batch of WUs with the same issues that people have been complaining about for how long now? :) |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
It’s beta. Accept it or move on.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It’s beta. Accept it or move on. the question though is how much longer it will be beta. Isn't the reason for beta that the developper of a tool is working on it in order to eliminate problems? Here, not much seems to have been done so far. Always the same errors and problems, none of them have been solved after so long time :-( |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
here the next one: https://www.gpugrid.net/result.php?resultid=33593354 it failed after 11.267 seconds, and doesn't even tell why it failed (unless I am unable to catch it). So I will stop crunching ATMs at least on this machine. Waste of expensive electricity :-( |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 662 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
It’s beta. Accept it or move on. Could be forever . . . does not matter as the science still gets done. Either accept failures with a beta app or move on. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode. It would be appreciated if the beta wrinkles could be ironed out from the administrative aspects of the app, too. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode. + 1 |
|
Send message Joined: 13 Apr 15 Posts: 11 Credit: 3,003,712,606 RAC: 2,331 Level ![]() Scientific publications
|
It’s beta. Accept it or move on. This excuse/reason has been used for too long now. It's getting old and I'm sick of people letting devs/admins of projects slide by with crap apps instead of fixing them like they know they should. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 4,772 Level ![]() Scientific publications
|
then don't support the project? and move on? Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it.
|
©2025 Universitat Pompeu Fabra