ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 35 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60673 - Posted: 22 Aug 2023, 18:04:39 UTC

There appears to be a flaw or weakness in the application right at the end of the finishing up stage where the app needs to tar the output file.

Seems it can lose track of the filename often or looks for the incorrect filename.

Might have something to do with the way the host is handling access permissions or the slowness of the filesystem in accessing the output file.

I wonder if putting a wait loop into the code right before this process would let the filesystem settle down long enough to get access to the file for tar'ing
ID: 60673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Art_Brown

Send message
Joined: 3 Jun 09
Posts: 4
Credit: 1,086,774,155
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60674 - Posted: 23 Aug 2023, 19:12:52 UTC

ATMbeta tasks show 100% complete but do not finish, and Task Manager shows 'normal' working CPU loads (5 to 6% in my case) (12 core, 16 logical). I abort the runs and stop accepting new work for a week or so hoping the next batch finishes and uploads, but it might be an issue with my PC.
Windows 10, 64GB ram, two GTX-980ti cards
Can anyone offer suggestions?
Thanks.

Two examples:

Computer: Pecan
Project GPUGRID

Name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083_0

Application ATMbeta 1.09 (cuda1121)
Workunit name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083
State Running
Received 8/23/2023 6:41:59 AM
Report deadline 8/28/2023 6:41:59 AM
Estimated app speed 580.20 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.983 CPUs + 1 NVIDIA GPU (device 1)
CPU time at last checkpoint 00:00:00
CPU time 04:25:24
Elapsed time 05:11:46
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 6,867.10 MB
Working set size 761.31 MB
Directory slots/1
Process ID 5364

Debug State: 2 - Scheduler: 2

Computer: Pecan
Project GPUGRID

Name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546_0

Application ATMbeta 1.09 (cuda1121)
Workunit name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546
State Running
Received 8/23/2023 6:42:37 AM
Report deadline 8/28/2023 6:42:36 AM
Estimated app speed 580.20 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.983 CPUs + 1 NVIDIA GPU (device 0)
CPU time at last checkpoint 00:00:00
CPU time 04:40:09
Elapsed time 05:17:34
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 7,414.65 MB
Working set size 1,184.61 MB
Directory slots/0
Process ID 20016

Debug State: 2 - Scheduler: 2
ID: 60674 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60675 - Posted: 23 Aug 2023, 21:07:29 UTC - in response to Message 60674.  

Yes. That is - sadly - normal for the current tasks in this series. Ultra-long tasks were split into bite-sized chunks - you can tell which chunk you're running from the task name. They're split into 0-5 to 4-5 (towards the end of the task name: there's no 5-5).

If you get a 0-5, it'll report progress normally, from 0% to 100%. All the others will jump quickly to 100%, and stay there.

Irritating, and it messes up work fetch, but it doesn't ultimately matter. Just let them run: they'll finish eventually, and life will move on.
ID: 60675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Art_Brown

Send message
Joined: 3 Jun 09
Posts: 4
Credit: 1,086,774,155
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60676 - Posted: 24 Aug 2023, 15:39:38 UTC - in response to Message 60675.  

Hi Richard and thanks for the info.
I stop all processing at 4pm daily, and when the ATMbeta tasks are still running, they don't survive the suspension and restart process. They spontaneously abort. So I'll still have to try new batches from time to time to avoid wasting CPU time for other projects.
Regards, Art
ID: 60676 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60677 - Posted: 24 Aug 2023, 18:40:52 UTC - in response to Message 60676.  

The ATMbeta tasks cannot be stopped during processing or they will error out.
ID: 60677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
wujj123456

Send message
Joined: 9 Jun 10
Posts: 19
Credit: 2,233,932,323
RAC: 0
Level
Phe
Scientific publications
watwatwatwat
Message 60683 - Posted: 26 Aug 2023, 21:45:17 UTC - in response to Message 60677.  

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?
ID: 60683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60684 - Posted: 27 Aug 2023, 2:04:39 UTC - in response to Message 60638.  

If it hurts when you <do that> then the most obvious solution is to not <do that>. This applies to most things in life.

It’s well known that these tasks do not like to be interrupted. If your power grid is that unstable then it’s probably best for you to crunch something else, or invest in a battery backup to keep the computer running during power outages.

These are still classified as Beta after all and that comes with the implication that things will not always work, and you need to accept whatever compromises that comes with it. If you don’t then your other solution could be to just disable Beta processing from your profile and wait for ACEMD3 work.



The sad part is that you are extra smart.
ID: 60684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60685 - Posted: 27 Aug 2023, 2:10:56 UTC - in response to Message 60683.  

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?

You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted.
ID: 60685 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60686 - Posted: 27 Aug 2023, 10:45:52 UTC

this morning, a ATM on a GTX980ti errored out after more than 10 hours :-(((

The reason, as often enough, is:
ValueError: Energy is NaN

What a waste of valuable energy! What I don't understand is that by now the developer does not have enough experience with this type of task so that errors like this could be eliminated.

At any rate, if this happens again, I will quit crunching ATM. Electricity rates here have trippled last year, and I can no longer afford to waste money.
ID: 60686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
roundup

Send message
Joined: 11 May 10
Posts: 68
Credit: 12,293,503,875
RAC: 3,253
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60687 - Posted: 27 Aug 2023, 11:42:37 UTC - in response to Message 60686.  

this morning, a ATM on a GTX980ti errored out after more than 10 hours :-(((

The reason, as often enough, is:
ValueError: Energy is NaN


Similar here after 11,363.13 seconds on a RTX 2070S for Unit 33589476.
I understand that it can happen in a Beta project. However, I would expect that the developer irons out the most common errors such as 'Energy is NaN', progress indication jumping to 100%, wrong remaining runtime indication, RTX4xxx errors on Windows - to name a few.
ID: 60687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60688 - Posted: 28 Aug 2023, 2:20:07 UTC - in response to Message 60685.  

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?

You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted.


I’m sure doing the same thing over and over and expecting a different result is the solution :)
ID: 60688 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,331
Level
Arg
Scientific publications
wat
Message 60689 - Posted: 28 Aug 2023, 3:14:03 UTC - in response to Message 60688.  

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?

You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted.


I’m sure doing the same thing over and over and expecting a different result is the solution :)


You mean like releasing batch after batch after batch of WUs with the same issues that people have been complaining about for how long now? :)
ID: 60689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60690 - Posted: 28 Aug 2023, 5:13:43 UTC - in response to Message 60689.  

It’s beta. Accept it or move on.
ID: 60690 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60691 - Posted: 28 Aug 2023, 9:06:32 UTC - in response to Message 60690.  

It’s beta. Accept it or move on.

the question though is how much longer it will be beta.
Isn't the reason for beta that the developper of a tool is working on it in order to eliminate problems?
Here, not much seems to have been done so far. Always the same errors and problems, none of them have been solved after so long time :-(
ID: 60691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60692 - Posted: 28 Aug 2023, 11:06:30 UTC

here the next one:

https://www.gpugrid.net/result.php?resultid=33593354

it failed after 11.267 seconds, and doesn't even tell why it failed (unless I am unable to catch it).

So I will stop crunching ATMs at least on this machine. Waste of expensive electricity :-(
ID: 60692 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 662
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60693 - Posted: 28 Aug 2023, 21:38:36 UTC - in response to Message 60691.  

It’s beta. Accept it or move on.

the question though is how much longer it will be beta.
Isn't the reason for beta that the developper of a tool is working on it in order to eliminate problems?
Here, not much seems to have been done so far. Always the same errors and problems, none of them have been solved after so long time :-(

Could be forever . . . does not matter as the science still gets done.

Either accept failures with a beta app or move on.
ID: 60693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60694 - Posted: 29 Aug 2023, 8:50:42 UTC - in response to Message 60693.  

Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode.

It would be appreciated if the beta wrinkles could be ironed out from the administrative aspects of the app, too.
ID: 60694 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60695 - Posted: 29 Aug 2023, 16:05:30 UTC - in response to Message 60694.  

Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode.

It would be appreciated if the beta wrinkles could be ironed out from the administrative aspects of the app, too.

+ 1
ID: 60695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,331
Level
Arg
Scientific publications
wat
Message 60696 - Posted: 29 Aug 2023, 17:30:47 UTC - in response to Message 60690.  

It’s beta. Accept it or move on.


This excuse/reason has been used for too long now. It's getting old and I'm sick of people letting devs/admins of projects slide by with crap apps instead of fixing them like they know they should.
ID: 60696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 4,772
Level
Trp
Scientific publications
wat
Message 60697 - Posted: 29 Aug 2023, 20:47:15 UTC - in response to Message 60696.  

then don't support the project? and move on?

Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it.
ID: 60697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · 23 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra