ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 35 · Next

AuthorMessage
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60244 - Posted: 31 Mar 2023, 8:29:47 UTC - in response to Message 60243.  
Last modified: 31 Mar 2023, 8:34:51 UTC

My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141.

Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205).

Edit - yes, it did. I see you've put step information in the task names: these were

PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0
PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0


My observations are same. When the units download, the estimated finish time reads 606 days.


https://www.gpugrid.net/results.php?hostid=534811&offset=0&show_names=0&state=0&appid=45


So far, in this batch, 3 WUs completed successfully, 1 error and 1 is crunching, on a windows 10 machine.


The units all crash on my other computer, which runs windows 7 and is rather old, 13 years. Maybe, it's time to retire it from this project, though it still runs well on other projects, like Einstein and FAH.



https://www.gpugrid.net/results.php?hostid=544232&offset=0&show_names=0&state=0&appid=45
ID: 60244 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60245 - Posted: 31 Mar 2023, 12:52:00 UTC

My first ATM beta on Windows10 failed after some 6 hours :-(
https://www.gpugrid.net/result.php?resultid=33393839

anyone an idea what exactly the problem was?
ID: 60245 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60246 - Posted: 31 Mar 2023, 13:01:59 UTC - in response to Message 60245.  

anyone an idea what exactly the problem was?

It says

ValueError: Energy is NaN.

A science error (impossible result), rather than a computing error.
ID: 60246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,722,595
RAC: 4,266,994
Level
Trp
Scientific publications
wat
Message 60247 - Posted: 31 Mar 2023, 13:27:18 UTC - in response to Message 60246.  

Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks.

Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility.
ID: 60247 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60248 - Posted: 31 Mar 2023, 18:06:07 UTC - in response to Message 60247.  

Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks.

Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility.

thanks for this thought; it could well be the case. For some time, this old GTX980TI has no longer followed the settings for GPU clock and Power target, in the old NVIDIA Inspector as well as in the newer Afterburner.
Hence, particularly with ATM tasks I noticed an overclocking from default 1152MHz up to 1330MHz. Not all the time, but many times.
I now experimented and found out that I can control the GPU clock by reducing the fan speed, with setting the GPU temperature at a fixed value and setting a check at "priorize temperature". So the clock now oscillates around 1.100MHz most of the time.
I will see whether the ATM tasks now will fail again, or not.


ID: 60248 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 148
Level
Leu
Scientific publications
wat
Message 60249 - Posted: 1 Apr 2023, 6:29:10 UTC
Last modified: 1 Apr 2023, 6:30:22 UTC

My atm beta tasks crash.
http://www.gpugrid.net/result.php?resultid=33398437
Do you know why?
ID: 60249 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ZUSE
Avatar

Send message
Joined: 10 Jun 20
Posts: 7
Credit: 980,066,632
RAC: 0
Level
Glu
Scientific publications
wat
Message 60250 - Posted: 1 Apr 2023, 7:00:18 UTC

me too.
All ATM tasks!

graphic card Tesla P4
Ryzen 5600G
32GB RAM
Windows 11
under Linux too
ID: 60250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60251 - Posted: 1 Apr 2023, 7:16:21 UTC - in response to Message 60249.  

Something in your Windows configuration has a problem running cmd.exe and calling the run.bat file. Windows barfs on the 0x1 exit error.

Same as the other fellow running Windows.

No concrete smoking gun flaw shown.
ID: 60251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60254 - Posted: 1 Apr 2023, 13:04:13 UTC

Another thing that can be possible is that your system re-started after an update or you suspended it.
I have concluded to let Intel, Microsoft, Dell and Acer update themselves when they want. Not our fault if the WU crashes. It is the onus of the Admin of the project to make their WU stable enough to default to the last checkpoint.
Our job is to keep our systems up to date and maintained to run these WUs to the best of our abilities.
ID: 60254 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60256 - Posted: 1 Apr 2023, 15:28:48 UTC

The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted.
ID: 60256 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 60257 - Posted: 2 Apr 2023, 2:20:00 UTC - in response to Message 60256.  

The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted.


I agree, I have lost quite a few hours on WU's that were going to complete because I had perform reboots and lost them. Is anyone addressing this issue yet?
ID: 60257 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60258 - Posted: 2 Apr 2023, 2:39:38 UTC - in response to Message 60257.  

Haven't heard or seen any comments by any of the devs. The acemd3 app hasn't been fixed in two years. And that is an internal application by Acellera.

No reason to expect any change in newest sub-project apps.

Not unless some dev has got a lot of time to dig into this type of bug.

And since almost all of the newer apps depend on external libraries, that falls to to those external toolsets and devs outside of this project.

So probably not going to happen.
ID: 60258 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ZUSE
Avatar

Send message
Joined: 10 Jun 20
Posts: 7
Credit: 980,066,632
RAC: 0
Level
Glu
Scientific publications
wat
Message 60259 - Posted: 2 Apr 2023, 8:26:04 UTC

it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes

the system was neither restarted during the calculation nor was there an update

so the problem lies elsewhere

Exactly the same on Linux.

Windows and drivers are up to date
ID: 60259 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60260 - Posted: 2 Apr 2023, 8:31:36 UTC - in response to Message 60259.  
Last modified: 2 Apr 2023, 8:31:56 UTC

it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes
the system was neither restarted during the calculation nor was there an update
so the problem lies elsewhere
Exactly the same on Linux.
Windows and drivers are up to date

If you're time-slicing with another GPU project that will cause a fatal "computation error" when BOINC switches between them.
ID: 60260 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60261 - Posted: 2 Apr 2023, 11:03:21 UTC

Task failed.
ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
__________________

Another one.

ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
_____________________

Third one.

ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
_________________________________

I have eleven failed tasks(proud of the record setting) all, revolving around the same thing.
ID: 60261 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bluestang

Send message
Joined: 13 Apr 15
Posts: 11
Credit: 3,003,712,606
RAC: 2,164,606
Level
Arg
Scientific publications
wat
Message 60264 - Posted: 2 Apr 2023, 17:55:46 UTC

Beta or not...how can a Project send out tasks with this length of runtime and not have any checkpointing of some sort.

The amount of resources that are wasted because of this has to be mind boggling.
ID: 60264 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
STARBASEn
Avatar

Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 60266 - Posted: 2 Apr 2023, 22:22:50 UTC - in response to Message 60264.  

I concur with bluestang, some of those likely successful WU's I lost had 20+ hours getting wasted because of a necessary reboot. Project owners should make some sort of a fix a priority.
ID: 60266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60267 - Posted: 2 Apr 2023, 23:46:57 UTC

Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions.

I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted.
ID: 60267 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60268 - Posted: 3 Apr 2023, 0:01:08 UTC - in response to Message 60267.  

Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions.

I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted.

________________

Do you know what the problem is? Quico has not understood what Abou did at the very start. I am pretty sure whatever it is if he brings it to the thread he will find an answer. There are a lot of people on the thread and one of them is you, who are willing to help to the best of their ability. Experts.
I have paused all Microsoft Updates for five weeks which seems to trigger the rest of the updates, like Intels. Just for these WUs.
ID: 60268 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60269 - Posted: 3 Apr 2023, 3:08:20 UTC - in response to Message 60268.  

But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app.

The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries.

Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app.

They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic.
ID: 60269 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra