Message boards :
News :
ATM
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 35 · Next
Author | Message |
---|---|
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141. My observations are same. When the units download, the estimated finish time reads 606 days. https://www.gpugrid.net/results.php?hostid=534811&offset=0&show_names=0&state=0&appid=45 So far, in this batch, 3 WUs completed successfully, 1 error and 1 is crunching, on a windows 10 machine. The units all crash on my other computer, which runs windows 7 and is rather old, 13 years. Maybe, it's time to retire it from this project, though it still runs well on other projects, like Einstein and FAH. https://www.gpugrid.net/results.php?hostid=544232&offset=0&show_names=0&state=0&appid=45 |
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 960 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
My first ATM beta on Windows10 failed after some 6 hours :-( https://www.gpugrid.net/result.php?resultid=33393839 anyone an idea what exactly the problem was? |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
anyone an idea what exactly the problem was? It says ValueError: Energy is NaN. A science error (impossible result), rather than a computing error. |
Send message Joined: 21 Feb 20 Posts: 1114 Credit: 40,838,348,595 RAC: 4,765,598 Level ![]() Scientific publications ![]() |
Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks. Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility. ![]() |
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 960 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks. thanks for this thought; it could well be the case. For some time, this old GTX980TI has no longer followed the settings for GPU clock and Power target, in the old NVIDIA Inspector as well as in the newer Afterburner. Hence, particularly with ATM tasks I noticed an overclocking from default 1152MHz up to 1330MHz. Not all the time, but many times. I now experimented and found out that I can control the GPU clock by reducing the fan speed, with setting the GPU temperature at a fixed value and setting a check at "priorize temperature". So the clock now oscillates around 1.100MHz most of the time. I will see whether the ATM tasks now will fail again, or not. |
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 163 Level ![]() Scientific publications ![]() |
|
![]() Send message Joined: 10 Jun 20 Posts: 7 Credit: 980,066,632 RAC: 0 Level ![]() Scientific publications ![]() |
me too. All ATM tasks! graphic card Tesla P4 Ryzen 5600G 32GB RAM Windows 11 under Linux too |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Something in your Windows configuration has a problem running cmd.exe and calling the run.bat file. Windows barfs on the 0x1 exit error. Same as the other fellow running Windows. No concrete smoking gun flaw shown. |
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Another thing that can be possible is that your system re-started after an update or you suspended it. I have concluded to let Intel, Microsoft, Dell and Acer update themselves when they want. Not our fault if the WU crashes. It is the onus of the Admin of the project to make their WU stable enough to default to the last checkpoint. Our job is to keep our systems up to date and maintained to run these WUs to the best of our abilities. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted. |
![]() Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted. I agree, I have lost quite a few hours on WU's that were going to complete because I had perform reboots and lost them. Is anyone addressing this issue yet? |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Haven't heard or seen any comments by any of the devs. The acemd3 app hasn't been fixed in two years. And that is an internal application by Acellera. No reason to expect any change in newest sub-project apps. Not unless some dev has got a lot of time to dig into this type of bug. And since almost all of the newer apps depend on external libraries, that falls to to those external toolsets and devs outside of this project. So probably not going to happen. |
![]() Send message Joined: 10 Jun 20 Posts: 7 Credit: 980,066,632 RAC: 0 Level ![]() Scientific publications ![]() |
it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes the system was neither restarted during the calculation nor was there an update so the problem lies elsewhere Exactly the same on Linux. Windows and drivers are up to date |
![]() Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() |
it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes If you're time-slicing with another GPU project that will cause a fatal "computation error" when BOINC switches between them. |
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Task failed. ImportError: DLL load failed while importing _openmm: The specified module could not be found. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed Encountered error while generating package metadata. __________________ Another one. ImportError: DLL load failed while importing _openmm: The specified module could not be found. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed Encountered error while generating package metadata. _____________________ Third one. ImportError: DLL load failed while importing _openmm: The specified module could not be found. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed Encountered error while generating package metadata. _________________________________ I have eleven failed tasks(proud of the record setting) all, revolving around the same thing. |
Send message Joined: 13 Apr 15 Posts: 11 Credit: 3,003,712,606 RAC: 2,390,740 Level ![]() Scientific publications ![]() |
Beta or not...how can a Project send out tasks with this length of runtime and not have any checkpointing of some sort. The amount of resources that are wasted because of this has to be mind boggling. |
![]() Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I concur with bluestang, some of those likely successful WU's I lost had 20+ hours getting wasted because of a necessary reboot. Project owners should make some sort of a fix a priority. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions. I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted. |
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions. ________________ Do you know what the problem is? Quico has not understood what Abou did at the very start. I am pretty sure whatever it is if he brings it to the thread he will find an answer. There are a lot of people on the thread and one of them is you, who are willing to help to the best of their ability. Experts. I have paused all Microsoft Updates for five weeks which seems to trigger the rest of the updates, like Intels. Just for these WUs. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app. The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries. Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app. They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic. |
©2025 Universitat Pompeu Fabra