ATM

Author	Message
Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 1,066 Level Scientific publications	Message 60244 - Posted: 31 Mar 2023, 8:29:47 UTC - in response to Message 60243. Last modified: 31 Mar 2023, 8:34:51 UTC My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141. Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205). Edit - yes, it did. I see you've put step information in the task names: these were PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0 PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0 My observations are same. When the units download, the estimated finish time reads 606 days. https://www.gpugrid.net/results.php?hostid=534811&offset=0&show_names=0&state=0&appid=45 So far, in this batch, 3 WUs completed successfully, 1 error and 1 is crunching, on a windows 10 machine. The units all crash on my other computer, which runs windows 7 and is rather old, 13 years. Maybe, it's time to retire it from this project, though it still runs well on other projects, like Einstein and FAH. https://www.gpugrid.net/results.php?hostid=544232&offset=0&show_names=0&state=0&appid=45 ID: 60244 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 60245 - Posted: 31 Mar 2023, 12:52:00 UTC My first ATM beta on Windows10 failed after some 6 hours :-( https://www.gpugrid.net/result.php?resultid=33393839 anyone an idea what exactly the problem was? ID: 60245 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60246 - Posted: 31 Mar 2023, 13:01:59 UTC - in response to Message 60245. anyone an idea what exactly the problem was? It says ValueError: Energy is NaN. A science error (impossible result), rather than a computing error. ID: 60246 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 60247 - Posted: 31 Mar 2023, 13:27:18 UTC - in response to Message 60246. Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks. Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility. ID: 60247 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 60248 - Posted: 31 Mar 2023, 18:06:07 UTC - in response to Message 60247. Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks. Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility. thanks for this thought; it could well be the case. For some time, this old GTX980TI has no longer followed the settings for GPU clock and Power target, in the old NVIDIA Inspector as well as in the newer Afterburner. Hence, particularly with ATM tasks I noticed an overclocking from default 1152MHz up to 1330MHz. Not all the time, but many times. I now experimented and found out that I can control the GPU clock by reducing the fan speed, with setting the GPU temperature at a fixed value and setting a check at "priorize temperature". So the clock now oscillates around 1.100MHz most of the time. I will see whether the ATM tasks now will fail again, or not. ID: 60248 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 194 Level Scientific publications	Message 60249 - Posted: 1 Apr 2023, 6:29:10 UTC Last modified: 1 Apr 2023, 6:30:22 UTC My atm beta tasks crash. http://www.gpugrid.net/result.php?resultid=33398437 Do you know why? ID: 60249 · Rating: 0 · rate: / Reply Quote

ZUSE Send message Joined: 10 Jun 20 Posts: 7 Credit: 980,066,632 RAC: 0 Level Scientific publications	Message 60250 - Posted: 1 Apr 2023, 7:00:18 UTC me too. All ATM tasks! graphic card Tesla P4 Ryzen 5600G 32GB RAM Windows 11 under Linux too ID: 60250 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60251 - Posted: 1 Apr 2023, 7:16:21 UTC - in response to Message 60249. Something in your Windows configuration has a problem running cmd.exe and calling the run.bat file. Windows barfs on the 0x1 exit error. Same as the other fellow running Windows. No concrete smoking gun flaw shown. ID: 60251 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60254 - Posted: 1 Apr 2023, 13:04:13 UTC Another thing that can be possible is that your system re-started after an update or you suspended it. I have concluded to let Intel, Microsoft, Dell and Acer update themselves when they want. Not our fault if the WU crashes. It is the onus of the Admin of the project to make their WU stable enough to default to the last checkpoint. Our job is to keep our systems up to date and maintained to run these WUs to the best of our abilities. ID: 60254 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60256 - Posted: 1 Apr 2023, 15:28:48 UTC The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted. ID: 60256 · Rating: 0 · rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 60257 - Posted: 2 Apr 2023, 2:20:00 UTC - in response to Message 60256. The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted. I agree, I have lost quite a few hours on WU's that were going to complete because I had perform reboots and lost them. Is anyone addressing this issue yet? ID: 60257 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60258 - Posted: 2 Apr 2023, 2:39:38 UTC - in response to Message 60257. Haven't heard or seen any comments by any of the devs. The acemd3 app hasn't been fixed in two years. And that is an internal application by Acellera. No reason to expect any change in newest sub-project apps. Not unless some dev has got a lot of time to dig into this type of bug. And since almost all of the newer apps depend on external libraries, that falls to to those external toolsets and devs outside of this project. So probably not going to happen. ID: 60258 · Rating: 0 · rate: / Reply Quote

ZUSE Send message Joined: 10 Jun 20 Posts: 7 Credit: 980,066,632 RAC: 0 Level Scientific publications	Message 60259 - Posted: 2 Apr 2023, 8:26:04 UTC it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes the system was neither restarted during the calculation nor was there an update so the problem lies elsewhere Exactly the same on Linux. Windows and drivers are up to date ID: 60259 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 60260 - Posted: 2 Apr 2023, 8:31:36 UTC - in response to Message 60259. Last modified: 2 Apr 2023, 8:31:56 UTC it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes the system was neither restarted during the calculation nor was there an update so the problem lies elsewhere Exactly the same on Linux. Windows and drivers are up to date If you're time-slicing with another GPU project that will cause a fatal "computation error" when BOINC switches between them. ID: 60260 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60261 - Posted: 2 Apr 2023, 11:03:21 UTC Task failed. ImportError: DLL load failed while importing _openmm: The specified module could not be found. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed Encountered error while generating package metadata. __________________ Another one. ImportError: DLL load failed while importing _openmm: The specified module could not be found. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed Encountered error while generating package metadata. _____________________ Third one. ImportError: DLL load failed while importing _openmm: The specified module could not be found. [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed Encountered error while generating package metadata. _________________________________ I have eleven failed tasks(proud of the record setting) all, revolving around the same thing. ID: 60261 · Rating: 0 · rate: / Reply Quote

bluestang Send message Joined: 13 Apr 15 Posts: 11 Credit: 3,003,712,606 RAC: 0 Level Scientific publications	Message 60264 - Posted: 2 Apr 2023, 17:55:46 UTC Beta or not...how can a Project send out tasks with this length of runtime and not have any checkpointing of some sort. The amount of resources that are wasted because of this has to be mind boggling. ID: 60264 · Rating: 0 · rate: / Reply Quote

STARBASEn Send message Joined: 17 Feb 09 Posts: 91 Credit: 1,603,303,394 RAC: 0 Level Scientific publications	Message 60266 - Posted: 2 Apr 2023, 22:22:50 UTC - in response to Message 60264. I concur with bluestang, some of those likely successful WU's I lost had 20+ hours getting wasted because of a necessary reboot. Project owners should make some sort of a fix a priority. ID: 60266 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60267 - Posted: 2 Apr 2023, 23:46:57 UTC Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions. I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted. ID: 60267 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60268 - Posted: 3 Apr 2023, 0:01:08 UTC - in response to Message 60267. Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions. I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted. ________________ Do you know what the problem is? Quico has not understood what Abou did at the very start. I am pretty sure whatever it is if he brings it to the thread he will find an answer. There are a lot of people on the thread and one of them is you, who are willing to help to the best of their ability. Experts. I have paused all Microsoft Updates for five weeks which seems to trigger the rest of the updates, like Intels. Just for these WUs. ID: 60268 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60269 - Posted: 3 Apr 2023, 3:08:20 UTC - in response to Message 60268. But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app. The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries. Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app. They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic. ID: 60269 · Rating: 0 · rate: / Reply Quote