Message boards :
News :
ATM
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 35 · Next
| Author | Message |
|---|---|
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
This WU with 'Jnk1' in it, lasted ten seconds. task 33411216 Edit. Now I have a WU with 'thrombin' in its name. Reached 100% in 15 minutes but is still busy with the GPU and CPU. task 33413038 |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy. task 33412833 |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
But it is showing progress as 100% So with all ATM WUs, this is "normal". Perhaps later the devs will be able to fix it. So there is no need to be surprised by this fact in every post -_- |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
This WU with 'Jnk1' in it, lasted ten seconds. _______________ Completed and validated. No. For some reason, people are aborting, like this WU 'thrombin'. We normally watch the progress report. Instead, check the Task Manager. If there is a heartbeat, let it run. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy. _______________ Completed and validated. Auram? |
|
Send message Joined: 19 Aug 07 Posts: 46 Credit: 45,339,082 RAC: 38 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks? |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks? One CPU is plenty for these tasks. It doesn't need a full GPU so I run Einstein, Milkyway or OPNG with it. Problem is if BOINC time slices it the ATM WU will fail when it gets restarted. Unless it time-sliced due to the final step (zipping up maybe?) after several hours. Then when it UL and Report as Valid. The best way to assure these ATM WUs succeed is to not run a different project to avoid having BOINC switch the GPU and crash it when it restarts. Running 2 ATM WUs per GPU or an ACEMD+ATM is ok since it doesn't switch away. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy._______________ Yes the failed WU is my Rig-11 which is having intermittent failures/reboots due to a MB/GPU issue of unknown origin. I've swapped GPUs several times and the problem stays with the Rig-11 MB so it's not a bad GPU. If I leave the GPU idle the CPU runs WUs fine. Einstein and Milkyway don't seem to cause the problem but Asteroids, GG and maybe OPNG do at random intervals. Also it might be time-slicing that I described in my penultimate reply. It's probably time to scrap the MB. Since most are designed for gamers they stuff too much junk on them and compromise their reliability. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Looks like all of today's WUs are failing: FileNotFoundError: [Errno 2] No such file or directory: 'CDK2_new_2_edit-1oiy-1h1q_0.xml'It dumbfounds me why they still have it set to fail 7 times. If they fail at the end then that's several days of compute time wasted. Isn't two failures enough? |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,839,470,595 RAC: 5,269 Level ![]() Scientific publications
|
I had two fail in this way, but the rest (20+ or so) are running fine. Certainly not "all" of them.
|
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running. The ATM tasks failed after a minute. I checked my settings - it cleary says: ATM (beta): no So, how come that ATMs are being downloaded? |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I think the beta toggle in preferences is 'sticky' in the scheduler. Seen similiar. Didn't get Python beta until I set beta in preferences. Unset beta in preferences and still got beta Python tasks. Beta set again for ATM. Probably only a detach and reattach will fix it. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running. I think ATMbeta is controlled by Run test applications? |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think ATMbeta is controlled byRun test applications? oh, this might explain. While I unchecked "ATM beta" I neglected to uncheck "Run test applications" |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
This WU had me error out for NAN at 913 seconds. I never overclock my GPUs and power limited this 2080 Ti to 180 W since GPUs are notorious for wasting energy. This NAN error is due setting the calculation boundaries wrong. https://www.gpugrid.net/workunit.php?wuid=27468777 |
|
Send message Joined: 4 May 17 Posts: 15 Credit: 17,444,875,743 RAC: 240 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Hello Quico, I hope my interpretation is correct. see https://boinc.berkeley.edu/trac/wiki/WrapperApp If no task has checkpoint_filename defined, then the job starts over and breaks with python pip. The task with the run script should define checkpoint_filename. The progress file is changed after each checkpoint. Maybe it is enough to specify progress as checkpoint_filename. Resume should then work exactly the same as when starting with checkpoint. progress The formula should be changed as suggested by Richard Haselgrove in http://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60206: progress = float(isample - last_sample)/float(num_samples - last_sample) Translated with www.DeepL.com/Translator (free version) |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
File "/var/lib/boinc-client/slots/34/lib/python3.9/site-packages/openmm/app/statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN.https://www.gpugrid.net/workunit.php?wuid=27469907Watched a WU finish and it spent 6 minutes out of 313 minutes on 100%. No checkpointing. Has it been confirmed that the calculation boundaries are correct and not the cause of the NaN errors? |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I, too, had such an error after the task had run for 7.885 seconds: File "Z:\BOINC\slots\3\lib\site-packages\openmm\app\statedatareporter.py", line 365, in _checkForErrors raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan') ValueError: Energy is NaN https://www.gpugrid.net/result.php?resultid=33436488 no overclocking. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Wow! Six minutes is a significant improvement over the hours it was taking before. Just don't give it a kick and abort. |
|
Send message Joined: 4 May 17 Posts: 15 Credit: 17,444,875,743 RAC: 240 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Too bad, not so simple after all. I wrote the checkpoint tag in the job.xml in the project directory under Windows and after two samples suspend/resume and again the job started with the first task and died with python pip. |
©2025 Universitat Pompeu Fabra