ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 35 · Next

AuthorMessage
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60293 - Posted: 6 Apr 2023, 14:49:23 UTC
Last modified: 6 Apr 2023, 15:30:56 UTC

This WU with 'Jnk1' in it, lasted ten seconds.
task 33411216


Edit. Now I have a WU with 'thrombin' in its name. Reached 100% in 15 minutes but is still busy with the GPU and CPU.
task 33413038
ID: 60293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60294 - Posted: 6 Apr 2023, 20:16:42 UTC

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833
ID: 60294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60295 - Posted: 6 Apr 2023, 21:45:11 UTC - in response to Message 60294.  
Last modified: 6 Apr 2023, 21:46:02 UTC

But it is showing progress as 100%

So with all ATM WUs, this is "normal".
Perhaps later the devs will be able to fix it.
So there is no need to be surprised by this fact in every post -_-
ID: 60295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60296 - Posted: 6 Apr 2023, 22:52:16 UTC - in response to Message 60293.  
Last modified: 6 Apr 2023, 22:55:29 UTC

This WU with 'Jnk1' in it, lasted ten seconds.
task 33411216


Edit. Now I have a WU with 'thrombin' in its name. Reached 100% in 15 minutes but is still busy with the GPU and CPU.
task 33413038

_______________

Completed and validated.

No. For some reason, people are aborting, like this WU 'thrombin'. We normally watch the progress report. Instead, check the Task Manager. If there is a heartbeat, let it run.
ID: 60296 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60297 - Posted: 6 Apr 2023, 23:23:17 UTC - in response to Message 60294.  

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833

_______________

Completed and validated.
Auram?
ID: 60297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Speedy

Send message
Joined: 19 Aug 07
Posts: 46
Credit: 45,339,082
RAC: 38
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60301 - Posted: 11 Apr 2023, 8:51:48 UTC

When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks?
ID: 60301 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 60302 - Posted: 11 Apr 2023, 11:59:50 UTC - in response to Message 60301.  

When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks?

One CPU is plenty for these tasks. It doesn't need a full GPU so I run Einstein, Milkyway or OPNG with it. Problem is if BOINC time slices it the ATM WU will fail when it gets restarted.
Unless it time-sliced due to the final step (zipping up maybe?) after several hours. Then when it UL and Report as Valid.
The best way to assure these ATM WUs succeed is to not run a different project to avoid having BOINC switch the GPU and crash it when it restarts. Running 2 ATM WUs per GPU or an ACEMD+ATM is ok since it doesn't switch away.
ID: 60302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 60303 - Posted: 11 Apr 2023, 12:08:11 UTC - in response to Message 60297.  
Last modified: 11 Apr 2023, 12:10:05 UTC

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833
_______________
Completed and validated.
Auram?

Yes the failed WU is my Rig-11 which is having intermittent failures/reboots due to a MB/GPU issue of unknown origin. I've swapped GPUs several times and the problem stays with the Rig-11 MB so it's not a bad GPU. If I leave the GPU idle the CPU runs WUs fine. Einstein and Milkyway don't seem to cause the problem but Asteroids, GG and maybe OPNG do at random intervals. Also it might be time-slicing that I described in my penultimate reply.
It's probably time to scrap the MB. Since most are designed for gamers they stuff too much junk on them and compromise their reliability.
ID: 60303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 60304 - Posted: 11 Apr 2023, 15:23:17 UTC

Looks like all of today's WUs are failing:
FileNotFoundError: [Errno 2] No such file or directory: 'CDK2_new_2_edit-1oiy-1h1q_0.xml'
It dumbfounds me why they still have it set to fail 7 times. If they fail at the end then that's several days of compute time wasted. Isn't two failures enough?
ID: 60304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 5,269
Level
Trp
Scientific publications
wat
Message 60305 - Posted: 11 Apr 2023, 15:27:48 UTC - in response to Message 60304.  

I had two fail in this way, but the rest (20+ or so) are running fine. Certainly not "all" of them.
ID: 60305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60306 - Posted: 11 Apr 2023, 16:05:14 UTC

Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running.
The ATM tasks failed after a minute.
I checked my settings - it cleary says:
ATM (beta): no

So, how come that ATMs are being downloaded?
ID: 60306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60307 - Posted: 11 Apr 2023, 16:20:02 UTC - in response to Message 60306.  

I think the beta toggle in preferences is 'sticky' in the scheduler.

Seen similiar. Didn't get Python beta until I set beta in preferences. Unset beta in preferences and still got beta Python tasks. Beta set again for ATM.

Probably only a detach and reattach will fix it.
ID: 60307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 60308 - Posted: 12 Apr 2023, 2:08:15 UTC - in response to Message 60306.  

Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running.
The ATM tasks failed after a minute.
I checked my settings - it cleary says:
ATM (beta): no

So, how come that ATMs are being downloaded?

I think ATMbeta is controlled by
Run test applications?
ID: 60308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60309 - Posted: 12 Apr 2023, 6:59:28 UTC - in response to Message 60308.  

I think ATMbeta is controlled by
Run test applications?

oh, this might explain.
While I unchecked "ATM beta" I neglected to uncheck "Run test applications"
ID: 60309 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 60310 - Posted: 12 Apr 2023, 15:43:22 UTC
Last modified: 12 Apr 2023, 15:43:37 UTC

This WU had me error out for NAN at 913 seconds. I never overclock my GPUs and power limited this 2080 Ti to 180 W since GPUs are notorious for wasting energy. This NAN error is due setting the calculation boundaries wrong.
https://www.gpugrid.net/workunit.php?wuid=27468777
ID: 60310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bibi

Send message
Joined: 4 May 17
Posts: 15
Credit: 17,444,875,743
RAC: 240
Level
Trp
Scientific publications
watwatwatwatwat
Message 60312 - Posted: 12 Apr 2023, 15:56:19 UTC

Hello Quico,
I hope my interpretation is correct.
see https://boinc.berkeley.edu/trac/wiki/WrapperApp

If no task has checkpoint_filename defined, then the job starts over and breaks with python pip.
The task with the run script should define checkpoint_filename. The progress file is changed after each checkpoint. Maybe it is enough to specify progress as checkpoint_filename. Resume should then work exactly the same as when starting with checkpoint.

progress
The formula should be changed as suggested by Richard Haselgrove in http://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60206:
progress = float(isample - last_sample)/float(num_samples - last_sample)


Translated with www.DeepL.com/Translator (free version)
ID: 60312 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 0
Level
Trp
Scientific publications
watwatwat
Message 60319 - Posted: 14 Apr 2023, 13:21:34 UTC
Last modified: 14 Apr 2023, 13:39:21 UTC

  File "/var/lib/boinc-client/slots/34/lib/python3.9/site-packages/openmm/app/statedatareporter.py", line 365, in _checkForErrors
    raise ValueError('Energy is NaN.  For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN.
https://www.gpugrid.net/workunit.php?wuid=27469907

Watched a WU finish and it spent 6 minutes out of 313 minutes on 100%.
No checkpointing.
Has it been confirmed that the calculation boundaries are correct and not the cause of the NaN errors?
ID: 60319 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60320 - Posted: 15 Apr 2023, 8:41:17 UTC - in response to Message 60319.  

I, too, had such an error after the task had run for 7.885 seconds:

File "Z:\BOINC\slots\3\lib\site-packages\openmm\app\statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN


https://www.gpugrid.net/result.php?resultid=33436488

no overclocking.
ID: 60320 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60321 - Posted: 15 Apr 2023, 11:07:03 UTC - in response to Message 60319.  

  File "/var/lib/boinc-client/slots/34/lib/python3.9/site-packages/openmm/app/statedatareporter.py", line 365, in _checkForErrors
    raise ValueError('Energy is NaN.  For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN.
https://www.gpugrid.net/workunit.php?wuid=27469907

Watched a WU finish and it spent 6 minutes out of 313 minutes on 100%.
No checkpointing.
Has it been confirmed that the calculation boundaries are correct and not the cause of the NaN errors?


Wow! Six minutes is a significant improvement over the hours it was taking before. Just don't give it a kick and abort.
ID: 60321 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bibi

Send message
Joined: 4 May 17
Posts: 15
Credit: 17,444,875,743
RAC: 240
Level
Trp
Scientific publications
watwatwatwatwat
Message 60322 - Posted: 15 Apr 2023, 11:19:45 UTC - in response to Message 60312.  

Too bad, not so simple after all. I wrote the checkpoint tag in the job.xml in the project directory under Windows and after two samples suspend/resume and again the job started with the first task and died with python pip.
ID: 60322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra