ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 35 · Next

AuthorMessage
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60270 - Posted: 3 Apr 2023, 8:10:32 UTC - in response to Message 60269.  

But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app.

The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries.

Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app.

They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic.

___________________

I am not saying anything but I agree with the sentiments of some.
Maybe, some of us can play with AToM libs.
ID: 60270 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60271 - Posted: 3 Apr 2023, 8:52:44 UTC - in response to Message 60270.  

Having looked into the internal logging of Quico's tasks in some detail because of the progress %age problem, it's clear that it goes through the motions of writing a checkpoint normally - 70 times per task for the recent short runs, 341 per task for the very long ones. That's about once every five minutes on my machines, which would be perfectly acceptable to me.

I would judge the problem to be with the other end of the problem - re-starting the task after an interruption. That's more complicated, from the programmer's point of view - not only does the state of the science program's data have to be restored from disk in the proper format, all the wrapper's counters and timings have to be re-aligned and re-started.

By all means explore and learn about the tools and libraries used for these tasks, but I suspect you'll have to get down and dirty with the application's code as well. Let us know how you get on.
ID: 60271 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60273 - Posted: 4 Apr 2023, 13:30:45 UTC

Impressive.
ID: 60273 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60274 - Posted: 4 Apr 2023, 15:39:05 UTC

anyone any idea why this task:
https://www.gpugrid.net/result.php?resultid=33405348 failed after 5 1/2 hours?

This time, there was no overclocking involved. So the reason must have been a different one :-(
ID: 60274 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60275 - Posted: 4 Apr 2023, 18:29:37 UTC - in response to Message 60274.  

ValueError: Energy is NaN. IOW Not a number.

Impossible value got the task thrown out. Couple of possible reasons.

Misconfigured or "bad" task

GPU running overclocked or hot and caused math errors.
ID: 60275 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[AF>FAH-Addict.net]toTOW

Send message
Joined: 28 Oct 10
Posts: 9
Credit: 25,781,299
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60276 - Posted: 4 Apr 2023, 18:47:37 UTC
Last modified: 4 Apr 2023, 18:54:28 UTC

ID: 60276 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 731
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60277 - Posted: 4 Apr 2023, 18:54:51 UTC - in response to Message 60276.  

We see this frequently with misconfigured tasks. Researcher does a poor job updating the task generation template when configuring for new tasks.

Wastes time and resources for every one.
ID: 60277 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Aleksey Belkov

Send message
Joined: 26 Dec 13
Posts: 86
Credit: 1,292,358,731
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60278 - Posted: 4 Apr 2023, 21:34:20 UTC - in response to Message 60276.  

seems to be failing the same way with missing files

Same here:
https://www.gpugrid.net/result.php?resultid=33406558
But first failed among dozen successful completed.
ID: 60278 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60279 - Posted: 5 Apr 2023, 6:43:22 UTC - in response to Message 60277.  

Wastes time and resources for every one.

Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad.
But I had one, day before yesterday, which failed after some 5-1/2 hours - which is not good :-(
ID: 60279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60281 - Posted: 5 Apr 2023, 8:47:03 UTC

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.
ID: 60281 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60282 - Posted: 5 Apr 2023, 18:46:01 UTC

Thought I'd run a quick test to see if there was any progress on the restart front. Waited until a task had just finished, and let a new one start and run to the first checkpoint: then paused it, and waited while another project borrowed the GPU temporarily.

The test task was p38_2m_2j_5-QUICO_ATM_OFF_STEPS-1-5-RND9265_1. On restart, it started again from zero progress, zero elapsed time, and ran up to the 0.200% point: then it crashed as before. I didn't have any time to rescue any logs from the restart - my BOINC client cleaned and reused the slot for something else before I could catch it.

The website report says it ran for about 40 seconds, and stderr.txt contains the lines

  Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /hdd/boinc-client/slots/2/tmp/pip-req-build-368b4spp
  fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied

That doesn't sound very hopeful. It's still a problem.
ID: 60282 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60283 - Posted: 5 Apr 2023, 19:05:03 UTC - in response to Message 60281.  

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

_______________________________

task 27451763
task 27451117
task 27452961
Completed and validated. No errors as yet. I dare not even sneeze near them. All updates are off.
ID: 60283 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60284 - Posted: 5 Apr 2023, 19:35:54 UTC

This "OFF" in the WU points towards Python. "AToM" also has something to do with Python.
I errored out on one of Abou's WU because my GPU was updated. Python? I do not know. I do not know how to dive under the bonnet but Google up "OFF Python" and "AToM Python", there is a relationship.
ID: 60284 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60285 - Posted: 5 Apr 2023, 20:44:22 UTC - in response to Message 60284.  

I think 'Python' is a programming language, and 'AToM' is a scientific program written in that language.
ID: 60285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60286 - Posted: 6 Apr 2023, 4:55:32 UTC - in response to Message 60279.  

Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad.
But I had one, day before yesterday, which failed after some 5-1/2 hours - which is not good :-(

what I noticed lately on my machines is: when ATM tasks fail, mostly after 60-90 seconds.
And stderr always says:

FileNotFoundError: [Errno 2] No such file or directory: 'thrombin_noH_2-1a-3b_0.xml'
23:18:10 (18772): C:/Windows/system32/cmd.exe exited; CPU time 18.421875

see here:
https://www.gpugrid.net/result.php?resultid=33409106
ID: 60286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60287 - Posted: 6 Apr 2023, 7:46:00 UTC - in response to Message 60283.  

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

_______________________________

task 27451763
task 27451117
task 27452961
Completed and validated. No errors as yet. I dare not even sneeze near them. All updates are off.

task 27452387
task 27452312
task 27452961
task 27452969
completed and validated.
One task in error,
task 33410323
ID: 60287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60289 - Posted: 6 Apr 2023, 10:44:12 UTC - in response to Message 60286.  

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'
ID: 60289 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60290 - Posted: 6 Apr 2023, 12:09:09 UTC

Has anyone noticed the WUs with 'Bace' in their name, they show progress as 100% but the Time Elapsed counter is still ticking. Task Manager shows the task is still busy computing. This goes on for hours on end and one Task went up to 24 Hrs in this state.
If a Task is doing this, it does not mean a failed task. Check in the Task Manager first. Let it complete. Currently, this Task is doing it.
task 33409877
I wish someone would put up a Notice that this project is not for persons who switch off their computers at night or for some other reasons.
ID: 60290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 1
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60291 - Posted: 6 Apr 2023, 12:10:38 UTC - in response to Message 60289.  

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

same here, about 1 hour ago:

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-30-40_0.xml'

such errors, happening often enough, may show some kind of sloppy tasks configuration ?
ID: 60291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60292 - Posted: 6 Apr 2023, 14:17:30 UTC - in response to Message 60291.  
Last modified: 6 Apr 2023, 14:18:30 UTC

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

same here, about 1 hour ago:

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-30-40_0.xml'

such errors, happening often enough, may show some kind of sloppy tasks configuration ?

__________________

Same here. The Task with 'MCLI' in their name lasted 18 seconds only.
task 33411408
ID: 60292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra