Message boards :
News :
ATM
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 35 · Next
| Author | Message |
|---|---|
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app. ___________________ I am not saying anything but I agree with the sentiments of some. Maybe, some of us can play with AToM libs. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Having looked into the internal logging of Quico's tasks in some detail because of the progress %age problem, it's clear that it goes through the motions of writing a checkpoint normally - 70 times per task for the recent short runs, 341 per task for the very long ones. That's about once every five minutes on my machines, which would be perfectly acceptable to me. I would judge the problem to be with the other end of the problem - re-starting the task after an interruption. That's more complicated, from the programmer's point of view - not only does the state of the science program's data have to be restored from disk in the proper format, all the wrapper's counters and timings have to be re-aligned and re-started. By all means explore and learn about the tools and libraries used for these tasks, but I suspect you'll have to get down and dirty with the application's code as well. Let us know how you get on. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Impressive. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
anyone any idea why this task: https://www.gpugrid.net/result.php?resultid=33405348 failed after 5 1/2 hours? This time, there was no overclocking involved. So the reason must have been a different one :-( |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
ValueError: Energy is NaN. IOW Not a number. Impossible value got the task thrown out. Couple of possible reasons. Misconfigured or "bad" task GPU running overclocked or hot and caused math errors. |
|
Send message Joined: 28 Oct 10 Posts: 9 Credit: 25,781,299 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All WUs seems to be failing the same way with missing files : https://www.gpugrid.net/result.php?resultid=33406732 https://www.gpugrid.net/result.php?resultid=33406795 https://www.gpugrid.net/result.php?resultid=33406795 |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
We see this frequently with misconfigured tasks. Researcher does a poor job updating the task generation template when configuring for new tasks. Wastes time and resources for every one. |
|
Send message Joined: 26 Dec 13 Posts: 86 Credit: 1,292,358,731 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
seems to be failing the same way with missing files Same here: https://www.gpugrid.net/result.php?resultid=33406558 But first failed among dozen successful completed. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Wastes time and resources for every one. Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad. But I had one, day before yesterday, which failed after some 5-1/2 hours - which is not good :-( |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thought I'd run a quick test to see if there was any progress on the restart front. Waited until a task had just finished, and let a new one start and run to the first checkpoint: then paused it, and waited while another project borrowed the GPU temporarily. The test task was p38_2m_2j_5-QUICO_ATM_OFF_STEPS-1-5-RND9265_1. On restart, it started again from zero progress, zero elapsed time, and ran up to the 0.200% point: then it crashed as before. I didn't have any time to rescue any logs from the restart - my BOINC client cleaned and reused the slot for something else before I could catch it. The website report says it ran for about 40 seconds, and stderr.txt contains the lines Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /hdd/boinc-client/slots/2/tmp/pip-req-build-368b4spp fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied That doesn't sound very hopeful. It's still a problem. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
task 27451592 _______________________________ task 27451763 task 27451117 task 27452961 Completed and validated. No errors as yet. I dare not even sneeze near them. All updates are off. |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
This "OFF" in the WU points towards Python. "AToM" also has something to do with Python. I errored out on one of Abou's WU because my GPU was updated. Python? I do not know. I do not know how to dive under the bonnet but Google up "OFF Python" and "AToM Python", there is a relationship. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think 'Python' is a programming language, and 'AToM' is a scientific program written in that language. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad. what I noticed lately on my machines is: when ATM tasks fail, mostly after 60-90 seconds. And stderr always says: FileNotFoundError: [Errno 2] No such file or directory: 'thrombin_noH_2-1a-3b_0.xml' 23:18:10 (18772): C:/Windows/system32/cmd.exe exited; CPU time 18.421875 see here: https://www.gpugrid.net/result.php?resultid=33409106 |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
task 27451592 task 27452387 task 27452312 task 27452961 task 27452969 completed and validated. One task in error, task 33410323 |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml' |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Has anyone noticed the WUs with 'Bace' in their name, they show progress as 100% but the Time Elapsed counter is still ticking. Task Manager shows the task is still busy computing. This goes on for hours on end and one Task went up to 24 Hrs in this state. If a Task is doing this, it does not mean a failed task. Check in the Task Manager first. Let it complete. Currently, this Task is doing it. task 33409877 I wish someone would put up a Notice that this project is not for persons who switch off their computers at night or for some other reasons. |
|
Send message Joined: 1 Jan 15 Posts: 1166 Credit: 12,260,898,501 RAC: 1 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with same here, about 1 hour ago: FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-30-40_0.xml' such errors, happening often enough, may show some kind of sloppy tasks configuration ? |
|
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]()
|
Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with __________________ Same here. The Task with 'MCLI' in their name lasted 18 seconds only. task 33411408 |
©2025 Universitat Pompeu Fabra