ATM

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60197 - Posted: 27 Mar 2023, 4:03:36 UTC - in response to Message 60196. It looks like you got bit by a permission error. PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml' Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something. ID: 60197 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 1,066 Level Scientific publications	Message 60198 - Posted: 27 Mar 2023, 7:03:07 UTC - in response to Message 60197. It looks like you got bit by a permission error. PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml' Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something. The Boinc version is 7.20.7. https://www.gpugrid.net/hosts_user.php?userid=19626 ID: 60198 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 1,066 Level Scientific publications	Message 60199 - Posted: 27 Mar 2023, 7:33:49 UTC Another task failed. https://www.gpugrid.net/result.php?resultid=33383003 03/27/2023 3:20:22 AM \| GPUGRID \| Computation for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 finished 03/27/2023 3:20:22 AM \| GPUGRID \| Output file MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5_0 for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 absent ID: 60199 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60200 - Posted: 27 Mar 2023, 8:04:20 UTC - in response to Message 60199. The output file will always be absent if the task fails - it doesn't get as far as writing it. The actual error is in the online report: ValueError: Energy is NaN. ('Not a Number') That's a science problem - not your fault. ID: 60200 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60201 - Posted: 27 Mar 2023, 8:48:51 UTC - in response to Message 60171. Last modified: 27 Mar 2023, 8:49:14 UTC I've seen that you are unhappy with the last batch of runs. Seeing that they take too much time. I've been playing to divide the runs in different steps to get a sweet spot that you're happy with it and it's not madness for me to organize all this runs and re-runs. I'll backtrack to the previous setting we had before. Apologies for that. I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app. But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown. Right, probably the wrapper should send a termination signal to AToM. We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal. However, I do not have access to the wrapper. Quico: please advise. I'll ask Raimondas about this and the other things that have been mentioned since he's the one taking care of this issue. ID: 60201 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60202 - Posted: 27 Mar 2023, 8:52:22 UTC I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? ID: 60202 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60203 - Posted: 27 Mar 2023, 9:08:39 UTC - in response to Message 60202. I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem. I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used. ID: 60203 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60204 - Posted: 27 Mar 2023, 9:37:43 UTC - in response to Message 60203. Last modified: 27 Mar 2023, 9:38:19 UTC I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem. I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used. This one is a rerun, meaning that 2/3 of the run were previously simulated. Maybe it was expecting to start from 0 samples and once it saw that we're at 228 from the beginning, it got confused. I'll comment that. PS: But others runs have been reporting correctly? ID: 60204 · Rating: 0 · rate: / Reply Quote

bibi Send message Joined: 4 May 17 Posts: 15 Credit: 17,759,125,743 RAC: 1,927 Level Scientific publications	Message 60205 - Posted: 27 Mar 2023, 9:43:06 UTC https://www.gpugrid.net/result.php?resultid=33382097 I had to suspend this task at sample 149 und resumed it an hour later, but it started again with the python install step and died. It should restart with sample 149. ID: 60205 · Rating: 0 · rate: / Reply Quote

bibi Send message Joined: 4 May 17 Posts: 15 Credit: 17,759,125,743 RAC: 1,927 Level Scientific publications	Message 60206 - Posted: 27 Mar 2023, 9:47:44 UTC - in response to Message 60204. see post https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60160 it should be progress = float(isample)/float(num_samples) ID: 60206 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60207 - Posted: 27 Mar 2023, 10:22:32 UTC - in response to Message 60206. Or possibly progress = float(isample - last_sample)/float(num_samples - last_sample) if you want a truncated resend to start from 0% - but might that affect paused/resumed tasks as well? ID: 60207 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 60208 - Posted: 27 Mar 2023, 13:49:57 UTC None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. The problem is not the time they take to run. No checkpointing. Fail if suspended and restarted. ID: 60208 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,857,476,011 RAC: 354 Level Scientific publications	Message 60209 - Posted: 27 Mar 2023, 14:28:18 UTC The problem is not the time they take to run. No checkpointing. Fail if suspended and restarted I agree with this. I had one error out on a restart two days ago after reaching nearly 100% due to no checkpoints. Not only that, but it then only showed 37 seconds of CPU time, so it doesn’t show what really happened. My latest one did complete but showed no check points. Therefore the long run time of is more of a high risk for a potential interruption. ID: 60209 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60210 - Posted: 27 Mar 2023, 16:05:56 UTC - in response to Message 60208. Last modified: 27 Mar 2023, 16:07:19 UTC None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. The problem is not the time they take to run. No checkpointing. Fail if suspended and restarted. ______________ My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system. Now this WU for example has run for 21 hours and is at 34.5%. task 27440346 Edit. It is still running fine. ID: 60210 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60211 - Posted: 27 Mar 2023, 16:41:41 UTC - in response to Message 60198. It looks like you got bit by a permission error. PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml' Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something. The Boinc version is 7.20.7. https://www.gpugrid.net/hosts_user.php?userid=19626 Not your fault, I got a couple errored tasks that duplicated yours. Just a bad batch of tasks went out. ID: 60211 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 194 Level Scientific publications	Message 60212 - Posted: 28 Mar 2023, 0:39:31 UTC - in response to Message 60185. Last modified: 28 Mar 2023, 0:57:30 UTC I have problem with cmd. It exits with code 1 in 0 seconds. Boinc version is 7.22.0 from https://github.com/BOINC/boinc/releases/tag/client_release%2F7.22%2F7.22.0 ID: 60212 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60213 - Posted: 28 Mar 2023, 14:55:14 UTC I've got another very curious one. PTP1B_new_20670_2qbs_23466_T4_2A-QUICO_TEST_ATM-0-1-RND0584_1 It started running about 2 hours ago, and says it's passed 60% progress. But now it seems to be making much slower work of it. Looking at the run log, it started with MAX_SAMPLES: 114. The log entries run from 2023-03-20 06:25:58 - INFO - sync_re - Started: sample 1, replica 0 to 2023-03-20 11:09:35 - INFO - sync_re - Finished: sample 114 (duration: 149.1039990450081 s) 2023-03-20 11:09:35 - INFO - sync_re - Finished: ATM simulations (duration: 17016.784924168984 s) Then it appears to start again, this time with MAX_SAMPLES: 341, logging from 2023-03-28 13:25:11 - INFO - sync_re - Started: sample 115, replica 0 (this is roughly when the task started running on my machine) to, so far 2023-03-28 15:45:16 - INFO - sync_re - Finished: sample 142 (duration: 299.707962396089 s) Note that each sample is taking roughly twice as long to complete as the ones before 114 - presumably run on a differently machine? The task is another resend, but the logging feels very strange. Is this how it's supposed to look? ID: 60213 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60214 - Posted: 28 Mar 2023, 15:20:32 UTC - in response to Message 60210. None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. The problem is not the time they take to run. No checkpointing. Fail if suspended and restarted. ______________ My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system. Now this WU for example has run for 21 hours and is at 34.5%. task 27440346 Edit. It is still running fine. _____________________________ The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished. I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused. ID: 60214 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60215 - Posted: 28 Mar 2023, 17:18:27 UTC Last modified: 28 Mar 2023, 17:20:17 UTC Looked at the errored tasks list on my account this morning and see another slew of badly misconfigured tasks went out. Been seeing a lot of file not found errors. FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_new_2-50-60_0.xml' Thankfully they fail fast and are purged shortly after working through the _7 iteration. ID: 60215 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60216 - Posted: 28 Mar 2023, 23:51:18 UTC - in response to Message 60214. None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. The problem is not the time they take to run. No checkpointing. Fail if suspended and restarted. ______________ My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system. Now this WU for example has run for 21 hours and is at 34.5%. task 27440346 Edit. It is still running fine. _____________________________ The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished. I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused. ________________ Completed after two days, four hours and forty minutes. Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero. ID: 60216 · Rating: 0 · rate: / Reply Quote