ATM

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60218 - Posted: 29 Mar 2023, 1:47:31 UTC - in response to Message 60216. This task PTP1B_23471_23468_2_2A-QUICO_TEST_ATM-0-1-RND8957_1 is currently doing the same on this host. Been at 100% complete now for at least an hour now. I know to just leave them alone and they will eventually finish and report as validated. ID: 60218 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,850,145,728 RAC: 1,066 Level Scientific publications	Message 60219 - Posted: 29 Mar 2023, 7:09:23 UTC This task reached "100% complete" in about 7 hours, and then ran for an additional 7 hours +, before actually finishing. https://www.gpugrid.net/workunit.php?wuid=27442023 Anybody got that beat?????? ID: 60219 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60220 - Posted: 29 Mar 2023, 7:20:09 UTC - in response to Message 60219. Last modified: 29 Mar 2023, 7:30:44 UTC Anybody got that beat?????? The task I reported in Message 60213 (14:55 yesterday) is still running. It was approaching 100% when I went to bed last night, and it's still there this morning. I'll go and check it out after coffee (I can't see the sample numbers remotely). As soon as I wrote that, it uploaded and reported! Ah well, my other Linux machine has got one in the same state. ID: 60220 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60223 - Posted: 29 Mar 2023, 8:23:53 UTC - in response to Message 60216. Last modified: 29 Mar 2023, 8:31:49 UTC None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. The problem is not the time they take to run. No checkpointing. Fail if suspended and restarted. ______________ My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system. Now this WU for example has run for 21 hours and is at 34.5%. task 27440346 Edit. It is still running fine. _____________________________ The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished. I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused. ________________ Completed after two days, four hours and forty minutes. Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero. _________________ Just woke up. The task was finished. Sent it home. task 27441741 ID: 60223 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60224 - Posted: 29 Mar 2023, 8:31:04 UTC OK, it's the same story as yesterday. This task: PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2 downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC. As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115. The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again. ID: 60224 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60225 - Posted: 29 Mar 2023, 9:22:38 UTC - in response to Message 60224. OK, it's the same story as yesterday. This task: PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2 downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC. As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115. The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again. I believe it's what I imagined. From the manual division I was doing before I was splitting some runs in 2/3 steps: 114 - 228 - 341 samples. If the job ID has a 2A/3A it's most probably that it's starting from a previous checkpoint and the progress report is going crazy with it. I'll pass this on to Raimondas to see if he can get a look at it. Our priority first is to be able to that these job divisions are done automatically like ACEMD does, that way we can avoid these really long jobs for everyone. Doing this manually makes it really hard to track all the jobs and the resends. So I hope that in the next days everything goes smoother. ID: 60225 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60226 - Posted: 29 Mar 2023, 12:12:55 UTC - in response to Message 60225. Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition. Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so. The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120. The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from. Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples? ID: 60226 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 60227 - Posted: 29 Mar 2023, 12:26:14 UTC - in response to Message 60226. The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from. 115/(341-114) = 0.5066 = 50.66% strikingly close. maybe "BOINC logic" in some form of rounding. but it's pretty clear that the 50% value is coming from this calculation. ID: 60227 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60228 - Posted: 29 Mar 2023, 12:38:37 UTC - in response to Message 60227. I thought I'd checked that, and got a different answer, but my mouse must have slipped on the calculator buttons. The difference is probably the 0.2% program setup stages - it'll do. Thanks. ID: 60228 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60229 - Posted: 29 Mar 2023, 14:42:44 UTC After that, it failed after 3 hours 20 minutes with a 'ValueError: Energy is NaN' error. Never mind - I tried. ID: 60229 · Rating: 0 · rate: / Reply Quote

kotenok2000 Send message Joined: 18 Jul 13 Posts: 79 Credit: 241,278,292 RAC: 194 Level Scientific publications	Message 60230 - Posted: 29 Mar 2023, 17:59:59 UTC - in response to Message 60229. Last modified: 29 Mar 2023, 18:27:03 UTC C:/Windows/system32/cmd.exe command creates c:\users\frolo\.exe\ folder. On subsequent runs it gives "A subdirectory or file .exe already exists." error. C:/Windows/system32/cmd.exe /c call test.bat outputs The syntax of the command is incorrect. C:\Windows\system32\cmd.exe /c call test.bat outputs 'test.bat' is not recognized as an internal or external command, operable program or batch file. ID: 60230 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60233 - Posted: 30 Mar 2023, 9:51:23 UTC - in response to Message 60226. Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition. Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so. The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120. The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from. Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples? The first 114 samples should be calculated by: T_PTP1B_new_20669_2qbr_23472_1A_3-QUICO_TEST_ATM-0-1-RND2542_0.tar.bz2 I've been doing all the division and resends manually and we've been simplifying the naming convention for my sake. Now we are testing a multiple_steps protocol just like in AceMD which should help ease things and I hope mess less with the progress reporter. ID: 60233 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60234 - Posted: 30 Mar 2023, 11:31:27 UTC - in response to Message 60233. Thanks. Be aware that out here in client-land we can only locate jobs by WU or task ID numbers - it's extremely difficult to find a task by name unless we can follow an ID chain. Newer versions of the BOINC website tools do provide a rudimentary 'search by name' facility, but it requires a full task name - no wildcards or partial matches. And I know your colleagues on this project are very wary about updating the server code. We'll just have to live with it. ID: 60234 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60237 - Posted: 30 Mar 2023, 18:08:57 UTC - in response to Message 60234. Yeah I'm sorry about that. I'm trying to learn as I go. I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs. ID: 60237 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60238 - Posted: 30 Mar 2023, 18:22:27 UTC - in response to Message 60237. Two downloaded, the first has reached 6% with no problems. ID: 60238 · Rating: 0 · rate: / Reply Quote

KAMasud Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level Scientific publications	Message 60239 - Posted: 30 Mar 2023, 18:43:26 UTC - in response to Message 60237. Yeah I'm sorry about that. I'm trying to learn as I go. I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs. ____________________ It is un-stable tasks, re-start problems, suspend problems. Quite a few of us have done year-plus runs on Climate. 24-hour runs are no problem. ID: 60239 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1171 Credit: 12,662,148,501 RAC: 3,588 Level Scientific publications	Message 60240 - Posted: 30 Mar 2023, 19:42:00 UTC Last modified: 30 Mar 2023, 20:12:31 UTC deleted ID: 60240 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 60241 - Posted: 30 Mar 2023, 20:11:40 UTC - in response to Message 60237. I believe I just finished one of these ATMbeta tasks. https://www.gpugrid.net/result.php?resultid=33393179 It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly. ID: 60241 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 60242 - Posted: 30 Mar 2023, 21:07:11 UTC - in response to Message 60241. Last modified: 30 Mar 2023, 21:07:30 UTC I believe I just finished one of these ATMbeta tasks. https://www.gpugrid.net/result.php?resultid=33393179 It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly. Same for me with Linux. Since there's no checkpointing I didn't bother to test suspending. I think all windows WUs failed. ID: 60242 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 60243 - Posted: 31 Mar 2023, 7:14:27 UTC Last modified: 31 Mar 2023, 8:09:36 UTC My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141. Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205). Edit - yes, it did. I see you've put step information in the task names: these were PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0 PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0 ID: 60243 · Rating: 0 · rate: / Reply Quote