ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 35 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60218 - Posted: 29 Mar 2023, 1:47:31 UTC - in response to Message 60216.  

This task PTP1B_23471_23468_2_2A-QUICO_TEST_ATM-0-1-RND8957_1 is currently doing the same on this host.

Been at 100% complete now for at least an hour now.

I know to just leave them alone and they will eventually finish and report as validated.
ID: 60218 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60219 - Posted: 29 Mar 2023, 7:09:23 UTC

This task reached "100% complete" in about 7 hours, and then ran for an additional 7 hours +, before actually finishing.

https://www.gpugrid.net/workunit.php?wuid=27442023


Anybody got that beat??????



ID: 60219 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60220 - Posted: 29 Mar 2023, 7:20:09 UTC - in response to Message 60219.  
Last modified: 29 Mar 2023, 7:30:44 UTC

Anybody got that beat??????

The task I reported in Message 60213 (14:55 yesterday) is still running. It was approaching 100% when I went to bed last night, and it's still there this morning. I'll go and check it out after coffee (I can't see the sample numbers remotely).

As soon as I wrote that, it uploaded and reported! Ah well, my other Linux machine has got one in the same state.
ID: 60220 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60223 - Posted: 29 Mar 2023, 8:23:53 UTC - in response to Message 60216.  
Last modified: 29 Mar 2023, 8:31:49 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

________________

Completed after two days, four hours and forty minutes.
Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero.

_________________

Just woke up. The task was finished. Sent it home.
task 27441741
ID: 60223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60224 - Posted: 29 Mar 2023, 8:31:04 UTC

OK, it's the same story as yesterday. This task:

PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2

downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC.

As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115.

The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again.
ID: 60224 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60225 - Posted: 29 Mar 2023, 9:22:38 UTC - in response to Message 60224.  

OK, it's the same story as yesterday. This task:

PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2

downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC.

As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115.

The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again.


I believe it's what I imagined. From the manual division I was doing before I was splitting some runs in 2/3 steps: 114 - 228 - 341 samples. If the job ID has a 2A/3A it's most probably that it's starting from a previous checkpoint and the progress report is going crazy with it. I'll pass this on to Raimondas to see if he can get a look at it.

Our priority first is to be able to that these job divisions are done automatically like ACEMD does, that way we can avoid these really long jobs for everyone. Doing this manually makes it really hard to track all the jobs and the resends. So I hope that in the next days everything goes smoother.
ID: 60225 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60226 - Posted: 29 Mar 2023, 12:12:55 UTC - in response to Message 60225.  

Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition.

Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so.

The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.

Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples?
ID: 60226 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,909,595
RAC: 4,232,576
Level
Trp
Scientific publications
wat
Message 60227 - Posted: 29 Mar 2023, 12:26:14 UTC - in response to Message 60226.  

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.


115/(341-114) = 0.5066 = 50.66%

strikingly close. maybe "BOINC logic" in some form of rounding. but it's pretty clear that the 50% value is coming from this calculation.

ID: 60227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60228 - Posted: 29 Mar 2023, 12:38:37 UTC - in response to Message 60227.  

I thought I'd checked that, and got a different answer, but my mouse must have slipped on the calculator buttons.

The difference is probably the 0.2% program setup stages - it'll do. Thanks.
ID: 60228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60229 - Posted: 29 Mar 2023, 14:42:44 UTC

After that, it failed after 3 hours 20 minutes with a 'ValueError: Energy is NaN' error. Never mind - I tried.
ID: 60229 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 148
Level
Leu
Scientific publications
wat
Message 60230 - Posted: 29 Mar 2023, 17:59:59 UTC - in response to Message 60229.  
Last modified: 29 Mar 2023, 18:27:03 UTC

C:/Windows/system32/cmd.exe command creates c:\users\frolo\.exe\ folder.
On subsequent runs it gives "A subdirectory or file .exe already exists." error.

C:/Windows/system32/cmd.exe /c call test.bat outputs
The syntax of the command is incorrect.


C:\Windows\system32\cmd.exe /c call test.bat outputs
'test.bat' is not recognized as an internal or external command,
operable program or batch file.
ID: 60230 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60233 - Posted: 30 Mar 2023, 9:51:23 UTC - in response to Message 60226.  

Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition.

Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so.

The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.

Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples?


The first 114 samples should be calculated by: T_PTP1B_new_20669_2qbr_23472_1A_3-QUICO_TEST_ATM-0-1-RND2542_0.tar.bz2
I've been doing all the division and resends manually and we've been simplifying the naming convention for my sake. Now we are testing a multiple_steps protocol just like in AceMD which should help ease things and I hope mess less with the progress reporter.
ID: 60233 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60234 - Posted: 30 Mar 2023, 11:31:27 UTC - in response to Message 60233.  

Thanks. Be aware that out here in client-land we can only locate jobs by WU or task ID numbers - it's extremely difficult to find a task by name unless we can follow an ID chain.

Newer versions of the BOINC website tools do provide a rudimentary 'search by name' facility, but it requires a full task name - no wildcards or partial matches. And I know your colleagues on this project are very wary about updating the server code. We'll just have to live with it.
ID: 60234 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60237 - Posted: 30 Mar 2023, 18:08:57 UTC - in response to Message 60234.  

Yeah I'm sorry about that. I'm trying to learn as I go.

I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs.
ID: 60237 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60238 - Posted: 30 Mar 2023, 18:22:27 UTC - in response to Message 60237.  

Two downloaded, the first has reached 6% with no problems.
ID: 60238 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60239 - Posted: 30 Mar 2023, 18:43:26 UTC - in response to Message 60237.  

Yeah I'm sorry about that. I'm trying to learn as I go.

I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs.


____________________

It is un-stable tasks, re-start problems, suspend problems. Quite a few of us have done year-plus runs on Climate. 24-hour runs are no problem.
ID: 60239 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Erich56

Send message
Joined: 1 Jan 15
Posts: 1166
Credit: 12,260,898,501
RAC: 869
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60240 - Posted: 30 Mar 2023, 19:42:00 UTC
Last modified: 30 Mar 2023, 20:12:31 UTC

deleted
ID: 60240 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60241 - Posted: 30 Mar 2023, 20:11:40 UTC - in response to Message 60237.  

I believe I just finished one of these ATMbeta tasks.

https://www.gpugrid.net/result.php?resultid=33393179

It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly.
ID: 60241 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60242 - Posted: 30 Mar 2023, 21:07:11 UTC - in response to Message 60241.  
Last modified: 30 Mar 2023, 21:07:30 UTC

I believe I just finished one of these ATMbeta tasks.

https://www.gpugrid.net/result.php?resultid=33393179

It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly.

Same for me with Linux. Since there's no checkpointing I didn't bother to test suspending. I think all windows WUs failed.
ID: 60242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60243 - Posted: 31 Mar 2023, 7:14:27 UTC
Last modified: 31 Mar 2023, 8:09:36 UTC

My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141.

Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205).

Edit - yes, it did. I see you've put step information in the task names: these were

PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0
PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0
ID: 60243 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · 11 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra