ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 35 · Next

AuthorMessage
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60197 - Posted: 27 Mar 2023, 4:03:36 UTC - in response to Message 60196.  

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.
ID: 60197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60198 - Posted: 27 Mar 2023, 7:03:07 UTC - in response to Message 60197.  

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.



The Boinc version is 7.20.7.

https://www.gpugrid.net/hosts_user.php?userid=19626


ID: 60198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,731,645,728
RAC: 47,738
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60199 - Posted: 27 Mar 2023, 7:33:49 UTC

Another task failed.

https://www.gpugrid.net/result.php?resultid=33383003

03/27/2023 3:20:22 AM | GPUGRID | Computation for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 finished
03/27/2023 3:20:22 AM | GPUGRID | Output file MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5_0 for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 absent

ID: 60199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60200 - Posted: 27 Mar 2023, 8:04:20 UTC - in response to Message 60199.  

The output file will always be absent if the task fails - it doesn't get as far as writing it. The actual error is in the online report:

ValueError: Energy is NaN.

('Not a Number')

That's a science problem - not your fault.
ID: 60200 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60201 - Posted: 27 Mar 2023, 8:48:51 UTC - in response to Message 60171.  
Last modified: 27 Mar 2023, 8:49:14 UTC

I've seen that you are unhappy with the last batch of runs. Seeing that they take too much time. I've been playing to divide the runs in different steps to get a sweet spot that you're happy with it and it's not madness for me to organize all this runs and re-runs. I'll backtrack to the previous setting we had before. Apologies for that.


I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.


Right, probably the wrapper should send a termination signal to AToM.

We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal.

However, I do not have access to the wrapper. Quico: please advise.


I'll ask Raimondas about this and the other things that have been mentioned since he's the one taking care of this issue.
ID: 60201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60202 - Posted: 27 Mar 2023, 8:52:22 UTC

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?
ID: 60202 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60203 - Posted: 27 Mar 2023, 9:08:39 UTC - in response to Message 60202.  

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem.

I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used.
ID: 60203 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60204 - Posted: 27 Mar 2023, 9:37:43 UTC - in response to Message 60203.  
Last modified: 27 Mar 2023, 9:38:19 UTC

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem.

I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used.


This one is a rerun, meaning that 2/3 of the run were previously simulated.
Maybe it was expecting to start from 0 samples and once it saw that we're at 228 from the beginning, it got confused.

I'll comment that.

PS: But others runs have been reporting correctly?
ID: 60204 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bibi

Send message
Joined: 4 May 17
Posts: 15
Credit: 17,444,875,743
RAC: 201,870
Level
Trp
Scientific publications
watwatwatwatwat
Message 60205 - Posted: 27 Mar 2023, 9:43:06 UTC

https://www.gpugrid.net/result.php?resultid=33382097

I had to suspend this task at sample 149 und resumed it an hour later, but it started again with the python install step and died. It should restart with sample 149.
ID: 60205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
bibi

Send message
Joined: 4 May 17
Posts: 15
Credit: 17,444,875,743
RAC: 201,870
Level
Trp
Scientific publications
watwatwatwatwat
Message 60206 - Posted: 27 Mar 2023, 9:47:44 UTC - in response to Message 60204.  

see post https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60160

it should be
progress = float(isample)/float(num_samples)
ID: 60206 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60207 - Posted: 27 Mar 2023, 10:22:32 UTC - in response to Message 60206.  

Or possibly
progress = float(isample - last_sample)/float(num_samples - last_sample)

if you want a truncated resend to start from 0% - but might that affect paused/resumed tasks as well?
ID: 60207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60208 - Posted: 27 Mar 2023, 13:49:57 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.
ID: 60208 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kksplace

Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,815,476,011
RAC: 0
Level
Phe
Scientific publications
wat
Message 60209 - Posted: 27 Mar 2023, 14:28:18 UTC

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted


I agree with this. I had one error out on a restart two days ago after reaching nearly 100% due to no checkpoints. Not only that, but it then only showed 37 seconds of CPU time, so it doesn’t show what really happened. My latest one did complete but showed no check points. Therefore the long run time of is more of a high risk for a potential interruption.
ID: 60209 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60210 - Posted: 27 Mar 2023, 16:05:56 UTC - in response to Message 60208.  
Last modified: 27 Mar 2023, 16:07:19 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.
ID: 60210 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60211 - Posted: 27 Mar 2023, 16:41:41 UTC - in response to Message 60198.  

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.



The Boinc version is 7.20.7.

https://www.gpugrid.net/hosts_user.php?userid=19626



Not your fault, I got a couple errored tasks that duplicated yours. Just a bad batch of tasks went out.
ID: 60211 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kotenok2000

Send message
Joined: 18 Jul 13
Posts: 79
Credit: 210,528,292
RAC: 148
Level
Leu
Scientific publications
wat
Message 60212 - Posted: 28 Mar 2023, 0:39:31 UTC - in response to Message 60185.  
Last modified: 28 Mar 2023, 0:57:30 UTC

I have problem with cmd. It exits with code 1 in 0 seconds.
Boinc version is 7.22.0 from https://github.com/BOINC/boinc/releases/tag/client_release%2F7.22%2F7.22.0
ID: 60212 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60213 - Posted: 28 Mar 2023, 14:55:14 UTC

I've got another very curious one.

PTP1B_new_20670_2qbs_23466_T4_2A-QUICO_TEST_ATM-0-1-RND0584_1

It started running about 2 hours ago, and says it's passed 60% progress. But now it seems to be making much slower work of it.

Looking at the run log, it started with MAX_SAMPLES: 114. The log entries run from

2023-03-20 06:25:58 - INFO - sync_re - Started: sample 1, replica 0
to
2023-03-20 11:09:35 - INFO - sync_re - Finished: sample 114 (duration: 149.1039990450081 s)
2023-03-20 11:09:35 - INFO - sync_re - Finished: ATM simulations (duration: 17016.784924168984 s)

Then it appears to start again, this time with MAX_SAMPLES: 341, logging from

2023-03-28 13:25:11 - INFO - sync_re - Started: sample 115, replica 0
(this is roughly when the task started running on my machine)
to, so far
2023-03-28 15:45:16 - INFO - sync_re - Finished: sample 142 (duration: 299.707962396089 s)

Note that each sample is taking roughly twice as long to complete as the ones before 114 - presumably run on a differently machine?

The task is another resend, but the logging feels very strange. Is this how it's supposed to look?
ID: 60213 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60214 - Posted: 28 Mar 2023, 15:20:32 UTC - in response to Message 60210.  

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.
ID: 60214 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1416
Credit: 9,119,446,190
RAC: 614,515
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60215 - Posted: 28 Mar 2023, 17:18:27 UTC
Last modified: 28 Mar 2023, 17:20:17 UTC

Looked at the errored tasks list on my account this morning and see another slew of badly misconfigured tasks went out.

Been seeing a lot of file not found errors.

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_new_2-50-60_0.xml'

Thankfully they fail fast and are purged shortly after working through the _7 iteration.
ID: 60215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
KAMasud

Send message
Joined: 27 Jul 11
Posts: 138
Credit: 539,953,398
RAC: 0
Level
Lys
Scientific publications
watwat
Message 60216 - Posted: 28 Mar 2023, 23:51:18 UTC - in response to Message 60214.  

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

________________

Completed after two days, four hours and forty minutes.
Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero.
ID: 60216 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra