Message boards :
News :
ATM
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 35 · Next
Author | Message |
---|---|
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
It looks like you got bit by a permission error. PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml' Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something. |
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
It looks like you got bit by a permission error. The Boinc version is 7.20.7. https://www.gpugrid.net/hosts_user.php?userid=19626 |
Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 52,725 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Another task failed. https://www.gpugrid.net/result.php?resultid=33383003 03/27/2023 3:20:22 AM | GPUGRID | Computation for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 finished 03/27/2023 3:20:22 AM | GPUGRID | Output file MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5_0 for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 absent |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The output file will always be absent if the task fails - it doesn't get as far as writing it. The actual error is in the online report: ValueError: Energy is NaN. ('Not a Number') That's a science problem - not your fault. |
Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
I've seen that you are unhappy with the last batch of runs. Seeing that they take too much time. I've been playing to divide the runs in different steps to get a sweet spot that you're happy with it and it's not madness for me to organize all this runs and re-runs. I'll backtrack to the previous setting we had before. Apologies for that.
I'll ask Raimondas about this and the other things that have been mentioned since he's the one taking care of this issue. |
Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem. I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used. |
Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? This one is a rerun, meaning that 2/3 of the run were previously simulated. Maybe it was expecting to start from 0 samples and once it saw that we're at 228 from the beginning, it got confused. I'll comment that. PS: But others runs have been reporting correctly? |
Send message Joined: 4 May 17 Posts: 15 Credit: 17,444,875,743 RAC: 222,959 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
https://www.gpugrid.net/result.php?resultid=33382097 I had to suspend this task at sample 149 und resumed it an hour later, but it started again with the python install step and died. It should restart with sample 149. |
Send message Joined: 4 May 17 Posts: 15 Credit: 17,444,875,743 RAC: 222,959 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
see post https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60160 it should be progress = float(isample)/float(num_samples) |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Or possibly progress = float(isample - last_sample)/float(num_samples - last_sample) if you want a truncated resend to start from 0% - but might that affect paused/resumed tasks as well? |
![]() Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() |
None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. The problem is not the time they take to run. No checkpointing. Fail if suspended and restarted. |
Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,815,476,011 RAC: 0 Level ![]() Scientific publications ![]() |
The problem is not the time they take to run. I agree with this. I had one error out on a restart two days ago after reaching nearly 100% due to no checkpoints. Not only that, but it then only showed 37 seconds of CPU time, so it doesn’t show what really happened. My latest one did complete but showed no check points. Therefore the long run time of is more of a high risk for a potential interruption. |
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. ______________ My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system. Now this WU for example has run for 21 hours and is at 34.5%. task 27440346 Edit. It is still running fine. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
It looks like you got bit by a permission error. Not your fault, I got a couple errored tasks that duplicated yours. Just a bad batch of tasks went out. |
Send message Joined: 18 Jul 13 Posts: 79 Credit: 210,528,292 RAC: 163 Level ![]() Scientific publications ![]() |
I have problem with cmd. It exits with code 1 in 0 seconds. Boinc version is 7.22.0 from https://github.com/BOINC/boinc/releases/tag/client_release%2F7.22%2F7.22.0 |
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 326,008 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I've got another very curious one. PTP1B_new_20670_2qbs_23466_T4_2A-QUICO_TEST_ATM-0-1-RND0584_1 It started running about 2 hours ago, and says it's passed 60% progress. But now it seems to be making much slower work of it. Looking at the run log, it started with MAX_SAMPLES: 114. The log entries run from 2023-03-20 06:25:58 - INFO - sync_re - Started: sample 1, replica 0 to 2023-03-20 11:09:35 - INFO - sync_re - Finished: sample 114 (duration: 149.1039990450081 s) 2023-03-20 11:09:35 - INFO - sync_re - Finished: ATM simulations (duration: 17016.784924168984 s) Then it appears to start again, this time with MAX_SAMPLES: 341, logging from 2023-03-28 13:25:11 - INFO - sync_re - Started: sample 115, replica 0 (this is roughly when the task started running on my machine) to, so far 2023-03-28 15:45:16 - INFO - sync_re - Finished: sample 142 (duration: 299.707962396089 s) Note that each sample is taking roughly twice as long to complete as the ones before 114 - presumably run on a differently machine? The task is another resend, but the logging feels very strange. Is this how it's supposed to look? |
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. _____________________________ The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished. I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused. |
![]() Send message Joined: 13 Dec 17 Posts: 1416 Credit: 9,119,446,190 RAC: 678,713 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Looked at the errored tasks list on my account this morning and see another slew of badly misconfigured tasks went out. Been seeing a lot of file not found errors. FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_new_2-50-60_0.xml' Thankfully they fail fast and are purged shortly after working through the _7 iteration. |
Send message Joined: 27 Jul 11 Posts: 138 Credit: 539,953,398 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come. ________________ Completed after two days, four hours and forty minutes. Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero. |
©2025 Universitat Pompeu Fabra