Advanced search

Message boards : News : ATM

Author Message
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60002 - Posted: 3 Mar 2023 | 10:39:46 UTC

Hello GPUGRID!

You‘ve already noticed that a new app called “ATM” has been deployed with some test runs. We are working on its validation and deployment, so expect more jobs to come on this app soon. Let me briefly explain what this new app is about.

The ATM application

The new ATM application stands for Alchemical Transfer Method, a methodology Emilio Gallicchio et al. designed for absolute and relative binding affinity predictions. The ATM method allows us to estimate binding affinities for molecules against a specific protein, measuring the strength at which they bind. This methodology falls under the category of alchemical free energy calculation methods, where unphysical intermediate states are used to estimate the free energy of physical processes (such as protein-ligand binding). The benefits of ATM, when compared with other common free energy prediction methods (like the popular FEP), come from its simplicity, as it can be used with any forcefield and does not require a lot of expertise to make it work properly.

Measuring experimental binding affinities between candidate molecules and the targeted protein is one of the first steps in drug discovery projects, but synthesizing molecules and performing experiments is expensive. Having the capacity to perform computational binding affinity predictions, particularly during drug lead optimization, is extremely beneficial. We are actively working now on testing and validating the ATM method so that we can start applying it to real drug discovery projects as soon as possible. Additionally, since these methods are usually applied to hundreds of molecules, it benefits a lot from the parallelization capabilities of GPUGRID, so if everything goes as expected, this could potentially send lots of work units.

The ATM app is based on Python, similar to the PythonRL application, where we ship it with a specific python environment.

Here are the two main references for the ATM method, for both absolute and relative binding affinity predictions:

Absolute binding free energy estimation with ATM: https://arxiv.org/pdf/2101.07894.pdf
Relative binding free energy estimation with ATM:
https://pubs.acs.org/doi/10.1021/acs.jcim.1c01129

For now we are only able to send jobs to Linux machines but we are hoping to have a Windows version soon.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60003 - Posted: 3 Mar 2023 | 10:40:19 UTC

I’m brand new to GPUGRID so apologies in advance if I make some mistakes. I’m looking forward to learn from you all and discuss about this app :)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60005 - Posted: 3 Mar 2023 | 11:50:49 UTC

Welcome!

Let's start with some good news. I picked up one of your test tasks a couple of days ago.

T0_1-QUICO_TEST_ATM-0-1-RND8922_0

It ran right through without raising any red flags, and validated at the end. A good start.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60006 - Posted: 3 Mar 2023 | 13:21:48 UTC - in response to Message 60003.

Thanks for creating an official topic on these types of tasks.

The latest problem observed recently was upload hangs due to a file size too big. it didnt cause an error, but it just never uploaded because the file size exceeded the size limt of your apache server. the only resolution for the user was to abort the transfer and hope it didnt get marked as an error.

have you already addressed this issue? either by adjusting the apache server file size, or adjusting the tasks to not create such large result files.
____________

Greger
Send message
Joined: 6 Jan 15
Posts: 74
Credit: 14,802,941,499
RAC: 24,702,098
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 60007 - Posted: 3 Mar 2023 | 20:20:09 UTC
Last modified: 3 Mar 2023 | 21:19:31 UTC

Welcome and thanks for info Quico

I did notice on past batch the upload got halted by server. It got rejected to download result.
I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload.

File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be.

event log for TL9_72-RAIMIS_TEST_ATM did say

<nbytes>729766132.000000</nbytes>
<max_nbytes>10000000000.000000</max_nbytes>


https://ibb.co/4pYBfNS
parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size
error code -224 (permanent HTTP error)
https://ibb.co/T40gFR9

I will do test new test on new units but would probably face same issue if server have not changed.

https://boinc.berkeley.edu/trac/wiki/JobTemplates

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60009 - Posted: 4 Mar 2023 | 6:05:52 UTC - in response to Message 60007.

File size in past history that max allowed have been 700mb

Greger, are you sure it was 700mb?
From what I remember, it was 500mb

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60011 - Posted: 4 Mar 2023 | 9:10:39 UTC

I have one which is looking a bit poorly. It's 'running' of host 132158l (Linux Mint 21.1, GTX 1660 super, 64 GB RAM), but it's only showing 3% progress after 18 hours.


(image from remote monitoring on a Windows computer)

Are there any files I can examine, or which would be useful to you for debugging - or should I simply abort it?

Dirk Broer
Send message
Joined: 4 Oct 09
Posts: 2
Credit: 116,659,519
RAC: 260,543
Level
Cys
Scientific publications
watwatwatwat
Message 60012 - Posted: 4 Mar 2023 | 9:50:47 UTC

I am trying to upload one, but can't get it to do the transfer:
Computer: MSI-B550-A-Pro
Project GPUGRID

Name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943_1

Application ATM: Free energy calculations of protein-ligand binding 1.13 (cuda1121)
Workunit name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943
State Uploading
Received 3/1/2023 4:46:17 PM
Report deadline 3/6/2023 4:46:16 PM
Estimated app speed 16.548,99 GFLOPs/sec
Estimated task size 1.000.000.000 GFLOPs
Resources 0,949 CPUs + 1 NVIDIA GPU
CPU time at last checkpoint 00:00:00
CPU time 05:27:34
Elapsed time 05:28:51
Estimated time remaining 00:00:00
Fraction done 100%
Virtual memory size 0,00 MB
Working set size 0,00 MB

Debug State: 4 - Scheduler: 0

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60013 - Posted: 4 Mar 2023 | 10:39:52 UTC

I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are:

+ echo 'Run AToM'
+ CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
+ python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

I'm aborting it.

NB a previous user also failed with a task from the same workunit: 27418556

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60022 - Posted: 6 Mar 2023 | 9:51:35 UTC - in response to Message 60013.

Thanks everyone for the replies!

From what I have seen, from the single test job I personally sent, one replica finished without issues but the other two blew up (Particle coordinate is NaN). I do find this strange because I have seen in the preparation that I run locally but not during production, the errors should be different. I'll check a few things locally since I changed a few things from my local runs and we'll try again, also with different inputs.

Welcome and thanks for info Quico

I did notice on past batch the upload got halted by server. It got rejected to download result.
I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload.

File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be.

event log for TL9_72-RAIMIS_TEST_ATM did say
<nbytes>729766132.000000</nbytes>
<max_nbytes>10000000000.000000</max_nbytes>


https://ibb.co/4pYBfNS
parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size
error code -224 (permanent HTTP error)
https://ibb.co/T40gFR9

I will do test new test on new units but would probably face same issue if server have not changed.

https://boinc.berkeley.edu/trac/wiki/JobTemplates


Thanks for this, I'll keep that in mind. From the succesful run the size file is 498M so it should be on the limit there to what @Erich56 says. But that's useful information for when I run bigger systems.

I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are:

+ echo 'Run AToM'
+ CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
+ python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

I'm aborting it.

NB a previous user also failed with a task from the same workunit: 27418556


Hmmm, that's weird. It shouldn't softlock in that step. Although this warning pops up it should keep running without issues. I'll ask around

gemini8
Send message
Joined: 3 Jul 16
Posts: 31
Credit: 1,268,950,176
RAC: 2,808,530
Level
Met
Scientific publications
watwat
Message 60029 - Posted: 7 Mar 2023 | 11:43:14 UTC

This task didn't want to upload, but neither would GPUGrid update when I aborted the upload.
Only got 24h time-outs.
____________
Greetings, Jens

STE\/E
Send message
Joined: 18 Sep 08
Posts: 368
Credit: 315,972,298
RAC: 120,276
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 60035 - Posted: 8 Mar 2023 | 12:14:20 UTC

I just aborted 1 ATM Wu https://www.gpugrid.net/result.php?resultid=33338739 that had been running for over 7 Days, it sat at 75% done the whole time. Got another one & it immediately jumped to 75% done. Probably just abort it & deselect any new ATM Wu's ...
____________
STE\/E

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60036 - Posted: 8 Mar 2023 | 14:24:59 UTC
Last modified: 8 Mar 2023 | 14:40:30 UTC

Some still running, many failing.
Does ATM really just need one CPU?
I think I saw a new 1.1 GB executable DLing. Maybe the failures tried to run on the older version?
What are the VRAM and RAM minimum requirements for ATM?

Server Status shows both ATM and ATMbeta tasks but Tasks shows them all as ATM.
Strange, all my previously completed ATM WUs have vanished from my Tasks list?

Thanks for the papers, I'll read them later.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60037 - Posted: 8 Mar 2023 | 15:23:40 UTC

Three successive errors on host 132158

All with "python: can't open file '/hdd/boinc-client/slots/2/Scripts/rbfe_explicit_sync.py': [Errno 2] No such file or directory"

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60038 - Posted: 8 Mar 2023 | 16:56:45 UTC

I let some computers run off all other WUs so they were just running 2 ATM WUs. It appears they do only use one CPU each but that may just be a consequence of specifying a single CPU in the client_state.xml file. Might your ATM project benefit from using multiple CPUs?

<app_version>
<app_name>ATM</app_name>
<version_num>113</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>1.000000</avg_ncpus>
<flops>46211986880283.171875</flops>
<plan_class>cuda1121</plan_class>
<api_version>7.7.0</api_version>
nvidia-smi reports ATM 1.13 WUs are using 550 to 568 MB of VRAM so call it 0.6 GB VRAM. BOINCtasks reports all WUs are using less than 1.2 GB RAM. That means that my computers could easily run up to 20 ATM WUs simultaneourly. Sadly GPUGRID does not allow us to control the number of WUs we DL like LHC or WCG do. So we're stuck with 2 set by the ACEMD project. I never run more than a single PYTHON WU on a computer so I get two and abort one and then have to uncheck PYTHON in my GPUGRID Preferences just in case ACEMD or ATM WUs materialize. I wonder how many years it's been since GG has improved the UI to make it more user-friendly? When one clicks their Preferences they still get 2 Warnings and 2 Strict Standards that have never been fixed.

Please add a link to your applications: https://www.gpugrid.net/apps.php
____________

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 1,464,751,749
RAC: 3,464,677
Level
Met
Scientific publications
wat
Message 60039 - Posted: 8 Mar 2023 | 19:52:31 UTC

Is there a way to tell if an ATM WU is progressing? I have had only one succeed so far over the last several weeks. However, all of the failures so far were one of two types: either a failure to upload (and the download aborted by me) or a simple "Error while computing", which happened very quickly.

However, I now have an ATM WU which has been processing for over seven hours. Looking at the WU properties, it shows the CPU time nearly equal to the elapsed time. The GPU shows processing spikes up to 99%, and the 'down' periods are short.

As others have reported, the Progress shows 75% steadily.

I am inclined to keep letting it compute, but want to know what behavior others have seen on successful ATM WUs.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60041 - Posted: 8 Mar 2023 | 21:02:08 UTC
Last modified: 8 Mar 2023 | 21:03:20 UTC

let me explain something about the 75% since it seems many don't understand what's happening here. the 75% is in no way an indication of how much the task has progressed. it is totally a function of how BOINC acts with the wrapper when the tasks are setup in the way that they are.

the wrapper uses a jobs.xml file to instruct BOINC on different "subtasks" to perform over the course of the run of a single task from the project. in the <task> element there is an option to add a <weight> argument. this would tell boinc how much "weight" in percentage of total task completion that this subtask is worth. weight of 1 is equal to 1% and so on. if this weight argument is not defined, each subtask gets equal weight.

in the case of the ATM tasks, the job.xml file has four subtasks, and no weights defined. the first 3 tasks are just quick extractions and unpacking and complete quickly. which is why the tasks jump to 75% straight away. if it's staying at 75% indefinitely then that's pretty indicative that the task is stuck and probably wont make more progress.

by comparison, the PythonGPU tasks have 2 sub tasks, but the first extraction task has a weight of 1 and the second run.py task has weight of 99 which is why it doesnt have this kind of behavior. and the acemd3 tasks only have one subtask in the file so it doesnt need a weight at all and progress is pretty linear.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60042 - Posted: 8 Mar 2023 | 21:59:23 UTC - in response to Message 60039.

I have one that's running (?) much the same. I think I've found a way to confirm it's still alive.

I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit.

I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current:

2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12
2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12
(duration: 12.440164870815352 s)

which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch?

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 1,464,751,749
RAC: 3,464,677
Level
Met
Scientific publications
wat
Message 60043 - Posted: 8 Mar 2023 | 22:39:10 UTC - in response to Message 60042.

Thanks for the idea. Sure enough, that file is showing activity (On sample 324, replica 3 for me.) OK. Just going to sit and wait.

Ian&Steve, thanks for the explanation. Just one thought: what if the fourth item is just "do everything else"? Couldn't that mean going straight from 75% to 100% at some point (assuming it is progressing)?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60045 - Posted: 9 Mar 2023 | 9:26:00 UTC - in response to Message 60042.

I have one that's running (?) much the same. I think I've found a way to confirm it's still alive.

I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit.

I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current:

2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12
2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12
(duration: 12.440164870815352 s)

which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch?


Thanks for this input (and everyone's). At least in the runs I sent recently we are expecting 341 samples.

I've seen that there were many crashes in the last batch of jobs I sent. I'll check if there were some issues on my end or it's just that the systems decided to blow up.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60046 - Posted: 9 Mar 2023 | 10:43:01 UTC - in response to Message 60045.

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.

BOINC doesn't think it's checkpointed since the beginning, even though checkpoints are listed at the end of each sample in the job.log

BOINC Manager shows that the fraction done is 75.000% - and has displayed that figure, unchanging, since a few minutes into the run.

I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML:

<file_ref>
<file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
<open_name>output.tar.bz2</open_name>
<copy_file/>
</file_ref>

More when it finishes.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60047 - Posted: 9 Mar 2023 | 11:40:25 UTC - in response to Message 60046.
Last modified: 9 Mar 2023 | 11:56:09 UTC

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.



That's good to know, thanks. Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?


I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML:

<file_ref>
<file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
<open_name>output.tar.bz2</open_name>
<copy_file/>
</file_ref>

More when it finishes.


Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60048 - Posted: 9 Mar 2023 | 11:57:55 UTC - in response to Message 60047.

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60049 - Posted: 9 Mar 2023 | 14:37:12 UTC - in response to Message 60048.

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.


Ah I see, from what I've seen the final upload archive has been around 500MB for these runs. Taking into accont what was mentioned filesize-wise in the beginning of the thread I'll tweak some paramaters in order to avoid heavier files

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60050 - Posted: 9 Mar 2023 | 14:50:08 UTC - in response to Message 60049.

you should also add weights to the <tasks> element in the jobs.xml file that's being used as well as adding some kind of progress reporting for the main script. jumping to 75% at the start and staying there for 12-24hrs until it jumps to 100% at the end is counterintuitive for most users and causes confusion about if the task is doing anything or not.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60051 - Posted: 9 Mar 2023 | 14:51:15 UTC - in response to Message 60047.
Last modified: 9 Mar 2023 | 14:53:22 UTC

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60052 - Posted: 9 Mar 2023 | 16:51:12 UTC
Last modified: 9 Mar 2023 | 17:01:11 UTC

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60053 - Posted: 9 Mar 2023 | 18:29:11 UTC - in response to Message 60051.

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.


Once the Windows version is live my personal set-up will join the cause and will have more feedback :)

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733


Thanks, for the insight. I'll make it save frames less frequently in order to avoid bigger filesizes.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60068 - Posted: 13 Mar 2023 | 16:26:17 UTC

nothing but errors from the current ATM batch. run.sh is missing or misnamed/misreferenced.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60069 - Posted: 13 Mar 2023 | 17:49:34 UTC
Last modified: 13 Mar 2023 | 17:49:46 UTC

I vaguely recall GG had a rule something like a computer can only DL 200 WUs a day. If it's still in place it would be absurd since the overriding rule is that a computer can only hold 2 WUs at a time.
At the rate ATM WUs are failing I could hit that limit, so I halted GG DLs.
Please delete all your WUs until you fix the bug.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60074 - Posted: 14 Mar 2023 | 12:35:21 UTC

Today's tasks are running OK - the run.sh script problem has been cured.

I'm running one that the previous user aborted before it even started - no need for that any more (WU 27426736).

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60075 - Posted: 14 Mar 2023 | 12:51:35 UTC - in response to Message 60074.
Last modified: 14 Mar 2023 | 12:52:47 UTC

i wouldnt say "cured". but newer tasks seem to be fine. I'm still getting a good number of resends with the same problem. i guess they'll make their way through the meat grinder before defaulting out.

example: http://www.gpugrid.net/result.php?resultid=33357435
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60076 - Posted: 14 Mar 2023 | 14:47:33 UTC - in response to Message 60075.

My point was: if you get one of these, let it run - it may be going to produce useful science. If it's one of the faulty ones, you waste about 20 seconds, and move on.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60084 - Posted: 15 Mar 2023 | 9:28:37 UTC
Last modified: 15 Mar 2023 | 9:30:08 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60085 - Posted: 15 Mar 2023 | 10:12:31 UTC - in response to Message 60076.
Last modified: 15 Mar 2023 | 10:16:48 UTC

Sorry about the run.sh missing issue of the past few days. It slipped through me. Also they were a few re-send tests that also crashed, but it should be fixed now.


Is there a way I could delete the failed/crashed files from the server?

We're also trying to find alternatives to avoid the filesize issue. I hope we can find a nice solution in the next few days.

Do the last few runs take less time, being less of a drag to run them? I'm trying to find the sweet spot for everyone/most of us.

Thanks everyone!

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60086 - Posted: 15 Mar 2023 | 10:13:40 UTC - in response to Message 60084.

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.


How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60087 - Posted: 15 Mar 2023 | 10:47:56 UTC - in response to Message 60085.

My host 508381 (GTX 1660 Ti) has finished a couple overnight, in about 9 hours. The last one finished just as I was reading your message, and I saw the upload size - 114 MB. Another failed with 'Energy is NaN', but that's another question.

The size and time figures are comfortable for me, but others will post their own views.

It would be helpful to work on the intermediate progress reports and checkpointing - at the moment, neither are reported to BOINC. This host (Linux Mint 20.3) spends the entire run reporting 75% progress: my other machine (Linux Mint 21.1) is stuck at 3%. Both run exactly the same build of BOINC.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60091 - Posted: 15 Mar 2023 | 11:29:34 UTC - in response to Message 60086.
Last modified: 15 Mar 2023 | 11:45:09 UTC

My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two.

i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti.

I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file)
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60093 - Posted: 15 Mar 2023 | 14:47:42 UTC - in response to Message 60086.
Last modified: 15 Mar 2023 | 14:53:46 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.
How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.

GPUgrid is set to only DL 2 WUs per computer.

It used to be higher but since ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization a normal BOINC client couldn't really make efficient use of more than 2. The history of setting the limit may have had something to do with DDOS attacks and throttling server access as a defense.

But Python WUs with a very low GPU utilization and ATM with about 25% utilization could run more. I believe it's possible for the work server to designate how many WUs of a given kind based on the client's hardware.

Some use a custom BOINC client that tricks the server into thinking their computer is more than one computer.

I suspect 1080s & 2080s could run 3 and 3080s could run 4 ATM WUs. Be nice to give it a try.

Checkpointing should be high on your To-Do List followed closely by progress reporting. File size is not an issue on the client side since you DL files over a GB. But increasing the limit on your server side would make that problem vanish. Run times have shortened and run fine, maybe a little shorter would be nice but not a priority.

Profile Stephen Uitti
Send message
Joined: 17 Mar 14
Posts: 4
Credit: 77,427,636
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 60094 - Posted: 15 Mar 2023 | 14:58:08 UTC

I noticed Free energy calculations of protein ligand binding in WUProp. For example, today's time is 0.03 hours. I checked, and i've 68 of these with a total of minimal time. So i checked, and they all get "Error while computing". I looked at a recent work unit, 27429650 T_CDK2_new_2_edit_1oiu_26_T2_2A_1-QUICO_TEST_ATM-0-1-RND4575_0
The log has this:

+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /t/boinclib/boinc-client/slots/8/tmp/pip-req-build-3qm67lb1
Running command git rev-parse -q --verify 'sha^5d7eac55295e8c6e777505c3ca7c998f1c85987d'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git checkout -q 5d7eac55295e8c6e777505c3ca7c998f1c85987d
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4


I'm running Linux Mint 19 (a bit out of date), git is git version 2.17.1
/usr/bin/python is Python 2.7.17 and /usr/bin/python3 is Python 3.6.9 -- this was common until recently
uname -a
Linux berfon 5.4.0-104-generic #118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
My machine has a gtx-950, so cuda tasks are OK.
It's having an issue writing to /t/boinclib/boinc-client/slots/8/tmp

sudo ls -ld /t/boinclib/boinc-client/slots/8/
drwxrwx--x 2 boinc boinc 4096 Mar 15 10:24 /t/boinclib/boinc-client/slots/8/
So it doesn't look like a permissions issue. The disk drive this is on has over 1 TB space free. It looks to me like git failed, and this is what is happening on all the work units.
My machine is running "New version of ACEMD" routinely.
My preferences for GPUGrid is to run everything. I'm not sure which category this is in, but it must be one of the beta apps.

I hope this helps.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60095 - Posted: 15 Mar 2023 | 15:23:16 UTC - in response to Message 60093.

GPUgrid is set to only DL 2 WUs per computer.


it's actually 2 per GPU, for up to 8 GPUs. 16 per computer/host.

ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization


acemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential.

____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60096 - Posted: 15 Mar 2023 | 17:53:15 UTC

I just started using nvitop for Linux and it gives a very different image of GPU utilization while running ATM: https://github.com/XuehaiPan/nvitop

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60097 - Posted: 15 Mar 2023 | 18:06:14 UTC - in response to Message 60096.
Last modified: 15 Mar 2023 | 18:09:46 UTC

i would probably give more trust to nvidia's own tools.

watch -n 1 nvidia-smi

or
watch -n 1 nvidia-smi --query-gpu=temperature.gpu,name,pci.bus_id,utilization.gpu,utilization.memory,clocks.current.sm,clocks.current.memory,power.draw,memory.used,pcie.link.gen.current,pcie.link.width.current --format=csv


but you said "acemd3" uses 50%. not ATM. overall I'd agree that ATM is closer to 50% effective or a little higher. it cycles between like 90 seconds @95+% and 30 seconds @0% and back and forth for the majority of the run.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60098 - Posted: 15 Mar 2023 | 18:09:25 UTC - in response to Message 60094.
Last modified: 15 Mar 2023 | 18:10:23 UTC

I'm running Linux Mint 19 (a bit out of date)
I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3.

My machine has a gtx-950, so cuda tasks are OK.
Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60099 - Posted: 15 Mar 2023 | 18:14:54 UTC - in response to Message 60098.

Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.


very likely the min CC is 5.0 (Maxwell) since Kepler cards seem to be erroring with the message that the card is too old.

all cuda 11.x apps are supported by CUDA 11.1+ drivers. with CUDA 11.1, Nvidia introduced forward compatibility of minor versions. so as long as you have 450+ drivers you should be able to run any CUDA app up to 11.8. CUDA 12+ will require moving to CUDA 12+ compatible drivers.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60100 - Posted: 15 Mar 2023 | 18:16:47 UTC - in response to Message 60095.

GPUgrid is set to only DL 2 WUs per computer.

it's actually 2 per GPU, for up to 8 GPUs. 16 per computer/host.
I'm sure you're right, it's been years since I put more than on GPU on a computer.

ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization
acemd3 has always used nearly 100% utilization with a single task on every GPU I've ever run. if you're only seeing 50%, sounds like you're hitting some other kind of bottleneck preventing the GPU from working to its full potential.[/quote]Let me rephrase that since it's been a long time since there was a steady flow of ACEMD. I always run 2 ACEMD WUs per GPU with no other GPU projects running. I can't remember what ACEMD utilization was but I don't recall that they slowed down much by running 2 WUs together.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60101 - Posted: 15 Mar 2023 | 18:19:02 UTC - in response to Message 60100.

maybe not much slower, but also not faster.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60102 - Posted: 15 Mar 2023 | 18:20:10 UTC - in response to Message 60097.

i would probably give more trust to nvidia's own tools.

watch -n 1 nvidia-smi

nvitop does that but graphs it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60103 - Posted: 15 Mar 2023 | 18:22:37 UTC - in response to Message 60101.

maybe not much slower, but also not faster.

But it has the advantage that compared to running a single ACEMD WU and letting the second GG sit idle waiting until it finishes and not getting the quick turnaround bonus feels like getting robbed :-) But who's counting?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60104 - Posted: 15 Mar 2023 | 18:26:29 UTC - in response to Message 60103.
Last modified: 15 Mar 2023 | 18:28:30 UTC

until your 12h task turns into two 25hr tasks running two and you get robbed anyway. robbed of the bonus for two tasks instead of just one.

you can set your machine to not download excess tasks by setting a smaller cache size or playing with resource share. that way it wont download the second task until the first one is nearly finished. there are lots of options you can tweak to get the desired behavior.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60105 - Posted: 15 Mar 2023 | 21:13:38 UTC
Last modified: 15 Mar 2023 | 21:13:58 UTC

Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again.

Had hope since the task mentions new in the name.

T_CDK2_new_2_edit_26_1h1q_T4_2_1-QUICO_TEST_ATM-0-1-RND2833_2

[Errno 2] No such file or directory

openmm.OpenMMException: Illegal value for DeviceIndex: 1

Guess I will be the next guinea pig.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60106 - Posted: 16 Mar 2023 | 1:28:51 UTC

Does the ATM app work with RTX 4000 series?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60107 - Posted: 16 Mar 2023 | 2:12:42 UTC - in response to Message 60106.

Does the ATM app work with RTX 4000 series?


Maybe. The Python app does, and the ATM is a similar kind of setup. You’ll have to try it and see.

Not sure how much progress the project has made for Windows though.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60108 - Posted: 16 Mar 2023 | 8:06:10 UTC - in response to Message 60098.

I'm running Linux Mint 19 (a bit out of date)
I just retired my last Linux Mint 19 computer yesterday and it had been running ATM, ACEMD & Python WUs on a 2080 Ti (12/7.5) fine. BTW, I tried the LM 21.1 upgrade from LM 20.3 and can't do things like open BOINC folder as admin. I can't see any advantage to 21.1 so I'm going to do a fresh install and revert back to 20.3.

My machine has a gtx-950, so cuda tasks are OK.
Is there a minimum requirement for CUDA and Compute Capability for ATM WUs?
https://www.techpowerup.com/gpu-specs/geforce-gtx-950.c2747 says CUDA 5.2 and https://developer.nvidia.com/cuda-gpus says 5.2.



Glad to know someone else also has the same problem with Mint 21.1. I will shift to some other flavour.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60111 - Posted: 18 Mar 2023 | 6:30:31 UTC

Got my first ATM Beta. Completed and validated.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60120 - Posted: 20 Mar 2023 | 14:45:24 UTC - in response to Message 60091.

My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two.

i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti.

I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file)


That sounds how ATM is intended to work for now. The idle GPU periods correspond to writing coordinates.

Happy to know that size of the jobs are good!


Picked up another ATM task but not holding much hope that it will run correctly based on the previous wingmen output files. Looks like the configuration is not correct again.

Had hope since the task mentions new in the name.

T_CDK2_new_2_edit_26_1h1q_T4_2_1-QUICO_TEST_ATM-0-1-RND2833_2

[Errno 2] No such file or directory

openmm.OpenMMException: Illegal value for DeviceIndex: 1

Guess I will be the next guinea pig.


I have seen your errors but I'm not sure why it's happening since I got several jobs running smoothly right now. I'll ask around.

The new tag is a legacy part on my end about receptor naming.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60121 - Posted: 20 Mar 2023 | 14:46:25 UTC

Another heads-up, it seems that the Windows app will available soon! That way we'll be able to look into the progress reporting issue.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60123 - Posted: 20 Mar 2023 | 19:54:13 UTC - in response to Message 60121.

...it seems that the Windows app will available soon!

that's good news - I'm looking foward to receiving ATM tasks :-)

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60126 - Posted: 22 Mar 2023 | 6:52:36 UTC
Last modified: 22 Mar 2023 | 6:53:46 UTC

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60128 - Posted: 22 Mar 2023 | 11:15:48 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


As far as I know, we are doing the final tests.
I'll let you know once it's fully ready and I have the green light to send jobs through there.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60129 - Posted: 22 Mar 2023 | 11:32:53 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?
____________

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60130 - Posted: 22 Mar 2023 | 14:37:45 UTC - in response to Message 60129.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60132 - Posted: 22 Mar 2023 | 14:45:55 UTC - in response to Message 60130.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?


no I don't run windows. i was just asking if you had the beta box selected because that's necessary.

but looking at the server, some people did get them. someone else earlier in this thread reported that they got and processed one also. very few went out, so unless your system asked when they were available, it would be easy to miss. you can setup a script to ask for them regularly, BOINC will stop asking after so many requests with no tasks sent.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60134 - Posted: 22 Mar 2023 | 14:54:54 UTC - in response to Message 60126.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?

I've yet to get a Windoze ATMbeta. They've been available for a while this morning and still nothing. That GPU just sits with bated breath.
What's the trick?

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60135 - Posted: 22 Mar 2023 | 15:07:47 UTC - in response to Message 60132.

I see that there is a windows app for ATM. But I have never received an app on any of my win machines, even with an updater. And yes, I have all the right project preferences set (everything checked). So, has anyone received an ATM task on a windows machine?


do you have allow beta/test applications checked?

Yep. Are you saying that you have received windows tasks for ATM?


no I don't run windows. i was just asking if you had the beta box selected because that's necessary.

but looking at the server, some people did get them. someone else earlier in this thread reported that they got and processed one also. very few went out, so unless your system asked when they were available, it would be easy to miss. you can setup a script to ask for them regularly, BOINC will stop asking after so many requests with no tasks sent.


Yep. As I said, I have an updater script running as well.
____________
Reno, NV
Team: SETI.USA

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60136 - Posted: 22 Mar 2023 | 15:11:24 UTC - in response to Message 60135.

KAMasud got one on his Windows system. maybe he can share his settings.
____________

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60137 - Posted: 22 Mar 2023 | 15:26:35 UTC
Last modified: 22 Mar 2023 | 15:38:55 UTC

Quico, Do you have some cryptic requirements specified for your Win ATMbeta WUs?

I've even had my Win computer set to only request ATMbeta WUs and still got nothing.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60138 - Posted: 23 Mar 2023 | 8:57:10 UTC - in response to Message 60136.

KAMasud got one on his Windows system. maybe he can share his settings.

____________________

Yes, I did get an ATM task. Completed and validated with success. No, I do not have any special settings. The only thing I do is not run any other project with GPU Grid. I have a feeling that they interfere with each other. How? GPU Grid is all over my cores and threads. Lacks discipline. My take on the subject. Admin, sorry.
Even though resources are wasted, I am not after the credits.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60139 - Posted: 23 Mar 2023 | 9:34:34 UTC
Last modified: 23 Mar 2023 | 13:06:14 UTC

I think it's just a matter of very few tests being submitted right now. Once I have the green light from Raimondas I'll start sending jobs through the windows app as well.
I have a complete system prepared just for you ;)

PS: You can now check the pre-print of our initial benchmark in the lab with ATM!
https://arxiv.org/abs/2303.11065

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60140 - Posted: 23 Mar 2023 | 12:46:59 UTC

Still no checkpoints. Hopefully this is top of your priority list.

BTW, highlight your URL and click URL above and it'll be linkable:
https://arxiv.org/abs/2303.11065

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60141 - Posted: 23 Mar 2023 | 13:08:40 UTC - in response to Message 60140.

Done! Thanks for it.

Reporting should be live for the jobs I'll send later today, please let me know if it works accordingly, specially the jobs with _BACE_ in their jobname.

I'll also start sending jobs through Windows today as well.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60142 - Posted: 23 Mar 2023 | 14:05:09 UTC

There ate two different ATM apps on the server stats page, and also on the apps.php page. But in project preferences, there is only one ATM app listed. We need a way to select both/either in our project preferences.
____________
Reno, NV
Team: SETI.USA

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60143 - Posted: 23 Mar 2023 | 16:09:27 UTC

Let it be. It is more fun this way. Never know what you will get next and adjust.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60144 - Posted: 23 Mar 2023 | 16:23:31 UTC
Last modified: 23 Mar 2023 | 16:26:46 UTC

My new WU behaves differently but I don't think checkpointing is working. It reported the first checkpoint after a minute and after an hour has yet to report a second one. Progress is stuck at 0.2 but time remaining has decreased from 1222 days to 22 days.

The Windoze WUs are all failing.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 60145 - Posted: 23 Mar 2023 | 17:11:09 UTC
Last modified: 23 Mar 2023 | 17:24:02 UTC

I have started to get these ATM tasks on my windoze hosts.

All are failing like this:

(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
11:28:53 (11872): wrapper (7.9.26016): starting
11:28:53 (11872): wrapper: running python.exe (bin/conda-unpack)
11:28:54 (11872): python.exe exited; CPU time 0.000000
11:28:54 (11872): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
analyze.sh
cntxt_0/
cntxt_0/PTP1B_new-23486-23479
p-0.dat
p-10.dat
p-11.dat
p-12.dat
p-13.dat
p-14.dat
p-15.dat
p-16.dat
p-17.dat
p-18.dat
p-19.dat
p-1.dat
p-20.dat
p-21.dat
p-2.dat
p-3.dat
p-4.dat
p-5.dat
p-6.dat
p-7.dat
p-8.dat
p-9.dat
PTP1B_new-23486-23479_0.xml
PTP1B_new-23486-23479_asyncre.cntl
PTP1B_new-23486-23479.inpcrd
PTP1B_new-23486-23479.prmtop
r0/
r0/PTP1B_new-23486-23479.dcd
r0/PTP1B_new-23486-23479_ckpt.xml
r0/PTP1B_new-23486-23479.out
r1/
r1/PTP1B_new-23486-23479.dcd
r1/PTP1B_new-23486-23479_ckpt.xml
r1/PTP1B_new-23486-23479.out
r10/
r10/PTP1B_new-23486-23479.dcd
r10/PTP1B_new-23486-23479_ckpt.xml
r10/PTP1B_new-23486-23479.out
r11/
r11/PTP1B_new-23486-23479.dcd
r11/PTP1B_new-23486-23479_ckpt.xml
r11/PTP1B_new-23486-23479.out
r12/
r12/PTP1B_new-23486-23479.dcd
r12/PTP1B_new-23486-23479_ckpt.xml
r12/PTP1B_new-23486-23479.out
r13/
r13/PTP1B_new-23486-23479.dcd
r13/PTP1B_new-23486-23479_ckpt.xml
r13/PTP1B_new-23486-23479.out
r14/
r14/PTP1B_new-23486-23479.dcd
r14/PTP1B_new-23486-23479_ckpt.xml
r14/PTP1B_new-23486-23479.out
r15/
r15/PTP1B_new-23486-23479.dcd
r15/PTP1B_new-23486-23479_ckpt.xml
r15/PTP1B_new-23486-23479.out
r16/
r16/PTP1B_new-23486-23479.dcd
r16/PTP1B_new-23486-23479_ckpt.xml
r16/PTP1B_new-23486-23479.out
r17/
r17/PTP1B_new-23486-23479.dcd
r17/PTP1B_new-23486-23479_ckpt.xml
r17/PTP1B_new-23486-23479.out
r18/
r18/PTP1B_new-23486-23479.dcd
r18/PTP1B_new-23486-23479_ckpt.xml
r18/PTP1B_new-23486-23479.out
r19/
r19/PTP1B_new-23486-23479.dcd
r19/PTP1B_new-23486-23479_ckpt.xml
r19/PTP1B_new-23486-23479.out
r2/
r2/PTP1B_new-23486-23479.dcd
r2/PTP1B_new-23486-23479_ckpt.xml
r2/PTP1B_new-23486-23479.out
r20/
r20/PTP1B_new-23486-23479.dcd
r20/PTP1B_new-23486-23479_ckpt.xml
r20/PTP1B_new-23486-23479.out
r21/
r21/PTP1B_new-23486-23479.dcd
r21/PTP1B_new-23486-23479_ckpt.xml
r21/PTP1B_new-23486-23479.out
r3/
r3/PTP1B_new-23486-23479.dcd
r3/PTP1B_new-23486-23479_ckpt.xml
r3/PTP1B_new-23486-23479.out
r4/
r4/PTP1B_new-23486-23479.dcd
r4/PTP1B_new-23486-23479_ckpt.xml
r4/PTP1B_new-23486-23479.out
r5/
r5/PTP1B_new-23486-23479.dcd
r5/PTP1B_new-23486-23479_ckpt.xml
r5/PTP1B_new-23486-23479.out
r6/
r6/PTP1B_new-23486-23479.dcd
r6/PTP1B_new-23486-23479_ckpt.xml
r6/PTP1B_new-23486-23479.out
r7/
r7/PTP1B_new-23486-23479.dcd
r7/PTP1B_new-23486-23479_ckpt.xml
r7/PTP1B_new-23486-23479.out
r8/
r8/PTP1B_new-23486-23479.dcd
r8/PTP1B_new-23486-23479_ckpt.xml
r8/PTP1B_new-23486-23479.out
r9/
r9/PTP1B_new-23486-23479.dcd
r9/PTP1B_new-23486-23479_ckpt.xml
r9/PTP1B_new-23486-23479.out
Rplots.pdf
run.log
run.sh
uwham_analysis.R
uwham_analysis.Rout
11:29:23 (11872): Library/usr/bin/tar.exe exited; CPU time 0.796875
11:29:23 (11872): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'run.bat' is not recognized as an internal or external command,
operable program or batch file.

11:29:24 (11872): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
11:29:24 (11872): app exit status: 0x1
11:29:24 (11872): called boinc_finish(195)


A script error?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60146 - Posted: 23 Mar 2023 | 17:46:00 UTC - in response to Message 60145.

I have started to get these ATM tasks on my windoze hosts.

All are failing like this:

(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
11:28:53 (11872): wrapper (7.9.26016): starting
11:28:53 (11872): wrapper: running python.exe (bin/conda-unpack)
11:28:54 (11872): python.exe exited; CPU time 0.000000
11:28:54 (11872): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
analyze.sh
cntxt_0/
cntxt_0/PTP1B_new-23486-23479
p-0.dat
p-10.dat
p-11.dat
p-12.dat
p-13.dat
p-14.dat
p-15.dat
p-16.dat
p-17.dat
p-18.dat
p-19.dat
p-1.dat
p-20.dat
p-21.dat
p-2.dat
p-3.dat
p-4.dat
p-5.dat
p-6.dat
p-7.dat
p-8.dat
p-9.dat
PTP1B_new-23486-23479_0.xml
PTP1B_new-23486-23479_asyncre.cntl
PTP1B_new-23486-23479.inpcrd
PTP1B_new-23486-23479.prmtop
r0/
r0/PTP1B_new-23486-23479.dcd
r0/PTP1B_new-23486-23479_ckpt.xml
r0/PTP1B_new-23486-23479.out
r1/
r1/PTP1B_new-23486-23479.dcd
r1/PTP1B_new-23486-23479_ckpt.xml
r1/PTP1B_new-23486-23479.out
r10/
r10/PTP1B_new-23486-23479.dcd
r10/PTP1B_new-23486-23479_ckpt.xml
r10/PTP1B_new-23486-23479.out
r11/
r11/PTP1B_new-23486-23479.dcd
r11/PTP1B_new-23486-23479_ckpt.xml
r11/PTP1B_new-23486-23479.out
r12/
r12/PTP1B_new-23486-23479.dcd
r12/PTP1B_new-23486-23479_ckpt.xml
r12/PTP1B_new-23486-23479.out
r13/
r13/PTP1B_new-23486-23479.dcd
r13/PTP1B_new-23486-23479_ckpt.xml
r13/PTP1B_new-23486-23479.out
r14/
r14/PTP1B_new-23486-23479.dcd
r14/PTP1B_new-23486-23479_ckpt.xml
r14/PTP1B_new-23486-23479.out
r15/
r15/PTP1B_new-23486-23479.dcd
r15/PTP1B_new-23486-23479_ckpt.xml
r15/PTP1B_new-23486-23479.out
r16/
r16/PTP1B_new-23486-23479.dcd
r16/PTP1B_new-23486-23479_ckpt.xml
r16/PTP1B_new-23486-23479.out
r17/
r17/PTP1B_new-23486-23479.dcd
r17/PTP1B_new-23486-23479_ckpt.xml
r17/PTP1B_new-23486-23479.out
r18/
r18/PTP1B_new-23486-23479.dcd
r18/PTP1B_new-23486-23479_ckpt.xml
r18/PTP1B_new-23486-23479.out
r19/
r19/PTP1B_new-23486-23479.dcd
r19/PTP1B_new-23486-23479_ckpt.xml
r19/PTP1B_new-23486-23479.out
r2/
r2/PTP1B_new-23486-23479.dcd
r2/PTP1B_new-23486-23479_ckpt.xml
r2/PTP1B_new-23486-23479.out
r20/
r20/PTP1B_new-23486-23479.dcd
r20/PTP1B_new-23486-23479_ckpt.xml
r20/PTP1B_new-23486-23479.out
r21/
r21/PTP1B_new-23486-23479.dcd
r21/PTP1B_new-23486-23479_ckpt.xml
r21/PTP1B_new-23486-23479.out
r3/
r3/PTP1B_new-23486-23479.dcd
r3/PTP1B_new-23486-23479_ckpt.xml
r3/PTP1B_new-23486-23479.out
r4/
r4/PTP1B_new-23486-23479.dcd
r4/PTP1B_new-23486-23479_ckpt.xml
r4/PTP1B_new-23486-23479.out
r5/
r5/PTP1B_new-23486-23479.dcd
r5/PTP1B_new-23486-23479_ckpt.xml
r5/PTP1B_new-23486-23479.out
r6/
r6/PTP1B_new-23486-23479.dcd
r6/PTP1B_new-23486-23479_ckpt.xml
r6/PTP1B_new-23486-23479.out
r7/
r7/PTP1B_new-23486-23479.dcd
r7/PTP1B_new-23486-23479_ckpt.xml
r7/PTP1B_new-23486-23479.out
r8/
r8/PTP1B_new-23486-23479.dcd
r8/PTP1B_new-23486-23479_ckpt.xml
r8/PTP1B_new-23486-23479.out
r9/
r9/PTP1B_new-23486-23479.dcd
r9/PTP1B_new-23486-23479_ckpt.xml
r9/PTP1B_new-23486-23479.out
Rplots.pdf
run.log
run.sh
uwham_analysis.R
uwham_analysis.Rout
11:29:23 (11872): Library/usr/bin/tar.exe exited; CPU time 0.796875
11:29:23 (11872): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'run.bat' is not recognized as an internal or external command,
operable program or batch file.

11:29:24 (11872): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
11:29:24 (11872): app exit status: 0x1
11:29:24 (11872): called boinc_finish(195)


A script error?


Hmmm I did send those this morning. Probably they entered the queue once my windows app was live and was looking for the run.bat.
If that's the case expect many crashes incoming :_(

The tests I'm monitoring seem to be still running so there's still hope

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60147 - Posted: 23 Mar 2023 | 19:33:22 UTC

FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example:

http://www.gpugrid.net/result.php?resultid=33375372

So there is hope.
____________
Reno, NV
Team: SETI.USA

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60148 - Posted: 23 Mar 2023 | 20:20:25 UTC - in response to Message 60147.

FWIW, this morning my windows machines started getting ATM tasks. Most of these tasks are erroring out. For these tasks, they have been issued many times over too many and failed every time. Looks like a problem with the tasks and not the clients running them. They will eventually work their way out of the system. But a few of the windows tasks I received today are actually working. Here is a successful example:

http://www.gpugrid.net/result.php?resultid=33375372

So there is hope.

--------------
Welcome Zombie67. If you are looking for more excitement, Climate has implemented OpenIFS.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60149 - Posted: 23 Mar 2023 | 20:23:04 UTC - in response to Message 60148.

All openifs tasks are already sent.

Pop Piasa
Avatar
Send message
Joined: 8 Aug 19
Posts: 252
Credit: 458,054,251
RAC: 0
Level
Gln
Scientific publications
watwat
Message 60150 - Posted: 24 Mar 2023 | 1:19:31 UTC - in response to Message 60147.

...But a few of the windows tasks I received today are actually working.


I have one that is working, but I had to add ATMs to my appconfig file to get them to more accurately show the time remaining, due to what Ian pointed out way upthread.
https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60041
I now see realistic time remaining.

My current appconfig.xml script
app_config>
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemd3</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>ATM</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<project_max_concurrent>1</project_max_concurrent>
<report_results_immediately/>
</app_config>


This task ran alongside a F@H task (project 18717) on a RTX3060 12GB card without any problem, in case anybody is interested.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60151 - Posted: 24 Mar 2023 | 2:54:44 UTC - in response to Message 60150.
Last modified: 24 Mar 2023 | 2:55:02 UTC

Why not
<app>
<name>PythonGPU</name>
<max_concurrent>1</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>4</cpu_usage>
</gpu_versions>
</app>

?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60152 - Posted: 24 Mar 2023 | 9:48:54 UTC

So far, 2 WUs successfully completed, another one running.

https://www.gpugrid.net/workunit.php?wuid=27438037

https://www.gpugrid.net/workunit.php?wuid=27438416

https://www.gpugrid.net/workunit.php?wuid=27438497


kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60153 - Posted: 24 Mar 2023 | 11:47:30 UTC - in response to Message 60152.
Last modified: 24 Mar 2023 | 12:05:55 UTC

it still can't run run.bat
http://www.gpugrid.net/result.php?resultid=33377536

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60154 - Posted: 24 Mar 2023 | 12:16:41 UTC
Last modified: 24 Mar 2023 | 12:26:33 UTC

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.
____________

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60155 - Posted: 24 Mar 2023 | 13:37:41 UTC - in response to Message 60154.

progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.


T_p38 were sent before the update so I guess it makes sense that they don't show reporting yet. Is the progress report for the BACE runs good? Is it staying stuck?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60156 - Posted: 24 Mar 2023 | 13:50:20 UTC - in response to Message 60155.

Yes, BACE looks good.

But something wrong with CDK2_new. It jumped to 100% but is still running.
____________

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60157 - Posted: 24 Mar 2023 | 13:59:50 UTC - in response to Message 60140.

Hello Quico and everyone. Thank you for trying AToM-OpenMM on GPUGRID.

I am unsure if it is relevant to this issue, but AToM implements full checkpointing. Each replica's status is stored in a .xml file in the replica directory. We usually checkpoint every 10 mins, but this interval can be changed in the control file with the CHECKPOINT_TIME parameter (in seconds). Checkpointing is also triggered by SIGTERM or SIGINT signals sent to the main AToM process.

Launching the AToM job from the same folder reads the checkpoints and should restart the simulation as if it had kept running.

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 8,623,824,643
RAC: 1,735,191
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60158 - Posted: 24 Mar 2023 | 14:08:25 UTC
Last modified: 24 Mar 2023 | 14:13:21 UTC

The python task must tell the boinc client how many ticks are to calculate (MAX_SAMPLES = 341 from *_asyncre.cntl times 22 replica) and the end of each tick.

In addition, the elapsed time used starts counting again at 0 after each restart. I don't know what the current situation is.

If the progress indicator is now ok, forgot my reply

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60159 - Posted: 24 Mar 2023 | 14:09:29 UTC
Last modified: 24 Mar 2023 | 14:18:59 UTC

The ATM tasks also record that a task has checkpointed in the job.log file in the slot directory (or did so, a few debug iterations ago - see message 60046).

That file can be viewed while a task is running, but not after it's finished. It's written (I think) by the science app, but messages are passed to BOINC by the wrapper: that's probably where the problem is.

Edit: OK, I've downloaded a BACE task (resend _4) and a T_PTP1B_new task (resend _3). I'll watch them when the current pair of Abouh tasks have finished.

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60160 - Posted: 24 Mar 2023 | 15:45:51 UTC - in response to Message 60158.

The GPUGRID version of AToM:

https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py

has this:


# Report progress on GPUGRID
progress = float(isample)/float(num_samples - last_sample)
open("progress", "w").write(str(progress))



which checks out as far as I can tell. last_sample is retrieved from checkpoints upon restart, so the progress % should be tracked correctly across restarts.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60161 - Posted: 24 Mar 2023 | 15:46:40 UTC

OK, the BACE task is running, and after 7 minutes or so, I see:

2023-03-24 15:40:33 - INFO - sync_re - Started: checkpointing
2023-03-24 15:40:49 - INFO - sync_re - Finished: checkpointing (duration: 15.699278543004766 s)
2023-03-24 15:40:49 - INFO - sync_re - Finished: sample 1 (duration: 303.5407383099664 s)

in the run.log file. So checkpointing is happening, but just not being reported through to BOINC.

Progress is 3.582% after eleven minutes.

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60162 - Posted: 24 Mar 2023 | 16:04:08 UTC - in response to Message 60157.

Actually, it is unclear if AToM's GPUGRID version checkpoints after catching termination signals. I'll ask Raimondas. Termination without checkpointing is usually okay, but progress since the checkpoint would be lost, and the number of samples recorded in the checkpoint file would not reflect the actual number of samples recorded.

Does anyone know if BOINC sends specific signals to terminate an app? Would the app pass the signal to the main AToM's python process?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60163 - Posted: 24 Mar 2023 | 16:20:44 UTC - in response to Message 60162.

The app seems to be both checkpointing, and updating progress, at the end of each sample. That will make re-alignment after a pause easier, but there's always some over-run, and data lost on restart. It's up to the application itself to record the data point reached, and to be used for the restart, as an integral part of the checkpointing process.

I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60164 - Posted: 24 Mar 2023 | 16:20:50 UTC
Last modified: 24 Mar 2023 | 16:20:58 UTC

Seriously? Only 14 tasks a day?

GPUGRID 3/24/2023 9:17:44 AM This computer has finished a daily quota of 14 tasks

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60165 - Posted: 24 Mar 2023 | 16:42:27 UTC - in response to Message 60164.

Seriously? Only 14 tasks a day?

The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60166 - Posted: 24 Mar 2023 | 16:53:12 UTC

The T_PTP1B_new task, on the other hand, is not reporting progress, even though it's logging checkpoints in the run.log

A file is maintained in the slot folder, called 'boinc_task_state.xml' (it's probably written by the wrapper, though I'm not certain of that).

The current contents are:

<active_task>
<project_master_url>https://www.gpugrid.net/</project_master_url>
<result_name>T_PTP1B_new_23484_23482_T3_2A_1-QUICO_TEST_ATM-0-1-RND3714_3</result_name>
<checkpoint_cpu_time>10.942300</checkpoint_cpu_time>
<checkpoint_elapsed_time>30.176729</checkpoint_elapsed_time>
<fraction_done>0.001996</fraction_done>
<peak_working_set_size>8318976</peak_working_set_size>
<peak_swap_size>16592896</peak_swap_size>
<peak_disk_usage>1318196036</peak_disk_usage>
</active_task>

The <fraction done> is reported as the 'progress%' figure - this one is reported as 0.199% by BOINC Manager (which truncates) and 0.200% by other tools (which round).

This task has been running for 43 minutes, and boinc_task_state.xml hasn't been re-written since the first minute.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60167 - Posted: 24 Mar 2023 | 20:30:16 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60168 - Posted: 24 Mar 2023 | 20:30:47 UTC


task 27438680
Completed and validated. While the following task had a failure after a re-start.
task 27438865

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60169 - Posted: 24 Mar 2023 | 20:49:28 UTC

My BACE task 33378091 finished successfully after 5 hours, under Linux Mint 21.1 with a GTX 1660 Super.

Four previous attempts failed, two of them under Windows with a 0xc0000135 error in Python.exe - that's a missing DLL.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60170 - Posted: 24 Mar 2023 | 21:46:07 UTC

Task 27438853
Completed and validated. Short one though.

Emilio Gallicchio
Send message
Joined: 23 Mar 23
Posts: 4
Credit: 87,500
RAC: 0
Level

Scientific publications
wat
Message 60171 - Posted: 25 Mar 2023 | 2:28:51 UTC - in response to Message 60163.


I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.


Right, probably the wrapper should send a termination signal to AToM.

We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal.

However, I do not have access to the wrapper. Quico: please advise.

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 5,209,144,224
RAC: 4,406,088
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60172 - Posted: 25 Mar 2023 | 9:32:49 UTC
Last modified: 25 Mar 2023 | 9:33:49 UTC

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?
Running linux with rtx3070 cards
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60173 - Posted: 25 Mar 2023 | 9:38:33 UTC - in response to Message 60171.
Last modified: 25 Mar 2023 | 9:50:33 UTC

The wrapper you're using at the moment is called "wrapper_26198_x86_64-pc-linux-gnu" (I haven't tried ATM under Windows yet, but can and will do so when I get a moment).

That wrapper name looks as if it was prepared from BOINC code dating to around February 2017. At that time, BOINC was working on versions of the wrapper specifically intended for use with VirtualBox.

BOINC makes pre-compiled versions of the wrapper available for projects to use "as is", but some projects customise the source code to suit their own needs. I don't know which path GPUGrid has taken.

Edit - I just looked at the file name the first time. In stderr.txt, I see

20:37:54 (115491): wrapper (7.7.26016): starting

That would put the date back to around November 2015, but I guess someone has made some local modifications.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60174 - Posted: 25 Mar 2023 | 9:45:14 UTC - in response to Message 60172.

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?

I have one at the moment which has been running for 17.5 hours. The same machine completed one yesterday (task 33374928) which ran for 19 hours. I wouldn't abort it just yet.

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 5,209,144,224
RAC: 4,406,088
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60175 - Posted: 25 Mar 2023 | 9:46:50 UTC - in response to Message 60174.

Hi, i have some "new_2" ATMs that run for 14h+ yet. Should i abort them?

I have one at the moment which has been running for 17.5 hours. The same machine completed one yesterday (task 33374928) which ran for 19 hours. I wouldn't abort it just yet.



thank you. I will let them running =)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60176 - Posted: 25 Mar 2023 | 11:32:54 UTC - in response to Message 60175.

And completed.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60177 - Posted: 25 Mar 2023 | 13:06:20 UTC - in response to Message 60165.
Last modified: 25 Mar 2023 | 13:53:41 UTC

Seriously? Only 14 tasks a day?

The quota adjusts dynamically - it goes up if you report successful tasks, and goes down if you report errors.

Quico, This behavior is intended to block misconfigured computers. In this case it's your Windows version that fails in seconds and being resent until it hits a Linux computer or fails 7 times. My Win computer was locked out of GG early yesterday but all my Linux computers donated until WUs ran out.
In this example the first 4 failures all went to Win7 & 11 computers and then Linux completed it successfully:
https://www.gpugrid.net/workunit.php?wuid=27438768

And the Win WUs are failing in seconds again with today's tranche.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60183 - Posted: 25 Mar 2023 | 14:27:30 UTC

WUs failing on Linux computers:

+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4
fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied
error: subprocess-exited-with-error

&#195;&#151; git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/36/tmp/pip-req-build-jsq34xa4 did not run successfully.
&#226;&#148;&#130; exit code: 128
&#226;&#149;&#176;&#226;&#148;&#128;> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

https://www.gpugrid.net/result.php?resultid=33379917

Profile Landjunge
Send message
Joined: 2 Nov 08
Posts: 3
Credit: 5,209,144,224
RAC: 4,406,088
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60184 - Posted: 25 Mar 2023 | 14:30:06 UTC

Any ideas why WUs are failing on a linux ubuntu machine with gtx1070?

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
14:01:49 (3551): wrapper (7.7.26016): starting
14:02:12 (3551): wrapper (7.7.26016): starting
14:02:12 (3551): wrapper: running bin/python (bin/conda-unpack)
14:02:13 (3551): bin/python exited; CPU time 0.280413
14:02:13 (3551): wrapper: running bin/tar (xjvf input.tar.bz2)
14:02:14 (3551): bin/tar exited; CPU time 0.840912
14:02:14 (3551): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/7/bin
+++ dirname /var/lib/boinc-client/slots/7/bin
++ local full_path_env=/var/lib/boinc-client/slots/7
+++ basename /var/lib/boinc-client/slots/7
++ local env_name=7
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/7
++ CONDA_PREFIX=/var/lib/boinc-client/slots/7
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(7) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/7/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/7/etc/conda/activate.d ']'
+++ ls -A /var/lib/boinc-client/slots/7/etc/conda/activate.d
++ '[' -n ocl-icd_activate.sh ']'
++ local _path
++ for _path in "$_script_dir"/*.sh
++ . /var/lib/boinc-client/slots/7/etc/conda/activate.d/ocl-icd_activate.sh
+++ conda_ocl_icd_activate
++++ ls /var/lib/boinc-client/slots/7/etc/OpenCL/vendors/
+++ [[ -z ocl-icd-system ]]
+ export PATH=/var/lib/boinc-client/slots/7:/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/7:/var/lib/boinc-client/slots/7/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/7/tmp
+ TMP=/var/lib/boinc-client/slots/7/tmp
+ mkdir -p /var/lib/boinc-client/slots/7/tmp
+ echo 'Install AToM'
+ REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/7/tmp/pip-req-build-0qwsbkqo
Running command git rev-parse -q --verify 'sha^172e6db924567cd0af1312d33f05b156b53e3d1c'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 172e6db924567cd0af1312d33f05b156b53e3d1c
Running command git checkout -q 172e6db924567cd0af1312d33f05b156b53e3d1c
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4
&#226;&#149;&#176;&#226;&#148;&#128;> [0 lines of output]
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

&#195;&#151; Encountered error while generating package metadata.
&#226;&#149;&#176;&#226;&#148;&#128;> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
14:02:22 (3551): bin/bash exited; CPU time 2.696428
14:02:22 (3551): app exit status: 0x1
14:02:22 (3551): called boinc_finish(195)

</stderr_txt>

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60185 - Posted: 25 Mar 2023 | 16:27:51 UTC - in response to Message 60173.

(I haven't tried ATM under Windows yet, but can and will do so when I get a moment).

Just downloaded a BACE task for Windows. There may be trouble ahead...

The job.xml file reads:

<job_desc>
<unzip_input>
<zipfilename>windows_x86_64__cuda1121.zip</zipfilename>
</unzip_input>
<task>
<application>python.exe</application>
<command_line>bin/conda-unpack</command_line>
<weight>1</weight>
</task>
<task>
<application>Library/usr/bin/tar.exe</application>
<command_line>xjvf input.tar.bz2</command_line>
<setenv>PATH=$PWD/Library/usr/bin</setenv>
<weight>1</weight>
</task>
<task>
<application>C:/Windows/system32/cmd.exe</application>
<command_line>/c call run.bat</command_line>
<setenv>CUDA_DEVICE=$GPU_DEVICE_NUM</setenv>
<stdout_filename>run.log</stdout_filename>
<weight>1000</weight>
<fraction_done_filename>progress</fraction_done_filename>
</task>
</job_desc>


1) We had problems with python.exe triggering a missing DLL error. I'll run Dependency Walker over this one, to see what the problem is.

2) It runs a private version of tar.exe: Microsoft included tar as a system utility from Windows 10 onwards - but I'm running Windows 7. The MS utility wouldn't run for me - I'll try this one.

3) I'm not totally convinced of the cmd.exe syntax either, but we'll cross that bridge when we get to it.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60186 - Posted: 25 Mar 2023 | 17:16:04 UTC - in response to Message 60185.
Last modified: 25 Mar 2023 | 17:42:46 UTC

First reports from Dependency Walker:

"Error opening file: The system cannot find the file specified" for
API-MS-WIN-CORE-PATH-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-ERROR-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-ROBUFFER-L1-1-0.DLL
API-MS-WIN-CORE-WINRT-STRING-L1-1-0.DLL
DCOMP.DLL
IESHIMS.DLL

The API-MS-WIN group and IESHIMS.DLL usually resolve when delay-load files are loaded during the run. But I can't find DCOMP.DLL in either the unpacked libraries, or the Windows system disk.

DCOMP.DLL seems to be called from MSHTML.DLL, which is a Windows system file. But I still can't find it from there.

Enough for now - my head is spinning!

Edit - DCOMP.DLL is present on my Windows 10 - now Windows 11 - laptop. Another fine example of Microsoft version control.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60188 - Posted: 26 Mar 2023 | 8:24:32 UTC
Last modified: 26 Mar 2023 | 9:21:02 UTC

Just a note of warning: one of my machines is running a JNK1 task - been running for 13 hours.

It's running fine - the run log has reached sample 287, and progress has reached 1.2654867256637168

But that's over 100%, and the BOINC display has reached (and is pegged at) 100% - probably has been for several hours. Ignore it.

Edit: It's reached sample 298. And I've found a [task name].cntl file, which contains the line

MAX_SAMPLES = 341

One reason why this needs fixing: I have my BOINC client set up in such a way that it normally fetches the next task around an hour before the current one is expected to finish. Because this one was (apparently) running so fast, it reached that point over five hours ago - and it's still waiting. Sorry Abouh - your next result will be late!

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 6,627,287,394
RAC: 30,274,262
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60189 - Posted: 26 Mar 2023 | 11:48:39 UTC

I also noticed this latest round of BACE tasks have become much longer to run on my GPUs. Some are hitting > 24 hrs. I am going to stop taking new ones unless the # samples/task is trimmed down.

[SG] Felix
Send message
Joined: 29 Jan 16
Posts: 11
Credit: 31,098,035
RAC: 0
Level
Val
Scientific publications
watwat
Message 60190 - Posted: 26 Mar 2023 | 12:16:20 UTC

I had this one running for about 8 hours, but then i had to shut down my computer.
Unfortunately, it couldn't restart from the app checkpoint, and since there is no boinc checkpoint, it crashed and reported no run time.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60191 - Posted: 26 Mar 2023 | 12:31:35 UTC

Forget about a re-start, these WUs cannot even take a suspension. I suspended my computer and this WU collapsed.
task 27438865

[SG] Felix
Send message
Joined: 29 Jan 16
Posts: 11
Credit: 31,098,035
RAC: 0
Level
Val
Scientific publications
watwat
Message 60192 - Posted: 26 Mar 2023 | 13:10:54 UTC

i'm a bit surprised right now, i looked at the resend, it was successfully completed in just over 2 minutes, how come? the computer has more WUs that were successfully completed in such a short time. Am I doing something wrong?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60193 - Posted: 26 Mar 2023 | 13:44:47 UTC - in response to Message 60189.

I also noticed this latest round of BACE tasks have become much longer to run on my GPUs. Some are hitting > 24 hrs. I am going to stop taking new ones unless the # samples/task is trimmed down.


I agree, the 4-6hr runs are much better.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60194 - Posted: 26 Mar 2023 | 23:44:00 UTC

I have task that reached 100% an hour ago, which means it is suppose to be finished, but it's still running.............

https://www.gpugrid.net/workunit.php?wuid=27439822

I don't want to aborted it, but this is annoying..........

What would be the reasonable amount of time one lets it run?????

The runtime at posting time is 7 hours and 30 minutes.



Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60195 - Posted: 27 Mar 2023 | 1:25:12 UTC - in response to Message 60194.

My last ATM tasks spent at least a couple of hours at the 100% completion point.

Just let them run and eventually they will turn themselves in for validation.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60196 - Posted: 27 Mar 2023 | 1:39:49 UTC - in response to Message 60195.

That's a mute point now. It errored out.

https://www.gpugrid.net/result.php?resultid=33381994

I guess this goes with the territory.




Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60197 - Posted: 27 Mar 2023 | 4:03:36 UTC - in response to Message 60196.

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60198 - Posted: 27 Mar 2023 | 7:03:07 UTC - in response to Message 60197.

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.



The Boinc version is 7.20.7.

https://www.gpugrid.net/hosts_user.php?userid=19626


Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60199 - Posted: 27 Mar 2023 | 7:33:49 UTC

Another task failed.

https://www.gpugrid.net/result.php?resultid=33383003

03/27/2023 3:20:22 AM | GPUGRID | Computation for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 finished
03/27/2023 3:20:22 AM | GPUGRID | Output file MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5_0 for task MCL1_new_2_29_27_OFF_4-QUICO_ATM_OPENFF-0-1-RND5141_5 absent

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60200 - Posted: 27 Mar 2023 | 8:04:20 UTC - in response to Message 60199.

The output file will always be absent if the task fails - it doesn't get as far as writing it. The actual error is in the online report:

ValueError: Energy is NaN.

('Not a Number')

That's a science problem - not your fault.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60201 - Posted: 27 Mar 2023 | 8:48:51 UTC - in response to Message 60171.
Last modified: 27 Mar 2023 | 8:49:14 UTC

I've seen that you are unhappy with the last batch of runs. Seeing that they take too much time. I've been playing to divide the runs in different steps to get a sweet spot that you're happy with it and it's not madness for me to organize all this runs and re-runs. I'll backtrack to the previous setting we had before. Apologies for that.


I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.


Right, probably the wrapper should send a termination signal to AToM.

We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal.

However, I do not have access to the wrapper. Quico: please advise.


I'll ask Raimondas about this and the other things that have been mentioned since he's the one taking care of this issue.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60202 - Posted: 27 Mar 2023 | 8:52:22 UTC

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60203 - Posted: 27 Mar 2023 | 9:08:39 UTC - in response to Message 60202.

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem.

I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60204 - Posted: 27 Mar 2023 | 9:37:43 UTC - in response to Message 60203.
Last modified: 27 Mar 2023 | 9:38:19 UTC

I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem.

I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used.


This one is a rerun, meaning that 2/3 of the run were previously simulated.
Maybe it was expecting to start from 0 samples and once it saw that we're at 228 from the beginning, it got confused.

I'll comment that.

PS: But others runs have been reporting correctly?

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 8,623,824,643
RAC: 1,735,191
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60205 - Posted: 27 Mar 2023 | 9:43:06 UTC

https://www.gpugrid.net/result.php?resultid=33382097

I had to suspend this task at sample 149 und resumed it an hour later, but it started again with the python install step and died. It should restart with sample 149.

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 8,623,824,643
RAC: 1,735,191
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60206 - Posted: 27 Mar 2023 | 9:47:44 UTC - in response to Message 60204.

see post https://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60160

it should be
progress = float(isample)/float(num_samples)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60207 - Posted: 27 Mar 2023 | 10:22:32 UTC - in response to Message 60206.

Or possibly

progress = float(isample - last_sample)/float(num_samples - last_sample)

if you want a truncated resend to start from 0% - but might that affect paused/resumed tasks as well?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60208 - Posted: 27 Mar 2023 | 13:49:57 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 1,464,751,749
RAC: 3,464,677
Level
Met
Scientific publications
wat
Message 60209 - Posted: 27 Mar 2023 | 14:28:18 UTC

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted


I agree with this. I had one error out on a restart two days ago after reaching nearly 100% due to no checkpoints. Not only that, but it then only showed 37 seconds of CPU time, so it doesn’t show what really happened. My latest one did complete but showed no check points. Therefore the long run time of is more of a high risk for a potential interruption.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60210 - Posted: 27 Mar 2023 | 16:05:56 UTC - in response to Message 60208.
Last modified: 27 Mar 2023 | 16:07:19 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60211 - Posted: 27 Mar 2023 | 16:41:41 UTC - in response to Message 60198.

It looks like you got bit by a permission error.

PermissionError: [Errno 13] Permission denied: 'r0/Jnk1_new_2-18659-18634_ckpt.xml'

Your boinc.service file might be an old version that does not let applications access to the .tmp directory or something.



The Boinc version is 7.20.7.

https://www.gpugrid.net/hosts_user.php?userid=19626



Not your fault, I got a couple errored tasks that duplicated yours. Just a bad batch of tasks went out.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60212 - Posted: 28 Mar 2023 | 0:39:31 UTC - in response to Message 60185.
Last modified: 28 Mar 2023 | 0:57:30 UTC

I have problem with cmd. It exits with code 1 in 0 seconds.
Boinc version is 7.22.0 from https://github.com/BOINC/boinc/releases/tag/client_release%2F7.22%2F7.22.0

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60213 - Posted: 28 Mar 2023 | 14:55:14 UTC

I've got another very curious one.

PTP1B_new_20670_2qbs_23466_T4_2A-QUICO_TEST_ATM-0-1-RND0584_1

It started running about 2 hours ago, and says it's passed 60% progress. But now it seems to be making much slower work of it.

Looking at the run log, it started with MAX_SAMPLES: 114. The log entries run from

2023-03-20 06:25:58 - INFO - sync_re - Started: sample 1, replica 0
to
2023-03-20 11:09:35 - INFO - sync_re - Finished: sample 114 (duration: 149.1039990450081 s)
2023-03-20 11:09:35 - INFO - sync_re - Finished: ATM simulations (duration: 17016.784924168984 s)

Then it appears to start again, this time with MAX_SAMPLES: 341, logging from

2023-03-28 13:25:11 - INFO - sync_re - Started: sample 115, replica 0
(this is roughly when the task started running on my machine)
to, so far
2023-03-28 15:45:16 - INFO - sync_re - Finished: sample 142 (duration: 299.707962396089 s)

Note that each sample is taking roughly twice as long to complete as the ones before 114 - presumably run on a differently machine?

The task is another resend, but the logging feels very strange. Is this how it's supposed to look?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60214 - Posted: 28 Mar 2023 | 15:20:32 UTC - in response to Message 60210.

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60215 - Posted: 28 Mar 2023 | 17:18:27 UTC
Last modified: 28 Mar 2023 | 17:20:17 UTC

Looked at the errored tasks list on my account this morning and see another slew of badly misconfigured tasks went out.

Been seeing a lot of file not found errors.

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_new_2-50-60_0.xml'

Thankfully they fail fast and are purged shortly after working through the _7 iteration.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60216 - Posted: 28 Mar 2023 | 23:51:18 UTC - in response to Message 60214.

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

________________

Completed after two days, four hours and forty minutes.
Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60218 - Posted: 29 Mar 2023 | 1:47:31 UTC - in response to Message 60216.

This task PTP1B_23471_23468_2_2A-QUICO_TEST_ATM-0-1-RND8957_1 is currently doing the same on this host.

Been at 100% complete now for at least an hour now.

I know to just leave them alone and they will eventually finish and report as validated.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60219 - Posted: 29 Mar 2023 | 7:09:23 UTC

This task reached "100% complete" in about 7 hours, and then ran for an additional 7 hours +, before actually finishing.

https://www.gpugrid.net/workunit.php?wuid=27442023


Anybody got that beat??????



Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60220 - Posted: 29 Mar 2023 | 7:20:09 UTC - in response to Message 60219.
Last modified: 29 Mar 2023 | 7:30:44 UTC

Anybody got that beat??????

The task I reported in Message 60213 (14:55 yesterday) is still running. It was approaching 100% when I went to bed last night, and it's still there this morning. I'll go and check it out after coffee (I can't see the sample numbers remotely).

As soon as I wrote that, it uploaded and reported! Ah well, my other Linux machine has got one in the same state.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60223 - Posted: 29 Mar 2023 | 8:23:53 UTC - in response to Message 60216.
Last modified: 29 Mar 2023 | 8:31:49 UTC

None of my WUs from yesterday completed. Please issue a server abort and eliminate all these defective WUs before releasing a new set. Otherwise defects will keep wasting 8 computers time for days to come.

The problem is not the time they take to run.
No checkpointing.
Fail if suspended and restarted.

______________

My problem with re-start and suspending is, these WUs are GPU intensive. As soon as one of these WUs pops up, my GPU fans let me know to do maintenance of the cooling system. I have laptops. I cannot take a blower on a running system.
Now this WU for example has run for 21 hours and is at 34.5%.
task 27440346
Edit. It is still running fine.

_____________________________

The above-mentioned WU is at 71.8% and has been running now for 1 Day and 20 hours. It is still running fine and as I cannot read log files, you can go over what it has been doing once finished.
I have marked no further WUs from GPUgrid. I will re-open after updates, etc which I have forced-paused.

________________

Completed after two days, four hours and forty minutes.
Now there is another problem. One task is showing 100% completed for the last four hours but it is still using the CPU for something. Not the GPU. The elapsed clock is still ticking but the remaining is zero.

_________________

Just woke up. The task was finished. Sent it home.
task 27441741

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60224 - Posted: 29 Mar 2023 | 8:31:04 UTC

OK, it's the same story as yesterday. This task:

PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2

downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC.

As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115.

The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60225 - Posted: 29 Mar 2023 | 9:22:38 UTC - in response to Message 60224.

OK, it's the same story as yesterday. This task:

PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2

downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC.

As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115.

The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again.


I believe it's what I imagined. From the manual division I was doing before I was splitting some runs in 2/3 steps: 114 - 228 - 341 samples. If the job ID has a 2A/3A it's most probably that it's starting from a previous checkpoint and the progress report is going crazy with it. I'll pass this on to Raimondas to see if he can get a look at it.

Our priority first is to be able to that these job divisions are done automatically like ACEMD does, that way we can avoid these really long jobs for everyone. Doing this manually makes it really hard to track all the jobs and the resends. So I hope that in the next days everything goes smoother.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60226 - Posted: 29 Mar 2023 | 12:12:55 UTC - in response to Message 60225.

Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition.

Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so.

The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.

Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60227 - Posted: 29 Mar 2023 | 12:26:14 UTC - in response to Message 60226.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.


115/(341-114) = 0.5066 = 50.66%

strikingly close. maybe "BOINC logic" in some form of rounding. but it's pretty clear that the 50% value is coming from this calculation.

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60228 - Posted: 29 Mar 2023 | 12:38:37 UTC - in response to Message 60227.

I thought I'd checked that, and got a different answer, but my mouse must have slipped on the calculator buttons.

The difference is probably the 0.2% program setup stages - it'll do. Thanks.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60229 - Posted: 29 Mar 2023 | 14:42:44 UTC

After that, it failed after 3 hours 20 minutes with a 'ValueError: Energy is NaN' error. Never mind - I tried.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60230 - Posted: 29 Mar 2023 | 17:59:59 UTC - in response to Message 60229.
Last modified: 29 Mar 2023 | 18:27:03 UTC

C:/Windows/system32/cmd.exe command creates c:\users\frolo\.exe\ folder.
On subsequent runs it gives "A subdirectory or file .exe already exists." error.

C:/Windows/system32/cmd.exe /c call test.bat outputs
The syntax of the command is incorrect.


C:\Windows\system32\cmd.exe /c call test.bat outputs
'test.bat' is not recognized as an internal or external command,
operable program or batch file.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60233 - Posted: 30 Mar 2023 | 9:51:23 UTC - in response to Message 60226.

Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition.

Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so.

The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.

Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples?


The first 114 samples should be calculated by: T_PTP1B_new_20669_2qbr_23472_1A_3-QUICO_TEST_ATM-0-1-RND2542_0.tar.bz2
I've been doing all the division and resends manually and we've been simplifying the naming convention for my sake. Now we are testing a multiple_steps protocol just like in AceMD which should help ease things and I hope mess less with the progress reporter.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60234 - Posted: 30 Mar 2023 | 11:31:27 UTC - in response to Message 60233.

Thanks. Be aware that out here in client-land we can only locate jobs by WU or task ID numbers - it's extremely difficult to find a task by name unless we can follow an ID chain.

Newer versions of the BOINC website tools do provide a rudimentary 'search by name' facility, but it requires a full task name - no wildcards or partial matches. And I know your colleagues on this project are very wary about updating the server code. We'll just have to live with it.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60237 - Posted: 30 Mar 2023 | 18:08:57 UTC - in response to Message 60234.

Yeah I'm sorry about that. I'm trying to learn as I go.

I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60238 - Posted: 30 Mar 2023 | 18:22:27 UTC - in response to Message 60237.

Two downloaded, the first has reached 6% with no problems.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60239 - Posted: 30 Mar 2023 | 18:43:26 UTC - in response to Message 60237.

Yeah I'm sorry about that. I'm trying to learn as I go.

I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs.


____________________

It is un-stable tasks, re-start problems, suspend problems. Quite a few of us have done year-plus runs on Climate. 24-hour runs are no problem.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60240 - Posted: 30 Mar 2023 | 19:42:00 UTC
Last modified: 30 Mar 2023 | 20:12:31 UTC

deleted

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60241 - Posted: 30 Mar 2023 | 20:11:40 UTC - in response to Message 60237.

I believe I just finished one of these ATMbeta tasks.

https://www.gpugrid.net/result.php?resultid=33393179

It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60242 - Posted: 30 Mar 2023 | 21:07:11 UTC - in response to Message 60241.
Last modified: 30 Mar 2023 | 21:07:30 UTC

I believe I just finished one of these ATMbeta tasks.

https://www.gpugrid.net/result.php?resultid=33393179

It never checkpointed but it did show correct estimations of time to finish plus the progress was correct and incremented correctly.

Same for me with Linux. Since there's no checkpointing I didn't bother to test suspending. I think all windows WUs failed.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60243 - Posted: 31 Mar 2023 | 7:14:27 UTC
Last modified: 31 Mar 2023 | 8:09:36 UTC

My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141.

Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205).

Edit - yes, it did. I see you've put step information in the task names: these were

PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0
PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60244 - Posted: 31 Mar 2023 | 8:29:47 UTC - in response to Message 60243.
Last modified: 31 Mar 2023 | 8:34:51 UTC

My current two ATM betas both have MAX_SAMPLES: +70 - but one started at 71, and the other at 141.

Both are displaying 100% progress. I watched one jump to 100% after about enough time to load the program and complete 1 sample: the other I would expect to finish within half an hour (it's on sample 205).

Edit - yes, it did. I see you've put step information in the task names: these were

PTP1B_20669_2qbr_23466_2-QUICO_ATM_OFF_STEPS-1-5-RND8189_0
PTP1B_23467_23475_4-QUICO_ATM_OFF_STEPS-2-5-RND5806_0


My observations are same. When the units download, the estimated finish time reads 606 days.


https://www.gpugrid.net/results.php?hostid=534811&offset=0&show_names=0&state=0&appid=45


So far, in this batch, 3 WUs completed successfully, 1 error and 1 is crunching, on a windows 10 machine.


The units all crash on my other computer, which runs windows 7 and is rather old, 13 years. Maybe, it's time to retire it from this project, though it still runs well on other projects, like Einstein and FAH.



https://www.gpugrid.net/results.php?hostid=544232&offset=0&show_names=0&state=0&appid=45

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60245 - Posted: 31 Mar 2023 | 12:52:00 UTC

My first ATM beta on Windows10 failed after some 6 hours :-(
https://www.gpugrid.net/result.php?resultid=33393839

anyone an idea what exactly the problem was?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60246 - Posted: 31 Mar 2023 | 13:01:59 UTC - in response to Message 60245.

anyone an idea what exactly the problem was?

It says

ValueError: Energy is NaN.

A science error (impossible result), rather than a computing error.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60247 - Posted: 31 Mar 2023 | 13:27:18 UTC - in response to Message 60246.

Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks.

Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60248 - Posted: 31 Mar 2023 | 18:06:07 UTC - in response to Message 60247.

Potentially, it could also be due to instability in overclocks, where applicable. I know the ACEMD3 tasks are susceptible to a “particle coordinate is NaN” type error from too much overclocks.

Of course less likely if things are not overclocked, or only mild overclocking. But just expressing the possibility.

thanks for this thought; it could well be the case. For some time, this old GTX980TI has no longer followed the settings for GPU clock and Power target, in the old NVIDIA Inspector as well as in the newer Afterburner.
Hence, particularly with ATM tasks I noticed an overclocking from default 1152MHz up to 1330MHz. Not all the time, but many times.
I now experimented and found out that I can control the GPU clock by reducing the fan speed, with setting the GPU temperature at a fixed value and setting a check at "priorize temperature". So the clock now oscillates around 1.100MHz most of the time.
I will see whether the ATM tasks now will fail again, or not.


kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60249 - Posted: 1 Apr 2023 | 6:29:10 UTC
Last modified: 1 Apr 2023 | 6:30:22 UTC

My atm beta tasks crash.
http://www.gpugrid.net/result.php?resultid=33398437
Do you know why?

ZUSE
Avatar
Send message
Joined: 10 Jun 20
Posts: 7
Credit: 417,413,397
RAC: 90,725
Level
Gln
Scientific publications
wat
Message 60250 - Posted: 1 Apr 2023 | 7:00:18 UTC

me too.
All ATM tasks!

graphic card Tesla P4
Ryzen 5600G
32GB RAM
Windows 11
under Linux too

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60251 - Posted: 1 Apr 2023 | 7:16:21 UTC - in response to Message 60249.

Something in your Windows configuration has a problem running cmd.exe and calling the run.bat file. Windows barfs on the 0x1 exit error.

Same as the other fellow running Windows.

No concrete smoking gun flaw shown.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60254 - Posted: 1 Apr 2023 | 13:04:13 UTC

Another thing that can be possible is that your system re-started after an update or you suspended it.
I have concluded to let Intel, Microsoft, Dell and Acer update themselves when they want. Not our fault if the WU crashes. It is the onus of the Admin of the project to make their WU stable enough to default to the last checkpoint.
Our job is to keep our systems up to date and maintained to run these WUs to the best of our abilities.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60256 - Posted: 1 Apr 2023 | 15:28:48 UTC

The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 60257 - Posted: 2 Apr 2023 | 2:20:00 UTC - in response to Message 60256.

The ATM tasks are just like the acemd3 tasks in that they can't be interrupted or restarted without erroring out. Unlike the acemd3 tasks which can be restarted on the same device, the ATM tasks can't be restarted or interrupted at all. They exit immediately if restarted.


I agree, I have lost quite a few hours on WU's that were going to complete because I had perform reboots and lost them. Is anyone addressing this issue yet?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60258 - Posted: 2 Apr 2023 | 2:39:38 UTC - in response to Message 60257.

Haven't heard or seen any comments by any of the devs. The acemd3 app hasn't been fixed in two years. And that is an internal application by Acellera.

No reason to expect any change in newest sub-project apps.

Not unless some dev has got a lot of time to dig into this type of bug.

And since almost all of the newer apps depend on external libraries, that falls to to those external toolsets and devs outside of this project.

So probably not going to happen.

ZUSE
Avatar
Send message
Joined: 10 Jun 20
Posts: 7
Credit: 417,413,397
RAC: 90,725
Level
Gln
Scientific publications
wat
Message 60259 - Posted: 2 Apr 2023 | 8:26:04 UTC

it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes

the system was neither restarted during the calculation nor was there an update

so the problem lies elsewhere

Exactly the same on Linux.

Windows and drivers are up to date

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60260 - Posted: 2 Apr 2023 | 8:31:36 UTC - in response to Message 60259.
Last modified: 2 Apr 2023 | 8:31:56 UTC

it is just interesting that acemd3 runs through and ATM does not. errors appear after a few minutes
the system was neither restarted during the calculation nor was there an update
so the problem lies elsewhere
Exactly the same on Linux.
Windows and drivers are up to date

If you're time-slicing with another GPU project that will cause a fatal "computation error" when BOINC switches between them.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60261 - Posted: 2 Apr 2023 | 11:03:21 UTC

Task failed.
ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
__________________

Another one.

ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
_____________________

Third one.

ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.
_________________________________

I have eleven failed tasks(proud of the record setting) all, revolving around the same thing.

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60264 - Posted: 2 Apr 2023 | 17:55:46 UTC

Beta or not...how can a Project send out tasks with this length of runtime and not have any checkpointing of some sort.

The amount of resources that are wasted because of this has to be mind boggling.

STARBASEn
Avatar
Send message
Joined: 17 Feb 09
Posts: 91
Credit: 1,603,303,394
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 60266 - Posted: 2 Apr 2023 | 22:22:50 UTC - in response to Message 60264.

I concur with bluestang, some of those likely successful WU's I lost had 20+ hours getting wasted because of a necessary reboot. Project owners should make some sort of a fix a priority.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60267 - Posted: 2 Apr 2023 | 23:46:57 UTC

Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions.

I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60268 - Posted: 3 Apr 2023 | 0:01:08 UTC - in response to Message 60267.

Or just acknowledge you aren't willing to accept the project limitations and move onto other gpu projects that fit your usage conditions.

I have no issue letting work run to completion because I know that I must let all hosts run uninterrupted.

________________

Do you know what the problem is? Quico has not understood what Abou did at the very start. I am pretty sure whatever it is if he brings it to the thread he will find an answer. There are a lot of people on the thread and one of them is you, who are willing to help to the best of their ability. Experts.
I have paused all Microsoft Updates for five weeks which seems to trigger the rest of the updates, like Intels. Just for these WUs.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60269 - Posted: 3 Apr 2023 | 3:08:20 UTC - in response to Message 60268.

But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app.

The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries.

Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app.

They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60270 - Posted: 3 Apr 2023 | 8:10:32 UTC - in response to Message 60269.

But abouh's app is different from Quico's. They use different external tools. You can't apply the same fixes that abouh did for Quico's app.

The Python tasks use the pytorch libraries and the Quico uses the AtoM libraries.

Plus the Python app is mostly a cpu app while the ATM app is mostly a gpu app.

They work very differently. Expecting that the app structure and design of the Python app is directly applicable to the AToM app is naive and simplistic.

___________________

I am not saying anything but I agree with the sentiments of some.
Maybe, some of us can play with AToM libs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60271 - Posted: 3 Apr 2023 | 8:52:44 UTC - in response to Message 60270.

Having looked into the internal logging of Quico's tasks in some detail because of the progress %age problem, it's clear that it goes through the motions of writing a checkpoint normally - 70 times per task for the recent short runs, 341 per task for the very long ones. That's about once every five minutes on my machines, which would be perfectly acceptable to me.

I would judge the problem to be with the other end of the problem - re-starting the task after an interruption. That's more complicated, from the programmer's point of view - not only does the state of the science program's data have to be restored from disk in the proper format, all the wrapper's counters and timings have to be re-aligned and re-started.

By all means explore and learn about the tools and libraries used for these tasks, but I suspect you'll have to get down and dirty with the application's code as well. Let us know how you get on.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60273 - Posted: 4 Apr 2023 | 13:30:45 UTC

Impressive.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60274 - Posted: 4 Apr 2023 | 15:39:05 UTC

anyone any idea why this task:
https://www.gpugrid.net/result.php?resultid=33405348 failed after 5 1/2 hours?

This time, there was no overclocking involved. So the reason must have been a different one :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60275 - Posted: 4 Apr 2023 | 18:29:37 UTC - in response to Message 60274.

ValueError: Energy is NaN. IOW Not a number.

Impossible value got the task thrown out. Couple of possible reasons.

Misconfigured or "bad" task

GPU running overclocked or hot and caused math errors.

[AF>FAH-Addict.net]toTOW
Send message
Joined: 28 Oct 10
Posts: 9
Credit: 25,781,299
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60276 - Posted: 4 Apr 2023 | 18:47:37 UTC
Last modified: 4 Apr 2023 | 18:54:28 UTC

All WUs seems to be failing the same way with missing files :
https://www.gpugrid.net/result.php?resultid=33406732
https://www.gpugrid.net/result.php?resultid=33406795
https://www.gpugrid.net/result.php?resultid=33406795

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60277 - Posted: 4 Apr 2023 | 18:54:51 UTC - in response to Message 60276.

We see this frequently with misconfigured tasks. Researcher does a poor job updating the task generation template when configuring for new tasks.

Wastes time and resources for every one.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 85
Credit: 1,215,531,270
RAC: 25,194
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60278 - Posted: 4 Apr 2023 | 21:34:20 UTC - in response to Message 60276.

seems to be failing the same way with missing files

Same here:
https://www.gpugrid.net/result.php?resultid=33406558
But first failed among dozen successful completed.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60279 - Posted: 5 Apr 2023 | 6:43:22 UTC - in response to Message 60277.

Wastes time and resources for every one.

Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad.
But I had one, day before yesterday, which failed after some 5-1/2 hours - which is not good :-(

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60281 - Posted: 5 Apr 2023 | 8:47:03 UTC

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60282 - Posted: 5 Apr 2023 | 18:46:01 UTC

Thought I'd run a quick test to see if there was any progress on the restart front. Waited until a task had just finished, and let a new one start and run to the first checkpoint: then paused it, and waited while another project borrowed the GPU temporarily.

The test task was p38_2m_2j_5-QUICO_ATM_OFF_STEPS-1-5-RND9265_1. On restart, it started again from zero progress, zero elapsed time, and ran up to the 0.200% point: then it crashed as before. I didn't have any time to rescue any logs from the restart - my BOINC client cleaned and reused the slot for something else before I could catch it.

The website report says it ran for about 40 seconds, and stderr.txt contains the lines

Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /hdd/boinc-client/slots/2/tmp/pip-req-build-368b4spp
fatal: unable to access '/home/conda/feedstock_root/build_artifacts/git_1679396317102/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/etc/gitconfig': Permission denied

That doesn't sound very hopeful. It's still a problem.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60283 - Posted: 5 Apr 2023 | 19:05:03 UTC - in response to Message 60281.

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

_______________________________

task 27451763
task 27451117
task 27452961
Completed and validated. No errors as yet. I dare not even sneeze near them. All updates are off.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60284 - Posted: 5 Apr 2023 | 19:35:54 UTC

This "OFF" in the WU points towards Python. "AToM" also has something to do with Python.
I errored out on one of Abou's WU because my GPU was updated. Python? I do not know. I do not know how to dive under the bonnet but Google up "OFF Python" and "AToM Python", there is a relationship.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60285 - Posted: 5 Apr 2023 | 20:44:22 UTC - in response to Message 60284.

I think 'Python' is a programming language, and 'AToM' is a scientific program written in that language.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60286 - Posted: 6 Apr 2023 | 4:55:32 UTC - in response to Message 60279.

Well, as long as a tasks fails within a few minutes (I had a few such ones yesterday), I think it's not that bad.
But I had one, day before yesterday, which failed after some 5-1/2 hours - which is not good :-(

what I noticed lately on my machines is: when ATM tasks fail, mostly after 60-90 seconds.
And stderr always says:

FileNotFoundError: [Errno 2] No such file or directory: 'thrombin_noH_2-1a-3b_0.xml'
23:18:10 (18772): C:/Windows/system32/cmd.exe exited; CPU time 18.421875

see here:
https://www.gpugrid.net/result.php?resultid=33409106

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60287 - Posted: 6 Apr 2023 | 7:46:00 UTC - in response to Message 60283.

task 27451592
task 27451185
task 27451971
task 27451763
Completed and validated. No errors as yet.

_______________________________

task 27451763
task 27451117
task 27452961
Completed and validated. No errors as yet. I dare not even sneeze near them. All updates are off.

task 27452387
task 27452312
task 27452961
task 27452969
completed and validated.
One task in error,
task 33410323

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60289 - Posted: 6 Apr 2023 | 10:44:12 UTC - in response to Message 60286.

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60290 - Posted: 6 Apr 2023 | 12:09:09 UTC

Has anyone noticed the WUs with 'Bace' in their name, they show progress as 100% but the Time Elapsed counter is still ticking. Task Manager shows the task is still busy computing. This goes on for hours on end and one Task went up to 24 Hrs in this state.
If a Task is doing this, it does not mean a failed task. Check in the Task Manager first. Let it complete. Currently, this Task is doing it.
task 33409877
I wish someone would put up a Notice that this project is not for persons who switch off their computers at night or for some other reasons.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60291 - Posted: 6 Apr 2023 | 12:10:38 UTC - in response to Message 60289.

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

same here, about 1 hour ago:

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-30-40_0.xml'

such errors, happening often enough, may show some kind of sloppy tasks configuration ?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60292 - Posted: 6 Apr 2023 | 14:17:30 UTC - in response to Message 60291.
Last modified: 6 Apr 2023 | 14:18:30 UTC

Similar story. MCL1_49_35_4-QUICO_ATM_OFF_STEPS-0-5-RND5875_2 failed in 44 seconds with

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-49-35_0.xml'

same here, about 1 hour ago:

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_pmx-30-40_0.xml'

such errors, happening often enough, may show some kind of sloppy tasks configuration ?

__________________

Same here. The Task with 'MCLI' in their name lasted 18 seconds only.
task 33411408

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60293 - Posted: 6 Apr 2023 | 14:49:23 UTC
Last modified: 6 Apr 2023 | 15:30:56 UTC

This WU with 'Jnk1' in it, lasted ten seconds.
task 33411216


Edit. Now I have a WU with 'thrombin' in its name. Reached 100% in 15 minutes but is still busy with the GPU and CPU.
task 33413038

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60294 - Posted: 6 Apr 2023 | 20:16:42 UTC

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 85
Credit: 1,215,531,270
RAC: 25,194
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60295 - Posted: 6 Apr 2023 | 21:45:11 UTC - in response to Message 60294.
Last modified: 6 Apr 2023 | 21:46:02 UTC

But it is showing progress as 100%

So with all ATM WUs, this is "normal".
Perhaps later the devs will be able to fix it.
So there is no need to be surprised by this fact in every post -_-

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60296 - Posted: 6 Apr 2023 | 22:52:16 UTC - in response to Message 60293.
Last modified: 6 Apr 2023 | 22:55:29 UTC

This WU with 'Jnk1' in it, lasted ten seconds.
task 33411216


Edit. Now I have a WU with 'thrombin' in its name. Reached 100% in 15 minutes but is still busy with the GPU and CPU.
task 33413038

_______________

Completed and validated.

No. For some reason, people are aborting, like this WU 'thrombin'. We normally watch the progress report. Instead, check the Task Manager. If there is a heartbeat, let it run.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60297 - Posted: 6 Apr 2023 | 23:23:17 UTC - in response to Message 60294.

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833

_______________

Completed and validated.
Auram?

Speedy
Send message
Joined: 19 Aug 07
Posts: 42
Credit: 28,391,082
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60301 - Posted: 11 Apr 2023 | 8:51:48 UTC

When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60302 - Posted: 11 Apr 2023 | 11:59:50 UTC - in response to Message 60301.

When tasks are available how much percentage of CPU do they require, does the CPU usage fluctuate like the other Python tasks?

One CPU is plenty for these tasks. It doesn't need a full GPU so I run Einstein, Milkyway or OPNG with it. Problem is if BOINC time slices it the ATM WU will fail when it gets restarted.
Unless it time-sliced due to the final step (zipping up maybe?) after several hours. Then when it UL and Report as Valid.
The best way to assure these ATM WUs succeed is to not run a different project to avoid having BOINC switch the GPU and crash it when it restarts. Running 2 ATM WUs per GPU or an ACEMD+ATM is ok since it doesn't switch away.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60303 - Posted: 11 Apr 2023 | 12:08:11 UTC - in response to Message 60297.
Last modified: 11 Apr 2023 | 12:10:05 UTC

This 'MCL1' has been running steadily for the last hour. But it is showing progress as 100% while the elapsed clock is ticking. Task Manager shows it is busy.
task 33412833
_______________
Completed and validated.
Auram?

Yes the failed WU is my Rig-11 which is having intermittent failures/reboots due to a MB/GPU issue of unknown origin. I've swapped GPUs several times and the problem stays with the Rig-11 MB so it's not a bad GPU. If I leave the GPU idle the CPU runs WUs fine. Einstein and Milkyway don't seem to cause the problem but Asteroids, GG and maybe OPNG do at random intervals. Also it might be time-slicing that I described in my penultimate reply.
It's probably time to scrap the MB. Since most are designed for gamers they stuff too much junk on them and compromise their reliability.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60304 - Posted: 11 Apr 2023 | 15:23:17 UTC

Looks like all of today's WUs are failing:

FileNotFoundError: [Errno 2] No such file or directory: 'CDK2_new_2_edit-1oiy-1h1q_0.xml'
It dumbfounds me why they still have it set to fail 7 times. If they fail at the end then that's several days of compute time wasted. Isn't two failures enough?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60305 - Posted: 11 Apr 2023 | 15:27:48 UTC - in response to Message 60304.

I had two fail in this way, but the rest (20+ or so) are running fine. Certainly not "all" of them.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60306 - Posted: 11 Apr 2023 | 16:05:14 UTC

Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running.
The ATM tasks failed after a minute.
I checked my settings - it cleary says:
ATM (beta): no

So, how come that ATMs are being downloaded?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60307 - Posted: 11 Apr 2023 | 16:20:02 UTC - in response to Message 60306.

I think the beta toggle in preferences is 'sticky' in the scheduler.

Seen similiar. Didn't get Python beta until I set beta in preferences. Unset beta in preferences and still got beta Python tasks. Beta set again for ATM.

Probably only a detach and reattach will fix it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60308 - Posted: 12 Apr 2023 | 2:08:15 UTC - in response to Message 60306.

Strange enough, about 2 hours ago one of my rigs downloaded 2 ATM tasks, while Python tasks were running.
The ATM tasks failed after a minute.
I checked my settings - it cleary says:
ATM (beta): no

So, how come that ATMs are being downloaded?

I think ATMbeta is controlled by
Run test applications?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60309 - Posted: 12 Apr 2023 | 6:59:28 UTC - in response to Message 60308.

I think ATMbeta is controlled by
Run test applications?

oh, this might explain.
While I unchecked "ATM beta" I neglected to uncheck "Run test applications"

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60310 - Posted: 12 Apr 2023 | 15:43:22 UTC
Last modified: 12 Apr 2023 | 15:43:37 UTC

This WU had me error out for NAN at 913 seconds. I never overclock my GPUs and power limited this 2080 Ti to 180 W since GPUs are notorious for wasting energy. This NAN error is due setting the calculation boundaries wrong.
https://www.gpugrid.net/workunit.php?wuid=27468777

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 8,623,824,643
RAC: 1,735,191
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60312 - Posted: 12 Apr 2023 | 15:56:19 UTC

Hello Quico,
I hope my interpretation is correct.
see https://boinc.berkeley.edu/trac/wiki/WrapperApp

If no task has checkpoint_filename defined, then the job starts over and breaks with python pip.
The task with the run script should define checkpoint_filename. The progress file is changed after each checkpoint. Maybe it is enough to specify progress as checkpoint_filename. Resume should then work exactly the same as when starting with checkpoint.

progress
The formula should be changed as suggested by Richard Haselgrove in http://www.gpugrid.net/forum_thread.php?id=5379&nowrap=true#60206:
progress = float(isample - last_sample)/float(num_samples - last_sample)


Translated with www.DeepL.com/Translator (free version)

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60319 - Posted: 14 Apr 2023 | 13:21:34 UTC
Last modified: 14 Apr 2023 | 13:39:21 UTC

File "/var/lib/boinc-client/slots/34/lib/python3.9/site-packages/openmm/app/statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN.
https://www.gpugrid.net/workunit.php?wuid=27469907

Watched a WU finish and it spent 6 minutes out of 313 minutes on 100%.
No checkpointing.
Has it been confirmed that the calculation boundaries are correct and not the cause of the NaN errors?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60320 - Posted: 15 Apr 2023 | 8:41:17 UTC - in response to Message 60319.

I, too, had such an error after the task had run for 7.885 seconds:

File "Z:\BOINC\slots\3\lib\site-packages\openmm\app\statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN


https://www.gpugrid.net/result.php?resultid=33436488

no overclocking.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60321 - Posted: 15 Apr 2023 | 11:07:03 UTC - in response to Message 60319.

File "/var/lib/boinc-client/slots/34/lib/python3.9/site-packages/openmm/app/statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN.
https://www.gpugrid.net/workunit.php?wuid=27469907

Watched a WU finish and it spent 6 minutes out of 313 minutes on 100%.
No checkpointing.
Has it been confirmed that the calculation boundaries are correct and not the cause of the NaN errors?


Wow! Six minutes is a significant improvement over the hours it was taking before. Just don't give it a kick and abort.

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 8,623,824,643
RAC: 1,735,191
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60322 - Posted: 15 Apr 2023 | 11:19:45 UTC - in response to Message 60312.

Too bad, not so simple after all. I wrote the checkpoint tag in the job.xml in the project directory under Windows and after two samples suspend/resume and again the job started with the first task and died with python pip.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60333 - Posted: 16 Apr 2023 | 18:26:41 UTC - in response to Message 60320.

I, too, had such an error after the task had run for 7.885 seconds:

File "Z:\BOINC\slots\3\lib\site-packages\openmm\app\statedatareporter.py", line 365, in _checkForErrors
raise ValueError('Energy is NaN. For more information, see https://github.com/openmm/openmm/wiki/Frequently-Asked-Questions#nan')
ValueError: Energy is NaN


https://www.gpugrid.net/result.php?resultid=33436488

no overclocking.


this time, the task errored out after 16.400 seconds :-(

https://www.gpugrid.net/result.php?resultid=33442242

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60335 - Posted: 17 Apr 2023 | 8:10:17 UTC

It feels like there's at least four categories of ATMbeta WUs running simultaneously.
None have checkpointing.
Top Priority should be to make checkpointing work.
Shotgun approach squanters a lot of compute time.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60336 - Posted: 17 Apr 2023 | 11:15:57 UTC

My Nation like many others has gone into a default situation. The most expensive item is the supply of electricity and they are frequently switching off the grid without informing us.
David H, says the WUs are checkpointing. If they are checkpointing, then why are they not recovering? Well recovering or not, I cannot do a thing about the electric grid. So, best of luck to the WUs and as it is Ramadan, I have nothing left in the upper chamber to argue. Over and out.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60352 - Posted: 26 Apr 2023 | 16:16:15 UTC

Still no checkpointing.
Suspent then unsuspend = crash.
Many WUs failing due to subprocess.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60353 - Posted: 28 Apr 2023 | 5:30:09 UTC

If there is a storm and electricity go, WU crashes. I know that Boincer's do not do a re-start for months on end but I have to do a re-start. WU crashes. If the GPU updates or the System updates, the WU crashes. If the cat plays with the keyboard, the WU crashes.
I do not want catty remarks but will keep crashing them from now on. Who cares.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 85
Credit: 1,215,531,270
RAC: 25,194
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60354 - Posted: 30 Apr 2023 | 12:40:47 UTC - in response to Message 60353.
Last modified: 30 Apr 2023 | 12:42:32 UTC

Who cares.

No, it's not about who cares.

This is about which of the project employees has the knowledge and resources to implement the necessary functionality, and which of them has the time for this.
And as you should understand, they don't make decisions there on their own, it's not a hobby.
The necessary specialists can now be involved in other, more priority projects for the institute, and neither we nor the employees themselves can influence this.
Deal with it.

Nothing will change from the number of tearful posts about the problem, no matter how much someone would like it.
Unless, of course, the goal is once again just to let off steam somewhere because of indignation.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60357 - Posted: 3 May 2023 | 8:34:49 UTC
Last modified: 3 May 2023 | 9:32:57 UTC

Task TYK2_m44_m55_5_FIX-QUICO_ATM_Sage_xTB-0-5-RND2847_0 (today):

FileNotFoundError: [Errno 2] No such file or directory: 'TYK2_m44_m55_0.xml'


Later - CDK2_miu_m26_4-QUICO_ATM_Sage_xTB-0-5-RND8419_0 running OK.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60358 - Posted: 5 May 2023 | 7:14:35 UTC
Last modified: 5 May 2023 | 7:46:54 UTC

And a similar batch configuration error with today's BACE run, like

BACE_m24_m7e_5-QUICO_ATM_Sage_xTB-0-5-RND7993_0

08:05:32 (386384): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

(five so far)

Edit - now wasted 20 of the things, and switched to Python to avoid quota errors. I should have dropped in to give you a hand when passing through Barcelona at the weekend!

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60359 - Posted: 5 May 2023 | 8:04:09 UTC

I cannot resource share ATMBeta with other projects because it is stopped to run other projects. Ends up with an error.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 85
Credit: 1,215,531,270
RAC: 25,194
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60360 - Posted: 5 May 2023 | 9:01:39 UTC - in response to Message 60358.

And a similar batch configuration error with today's BACE run, like

Same for Win apps:
https://www.gpugrid.net/result.php?resultid=33475629
https://www.gpugrid.net/results.php?userid=101590
Sad : /

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60361 - Posted: 5 May 2023 | 11:44:07 UTC - in response to Message 60359.

I cannot resource share ATMBeta with other projects because it is stopped to run other projects. Ends up with an error.


set all other GPU projects to resource share of 0, then they wont run at all when you have ATM work.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60362 - Posted: 5 May 2023 | 16:10:12 UTC

many of the recent ATMs errored out after not even a minute, stderr says:

wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
Der Befehl "run.bat" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.

in English: the command "run.but" is either misspelled our could not be found.

What's up?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60363 - Posted: 5 May 2023 | 16:16:35 UTC

Same equivalent type of error in Linux for a great many tasks.

bin/bash: run.sh: No such file or directory

BACE_m7g_m7c_3-QUICO_ATM_Sage_xTB-0-5-RND8127_3

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60364 - Posted: 5 May 2023 | 17:19:23 UTC

Got a collection of twenty-one errored tasks. Suspended work fetch on that computer. The other is busy with Abous WU.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60365 - Posted: 5 May 2023 | 19:07:51 UTC

Now these are doing it as well: MCL1_m28_m47_1_FIX-QUICO_ATM_Sage_xTB-0-5-RND0954_0

18:09:56 (394275): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60366 - Posted: 5 May 2023 | 21:27:48 UTC - in response to Message 60365.
Last modified: 5 May 2023 | 21:28:11 UTC

Now these are doing it as well: MCL1_m28_m47_1_FIX-QUICO_ATM_Sage_xTB-0-5-RND0954_0

18:09:56 (394275): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).



Exactly!

When you have more Tasks that Error (277) than Valid (240) ... that is pretty damn sad!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60369 - Posted: 6 May 2023 | 7:05:05 UTC - in response to Message 60365.

The experimenters and/or staff have got to get a grip on this - you are wasting everybody's time and electricity.

BOINC is very unforgiving: you have to get it 100% exact, all at the same time, every time. It's worth you taking a pause after each new batch is prepared, and then going back and proof-reading the configuration. Five minutes spent checking would probably have meant getting some real research results over the weekend: now, nothing will probably work until Monday (and I'm not holding my breath then, either).

+ 1

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60370 - Posted: 6 May 2023 | 11:20:02 UTC - in response to Message 60364.
Last modified: 6 May 2023 | 11:23:43 UTC

Got a collection of twenty-one errored tasks. Suspended work fetch on that computer. The other is busy with Abous WU.

___________

Abous, WU finished and I got one ATMBeta. It lasted all of one minute and three seconds. Suspended work fetch on this computer also.
Validated two ATMBeta, error twenty-two.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60371 - Posted: 6 May 2023 | 12:21:56 UTC

Maybe someone can answer a question I have. After running ATMBeta, Einstein starts but it reports, GPU is missing. How does this happen?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60372 - Posted: 6 May 2023 | 12:58:47 UTC - in response to Message 60371.
Last modified: 6 May 2023 | 12:59:14 UTC

atmbeta likely has nothing to do with it.

but ATMbeta uses CUDA, Einstein uses OpenCL. does BOINC still report OpenCL support in the startup log? you might need to reinstall your drivers.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60373 - Posted: 6 May 2023 | 16:27:29 UTC - in response to Message 60372.

atmbeta likely has nothing to do with it.

but ATMbeta uses CUDA, Einstein uses OpenCL. does BOINC still report OpenCL support in the startup log? you might need to reinstall your drivers.

_____________________________

Thank you. I have just finished reinstalling Windows and now the drivers.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60375 - Posted: 8 May 2023 | 3:47:12 UTC

Clean Windows install and drivers install.
task 27488709
I am at my wit's end. Now whats wrong?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60376 - Posted: 8 May 2023 | 10:34:26 UTC - in response to Message 60358.

And a similar batch configuration error with today's BACE run, like

BACE_m24_m7e_5-QUICO_ATM_Sage_xTB-0-5-RND7993_0

08:05:32 (386384): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

(five so far)

Edit - now wasted 20 of the things, and switched to Python to avoid quota errors. I should have dropped in to give you a hand when passing through Barcelona at the weekend!


Yes, big mess up on my end. More painful since it happened to two of the sets with more runs. I just forgot to run the script that copies the run.sh and run.bat files to the batch folders. It happened to 2/8 batches but yeah, big whoop. Apologies on that. The "fixed" runs should be sent soon. The "missing *0.xml" errors should not happen anymore too.

Regarding checkpoint, at least I, cannot do much more than pass the message which I have done several times.

Again, sorry for this. I can understand it to be very annoying.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60378 - Posted: 8 May 2023 | 10:52:27 UTC - in response to Message 60376.

Thanks for reporting back.

The good news is that task BACE_m4m_m4n_3_FIX-QUICO_ATM_Sage_xTB-0-5-RND4596_0 is running as it should.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60379 - Posted: 8 May 2023 | 20:00:41 UTC

Extrapolated execution times for several of my currently running "BACE_" and "MCL1_" WUs are pointing to be longer than other previous batches.
I hope this don't lead to result files bigger than server can handle...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60380 - Posted: 8 May 2023 | 20:11:32 UTC

Agreed. My first BASE of the current batch ran for 20 minutes per sample, compared with previous batches which ran at speeds down as low as 5 minutes per sample. It's touch and go whether they will complete within 24 hours (GTX 1660 Ti/super).

But in spite of the apparent speed, it finished and was accepted after 7 hours, as before.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60381 - Posted: 9 May 2023 | 3:44:44 UTC - in response to Message 60379.
Last modified: 9 May 2023 | 3:45:30 UTC

Extrapolated execution times for several of my currently running "BACE_" and "MCL1_" WUs are pointing to be longer than other previous batches.
I hope this don't lead to result files bigger than server can handle...

I am afraid that just now I am confronted with such a case: the file has size of 719 MB, and it does not upload, just backing off all the time :-(
WTF is this? Did it run 15 hours on a RTX3070 just for nothing?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60382 - Posted: 9 May 2023 | 7:05:23 UTC

I'm in the same unfortunate situation with too large an upload.

GPUGRID BACE_m7i_m7a_3_FIX-QUICO_ATM_Sage_xTB-0-5-RND9648_1_0 0.000 736531.62 K 00:00:22 - 01:53:11 0.00 Kbps Upload pending (Retry in: 01:14:27), retried: 6 Numbskull

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 32
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60383 - Posted: 9 May 2023 | 8:47:07 UTC

Now have 15 BACE tasks backed up because server not accepting the file size.

Will this be fixed or should all BACE tasks be aborted?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60384 - Posted: 9 May 2023 | 9:04:17 UTC - in response to Message 60383.

I'd hang on to them for a day or two - it can be fixed, if the right person pulls their finger out.

I have one heading that way, and I'll post the debug messages when it's cooked. At the moment, my suspicion is the Apache configuration, but we need proof.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60385 - Posted: 9 May 2023 | 11:23:44 UTC - in response to Message 60384.

I'd hang on to them for a day or two - it can be fixed, if the right person pulls their finger out.

Although I doubt that this will happen :-(

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60386 - Posted: 9 May 2023 | 11:58:44 UTC

previous instances of this problem you could abort the large upload and it will report fine and you still get credit most of the time.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60387 - Posted: 9 May 2023 | 12:23:26 UTC - in response to Message 60386.

I'd like to think that all that bandwidth carries something of value to the researchers - that would be the main point of it.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60388 - Posted: 9 May 2023 | 12:37:21 UTC - in response to Message 60387.

i thought one of the researchers said they don't need this file.

but if you don't ever upload it they wont get anything and you wont get credit anyway. not really anything to lose.

they've never been able to raise this limit on their server.
____________

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 32
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60389 - Posted: 9 May 2023 | 12:57:29 UTC

I just aborted such a completed task which then showed as ready to report. Reported it, but zero credit :-(

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60390 - Posted: 9 May 2023 | 13:02:56 UTC - in response to Message 60388.

i thought one of the researchers said they don't need this file.

Perhaps Quico could confirm that, since we seem to have his attention?

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 32
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60391 - Posted: 9 May 2023 | 13:27:07 UTC

As some compensation for the BACE tasks, I'm seeing the MCL1 sage seasoned tasks reporting

after just a minute running and yet getting full credit :-)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60392 - Posted: 9 May 2023 | 14:21:24 UTC

OK, confirmed - it is still the Apache problem.

Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: HTTP/1.1 413 Request Entity Too Large
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Date: Tue, 09 May 2023 14:06:15 GMT
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_auth_gssapi/1.5.1 mod_auth_kerb/5.4 mod_fcgid/2.3.9 PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5

File (the larger of two) is 754.1 MB (Linux decimal), 719.15 MB (Boinc binary).

At this end, we have two choices:

1) Abort the data transfer, as Ian suggests.
2) Wait 90 days for somebody to find the key to the server closet.

Quico?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60393 - Posted: 9 May 2023 | 14:41:54 UTC

is this problem only with "BACE..." tasks, or has anyone seen it with other types of task as well?

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 32
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60395 - Posted: 9 May 2023 | 16:28:52 UTC

Just BACE tasks affected. I've now aborted 14 tasks and half were credited.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60396 - Posted: 9 May 2023 | 16:36:10 UTC - in response to Message 60395.
Last modified: 9 May 2023 | 16:45:18 UTC

Just BACE tasks affected. I've now aborted 14 tasks and half were credited.

really strange, isn't it? What's the criterion for granting credit or not granting credit ???

I now aborted two such BACE tasks which could not upload.
For one I got credit, for the other one it said "upload failure" - real junk :-((( 15 hours on a RTX3070 just for NOTHING :-(((

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60397 - Posted: 9 May 2023 | 17:38:03 UTC
Last modified: 9 May 2023 | 17:38:40 UTC

I just aborted my too large upload and got some credit for it. Missed the 50% bonus because of holding onto it for too long.

Wish I had known that aborting a task may give you credits anyway.

Now better now since we are likely to keep running into this situation because they will never fix the Apache misconfiguration.

kksplace
Send message
Joined: 4 Mar 18
Posts: 53
Credit: 1,464,751,749
RAC: 3,464,677
Level
Met
Scientific publications
wat
Message 60398 - Posted: 9 May 2023 | 19:57:58 UTC

When I attempt to Abort two of these WUs, nothing seems to happen at all. Both tasks still show "Uploading" and the Transfers page still shows them at 0%. I have tried "Retry Now" on the Transfers page several times (each) to no avail. Should I instead "Abort Transfer"?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60399 - Posted: 9 May 2023 | 22:06:00 UTC - in response to Message 60398.

Yes, you want to go to the Transfers page, select the tasks that are in upload backoff and "Abort Transfer"

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60400 - Posted: 10 May 2023 | 3:01:27 UTC

Way to go GPUgrid. Sure are on a roll lately with the complete utter mess ups.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60401 - Posted: 10 May 2023 | 4:17:08 UTC
Last modified: 10 May 2023 | 4:17:42 UTC

during last night, two of my machines downloaded and startet "Bace" tasks (first letter in upper case, the following ones in lower case).
In contrast to the failing ones before which were calles "BACE" (all upper case).

Did someone get the "Bace" before and did they succeed, or was their upload file also too large?

The question for me now is: should I abort them?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60402 - Posted: 10 May 2023 | 6:23:06 UTC - in response to Message 60401.

during last night, two of my machines downloaded and startet "Bace" tasks (first letter in upper case, the following ones in lower case).
In contrast to the failing ones before which were calles "BACE" (all upper case).

Did someone get the "Bace" before and did they succeed, or was their upload file also too large?

The question for me now is: should I abort them?



https://www.gpugrid.net/workunit.php?wuid=27494001

Bace Unit was successful. See above link. If you run them, watch the elapsed time and progress rate, and you will know in a few minutes, how they will go.



Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60403 - Posted: 10 May 2023 | 6:53:49 UTC - in response to Message 60402.

during last night, two of my machines downloaded and startet "Bace" tasks (first letter in upper case, the following ones in lower case).
In contrast to the failing ones before which were calles "BACE" (all upper case).

Did someone get the "Bace" before and did they succeed, or was their upload file also too large?

The question for me now is: should I abort them?



https://www.gpugrid.net/workunit.php?wuid=27494001

Bace Unit was successful. See above link. If you run them, watch the elapsed time and progress rate, and you will know in a few minutes, how they will go.





See examples:

This one is running well and should finish ok in a few hours. I am running two units simultaneously:

https://www.gpugrid.net/workunit.php?wuid=27494007


This one was running long and I aborted:

https://www.gpugrid.net/workunit.php?wuid=27492188


This one was probably good and I shouldn't have aborted it:

https://www.gpugrid.net/workunit.php?wuid=27494031


In the BOINC manager, highlight the unit and click the properties button on the left, and its progress rate will tell you whether its good or not.


Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60404 - Posted: 10 May 2023 | 7:53:02 UTC - in response to Message 60392.
Last modified: 10 May 2023 | 8:13:36 UTC

OK, confirmed - it is still the Apache problem.

Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: HTTP/1.1 413 Request Entity Too Large
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Date: Tue, 09 May 2023 14:06:15 GMT
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_auth_gssapi/1.5.1 mod_auth_kerb/5.4 mod_fcgid/2.3.9 PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5

File (the larger of two) is 754.1 MB (Linux decimal), 719.15 MB (Boinc binary).

At this end, we have two choices:

1) Abort the data transfer, as Ian suggests.
2) Wait 90 days for somebody to find the key to the server closet.

Quico?


That's weird. I'll get a look.
But this shouldn't happen so cancel the BACE (uppercase) runs. I'll have a look on how to do it from here.
With the last implentation there shouldn't be such related file-size issues.

All bad/bug jobs should be cancelled by now.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60405 - Posted: 10 May 2023 | 8:01:15 UTC - in response to Message 60404.

This seems like the clearest web advice:

https://www.keycdn.com/support/413-request-entity-too-large

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60406 - Posted: 10 May 2023 | 12:26:08 UTC - in response to Message 60404.

OK, confirmed - it is still the Apache problem.

Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: HTTP/1.1 413 Request Entity Too Large
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Date: Tue, 09 May 2023 14:06:15 GMT
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_auth_gssapi/1.5.1 mod_auth_kerb/5.4 mod_fcgid/2.3.9 PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5

File (the larger of two) is 754.1 MB (Linux decimal), 719.15 MB (Boinc binary).

At this end, we have two choices:

1) Abort the data transfer, as Ian suggests.
2) Wait 90 days for somebody to find the key to the server closet.

Quico?


That's weird. I'll get a look.
But this shouldn't happen so cancel the BACE (uppercase) runs. I'll have a look on how to do it from here.
With the last implentation there shouldn't be such related file-size issues.

All bad/bug jobs should be cancelled by now.


this is not something new from this project. this has been a recurring issue from time to time. seems to pop up about every year or so whenever the result files get so large for one reason or another. so don't feel bad if you are unable to find the setting to fix the file size limit. no one else from the project has been able to for the last several years.

why are the result files so large? 500+MB. that's the root cause of the issue. do you need the data in these files? if not, why are they being created?

____________

bibi
Send message
Joined: 4 May 17
Posts: 14
Credit: 8,623,824,643
RAC: 1,735,191
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60407 - Posted: 10 May 2023 | 13:09:05 UTC - in response to Message 60406.

This files hold the results from the last run, i.e. sample 1 to 70 to start the next run with sample 71 to 140. There are the checkpoint data.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60408 - Posted: 10 May 2023 | 14:46:26 UTC - in response to Message 60406.

OK, confirmed - it is still the Apache problem.

Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: HTTP/1.1 413 Request Entity Too Large
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Date: Tue, 09 May 2023 14:06:15 GMT
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_auth_gssapi/1.5.1 mod_auth_kerb/5.4 mod_fcgid/2.3.9 PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5

File (the larger of two) is 754.1 MB (Linux decimal), 719.15 MB (Boinc binary).

At this end, we have two choices:

1) Abort the data transfer, as Ian suggests.
2) Wait 90 days for somebody to find the key to the server closet.

Quico?


That's weird. I'll get a look.
But this shouldn't happen so cancel the BACE (uppercase) runs. I'll have a look on how to do it from here.
With the last implentation there shouldn't be such related file-size issues.

All bad/bug jobs should be cancelled by now.


this is not something new from this project. this has been a recurring issue from time to time. seems to pop up about every year or so whenever the result files get so large for one reason or another. so don't feel bad if you are unable to find the setting to fix the file size limit. no one else from the project has been able to for the last several years.

why are the result files so large? 500+MB. that's the root cause of the issue. do you need the data in these files? if not, why are they being created?


The heavy files are the .dcd which technically I don't really need to obtain to perform the final free energy calculation but it's necessary in case something weird is happening and we want to revisit those frames. .dcd files contains the information and coordinates of all the system atoms but uncompressed. Since there are other trajectory formats, such as .xtc, that compress this data resulting in much lower filesizes we asked to implement the fileformat into OpenMM. As far as I know this has been implemented in our lab but needs the final approval of the "higher-ups" to get it running and then modify ATM to process trajectory files with .xtc.

Nevertheless, this shouldn't have happened (it run OK in other instances with BACE) and apologise for this.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60409 - Posted: 11 May 2023 | 9:40:00 UTC - in response to Message 60408.

I resolved my issue by spoofing client_state.xml - I said the over-size file had completed uploading, and that the task was ready to report. The server accepted it as valid.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60410 - Posted: 11 May 2023 | 12:42:47 UTC

"ValueError: Energy is NaN" is back quite often :-(

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60411 - Posted: 12 May 2023 | 2:36:43 UTC

Looks like the Progress bar has stopped working again, all quickly pegged at 100%.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60412 - Posted: 12 May 2023 | 6:39:22 UTC - in response to Message 60411.

I noticed that too. It was working for a while, then today's work it's back to 100% almost immediately.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60413 - Posted: 12 May 2023 | 7:19:53 UTC

Remind yourselves of my explanation at message 60315.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60414 - Posted: 12 May 2023 | 11:50:58 UTC
Last modified: 12 May 2023 | 11:52:13 UTC

I just notice that on one of my machines 2 "BACE" tasks are being processed, plus a third one is in waiting position.

Will the same problem happen as before - upload file too large? So should I better delete these 3 BACE ?

Edit: also on another machine a BACE is running right now

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60415 - Posted: 12 May 2023 | 14:16:06 UTC
Last modified: 12 May 2023 | 14:16:46 UTC

Why? I am uploading.
task 33495789

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60416 - Posted: 12 May 2023 | 14:45:18 UTC - in response to Message 60413.

Remind yourselves of my explanation at message 60315.

One would think that having figured out how to get it working nearly normal they'd maintain that level of proficiency instead of reverting back to the beginning.
Writing a BKM (Best Known Method) and checking the boxes when creating new work might prevent having to rediscover everything everywhere all at once.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60417 - Posted: 12 May 2023 | 15:30:50 UTC - in response to Message 60416.

The difference between 0-5 and n-5 has been consistent throughout - there hasn't been a "fix and revert". Just new data runs starting from 0 again.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60420 - Posted: 12 May 2023 | 23:04:15 UTC - in response to Message 60417.

+1
Thanks for reminding us all how it works.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60421 - Posted: 13 May 2023 | 4:07:48 UTC - in response to Message 60404.

Quico said on May 10th:

With the last implentation there shouldn't be such related file-size issues.
All bad/bug jobs should be cancelled by now.

Last night, another BACE upload got stuck because of file size 719 MB :-(
How come?

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60422 - Posted: 13 May 2023 | 10:58:01 UTC

The number of failures due to Computational Errors is skyrocketing and shamefully they still require 7 donors to fail before recognizing it.

Aurum
Avatar
Send message
Joined: 12 Jul 17
Posts: 399
Credit: 13,024,117,632
RAC: 112,475
Level
Trp
Scientific publications
watwatwat
Message 60423 - Posted: 13 May 2023 | 11:00:45 UTC - in response to Message 60417.

The difference between 0-5 and n-5 has been consistent throughout - there hasn't been a "fix and revert". Just new data runs starting from 0 again.


So Progress bars jumping to 100% rendering them useless is proper behavior?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60424 - Posted: 13 May 2023 | 11:05:19 UTC - in response to Message 60423.

So Progress bars jumping to 100% rendering them useless is proper behavior?

No, it's "unfixed" behaviour, hopefully on the 'To do' list.

[FVG] pima1965
Send message
Joined: 20 Jan 10
Posts: 4
Credit: 2,706,015,364
RAC: 495,820
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60425 - Posted: 13 May 2023 | 14:52:04 UTC

Good evening, only on one of my PCs with Windows 11, I7-13700KF and RTX 2080 Ti, none of the GPUGRID ATMbeta tasks (CUDA 1121) can be processed. By now more than a hundred have ended after a few tens of seconds. Other tasks (for example based on CUDA 1131) are also processed on this PC and without any problems. I have no idea what could be causing it so I do not know how to fix it. Thanks in advance to anyone who can help me solve the problem.

Output su Stderr
<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
04:36:16 (31676): wrapper (7.9.26016): starting
04:36:16 (31676): wrapper: running python.exe (bin/conda-unpack)
04:36:17 (31676): python.exe exited; CPU time 0.000000
04:36:17 (31676): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
MCL1_m51_m45_0.xml
MCL1_m51_m45_asyncre.cntl
MCL1_m51_m45.inpcrd
MCL1_m51_m45.prmtop
run.bat
run.sh
04:36:18 (31676): Library/usr/bin/tar.exe exited; CPU time 0.000000
04:36:18 (31676): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
04:36:20 (31676): C:/Windows/system32/cmd.exe exited; CPU time 0.015625
04:36:20 (31676): app exit status: 0x1
04:36:20 (31676): called boinc_finish(195)
0 bytes in 0 Free Blocks.
530 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 481994 bytes.
Dumping objects ->
{3078527} normal block at 0x00000221DD3AE4C0, 64 bytes long.
Data: <PATH=C:\ProgramD> 50 41 54 48 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{3078506} normal block at 0x00000221DD2D1060, 241 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {3078503} normal block at 0x00000221DB70B460, 8 bytes long.
Data: < &#221;! > 00 00 1A DD 21 02 00 00
{3077864} normal block at 0x00000221DD2D11F0, 241 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{3077239} normal block at 0x00000221DB70BE10, 8 bytes long.
Data: <pk=&#221;! > 70 6B 3D DD 21 02 00 00
..\zip\boinc_zip.cpp(122) : {281} normal block at 0x00000221DB70D7F0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{266} normal block at 0x00000221DB713020, 16 bytes long.
Data: <87q&#219;! > 38 37 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{265} normal block at 0x00000221DB7128A0, 16 bytes long.
Data: < 7q&#219;! > 10 37 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{264} normal block at 0x00000221DB712490, 16 bytes long.
Data: <&#232;6q&#219;! > E8 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{263} normal block at 0x00000221DB712850, 16 bytes long.
Data: <&#192;6q&#219;! > C0 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{262} normal block at 0x00000221DB7122B0, 16 bytes long.
Data: < 6q&#219;! > 98 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{261} normal block at 0x00000221DB712E40, 16 bytes long.
Data: <p6q&#219;! > 70 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{260} normal block at 0x00000221DB70C780, 32 bytes long.
Data: <CUDA_DEVICE=0 PU> 43 55 44 41 5F 44 45 56 49 43 45 3D 30 00 50 55
{259} normal block at 0x00000221DB712A80, 16 bytes long.
Data: <p q&#219;! > 70 10 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{258} normal block at 0x00000221DB711070, 40 bytes long.
Data: < *q&#219;! &#199;p&#219;! > 80 2A 71 DB 21 02 00 00 80 C7 70 DB 21 02 00 00
{257} normal block at 0x00000221DB712350, 16 bytes long.
Data: <P6q&#219;! > 50 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{256} normal block at 0x00000221DB712300, 16 bytes long.
Data: <(6q&#219;! > 28 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{255} normal block at 0x00000221DB70CCC0, 32 bytes long.
Data: <C:/Windows/syste> 43 3A 2F 57 69 6E 64 6F 77 73 2F 73 79 73 74 65
{254} normal block at 0x00000221DB712CB0, 16 bytes long.
Data: < 6q&#219;! > 00 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{253} normal block at 0x00000221DB70C060, 32 bytes long.
Data: <xjvf input.tar.b> 78 6A 76 66 20 69 6E 70 75 74 2E 74 61 72 2E 62
{252} normal block at 0x00000221DB712800, 16 bytes long.
Data: <H5q&#219;! > 48 35 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{251} normal block at 0x00000221DB712670, 16 bytes long.
Data: < 5q&#219;! > 20 35 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{250} normal block at 0x00000221DB712C10, 16 bytes long.
Data: <&#248;4q&#219;! > F8 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{249} normal block at 0x00000221DB713160, 16 bytes long.
Data: <&#208;4q&#219;! > D0 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{248} normal block at 0x00000221DB712F80, 16 bytes long.
Data: <&#168;4q&#219;! > A8 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{247} normal block at 0x00000221DB712620, 16 bytes long.
Data: < 4q&#219;! > 80 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{245} normal block at 0x00000221DB712F30, 16 bytes long.
Data: <0 q&#219;! > 30 12 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{244} normal block at 0x00000221DB711230, 40 bytes long.
Data: <0/q&#219;! &#192;&#228;:&#221;! > 30 2F 71 DB 21 02 00 00 C0 E4 3A DD 21 02 00 00
{243} normal block at 0x00000221DB712EE0, 16 bytes long.
Data: <`4q&#219;! > 60 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{242} normal block at 0x00000221DB712530, 16 bytes long.
Data: <84q&#219;! > 38 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{241} normal block at 0x00000221DB70CD80, 32 bytes long.
Data: <Library/usr/bin/> 4C 69 62 72 61 72 79 2F 75 73 72 2F 62 69 6E 2F
{240} normal block at 0x00000221DB712AD0, 16 bytes long.
Data: < 4q&#219;! > 10 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{239} normal block at 0x00000221DB70C8A0, 32 bytes long.
Data: <bin/conda-unpack> 62 69 6E 2F 63 6F 6E 64 61 2D 75 6E 70 61 63 6B
{238} normal block at 0x00000221DB712260, 16 bytes long.
Data: <X3q&#219;! > 58 33 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{237} normal block at 0x00000221DB7124E0, 16 bytes long.
Data: <03q&#219;! > 30 33 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{236} normal block at 0x00000221DB7125D0, 16 bytes long.
Data: < 3q&#219;! > 08 33 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{235} normal block at 0x00000221DB712E90, 16 bytes long.
Data: <&#224;2q&#219;! > E0 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{234} normal block at 0x00000221DB7127B0, 16 bytes long.
Data: <&#184;2q&#219;! > B8 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{233} normal block at 0x00000221DB7123F0, 16 bytes long.
Data: < 2q&#219;! > 90 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{232} normal block at 0x00000221DB713110, 16 bytes long.
Data: <p2q&#219;! > 70 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{231} normal block at 0x00000221DB712FD0, 16 bytes long.
Data: <H2q&#219;! > 48 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{230} normal block at 0x00000221DB7123A0, 16 bytes long.
Data: < 2q&#219;! > 20 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{229} normal block at 0x00000221DB713220, 1488 bytes long.
Data: <&#160;#q&#219;! python.e> A0 23 71 DB 21 02 00 00 70 79 74 68 6F 6E 2E 65
{93} normal block at 0x00000221DB70CC60, 32 bytes long.
Data: <windows_x86_64__> 77 69 6E 64 6F 77 73 5F 78 38 36 5F 36 34 5F 5F
{92} normal block at 0x00000221DB70BCD0, 16 bytes long.
Data: < q&#219;! > 00 10 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{91} normal block at 0x00000221DB711000, 40 bytes long.
Data: <&#208;&#188;p&#219;! `&#204;p&#219;! > D0 BC 70 DB 21 02 00 00 60 CC 70 DB 21 02 00 00
{70} normal block at 0x00000221DB70BEB0, 16 bytes long.
Data: < &#234;&#249;&#134;&#246; > 80 EA F9 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{69} normal block at 0x00000221DB70B0A0, 16 bytes long.
Data: <@&#233;&#249;&#134;&#246; > 40 E9 F9 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{68} normal block at 0x00000221DB70BC80, 16 bytes long.
Data: <&#248;W&#246;&#134;&#246; > F8 57 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{67} normal block at 0x00000221DB70B8C0, 16 bytes long.
Data: <&#216;W&#246;&#134;&#246; > D8 57 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{66} normal block at 0x00000221DB70BDC0, 16 bytes long.
Data: <P &#246;&#134;&#246; > 50 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{65} normal block at 0x00000221DB70BBE0, 16 bytes long.
Data: <0 &#246;&#134;&#246; > 30 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x00000221DB70B6E0, 16 bytes long.
Data: <&#224; &#246;&#134;&#246; > E0 02 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x00000221DB70B640, 16 bytes long.
Data: < &#246;&#134;&#246; > 10 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x00000221DB70B5F0, 16 bytes long.
Data: <p &#246;&#134;&#246; > 70 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x00000221DB70B870, 16 bytes long.
Data: < &#192;&#244;&#134;&#246; > 18 C0 F4 86 F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.
</stderr_txt>
]]>

____________

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60426 - Posted: 14 May 2023 | 0:39:45 UTC - in response to Message 60192.

i'm a bit surprised right now, i looked at the resend, it was successfully completed in just over 2 minutes, how come? the computer has more WUs that were successfully completed in such a short time. Am I doing something wrong?

Did you figure out why? I couldn't find reply to this in the thread. I just started recently and half of my WUs are like that, while the others looks normal (other than the progress bar). Are these short ones legitimate results?

https://www.gpugrid.net/result.php?resultid=33503009
https://www.gpugrid.net/result.php?resultid=33503008
https://www.gpugrid.net/result.php?resultid=33502957
https://www.gpugrid.net/result.php?resultid=33505285
____________

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60427 - Posted: 14 May 2023 | 0:42:43 UTC - in response to Message 60425.
Last modified: 14 May 2023 | 0:44:09 UTC

Is your computer connected to the Internet? Can you open https://github.com/raimis/AToM-OpenMM.git with your browser? I check a few results and next few lines are usually fetching from git repository, like this:

08:08:20 (12088): Library/usr/bin/tar.exe exited; CPU time 0.000000
08:08:20 (12088): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'C:\ProgramData\BOINC\slots\13\tmp\pip-req-build-vp0jsx13'
Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac
Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac

____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60428 - Posted: 14 May 2023 | 5:31:04 UTC - in response to Message 60426.
Last modified: 14 May 2023 | 5:48:06 UTC

i'm a bit surprised right now, i looked at the resend, it was successfully completed in just over 2 minutes, how come? the computer has more WUs that were successfully completed in such a short time. Am I doing something wrong?

Did you figure out why? I couldn't find reply to this in the thread. I just started recently and half of my WUs are like that, while the others looks normal (other than the progress bar). Are these short ones legitimate results?

https://www.gpugrid.net/result.php?resultid=33503009
https://www.gpugrid.net/result.php?resultid=33503008
https://www.gpugrid.net/result.php?resultid=33502957
https://www.gpugrid.net/result.php?resultid=33505285

__________________

I have been having a sneaky suspicion about these two-minute affairs. Most of the job/ steps are done on some other computer then it errors. It restarts on another machine but from where it errored out. So, the bulk of the job gets done on one computer which gets no credit and completes on another in two minutes with all the credit. Nice na? :)
________

For example, we had an electric failure this morning and I suspended two tasks to put the laptops to sleep. Both ended with an error. The bulk of the job was done but someone else will complete it in two minutes. If on another machine it can do these acrobatics from the last good checkpoint then why is it not doing so on the original? As to the fairness of the affair, you decide.
These tasks are not suspending or restarting as yet.

[FVG] pima1965
Send message
Joined: 20 Jan 10
Posts: 4
Credit: 2,706,015,364
RAC: 495,820
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60429 - Posted: 14 May 2023 | 14:13:34 UTC - in response to Message 60427.

Hello and thanks for your reply. Yes, my PC is always connected to the internet, and I can correctly open https://github.com/raimis/AToM-OpenMM.git. Can you tell me what I have to do to solve my problem? Thanks again and best regards.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60430 - Posted: 14 May 2023 | 16:14:11 UTC - in response to Message 60428.
Last modified: 14 May 2023 | 16:46:39 UTC

I have been having a sneaky suspicion about these two-minute affairs. Most of the job/ steps are done on some other computer then it errors. It restarts on another machine but from where it errored out. So, the bulk of the job gets done on one computer which gets no credit and completes on another in two minutes with all the credit. Nice na?

I haven't heard any boinc project carrying over results from different hosts. If some hosts fail, others always start afresh. In addition, none of the WUs listed above had results from any other hosts.

I suspect these are actually failures but somehow marked as success, but I can't confirm either way from the output. Credit is one thing, but these WUs also have a quorum of 1, meaning this is taken as the final result. If that's bogus, the project likely want to fix the bug, find these bogus results and rerun them somehow.
____________

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60431 - Posted: 14 May 2023 | 16:45:51 UTC - in response to Message 60429.
Last modified: 14 May 2023 | 16:55:39 UTC

Hello and thanks for your reply. Yes, my PC is always connected to the internet, and I can correctly open https://github.com/raimis/AToM-OpenMM.git. Can you tell me what I have to do to solve my problem? Thanks again and best regards.

04:36:18 (31676): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
04:36:20 (31676): C:/Windows/system32/cmd.exe exited; CPU time 0.015625
04:36:20 (31676): app exit status: 0x1

Hmm, then I don't see anything else that could obvious go wrong. Your WUs basically failed at the "run.bat". Example extracted from the slot running the task on my host: https://pastebin.com/4nqK0egx.

This script seems to be independent enough that you can try running on its own. Try this.
1) Get to your GPUGrid project folder inside BOINC data folder (default is %programdata%\BOINC\projects\www.gpugrid.net\)
2) Copy that windows_x86_64__cuda1121.zip.35a24fdec33997d4c4468c32b53b139c to a temporary folder and unzip it. 7-zip should be able to unzip it directly but at worst you rename it to .zip and then unzip it.
3) Copy the run.bat from the paste link into same folder, replace all `@echo` with `echo` and spray `timeout 5` everywhere. This would pause after each line and give you a chance to see the output. You might also want to change those "exit XX" to "echo something" so you see failure instead of exiting the shell immediately. Run the script from your temporary folder. (This is important. The script refers to %CD% so it's expected to run at the folder where all the unzipped files and run.bat reside.) This should tell you which step failed first, but how to fix that, well, depends on the failure. I expect you hit failure before `@echo Run AToM` line.

PS: I am not very familiar with Windows, so there must be better ways to debug a batch file instead of 3).
PS2: If you haven't already, reset the project first just to rule out the chance of some corrupted file.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60432 - Posted: 14 May 2023 | 19:15:28 UTC - in response to Message 60431.

Hello and thanks for your reply. Yes, my PC is always connected to the internet, and I can correctly open https://github.com/raimis/AToM-OpenMM.git. Can you tell me what I have to do to solve my problem? Thanks again and best regards.

04:36:18 (31676): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
04:36:20 (31676): C:/Windows/system32/cmd.exe exited; CPU time 0.015625
04:36:20 (31676): app exit status: 0x1

Hmm, then I don't see anything else that could obvious go wrong. Your WUs basically failed at the "run.bat". Example extracted from the slot running the task on my host: https://pastebin.com/4nqK0egx.

This script seems to be independent enough that you can try running on its own. Try this.
1) Get to your GPUGrid project folder inside BOINC data folder (default is %programdata%\BOINC\projects\www.gpugrid.net\)
2) Copy that windows_x86_64__cuda1121.zip.35a24fdec33997d4c4468c32b53b139c to a temporary folder and unzip it. 7-zip should be able to unzip it directly but at worst you rename it to .zip and then unzip it.
3) Copy the run.bat from the paste link into same folder, replace all `@echo` with `echo` and spray `timeout 5` everywhere. This would pause after each line and give you a chance to see the output. You might also want to change those "exit XX" to "echo something" so you see failure instead of exiting the shell immediately. Run the script from your temporary folder. (This is important. The script refers to %CD% so it's expected to run at the folder where all the unzipped files and run.bat reside.) This should tell you which step failed first, but how to fix that, well, depends on the failure. I expect you hit failure before `@echo Run AToM` line.

PS: I am not very familiar with Windows, so there must be better ways to debug a batch file instead of 3).
PS2: If you haven't already, reset the project first just to rule out the chance of some corrupted file.


___________________

Here, solve it yourself. It is all gibberish to me. Marvels and Mysteries of ATM.
task 33506305

I just come to check what my computers are doing.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60433 - Posted: 14 May 2023 | 22:43:15 UTC - in response to Message 60432.

Here, solve it yourself. It is all gibberish to me. Marvels and Mysteries of ATM.
task 33506305

I just come to check what my computers are doing.

FYI, the reply you quoted wasn't replying to you. It was for pima (Message 60429).

[FVG] pima1965
Send message
Joined: 20 Jan 10
Posts: 4
Credit: 2,706,015,364
RAC: 495,820
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60434 - Posted: 15 May 2023 | 6:30:33 UTC - in response to Message 60431.

Hello and thanks again. I have implemented everything you advised me; at the moment there are no ATMbeta tasks available, so I have no way of understanding if everything has led to any results. I will keep you informed.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60435 - Posted: 15 May 2023 | 19:08:03 UTC

FileNotFoundError: [Errno 2] No such file or directory: 'TYK2_m42_m54_0.xml'

https://www.gpugrid.net/result.php?resultid=33509037

:-(

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60436 - Posted: 16 May 2023 | 7:24:08 UTC - in response to Message 60431.

When i did that i got this:
E:\programdata\BOINC\projects\www.gpugrid.net\1>echo Setup environment
Setup environment

E:\programdata\BOINC\projects\www.gpugrid.net\1>set HOMEPATH=E:\programdata\BOINC\projects\www.gpugrid.net\1

E:\programdata\BOINC\projects\www.gpugrid.net\1>set PATH=E:\programdata\BOINC\projects\www.gpugrid.net\1;E:\programdata\BOINC\projects\www.gpugrid.net\1\Library\usr\bin;E:\programdata\BOINC\projects\www.gpugrid.net\1\Library\bin;C:\Windows\system32;C:\Windows

E:\programdata\BOINC\projects\www.gpugrid.net\1>set PYTHONPATH=E:\programdata\BOINC\projects\www.gpugrid.net\1\Lib\python3.9\site-packages

E:\programdata\BOINC\projects\www.gpugrid.net\1>set SYSTEMROOT=C:\Windows

E:\programdata\BOINC\projects\www.gpugrid.net\1>echo Create a temporary directory
Create a temporary directory

E:\programdata\BOINC\projects\www.gpugrid.net\1>set TEMP=E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp

E:\programdata\BOINC\projects\www.gpugrid.net\1>mkdir E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp
Подпапка или файл E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp уже существует.

E:\programdata\BOINC\projects\www.gpugrid.net\1>echo Install AToM
Install AToM

E:\programdata\BOINC\projects\www.gpugrid.net\1>set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac

E:\programdata\BOINC\projects\www.gpugrid.net\1>python.exe -m pip install git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac || pause
Collecting git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac
Cloning https://github.com/raimis/AToM-OpenMM.git (to revision d7931b9a6217232d481731f7589d64b100a514ac) to e:\programdata\boinc\projects\www.gpugrid.net\1\tmp\pip-req-build-679j5xcv
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-679j5xcv'
Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac
Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac
Resolved https://github.com/raimis/AToM-OpenMM.git to commit d7931b9a6217232d481731f7589d64b100a514ac
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-679j5xcv\setup.py", line 23, in <module>
from async_re import __version__ as VERSION
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-679j5xcv\async_re.py", line 24, in <module>
from ommreplica import *
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-679j5xcv\ommreplica.py", line 8, in <module>
from simtk import openmm as mm
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\simtk\__init__.py", line 1, in <module>
import openmm
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\openmm\__init__.py", line 24, in <module>
from openmm.openmm import *
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\openmm\openmm.py", line 10, in <module>
from . import _openmm
ImportError: DLL load failed while importing _openmm: Не найден указанный модуль.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60437 - Posted: 16 May 2023 | 7:39:21 UTC - in response to Message 60431.
Last modified: 16 May 2023 | 8:09:35 UTC

When i did that i got this:

E:\programdata\BOINC\projects\www.gpugrid.net\1>run.bat

E:\programdata\BOINC\projects\www.gpugrid.net\1>echo Setup environment
Setup environment

E:\programdata\BOINC\projects\www.gpugrid.net\1>set HOMEPATH=E:\programdata\BOINC\projects\www.gpugrid.net\1

E:\programdata\BOINC\projects\www.gpugrid.net\1>set PATH=E:\programdata\BOINC\projects\www.gpugrid.net\1;E:\programdata\BOINC\projects\www.gpugrid.net\1\Library\usr\bin;E:\programdata\BOINC\projects\www.gpugrid.net\1\Library\bin;C:\Windows\system32;C:\Windows

E:\programdata\BOINC\projects\www.gpugrid.net\1>set PYTHONPATH=E:\programdata\BOINC\projects\www.gpugrid.net\1\Lib\python3.9\site-packages

E:\programdata\BOINC\projects\www.gpugrid.net\1>set SYSTEMROOT=C:\Windows

E:\programdata\BOINC\projects\www.gpugrid.net\1>echo Create a temporary directory
Create a temporary directory

E:\programdata\BOINC\projects\www.gpugrid.net\1>set TEMP=E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp

E:\programdata\BOINC\projects\www.gpugrid.net\1>mkdir E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp
Подпапка или файл E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp уже существует.

E:\programdata\BOINC\projects\www.gpugrid.net\1>echo Install AToM
Install AToM

E:\programdata\BOINC\projects\www.gpugrid.net\1>set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac

E:\programdata\BOINC\projects\www.gpugrid.net\1>python.exe -m pip install git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac || pause
Collecting git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac
Cloning https://github.com/raimis/AToM-OpenMM.git (to revision d7931b9a6217232d481731f7589d64b100a514ac) to e:\programdata\boinc\projects\www.gpugrid.net\1\tmp\pip-req-build-_9weckoo
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_9weckoo'
Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac
Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac
Resolved https://github.com/raimis/AToM-OpenMM.git to commit d7931b9a6217232d481731f7589d64b100a514ac
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_9weckoo\setup.py", line 23, in <module> from async_re import __version__ as VERSION
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_9weckoo\async_re.py", line 24, in <module>
from ommreplica import *
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_9weckoo\ommreplica.py", line 8, in <module>
from simtk import openmm as mm
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\simtk\__init__.py", line 1, in <module>
import openmm
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\openmm\__init__.py", line 24, in <module>
from openmm.openmm import *
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\openmm\openmm.py", line 10, in <module>
from . import _openmm
ImportError: DLL load failed while importing _openmm: Не найден указанный модуль.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
Для продолжения нажмите любую клавишу . . .

E:\programdata\BOINC\projects\www.gpugrid.net\1>python.exe -m pip list
Package Version
------------ -------
atmmetaforce 0.3
configobj 5.0.8
numpy 1.24.2
OpenMM 8.0.0
pip 23.0.1
setuptools 67.6.0
six 1.16.0
wheel 0.40.0

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60438 - Posted: 16 May 2023 | 9:02:07 UTC - in response to Message 60435.

FileNotFoundError: [Errno 2] No such file or directory: 'TYK2_m42_m54_0.xml'

https://www.gpugrid.net/result.php?resultid=33509037

:-(


Crap, I forgot to clean those that didn't equilibrate succesfully here in local. Let me see if I can find the other few that crashed and cancel those WU.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60439 - Posted: 16 May 2023 | 9:05:34 UTC - in response to Message 60410.

"ValueError: Energy is NaN" is back quite often :-(



Do these Energy is NaN come back really quickly? Run with similar names? Upon checking results I have seen that some runs have indeed crashed but not very often.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60440 - Posted: 16 May 2023 | 9:21:55 UTC - in response to Message 60417.

The difference between 0-5 and n-5 has been consistent throughout - there hasn't been a "fix and revert". Just new data runs starting from 0 again.


So I'm usually trying to hit 350 samples which equivalates to a bit more than 60ns of sampling time.

At the beginning I was sending the full run to a single volunteer but there were the size-issues and some people expressed that the samples were too long. I reduced the frame-saving frequency and started to divide these runs manually but this was too time-consuming and very hard to track. That was also causing issues with the progress bars.

That's why later on it was implemented what we use now. Like in AceMD now we can chain these runs. Instead of sending the further steps manually it is done now automatically. This helped me divide the runs into smaller chunks, making runs smaller in size and faster to run.
In theory this should have also fixed the issue with progress bars, since the cntl file also asks for +70 samples. But I guess that the first step of the runs show a proper progress bar but the following ones get stuck at 100% since the beginning? Since the control file reads +70 and the log file starts at 71.
I'll pester the devs again to see if they can have a fix soon on it.


About the recent errors. Some of them are on my end, I messed up a few times. We changed the preparation protocol and some running conditions for GPUGRID (as explained before) and sometimes a necessary tiny script was left there to run... I've taken the necessary measures to avoid this as much as possible. I hope we do not have an issue like before.
Regarding the BACE files with very big size... Maybe I forgot to cancel some WUs? It was the first time I was doing this and the searchbar works very wonky.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 85
Credit: 1,215,531,270
RAC: 25,194
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60441 - Posted: 16 May 2023 | 11:14:55 UTC - in response to Message 60438.

Let me see if I can find the other few that crashed and cancel those WU.

FileNotFoundError:
https://www.gpugrid.net/result.php?resultid=33507881
Energy is NaN:
https://www.gpugrid.net/result.php?resultid=33507902
https://www.gpugrid.net/result.php?resultid=33509252
ImportError:
https://www.gpugrid.net/result.php?resultid=33504227
https://www.gpugrid.net/result.php?resultid=33503240

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60442 - Posted: 16 May 2023 | 20:23:33 UTC

Why does "python.exe -m pip install git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac" fail like this?
Collecting git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac
Cloning https://github.com/raimis/AToM-OpenMM.git (to revision d7931b9a6217232d481731f7589d64b100a514ac) to e:\programdata\boinc\projects\www.gpugrid.net\1\tmp\pip-req-build-_8e3bl8k
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git 'E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_8e3bl8k'
Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac
Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac
Resolved https://github.com/raimis/AToM-OpenMM.git to commit d7931b9a6217232d481731f7589d64b100a514ac
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [16 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_8e3bl8k\setup.py", line 23, in <module> from async_re import __version__ as VERSION
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_8e3bl8k\async_re.py", line 24, in <module>
from ommreplica import *
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\tmp\pip-req-build-_8e3bl8k\ommreplica.py", line 8, in <module>
from simtk import openmm as mm
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\simtk\__init__.py", line 1, in <module>
import openmm
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\openmm\__init__.py", line 24, in <module>
from openmm.openmm import *
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\openmm\openmm.py", line 10, in <module>
from . import _openmm
ImportError: DLL load failed while importing _openmm: Не найден указанный модуль.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

[FVG] pima1965
Send message
Joined: 20 Jan 10
Posts: 4
Credit: 2,706,015,364
RAC: 495,820
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60443 - Posted: 16 May 2023 | 20:39:47 UTC - in response to Message 60431.

Good evening, due to my very poor IT skills, despite your precise instructions which I followed to the letter, I was not able to understand and solve the problem. Given that my other PCs, some with Windows 10 and some with Windows 11, do not have any kind of problem, I believe that the cause lies with this specific PC. For the moment I have disconnected this PC from the GPUGRID project and I have also uninstalled BOINC. When I have more time I will try again. Thanks again for your kind cooperation.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60444 - Posted: 18 May 2023 | 2:54:04 UTC - in response to Message 60442.

from openmm.openmm import *
File "E:\programdata\BOINC\projects\www.gpugrid.net\1\lib\site-packages\openmm\openmm.py", line 10, in <module>
from . import _openmm
ImportError: DLL load failed while importing _openmm: Не найден указанный модуль.
[end of output]


Ouch, sorry, I missed one step. Before you execute run.bat, you need to run this first inside the same folder: ".\python.exe bin/conda-unpack". Then run.bat should get you past the pip install command.

Before doing that though, can I ask if you are into the same problem? I didn't find a similarly failed WU from your machines. The steps were only for pima1965's errors. If your WUs aren't running into setup failures but something else, there is no point of trying these steps at first place...

PS: FWIW, the instructions I provided was simply trying to reproduce an environment based on projects\www.gpugrid.net\job.xml* without a running WU. Ultimately, if you can catch a running WU before it fails and gets cleaned up, you can copy its slot folder over to get a more accurate environment. (You can get the slot folder by checking the property of the task in UI.)

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60445 - Posted: 18 May 2023 | 8:30:21 UTC

I had 14 errors in a row last night, between about 18:30 and 19:30 UTC. All failed with a variant of

Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /hdd/boinc-client/slots/5/tmp/pip-req-build-_q32nezm
fatal: unable to access 'https://github.com/raimis/AToM-OpenMM.git/': Recv failure: Connection reset by peer

Is that something you can control?

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60446 - Posted: 18 May 2023 | 8:34:19 UTC - in response to Message 60441.

Let me see if I can find the other few that crashed and cancel those WU.

FileNotFoundError:
https://www.gpugrid.net/result.php?resultid=33507881
Energy is NaN:
https://www.gpugrid.net/result.php?resultid=33507902
https://www.gpugrid.net/result.php?resultid=33509252
ImportError:
https://www.gpugrid.net/result.php?resultid=33504227
https://www.gpugrid.net/result.php?resultid=33503240


Thanks for this, I will get a close look to these systems to see what could be the reason of the error.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60447 - Posted: 18 May 2023 | 8:38:11 UTC - in response to Message 60445.

I had 14 errors in a row last night, between about 18:30 and 19:30 UTC. All failed with a variant of

Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /hdd/boinc-client/slots/5/tmp/pip-req-build-_q32nezm
fatal: unable to access 'https://github.com/raimis/AToM-OpenMM.git/': Recv failure: Connection reset by peer

Is that something you can control?


It seems that this is a Github problem. It has been a bit unstable over the past few days.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 60451 - Posted: 20 May 2023 | 14:21:48 UTC
Last modified: 20 May 2023 | 14:24:30 UTC

4 tasks and 4 errors

2 computation errors - 195 (0xc3) EXIT_CHILD_FAILED
3 aborts - tasks are completed but do not stop when finished.
checked via BOINC tasks and 0% CPU usage, just hogging a GPU.
Just killed one at 4hrs and 50+ minutes.

Last abort yielded this stderr:
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
Detected memory leaks!
Dumping objects ->
..\api\boinc_api.cpp(309) : {3078512} normal block at 0x000002153E930FD0, 8 bytes long.
Data: < 6@ > 00 00 36 40 15 02 00 00
..\lib\diagnostics_win.cpp(417) : {3077232} normal block at 0x000002153E919030, 1080 bytes long.
Data: <8> h > 38 3E 00 00 CD CD CD CD 68 01 00 00 00 00 00 00
..\zip\boinc_zip.cpp(122) : {278} normal block at 0x000002153E92E7F0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Object dump complete.


I do not have memory leaks.
I can run all my other projects (LHC included) without any problems.

Computational errors stop after 166 seconds and 180 seconds

This stands out from the rest of the data: python setup.py egg_info did not run successfully.
exit code: 1

ImportError: DLL load failed while importing _openmm: The specified module could not be found.
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

Encountered error while generating package metadata.


So whatever batch I am getting, its all buggy


https://www.gpugrid.net/results.php?userid=107556
My list of tasks for you to look through.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 60456 - Posted: 20 May 2023 | 22:31:41 UTC
Last modified: 20 May 2023 | 22:35:20 UTC

Computer: DESKTOP-LFM92VN
Project GPUGRID

Name PTP1B_m67_m74_1-QUICO_ATM_Sage_xTB_14-4-5-RND6230_0

Application ATMbeta: Free energy calculations of protein-ligand binding 1.09 (cuda1121)
Workunit name PTP1B_m67_m74_1-QUICO_ATM_Sage_xTB_14-4-5-RND6230
State Running
Received 5/20/2023 10:05:38 PM
Report deadline 5/25/2023 10:05:37 PM
Estimated app speed 32,450.25 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.988 CPUs + 1 NVIDIA GPU (device 0)
CPU time at last checkpoint 00:00:00
CPU time 02:13:49
Elapsed time 02:21:04
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 7,131.64 MB
Working set size 955.92 MB
Directory slots/0
Process ID 2152

Debug State: 2 - Scheduler: 2

0 checkpoint?
No Stderr during the run in the slot folder.


Now the Stderr reads the same as all the rest. Import blah blah, memory leaks, etc.

https://www.gpugrid.net/result.php?resultid=33520613

I have yet to complete any of these new tasks.

GPU drivers are up to date.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60460 - Posted: 21 May 2023 | 5:30:35 UTC - in response to Message 60456.

What does run.log show?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60462 - Posted: 21 May 2023 | 10:00:21 UTC

Could someone who has a stable power supply and no Al Nino events going on in their area disrupt the energy supply (South Asia, where I am is badly affected by squalls and rainfall) with a spare computer download it and install Python or Anaconda then check these WUs. (All my errors are now being caused by weather events and power failures). I allowed someone who is learning Python to use my computers and he installed these two.
I wish these WUs could handle suspend mode. Performance is set to keep in memory.
If installing Python makes these WUs stable then what will it mean? Something like Climate which requires 32-bit DLLs.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 60464 - Posted: 21 May 2023 | 10:29:02 UTC - in response to Message 60460.
Last modified: 21 May 2023 | 10:33:38 UTC

I will have to look at the next one.
This task has already been deleted.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 60465 - Posted: 21 May 2023 | 11:42:32 UTC - in response to Message 60460.

Yet another abort needed.
Computer: DESKTOP-LFM92VN
Project GPUGRID

Name p38_m2z_maa_3-QUICO_ATM_Sage_xTB_14-4-5-RND6656_0

Application ATMbeta: Free energy calculations of protein-ligand binding 1.09 (cuda1121)
Workunit name p38_m2z_maa_3-QUICO_ATM_Sage_xTB_14-4-5-RND6656
State Running
Received 5/20/2023 10:22:37 PM
Report deadline 5/25/2023 10:22:36 PM
Estimated app speed 32,450.25 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.988 CPUs + 1 NVIDIA GPU (device 0)
CPU time at last checkpoint 00:00:00
CPU time 02:18:29
Elapsed time 00:51:04
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 7,218.47 MB
Working set size 2,101.48 MB
Directory slots/0
Process ID 17800

Debug State: 2 - Scheduler: 2

This runs on a GTX 1080 and one virtual core.



What does run.log show?


Setup environment

D:\data\slots\0>set HOMEPATH=D:\data\slots\0

D:\data\slots\0>set PATH=D:\data\slots\0;D:\data\slots\0\Library\usr\bin;D:\data\slots\0\Library\bin;C:\Windows\system32;C:\Windows

D:\data\slots\0>set PYTHONPATH=D:\data\slots\0\Lib\python3.9\site-packages

D:\data\slots\0>set SYSTEMROOT=C:\Windows
Create a temporary directory

D:\data\slots\0>set TEMP=D:\data\slots\0\tmp

D:\data\slots\0>mkdir D:\data\slots\0\tmp
Install AToM

D:\data\slots\0>set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac

D:\data\slots\0>python.exe -m pip install git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac || exit 14
Collecting git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac
Cloning https://github.com/raimis/AToM-OpenMM.git (to revision d7931b9a6217232d481731f7589d64b100a514ac) to d:\data\slots\0\tmp\pip-req-build-n4bnfm46
Resolved https://github.com/raimis/AToM-OpenMM.git to commit d7931b9a6217232d481731f7589d64b100a514ac
Preparing metadata (setup.py): started
Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: async-re
Building wheel for async-re (setup.py): started
Building wheel for async-re (setup.py): finished with status 'done'
Created wheel for async-re: filename=async_re-3.3.0-py3-none-any.whl size=40735 sha256=b78cd7a2db0c0a4584d16c9a7967a2bdafe4f91754d0514cf1d4be7fd9e7038f
Stored in directory: c:\users\greg\appdata\local\pip\cache\wheels\e3\94\02\5d2f795e8088e5cda09e48b0a167d6325c316862c02fa11467
Successfully built async-re
Installing collected packages: async-re
Successfully installed async-re-3.3.0

D:\data\slots\0>python.exe -m pip list
Package Version
------------ -------
async-re 3.3.0
atmmetaforce 0.3
configobj 5.0.8
numpy 1.24.2
OpenMM 8.0.0
pip 23.0.1
setuptools 67.6.0
six 1.16.0
wheel 0.40.0
Configure AToM

D:\data\slots\0>echo localhost,0:0,1,CUDA,,D:\data\slots\0\tmp 1>nodefile
Extract restart

D:\data\slots\0>tar.exe xjvf restart.tar.bz2 || true
r0/p38_m2z_maa_ckpt.xml
r1/p38_m2z_maa_ckpt.xml
r10/p38_m2z_maa_ckpt.xml
r11/p38_m2z_maa_ckpt.xml
r12/p38_m2z_maa_ckpt.xml
r13/p38_m2z_maa_ckpt.xml
r14/p38_m2z_maa_ckpt.xml
r15/p38_m2z_maa_ckpt.xml
r16/p38_m2z_maa_ckpt.xml
r17/p38_m2z_maa_ckpt.xml
r18/p38_m2z_maa_ckpt.xml
r19/p38_m2z_maa_ckpt.xml
r2/p38_m2z_maa_ckpt.xml
r20/p38_m2z_maa_ckpt.xml
r21/p38_m2z_maa_ckpt.xml
r3/p38_m2z_maa_ckpt.xml
r4/p38_m2z_maa_ckpt.xml
r5/p38_m2z_maa_ckpt.xml
r6/p38_m2z_maa_ckpt.xml
r7/p38_m2z_maa_ckpt.xml
r8/p38_m2z_maa_ckpt.xml
r9/p38_m2z_maa_ckpt.xml
Run AToM

D:\data\slots\0>set CONFIG_FILE=p38_m2z_maa_asyncre.cntl

D:\data\slots\0>python.exe Scripts\rbfe_explicit_sync.py p38_m2z_maa_asyncre.cntl || exit 22
2023-05-21 12:44:43 - INFO - sync_re - Configuration:
2023-05-21 12:44:43 - INFO - sync_re - JOB_TRANSPORT: LOCAL_OPENMM
2023-05-21 12:44:43 - INFO - sync_re - BASENAME: p38_m2z_maa
2023-05-21 12:44:43 - INFO - sync_re - RE_SETUP: YES
2023-05-21 12:44:43 - INFO - sync_re - TEMPERATURES: 300
2023-05-21 12:44:43 - INFO - sync_re - LAMBDAS: 0.00, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00
2023-05-21 12:44:43 - INFO - sync_re - DIRECTION: 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
2023-05-21 12:44:43 - INFO - sync_re - INTERMEDIATE: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2023-05-21 12:44:43 - INFO - sync_re - LAMBDA1: 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.50, 0.40, 0.30, 0.20, 0.10, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00
2023-05-21 12:44:43 - INFO - sync_re - LAMBDA2: 0.00, 0.10, 0.20, 0.30, 0.40, 0.50, 0.50, 0.50, 0.50, 0.50, 0.50, 0.50, 0.50, 0.50, 0.50, 0.50, 0.50, 0.40, 0.30, 0.20, 0.10, 0.00
2023-05-21 12:44:43 - INFO - sync_re - ALPHA: 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10
2023-05-21 12:44:43 - INFO - sync_re - U0: 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110., 110.
2023-05-21 12:44:43 - INFO - sync_re - W0COEFF: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
2023-05-21 12:44:43 - INFO - sync_re - DISPLACEMENT: 22.0, 22.0, 22.0
2023-05-21 12:44:43 - INFO - sync_re - WALL_TIME: 9999
2023-05-21 12:44:43 - INFO - sync_re - CYCLE_TIME: 60
2023-05-21 12:44:43 - INFO - sync_re - CHECKPOINT_TIME: 300
2023-05-21 12:44:43 - INFO - sync_re - NODEFILE: nodefile
2023-05-21 12:44:43 - INFO - sync_re - SUBJOBS_BUFFER_SIZE: 0
2023-05-21 12:44:43 - INFO - sync_re - PRODUCTION_STEPS: 2000
2023-05-21 12:44:43 - INFO - sync_re - PRNT_FREQUENCY: 2000
2023-05-21 12:44:43 - INFO - sync_re - TRJ_FREQUENCY: 40000
2023-05-21 12:44:43 - INFO - sync_re - LIGAND1_ATOMS: ['5596', '5597', '5598', '5599', '5600', '5601', '5602', '5603', '5604', '5605', '5606', '5607', '5608', '5609', '5610', '5611', '5612', '5613', '5614', '5615', '5616', '5617', '5618', '5619', '5620', '5621', '5622', '5623', '5624', '5625', '5626', '5627', '5628', '5629', '5630', '5631', '5632', '5633', '5634', '5635', '5636', '5637', '5638', '5639', '5640']
2023-05-21 12:44:43 - INFO - sync_re - LIGAND2_ATOMS: ['5641', '5642', '5643', '5644', '5645', '5646', '5647', '5648', '5649', '5650', '5651', '5652', '5653', '5654', '5655', '5656', '5657', '5658', '5659', '5660', '5661', '5662', '5663', '5664', '5665', '5666', '5667', '5668', '5669', '5670', '5671', '5672', '5673', '5674', '5675', '5676', '5677', '5678', '5679', '5680', '5681', '5682', '5683']
2023-05-21 12:44:43 - INFO - sync_re - LIGAND1_CM_ATOMS: 5601
2023-05-21 12:44:43 - INFO - sync_re - LIGAND2_CM_ATOMS: 5646
2023-05-21 12:44:43 - INFO - sync_re - RCPT_CM_ATOMS: ['460', '483', '494', '501', '550', '745', '755', '771', '1178', '1337', '1363', '1654', '1673', '1689', '1703', '1720', '1739', '1756', '1763', '1773', '2532', '2685']
2023-05-21 12:44:43 - INFO - sync_re - CM_KF: 25.00
2023-05-21 12:44:43 - INFO - sync_re - CM_TOL: 10
2023-05-21 12:44:43 - INFO - sync_re - POS_RESTRAINED_ATOMS: ['4', '19', '51', '57', '71', '91', '112', '136', '153', '168', '187', '201', '223', '237', '256', '280', '295', '319', '325', '340', '364', '385', '402', '416', '435', '454', '460', '476', '483', '494', '501', '511', '532', '539', '550', '566', '577', '587', '597', '617', '629', '643', '665', '679', '686', '705', '729', '745', '755', '771', '793', '815', '834', '845', '877', '883', '903', '920', '931', '950', '969', '986', '996', '1018', '1042', '1056', '1077', '1101', '1116', '1135', '1159', '1178', '1197', '1219', '1236', '1253', '1275', '1292', '1307', '1321', '1337', '1356', '1363', '1382', '1401', '1413', '1429', '1449', '1471', '1477', '1487', '1511', '1522', '1541', '1556', '1571', '1591', '1605', '1617', '1633', '1654', '1673', '1689', '1703', '1720', '1739', '1756', '1763', '1773', '1785', '1804', '1818', '1832', '1851', '1867', '1889', '1900', '1917', '1939', '1958', '1972', '1984', '1996', '2013', '2029', '2046', '2066', '2085', '2104', '2125', '2142', '2161', '2180', '2204', '2211', '2230', '2252', '2273', '2292', '2309', '2320', '2330', '2342', '2361', '2380', '2397', '2421', '2433', '2452', '2482', '2488', '2499', '2513', '2532', '2542', '2558', '2572', '2587', '2599', '2610', '2625', '2644', '2666', '2685', '2704', '2716', '2736', '2743', '2765', '2782', '2796', '2808', '2820', '2835', '2852', '2866', '2873', '2894', '2910', '2920', '2934', '2958', '2982', '3003', '3027', '3045', '3051', '3066', '3085', '3102', '3121', '3135', '3159', '3176', '3193', '3214', '3228', '3245', '3259', '3275', '3287', '3306', '3330', '3341', '3357', '3364', '3375', '3394', '3411', '3421', '3436', '3455', '3474', '3488', '3495', '3519', '3533', '3552', '3580', '3586', '3593', '3607', '3619', '3636', '3655', '3667', '3684', '3703', '3725', '3744', '3763', '3782', '3806', '3825', '3841', '3848', '3870', '3876', '3883', '3893', '3908', '3927', '3946', '3968', '3990', '4009', '4020', '4031', '4046', '4057', '4067', '4091', '4105', '4126', '4145', '4162', '4173', '4192', '4206', '4223', '4248', '4254', '4276', '4293', '4307', '4327', '4337', '4351', '4367', '4387', '4406', '4413', '4423', '4445', '4451', '4470', '4480', '4496', '4508', '4527', '4546', '4561', '4583', '4600', '4619', '4635', '4654', '4666', '4677', '4689', '4711', '4735', '4754', '4768', '4778', '4788', '4805', '4815', '4834', '4844', '4861', '4871', '4892', '4912', '4922', '4939', '4960', '4977', '4997', '5003', '5015', '5027', '5050', '5056', '5072', '5082', '5102', '5108', '5129', '5141', '5158', '5169', '5189', '5204', '5215', '5239', '5251', '5270', '5289', '5308', '5320', '5335', '5359', '5381', '5392', '5411', '5425', '5446', '5458', '5473', '5489', '5508', '5519', '5539', '5563', '5577', '5591']
2023-05-21 12:44:43 - INFO - sync_re - POSRE_FORCE_CONSTANT: 25.0
2023-05-21 12:44:43 - INFO - sync_re - POSRE_TOLERANCE: 1.5
2023-05-21 12:44:43 - INFO - sync_re - ALIGN_LIGAND1_REF_ATOMS: ['5', '1', '20']
2023-05-21 12:44:43 - INFO - sync_re - ALIGN_LIGAND2_REF_ATOMS: ['5', '1', '20']
2023-05-21 12:44:43 - INFO - sync_re - ALIGN_KF_SEP: 2.5
2023-05-21 12:44:43 - INFO - sync_re - ALIGN_K_THETA: 25.0
2023-05-21 12:44:43 - INFO - sync_re - ALIGN_K_PSI: 25.0
2023-05-21 12:44:43 - INFO - sync_re - UMAX: 200.00
2023-05-21 12:44:43 - INFO - sync_re - ACORE: 0.062500
2023-05-21 12:44:43 - INFO - sync_re - UBCORE: 100.0
2023-05-21 12:44:43 - INFO - sync_re - FRICTION_COEFF: 0.100000
2023-05-21 12:44:43 - INFO - sync_re - TIME_STEP: 0.004
2023-05-21 12:44:43 - INFO - sync_re - OPENMM_PLATFORM: CUDA
2023-05-21 12:44:43 - INFO - sync_re - VERBOSE: no
2023-05-21 12:44:43 - INFO - sync_re - HMASS: 1.5
2023-05-21 12:44:43 - INFO - sync_re - MAX_SAMPLES: +70
2023-05-21 12:44:43 - INFO - sync_re - State parameters
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.0, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.0, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.05, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.1, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.1, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.2, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.15, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.3, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.2, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.4, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.25, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.3, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.1, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.35, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.2, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.4, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.3, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.45, 'atmdirection': 1.0, 'atmintermediate': 0.0, 'lambda1': 0.4, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.5, 'atmdirection': 1.0, 'atmintermediate': 1.0, 'lambda1': 0.5, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=1.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.5, 'atmdirection': -1.0, 'atmintermediate': 1.0, 'lambda1': 0.5, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=1.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.55, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.4, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.6, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.3, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.65, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.2, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.7, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.1, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.75, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.5, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.8, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.4, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.85, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.3, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.9, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.2, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 0.95, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.1, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - State: {'lambda': 1.0, 'atmdirection': -1.0, 'atmintermediate': 0.0, 'lambda1': 0.0, 'lambda2': 0.0, 'alpha': Quantity(value=0.1, unit=mole/kilocalorie), 'u0': Quantity(value=110.0, unit=kilocalorie/mole), 'w0': Quantity(value=0.0, unit=kilocalorie/mole), 'temperature': Quantity(value=300.0, unit=kelvin)}
2023-05-21 12:44:43 - INFO - sync_re - Started: ATM setup
2023-05-21 12:44:43 - INFO - sync_re - Started: create system
warning: AddRestraintForce() is deprecated. Use addVsiteRestraintForceCMCM()
warning: AddRestraintForce() is deprecated. Use addVsiteRestraintForceCMCM()
2023-05-21 12:44:46 - INFO - sync_re - Running with a 4.000000 fs time-step with bonded forces integrated 4 times per time-step
2023-05-21 12:44:46 - INFO - sync_re - Finished: create system (duration: 3.515999999999849 s)
2023-05-21 12:44:46 - INFO - sync_re - Started: create worker
2023-05-21 12:44:46 - INFO - sync_re - Device: CUDA 0
2023-05-21 12:45:24 - INFO - sync_re - Finished: create worker (duration: 37.702999999999975 s)
2023-05-21 12:45:24 - INFO - sync_re - Started: create replicas
2023-05-21 12:45:24 - INFO - sync_re - Loading checkpointfile r0/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:27 - INFO - sync_re - Loading checkpointfile r1/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:29 - INFO - sync_re - Loading checkpointfile r2/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:32 - INFO - sync_re - Loading checkpointfile r3/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:34 - INFO - sync_re - Loading checkpointfile r4/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:37 - INFO - sync_re - Loading checkpointfile r5/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:39 - INFO - sync_re - Loading checkpointfile r6/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:42 - INFO - sync_re - Loading checkpointfile r7/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:44 - INFO - sync_re - Loading checkpointfile r8/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:47 - INFO - sync_re - Loading checkpointfile r9/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:49 - INFO - sync_re - Loading checkpointfile r10/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:52 - INFO - sync_re - Loading checkpointfile r11/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:54 - INFO - sync_re - Loading checkpointfile r12/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:57 - INFO - sync_re - Loading checkpointfile r13/p38_m2z_maa_ckpt.xml
2023-05-21 12:45:59 - INFO - sync_re - Loading checkpointfile r14/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:02 - INFO - sync_re - Loading checkpointfile r15/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:04 - INFO - sync_re - Loading checkpointfile r16/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:07 - INFO - sync_re - Loading checkpointfile r17/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:09 - INFO - sync_re - Loading checkpointfile r18/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:12 - INFO - sync_re - Loading checkpointfile r19/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:14 - INFO - sync_re - Loading checkpointfile r20/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:17 - INFO - sync_re - Loading checkpointfile r21/p38_m2z_maa_ckpt.xml
2023-05-21 12:46:19 - INFO - sync_re - Replica 0: cycle 211, state 5
2023-05-21 12:46:19 - INFO - sync_re - Replica 1: cycle 211, state 2
2023-05-21 12:46:19 - INFO - sync_re - Replica 2: cycle 211, state 3
2023-05-21 12:46:19 - INFO - sync_re - Replica 3: cycle 211, state 11
2023-05-21 12:46:19 - INFO - sync_re - Replica 4: cycle 211, state 1
2023-05-21 12:46:19 - INFO - sync_re - Replica 5: cycle 211, state 6
2023-05-21 12:46:19 - INFO - sync_re - Replica 6: cycle 211, state 1
2023-05-21 12:46:19 - INFO - sync_re - Replica 7: cycle 211, state 10
2023-05-21 12:46:19 - INFO - sync_re - Replica 8: cycle 211, state 4
2023-05-21 12:46:19 - INFO - sync_re - Replica 9: cycle 211, state 8
2023-05-21 12:46:19 - INFO - sync_re - Replica 10: cycle 211, state 7
2023-05-21 12:46:19 - INFO - sync_re - Replica 11: cycle 211, state 20
2023-05-21 12:46:19 - INFO - sync_re - Replica 12: cycle 211, state 12
2023-05-21 12:46:19 - INFO - sync_re - Replica 13: cycle 211, state 14
2023-05-21 12:46:19 - INFO - sync_re - Replica 14: cycle 211, state 9
2023-05-21 12:46:19 - INFO - sync_re - Replica 15: cycle 211, state 13
2023-05-21 12:46:19 - INFO - sync_re - Replica 16: cycle 211, state 19
2023-05-21 12:46:19 - INFO - sync_re - Replica 17: cycle 211, state 15
2023-05-21 12:46:19 - INFO - sync_re - Replica 18: cycle 211, state 17
2023-05-21 12:46:19 - INFO - sync_re - Replica 19: cycle 211, state 16
2023-05-21 12:46:19 - INFO - sync_re - Replica 20: cycle 211, state 18
2023-05-21 12:46:19 - INFO - sync_re - Replica 21: cycle 211, state 21
2023-05-21 12:46:19 - INFO - sync_re - Finished: create replicas (duration: 55.406000000000176 s)
2023-05-21 12:46:19 - INFO - sync_re - Started: update replicas
2023-05-21 12:46:28 - INFO - sync_re - Finished: update replicas (duration: 9.1099999999999 s)
2023-05-21 12:46:28 - INFO - sync_re - Finished: ATM setup (duration: 105.7349999999999 s)
2023-05-21 12:46:28 - INFO - sync_re - Started: ATM simulations
2023-05-21 12:46:28 - INFO - sync_re - Additional number of samples: 70
2023-05-21 12:46:28 - INFO - sync_re - Started: sample 211
2023-05-21 12:46:28 - INFO - sync_re - Started: sample 211, replica 0
2023-05-21 12:47:05 - INFO - sync_re - Finished: sample 211, replica 0 (duration: 36.125 s)
2023-05-21 12:47:05 - INFO - sync_re - Started: sample 211, replica 1
2023-05-21 12:47:35 - INFO - sync_re - Finished: sample 211, replica 1 (duration: 30.8900000000001 s)
2023-05-21 12:47:35 - INFO - sync_re - Started: sample 211, replica 2
2023-05-21 12:48:07 - INFO - sync_re - Finished: sample 211, replica 2 (duration: 31.9849999999999 s)
2023-05-21 12:48:07 - INFO - sync_re - Started: sample 211, replica 3
2023-05-21 12:48:38 - INFO - sync_re - Finished: sample 211, replica 3 (duration: 30.54600000000005 s)
2023-05-21 12:48:38 - INFO - sync_re - Started: sample 211, replica 4
2023-05-21 12:49:09 - INFO - sync_re - Finished: sample 211, replica 4 (duration: 31.297000000000025 s)
2023-05-21 12:49:09 - INFO - sync_re - Started: sample 211, replica 5
2023-05-21 12:49:41 - INFO - sync_re - Finished: sample 211, replica 5 (duration: 31.375 s)
2023-05-21 12:49:41 - INFO - sync_re - Started: sample 211, replica 6
2023-05-21 12:50:11 - INFO - sync_re - Finished: sample 211, replica 6 (duration: 30.672000000000025 s)
2023-05-21 12:50:11 - INFO - sync_re - Started: sample 211, replica 7
2023-05-21 12:50:43 - INFO - sync_re - Finished: sample 211, replica 7 (duration: 31.6099999999999 s)
2023-05-21 12:50:43 - INFO - sync_re - Started: sample 211, replica 8
2023-05-21 12:51:14 - INFO - sync_re - Finished: sample 211, replica 8 (duration: 30.843000000000075 s)
2023-05-21 12:51:14 - INFO - sync_re - Started: sample 211, replica 9
2023-05-21 12:51:42 - INFO - sync_re - Finished: sample 211, replica 9 (duration: 28.672000000000025 s)
2023-05-21 12:51:42 - INFO - sync_re - Started: sample 211, replica 10
2023-05-21 12:52:13 - INFO - sync_re - Finished: sample 211, replica 10 (duration: 30.5 s)
2023-05-21 12:52:13 - INFO - sync_re - Started: sample 211, replica 11
2023-05-21 12:52:43 - INFO - sync_re - Finished: sample 211, replica 11 (duration: 30.312999999999874 s)
2023-05-21 12:52:43 - INFO - sync_re - Started: sample 211, replica 12
2023-05-21 12:53:15 - INFO - sync_re - Finished: sample 211, replica 12 (duration: 31.797000000000025 s)
2023-05-21 12:53:15 - INFO - sync_re - Started: sample 211, replica 13
2023-05-21 12:53:48 - INFO - sync_re - Finished: sample 211, replica 13 (duration: 32.5 s)
2023-05-21 12:53:48 - INFO - sync_re - Started: sample 211, replica 14
2023-05-21 12:54:20 - INFO - sync_re - Finished: sample 211, replica 14 (duration: 32.53099999999995 s)
2023-05-21 12:54:20 - INFO - sync_re - Started: sample 211, replica 15
2023-05-21 12:54:52 - INFO - sync_re - Finished: sample 211, replica 15 (duration: 32.34400000000005 s)
2023-05-21 12:54:52 - INFO - sync_re - Started: sample 211, replica 16
2023-05-21 12:55:24 - INFO - sync_re - Finished: sample 211, replica 16 (duration: 31.75 s)
2023-05-21 12:55:24 - INFO - sync_re - Started: sample 211, replica 17
2023-05-21 12:55:56 - INFO - sync_re - Finished: sample 211, replica 17 (duration: 31.875 s)
2023-05-21 12:55:56 - INFO - sync_re - Started: sample 211, replica 18
2023-05-21 12:56:29 - INFO - sync_re - Finished: sample 211, replica 18 (duration: 32.4849999999999 s)
2023-05-21 12:56:29 - INFO - sync_re - Started: sample 211, replica 19
2023-05-21 12:56:59 - INFO - sync_re - Finished: sample 211, replica 19 (duration: 30.077999999999975 s)
2023-05-21 12:56:59 - INFO - sync_re - Started: sample 211, replica 20
2023-05-21 12:57:30 - INFO - sync_re - Finished: sample 211, replica 20 (duration: 31.2650000000001 s)
2023-05-21 12:57:30 - INFO - sync_re - Started: sample 211, replica 21
2023-05-21 12:58:00 - INFO - sync_re - Finished: sample 211, replica 21 (duration: 30.593999999999824 s)
2023-05-21 12:58:00 - INFO - sync_re - Started: exchange replicas
2023-05-21 12:58:00 - INFO - sync_re - Replica 18: 17 --> 18
2023-05-21 12:58:00 - INFO - sync_re - Replica 20: 18 --> 17
2023-05-21 12:58:00 - INFO - sync_re - Finished: exchange replicas (duration: 0.047000000000025466 s)
2023-05-21 12:58:00 - INFO - sync_re - Started: update replicas
2023-05-21 12:58:09 - INFO - sync_re - Finished: update replicas (duration: 8.812000000000126 s)
2023-05-21 12:58:09 - INFO - sync_re - Started: write replicas samples and trajectories
2023-05-21 12:58:09 - INFO - sync_re - Finished: write replicas samples and trajectories (duration: 0.015999999999849024 s)
2023-05-21 12:58:09 - INFO - sync_re - Started: checkpointing
2023-05-21 12:58:59 - INFO - sync_re - Finished: checkpointing (duration: 50.031000000000176 s)
2023-05-21 12:58:59 - INFO - sync_re - Finished: sample 211 (duration: 750.9680000000001 s)
2023-05-21 12:58:59 - INFO - sync_re - Started: sample 212
2023-05-21 12:58:59 - INFO - sync_re - Started: sample 212, replica 0
2023-05-21 12:59:30 - INFO - sync_re - Finished: sample 212, replica 0 (duration: 30.687999999999874 s)
2023-05-21 12:59:30 - INFO - sync_re - Started: sample 212, replica 1
2023-05-21 13:00:00 - INFO - sync_re - Finished: sample 212, replica 1 (duration: 30.25 s)
2023-05-21 13:00:00 - INFO - sync_re - Started: sample 212, replica 2
2023-05-21 13:00:31 - INFO - sync_re - Finished: sample 212, replica 2 (duration: 30.797000000000025 s)
2023-05-21 13:00:31 - INFO - sync_re - Started: sample 212, replica 3
2023-05-21 13:01:02 - INFO - sync_re - Finished: sample 212, replica 3 (duration: 30.467999999999847 s)
2023-05-21 13:01:02 - INFO - sync_re - Started: sample 212, replica 4
2023-05-21 13:01:31 - INFO - sync_re - Finished: sample 212, replica 4 (duration: 29.71900000000005 s)
2023-05-21 13:01:31 - INFO - sync_re - Started: sample 212, replica 5
2023-05-21 13:02:02 - INFO - sync_re - Finished: sample 212, replica 5 (duration: 30.90599999999995 s)
2023-05-21 13:02:02 - INFO - sync_re - Started: sample 212, replica 6
2023-05-21 13:02:32 - INFO - sync_re - Finished: sample 212, replica 6 (duration: 29.96900000000005 s)
2023-05-21 13:02:32 - INFO - sync_re - Started: sample 212, replica 7
2023-05-21 13:03:03 - INFO - sync_re - Finished: sample 212, replica 7 (duration: 30.391000000000076 s)
2023-05-21 13:03:03 - INFO - sync_re - Started: sample 212, replica 8
2023-05-21 13:03:33 - INFO - sync_re - Finished: sample 212, replica 8 (duration: 30.34400000000005 s)
2023-05-21 13:03:33 - INFO - sync_re - Started: sample 212, replica 9
2023-05-21 13:04:04 - INFO - sync_re - Finished: sample 212, replica 9 (duration: 30.79599999999982 s)
2023-05-21 13:04:04 - INFO - sync_re - Started: sample 212, replica 10
2023-05-21 13:04:34 - INFO - sync_re - Finished: sample 212, replica 10 (duration: 30.375 s)
2023-05-21 13:04:34 - INFO - sync_re - Started: sample 212, replica 11
2023-05-21 13:05:04 - INFO - sync_re - Finished: sample 212, replica 11 (duration: 30.063000000000102 s)
2023-05-21 13:05:04 - INFO - sync_re - Started: sample 212, replica 12
2023-05-21 13:05:34 - INFO - sync_re - Finished: sample 212, replica 12 (duration: 29.827999999999975 s)
2023-05-21 13:05:34 - INFO - sync_re - Started: sample 212, replica 13
2023-05-21 13:06:05 - INFO - sync_re - Finished: sample 212, replica 13 (duration: 30.952999999999975 s)
2023-05-21 13:06:05 - INFO - sync_re - Started: sample 212, replica 14
2023-05-21 13:06:35 - INFO - sync_re - Finished: sample 212, replica 14 (duration: 30.264999999999873 s)
2023-05-21 13:06:35 - INFO - sync_re - Started: sample 212, replica 15
2023-05-21 13:07:04 - INFO - sync_re - Finished: sample 212, replica 15 (duration: 28.563000000000102 s)
2023-05-21 13:07:04 - INFO - sync_re - Started: sample 212, replica 16
2023-05-21 13:07:16 - INFO - sync_re - Finished: sample 212, replica 16 (duration: 12.0 s)
2023-05-21 13:07:16 - INFO - sync_re - Started: sample 212, replica 17
2023-05-21 13:07:27 - INFO - sync_re - Finished: sample 212, replica 17 (duration: 11.530999999999949 s)
2023-05-21 13:07:27 - INFO - sync_re - Started: sample 212, replica 18
2023-05-21 13:07:39 - INFO - sync_re - Finished: sample 212, replica 18 (duration: 11.938000000000102 s)
2023-05-21 13:07:39 - INFO - sync_re - Started: sample 212, replica 19
2023-05-21 13:07:51 - INFO - sync_re - Finished: sample 212, replica 19 (duration: 11.5 s)
2023-05-21 13:07:51 - INFO - sync_re - Started: sample 212, replica 20
2023-05-21 13:08:03 - INFO - sync_re - Finished: sample 212, replica 20 (duration: 11.967999999999847 s)
2023-05-21 13:08:03 - INFO - sync_re - Started: sample 212, replica 21
2023-05-21 13:08:14 - INFO - sync_re - Finished: sample 212, replica 21 (duration: 11.672000000000025 s)
2023-05-21 13:08:14 - INFO - sync_re - Started: exchange replicas
2023-05-21 13:08:14 - INFO - sync_re - Replica 4: 1 --> 1
2023-05-21 13:08:14 - INFO - sync_re - Replica 6: 1 --> 1
2023-05-21 13:08:14 - INFO - sync_re - Replica 7: 10 --> 11
2023-05-21 13:08:14 - INFO - sync_re - Replica 3: 11 --> 10
2023-05-21 13:08:14 - INFO - sync_re - Finished: exchange replicas (duration: 0.06300000000010186 s)
2023-05-21 13:08:14 - INFO - sync_re - Started: update replicas
2023-05-21 13:08:23 - INFO - sync_re - Finished: update replicas (duration: 8.827999999999975 s)
2023-05-21 13:08:23 - INFO - sync_re - Started: write replicas samples and trajectories
2023-05-21 13:08:23 - INFO - sync_re - Finished: write replicas samples and trajectories (duration: 0.0 s)
2023-05-21 13:08:23 - INFO - sync_re - Started: checkpointing
2023-05-21 13:09:12 - INFO - sync_re - Finished: checkpointing (duration: 49.266000000000076 s)
2023-05-21 13:09:13 - INFO - sync_re - Finished: sample 212 (duration: 613.1569999999999 s)
2023-05-21 13:09:13 - INFO - sync_re - Started: sample 213
2023-05-21 13:09:13 - INFO - sync_re - Started: sample 213, replica 0
2023-05-21 13:09:59 - INFO - sync_re - Finished: sample 213, replica 0 (duration: 46.4369999999999 s)
2023-05-21 13:09:59 - INFO - sync_re - Started: sample 213, replica 1
2023-05-21 13:10:45 - INFO - sync_re - Finished: sample 213, replica 1 (duration: 45.733999999999924 s)
2023-05-21 13:10:45 - INFO - sync_re - Started: sample 213, replica 2
2023-05-21 13:11:33 - INFO - sync_re - Finished: sample 213, replica 2 (duration: 48.28200000000015 s)
2023-05-21 13:11:33 - INFO - sync_re - Started: sample 213, replica 3
2023-05-21 13:12:24 - INFO - sync_re - Finished: sample 213, replica 3 (duration: 51.3119999999999 s)
2023-05-21 13:12:24 - INFO - sync_re - Started: sample 213, replica 4
2023-05-21 13:13:16 - INFO - sync_re - Finished: sample 213, replica 4 (duration: 51.922000000000025 s)
2023-05-21 13:13:16 - INFO - sync_re - Started: sample 213, replica 5
2023-05-21 13:14:08 - INFO - sync_re - Finished: sample 213, replica 5 (duration: 51.375 s)
2023-05-21 13:14:08 - INFO - sync_re - Started: sample 213, replica 6
2023-05-21 13:14:59 - INFO - sync_re - Finished: sample 213, replica 6 (duration: 51.53099999999995 s)
2023-05-21 13:14:59 - INFO - sync_re - Started: sample 213, replica 7
2023-05-21 13:15:51 - INFO - sync_re - Finished: sample 213, replica 7 (duration: 51.8130000000001 s)
2023-05-21 13:15:51 - INFO - sync_re - Started: sample 213, replica 8
2023-05-21 13:16:42 - INFO - sync_re - Finished: sample 213, replica 8 (duration: 51.0619999999999 s)
2023-05-21 13:16:42 - INFO - sync_re - Started: sample 213, replica 9
2023-05-21 13:17:34 - INFO - sync_re - Finished: sample 213, replica 9 (duration: 51.98500000000013 s)
2023-05-21 13:17:34 - INFO - sync_re - Started: sample 213, replica 10
2023-05-21 13:18:26 - INFO - sync_re - Finished: sample 213, replica 10 (duration: 52.0 s)
2023-05-21 13:18:26 - INFO - sync_re - Started: sample 213, replica 11
2023-05-21 13:19:18 - INFO - sync_re - Finished: sample 213, replica 11 (duration: 52.28099999999995 s)
2023-05-21 13:19:18 - INFO - sync_re - Started: sample 213, replica 12
2023-05-21 13:20:10 - INFO - sync_re - Finished: sample 213, replica 12 (duration: 51.266000000000076 s)
2023-05-21 13:20:10 - INFO - sync_re - Started: sample 213, replica 13
2023-05-21 13:21:03 - INFO - sync_re - Finished: sample 213, replica 13 (duration: 53.233999999999924 s)
2023-05-21 13:21:03 - INFO - sync_re - Started: sample 213, replica 14
2023-05-21 13:21:55 - INFO - sync_re - Finished: sample 213, replica 14 (duration: 52.03099999999995 s)
2023-05-21 13:21:55 - INFO - sync_re - Started: sample 213, replica 15
2023-05-21 13:22:49 - INFO - sync_re - Finished: sample 213, replica 15 (duration: 53.75 s)
2023-05-21 13:22:49 - INFO - sync_re - Started: sample 213, replica 16
2023-05-21 13:23:42 - INFO - sync_re - Finished: sample 213, replica 16 (duration: 53.077999999999975 s)
2023-05-21 13:23:42 - INFO - sync_re - Started: sample 213, replica 17
2023-05-21 13:24:34 - INFO - sync_re - Finished: sample 213, replica 17 (duration: 52.327999999999975 s)
2023-05-21 13:24:34 - INFO - sync_re - Started: sample 213, replica 18
2023-05-21 13:25:27 - INFO - sync_re - Finished: sample 213, replica 18 (duration: 52.82900000000018 s)
2023-05-21 13:25:27 - INFO - sync_re - Started: sample 213, replica 19
2023-05-21 13:26:20 - INFO - sync_re - Finished: sample 213, replica 19 (duration: 52.92099999999982 s)
2023-05-21 13:26:20 - INFO - sync_re - Started: sample 213, replica 20
2023-05-21 13:27:13 - INFO - sync_re - Finished: sample 213, replica 20 (duration: 53.40700000000015 s)
2023-05-21 13:27:13 - INFO - sync_re - Started: sample 213, replica 21
2023-05-21 13:28:05 - INFO - sync_re - Finished: sample 213, replica 21 (duration: 52.28099999999995 s)
2023-05-21 13:28:05 - INFO - sync_re - Started: exchange replicas
2023-05-21 13:28:05 - INFO - sync_re - Finished: exchange replicas (duration: 0.047000000000025466 s)
2023-05-21 13:28:05 - INFO - sync_re - Started: update replicas
2023-05-21 13:28:14 - INFO - sync_re - Finished: update replicas (duration: 9.030999999999949 s)
2023-05-21 13:28:14 - INFO - sync_re - Started: write replicas samples and trajectories
2023-05-21 13:28:14 - INFO - sync_re - Finished: write replicas samples and trajectories (duration: 0.0 s)
2023-05-21 13:28:14 - INFO - sync_re - Started: checkpointing
2023-05-21 13:29:04 - INFO - sync_re - Finished: checkpointing (duration: 50.016000000000076 s)
2023-05-21 13:29:04 - INFO - sync_re - Finished: sample 213 (duration: 1191.953 s)
2023-05-21 13:29:04 - INFO - sync_re - Started: sample 214
2023-05-21 13:29:04 - INFO - sync_re - Started: sample 214, replica 0
2023-05-21 13:29:58 - INFO - sync_re - Finished: sample 214, replica 0 (duration: 53.28099999999995 s)
2023-05-21 13:29:58 - INFO - sync_re - Started: sample 214, replica 1
2023-05-21 13:30:50 - INFO - sync_re - Finished: sample 214, replica 1 (duration: 52.483999999999924 s)
2023-05-21 13:30:50 - INFO - sync_re - Started: sample 214, replica 2
2023-05-21 13:31:43 - INFO - sync_re - Finished: sample 214, replica 2 (duration: 53.172000000000025 s)
2023-05-21 13:31:43 - INFO - sync_re - Started: sample 214, replica 3
2023-05-21 13:32:37 - INFO - sync_re - Finished: sample 214, replica 3 (duration: 53.141000000000076 s)
2023-05-21 13:32:37 - INFO - sync_re - Started: sample 214, replica 4
2023-05-21 13:33:30 - INFO - sync_re - Finished: sample 214, replica 4 (duration: 53.63999999999987 s)
2023-05-21 13:33:30 - INFO - sync_re - Started: sample 214, replica 5
2023-05-21 13:34:22 - INFO - sync_re - Finished: sample 214, replica 5 (duration: 51.98500000000013 s)
2023-05-21 13:34:22 - INFO - sync_re - Started: sample 214, replica 6
2023-05-21 13:35:07 - INFO - sync_re - Finished: sample 214, replica 6 (duration: 44.85900000000038 s)
2023-05-21 13:35:07 - INFO - sync_re - Started: sample 214, replica 7
2023-05-21 13:35:51 - INFO - sync_re - Finished: sample 214, replica 7 (duration: 44.35999999999967 s)
2023-05-21 13:35:51 - INFO - sync_re - Started: sample 214, replica 8

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 60466 - Posted: 21 May 2023 | 11:44:50 UTC

I can't run ATM stuff at the moment.
Suspending project until this problem with depreciation is fixed.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60468 - Posted: 21 May 2023 | 16:33:28 UTC

Need some more tasks for the coming week. RTS=0

I am able to run the ATM tasks with no issues except for the occassional NaN task that everybody gets.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 60469 - Posted: 21 May 2023 | 17:54:18 UTC - in response to Message 60468.

Need some more tasks for the coming week. RTS=0

I am able to run the ATM tasks with no issues except for the occassional NaN task that everybody gets.


All I get now is QUICO and that dies.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60551 - Posted: 26 Jun 2023 | 12:09:39 UTC

These units still crash on restart, but it otherwise seems to run fine:

https://www.gpugrid.net/result.php?resultid=33522811

Though on my older computer they error out:

https://www.gpugrid.net/results.php?hostid=607570

https://www.gpugrid.net/result.php?resultid=33523431

I wonder if it's just too old to do these units, but it did complete ACEMD 3 unit.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60586 - Posted: 12 Jul 2023 | 2:30:12 UTC

When will atm be unsuspended?

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60588 - Posted: 12 Jul 2023 | 14:42:53 UTC - in response to Message 60425.
Last modified: 12 Jul 2023 | 14:43:36 UTC

Good evening, only on one of my PCs with Windows 11, I7-13700KF and RTX 2080 Ti, none of the GPUGRID ATMbeta tasks (CUDA 1121) can be processed. By now more than a hundred have ended after a few tens of seconds. Other tasks (for example based on CUDA 1131) are also processed on this PC and without any problems. I have no idea what could be causing it so I do not know how to fix it. Thanks in advance to anyone who can help me solve the problem.

Output su Stderr
<core_client_version>7.20.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 195 (0xc3)</message>
<stderr_txt>
04:36:16 (31676): wrapper (7.9.26016): starting
04:36:16 (31676): wrapper: running python.exe (bin/conda-unpack)
04:36:17 (31676): python.exe exited; CPU time 0.000000
04:36:17 (31676): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
MCL1_m51_m45_0.xml
MCL1_m51_m45_asyncre.cntl
MCL1_m51_m45.inpcrd
MCL1_m51_m45.prmtop
run.bat
run.sh
04:36:18 (31676): Library/usr/bin/tar.exe exited; CPU time 0.000000
04:36:18 (31676): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
04:36:20 (31676): C:/Windows/system32/cmd.exe exited; CPU time 0.015625
04:36:20 (31676): app exit status: 0x1
04:36:20 (31676): called boinc_finish(195)
0 bytes in 0 Free Blocks.
530 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 481994 bytes.
Dumping objects ->
{3078527} normal block at 0x00000221DD3AE4C0, 64 bytes long.
Data: <PATH=C:\ProgramD> 50 41 54 48 3D 43 3A 5C 50 72 6F 67 72 61 6D 44
{3078506} normal block at 0x00000221DD2D1060, 241 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
..\api\boinc_api.cpp(309) : {3078503} normal block at 0x00000221DB70B460, 8 bytes long.
Data: < &#221;! > 00 00 1A DD 21 02 00 00
{3077864} normal block at 0x00000221DD2D11F0, 241 bytes long.
Data: <<project_prefere> 3C 70 72 6F 6A 65 63 74 5F 70 72 65 66 65 72 65
{3077239} normal block at 0x00000221DB70BE10, 8 bytes long.
Data: <pk=&#221;! > 70 6B 3D DD 21 02 00 00
..\zip\boinc_zip.cpp(122) : {281} normal block at 0x00000221DB70D7F0, 260 bytes long.
Data: < > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
{266} normal block at 0x00000221DB713020, 16 bytes long.
Data: <87q&#219;! > 38 37 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{265} normal block at 0x00000221DB7128A0, 16 bytes long.
Data: < 7q&#219;! > 10 37 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{264} normal block at 0x00000221DB712490, 16 bytes long.
Data: <&#232;6q&#219;! > E8 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{263} normal block at 0x00000221DB712850, 16 bytes long.
Data: <&#192;6q&#219;! > C0 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{262} normal block at 0x00000221DB7122B0, 16 bytes long.
Data: < 6q&#219;! > 98 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{261} normal block at 0x00000221DB712E40, 16 bytes long.
Data: <p6q&#219;! > 70 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{260} normal block at 0x00000221DB70C780, 32 bytes long.
Data: <CUDA_DEVICE=0 PU> 43 55 44 41 5F 44 45 56 49 43 45 3D 30 00 50 55
{259} normal block at 0x00000221DB712A80, 16 bytes long.
Data: <p q&#219;! > 70 10 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{258} normal block at 0x00000221DB711070, 40 bytes long.
Data: < *q&#219;! &#199;p&#219;! > 80 2A 71 DB 21 02 00 00 80 C7 70 DB 21 02 00 00
{257} normal block at 0x00000221DB712350, 16 bytes long.
Data: <P6q&#219;! > 50 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{256} normal block at 0x00000221DB712300, 16 bytes long.
Data: <(6q&#219;! > 28 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{255} normal block at 0x00000221DB70CCC0, 32 bytes long.
Data: <C:/Windows/syste> 43 3A 2F 57 69 6E 64 6F 77 73 2F 73 79 73 74 65
{254} normal block at 0x00000221DB712CB0, 16 bytes long.
Data: < 6q&#219;! > 00 36 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{253} normal block at 0x00000221DB70C060, 32 bytes long.
Data: <xjvf input.tar.b> 78 6A 76 66 20 69 6E 70 75 74 2E 74 61 72 2E 62
{252} normal block at 0x00000221DB712800, 16 bytes long.
Data: <H5q&#219;! > 48 35 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{251} normal block at 0x00000221DB712670, 16 bytes long.
Data: < 5q&#219;! > 20 35 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{250} normal block at 0x00000221DB712C10, 16 bytes long.
Data: <&#248;4q&#219;! > F8 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{249} normal block at 0x00000221DB713160, 16 bytes long.
Data: <&#208;4q&#219;! > D0 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{248} normal block at 0x00000221DB712F80, 16 bytes long.
Data: <&#168;4q&#219;! > A8 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{247} normal block at 0x00000221DB712620, 16 bytes long.
Data: < 4q&#219;! > 80 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{245} normal block at 0x00000221DB712F30, 16 bytes long.
Data: <0 q&#219;! > 30 12 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{244} normal block at 0x00000221DB711230, 40 bytes long.
Data: <0/q&#219;! &#192;&#228;:&#221;! > 30 2F 71 DB 21 02 00 00 C0 E4 3A DD 21 02 00 00
{243} normal block at 0x00000221DB712EE0, 16 bytes long.
Data: <`4q&#219;! > 60 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{242} normal block at 0x00000221DB712530, 16 bytes long.
Data: <84q&#219;! > 38 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{241} normal block at 0x00000221DB70CD80, 32 bytes long.
Data: <Library/usr/bin/> 4C 69 62 72 61 72 79 2F 75 73 72 2F 62 69 6E 2F
{240} normal block at 0x00000221DB712AD0, 16 bytes long.
Data: < 4q&#219;! > 10 34 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{239} normal block at 0x00000221DB70C8A0, 32 bytes long.
Data: <bin/conda-unpack> 62 69 6E 2F 63 6F 6E 64 61 2D 75 6E 70 61 63 6B
{238} normal block at 0x00000221DB712260, 16 bytes long.
Data: <X3q&#219;! > 58 33 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{237} normal block at 0x00000221DB7124E0, 16 bytes long.
Data: <03q&#219;! > 30 33 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{236} normal block at 0x00000221DB7125D0, 16 bytes long.
Data: < 3q&#219;! > 08 33 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{235} normal block at 0x00000221DB712E90, 16 bytes long.
Data: <&#224;2q&#219;! > E0 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{234} normal block at 0x00000221DB7127B0, 16 bytes long.
Data: <&#184;2q&#219;! > B8 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{233} normal block at 0x00000221DB7123F0, 16 bytes long.
Data: < 2q&#219;! > 90 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{232} normal block at 0x00000221DB713110, 16 bytes long.
Data: <p2q&#219;! > 70 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{231} normal block at 0x00000221DB712FD0, 16 bytes long.
Data: <H2q&#219;! > 48 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{230} normal block at 0x00000221DB7123A0, 16 bytes long.
Data: < 2q&#219;! > 20 32 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{229} normal block at 0x00000221DB713220, 1488 bytes long.
Data: <&#160;#q&#219;! python.e> A0 23 71 DB 21 02 00 00 70 79 74 68 6F 6E 2E 65
{93} normal block at 0x00000221DB70CC60, 32 bytes long.
Data: <windows_x86_64__> 77 69 6E 64 6F 77 73 5F 78 38 36 5F 36 34 5F 5F
{92} normal block at 0x00000221DB70BCD0, 16 bytes long.
Data: < q&#219;! > 00 10 71 DB 21 02 00 00 00 00 00 00 00 00 00 00
{91} normal block at 0x00000221DB711000, 40 bytes long.
Data: <&#208;&#188;p&#219;! `&#204;p&#219;! > D0 BC 70 DB 21 02 00 00 60 CC 70 DB 21 02 00 00
{70} normal block at 0x00000221DB70BEB0, 16 bytes long.
Data: < &#234;&#249;&#134;&#246; > 80 EA F9 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{69} normal block at 0x00000221DB70B0A0, 16 bytes long.
Data: <@&#233;&#249;&#134;&#246; > 40 E9 F9 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{68} normal block at 0x00000221DB70BC80, 16 bytes long.
Data: <&#248;W&#246;&#134;&#246; > F8 57 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{67} normal block at 0x00000221DB70B8C0, 16 bytes long.
Data: <&#216;W&#246;&#134;&#246; > D8 57 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{66} normal block at 0x00000221DB70BDC0, 16 bytes long.
Data: <P &#246;&#134;&#246; > 50 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{65} normal block at 0x00000221DB70BBE0, 16 bytes long.
Data: <0 &#246;&#134;&#246; > 30 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{64} normal block at 0x00000221DB70B6E0, 16 bytes long.
Data: <&#224; &#246;&#134;&#246; > E0 02 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{63} normal block at 0x00000221DB70B640, 16 bytes long.
Data: < &#246;&#134;&#246; > 10 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{62} normal block at 0x00000221DB70B5F0, 16 bytes long.
Data: <p &#246;&#134;&#246; > 70 04 F6 86 F6 7F 00 00 00 00 00 00 00 00 00 00
{61} normal block at 0x00000221DB70B870, 16 bytes long.
Data: < &#192;&#244;&#134;&#246; > 18 C0 F4 86 F6 7F 00 00 00 00 00 00 00 00 00 00
Object dump complete.
</stderr_txt>
]]>

I just had a theory that cmd could fail because both you and i had set default command processor to Windows terminal instead of Console Window Host.
Unfortunately i can't test it because there are no more ATM tasks.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60589 - Posted: 14 Jul 2023 | 13:29:56 UTC

Didn't help.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60590 - Posted: 14 Jul 2023 | 20:48:33 UTC - in response to Message 60589.
Last modified: 14 Jul 2023 | 20:59:15 UTC

Didn't help.


It could be a hardware problem (processor, RAM, etc), not software.

I have 2 computers crunching here.

One is a Core 7 intel, with 32 Gigs RAM, and it completes both ACEMDs and ATMbetas successfully.

https://www.gpugrid.net/results.php?hostid=608721

There other is an AMD Phenol II, with 16 Gigs RAM, and it completes ACEMDs successfully, while ATMbetas error out. (I can't put any more RAM on this MB.)

https://www.gpugrid.net/results.php?hostid=607570

They both have the same OS.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60591 - Posted: 15 Jul 2023 | 14:03:28 UTC
Last modified: 15 Jul 2023 | 14:05:49 UTC

In my case it crashes instantly on Wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60597 - Posted: 15 Jul 2023 | 23:15:45 UTC

These units still crash when shutdown and then restarted. The progress bar goes to 100% done after a few minutes, when you get to the subsequent units in the thread. Looks like nothing has been updated.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60598 - Posted: 17 Jul 2023 | 13:13:03 UTC - in response to Message 60597.

ATM Beta still crashes after 40 seconds on a RTX 4080.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60599 - Posted: 17 Jul 2023 | 18:07:48 UTC

after some time, today I resumed crunching ATM tasks.

However, I notice a strange behaviour of BOINC when trying to download a second task per GPU:

When pushing the "update" button, no second task will download (although plenty of them available), and the event log says:

17.07.2023 20:00:19 | GPUGRID | Requesting new tasks for NVIDIA GPU
17.07.2023 20:00:21 | GPUGRID | Scheduler request completed: got 0 new tasks
17.07.2023 20:00:21 | GPUGRID | No tasks sent
17.07.2023 20:00:21 | GPUGRID | No tasks are available for ATM: Free energy calculations of protein-ligand binding
17.07.2023 20:00:21 | GPUGRID | Tasks won't finish in time: BOINC runs 96.7% of the time; computation is enabled 100.0% of that

I had downloaded hundreds of ATM tasks before on this system, and always I could download a second one which stayed in "waiting position" until the first one got finished.
Never before I saw this kind of statement.

Can anyone tell me what's wrong, and what I can do in order to get a second task downloaded?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60600 - Posted: 17 Jul 2023 | 18:14:27 UTC - in response to Message 60599.

the estimate time to completion is too long. so BOINC thinks they wont finish by their listed 5 day deadline. that's why. you can try editing the DCF in the client state file manually, or just wait for it to adjust itself.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60601 - Posted: 17 Jul 2023 | 18:51:04 UTC - in response to Message 60600.

the estimate time to completion is too long. so BOINC thinks they wont finish by their listed 5 day deadline. ...

that's what I suspected first (I had that before on another machine), then I took a look at the times, and surprise:
right now, a task has been running for 2:32 hrs, indicated completion time: 34:50 minutes(!).
So the problem must be somewhere else :-(

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60602 - Posted: 17 Jul 2023 | 19:34:26 UTC - in response to Message 60601.

the estimate time to completion is too long. so BOINC thinks they wont finish by their listed 5 day deadline. ...

that's what I suspected first (I had that before on another machine), then I took a look at the times, and surprise:
right now, a task has been running for 2:32 hrs, indicated completion time: 34:50 minutes(!).
So the problem must be somewhere else :-(


it has to do with the estimated completion time of the task it's trying to download + the tasks you have. not just the tasks you have.

a brand new task might say it will take 90hrs to finish. you have 34hrs remaining on your work. so it thinks it would be 5.1 days before the new task would finish, so it decides to not download any.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60603 - Posted: 17 Jul 2023 | 19:50:07 UTC - in response to Message 60602.

... you have 34hrs remaining on your work. ...

NOT 34hrs, but 34 minutes !

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60604 - Posted: 17 Jul 2023 | 19:59:25 UTC - in response to Message 60603.
Last modified: 17 Jul 2023 | 20:01:19 UTC

... you have 34hrs remaining on your work. ...

NOT 34hrs, but 34 minutes !


that's inconsequential, it was just an example. the point was that it depends mostly on the time estimate of the task to be downloaded. which could be in excess of 5 days already and you're in the same situation.

several of mine show initial estimates like 200+ days.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60605 - Posted: 17 Jul 2023 | 20:16:02 UTC - in response to Message 60601.

right now, a task has been running for 2:32 hrs, indicated completion time: 34:50 minutes(!).
So the problem must be somewhere else :-(

When trying to download a second task, set the "Store at least X days of work" parameter at BOINC local preferences as tight in excess as possible to remaining calculated time for the task in progress.
At your example: about 34 minutes remaining, try setting the "Store at least X days of work" parameter to 0.03 days (about 43 minutes).
And parameter "Store up to an additional X days of work" set to 0.00

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60606 - Posted: 18 Jul 2023 | 14:31:30 UTC

@ ServicEnginIC, thanks for your hints.
However, I finally did not need to do anything: all of a sudden, two tasks per GPU got downlaoded :-)

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60610 - Posted: 20 Jul 2023 | 13:26:42 UTC

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60611 - Posted: 20 Jul 2023 | 13:53:53 UTC - in response to Message 60610.

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.


I have question regarding the minimum hardware requirements (i.e. Amount, speed, type of RAM, CPU speed and type, motherboard speed and requirements, etc.) for the computer to be able to complete successfully, these units for either windows and linux OS?

One of my computers has been running these units successfully, the other has not. They both have the same OS, but have different hardware. I just want to know the limits.




Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60612 - Posted: 20 Jul 2023 | 14:18:38 UTC - in response to Message 60611.

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.


I have question regarding the minimum hardware requirements (i.e. Amount, speed, type of RAM, CPU speed and type, motherboard speed and requirements, etc.) for the computer to be able to complete successfully, these units for either windows and linux OS?

One of my computers has been running these units successfully, the other has not. They both have the same OS, but have different hardware. I just want to know the limits.






I'm not sure to be the most adequate to answer this question but I might try my best. AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60613 - Posted: 20 Jul 2023 | 16:28:47 UTC

this kind of error

tar: run.log: file changed as we read it
tar: r*/*.xml: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors


has happened quite often lately.
Quico, anything you can do about it?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60614 - Posted: 20 Jul 2023 | 16:43:47 UTC - in response to Message 60612.

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.


I have question regarding the minimum hardware requirements (i.e. Amount, speed, type of RAM, CPU speed and type, motherboard speed and requirements, etc.) for the computer to be able to complete successfully, these units for either windows and linux OS?

One of my computers has been running these units successfully, the other has not. They both have the same OS, but have different hardware. I just want to know the limits.






I'm not sure to be the most adequate to answer this question but I might try my best. AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread.



Both computers are running the same driver, and both computers have the same type of video card rtx 2080ti.

Here is the portion from the log from the computer that has the errors:

Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/0/tmp/pip-req-build-9y8_6t1d
Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac
Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4
&#226;&#149;&#176;&#226;&#148;&#128;> [0 lines of output]
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

&#195;&#151; Encountered error while generating package metadata.
&#226;&#149;&#176;&#226;&#148;&#128;> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
15:34:05 (42979): bin/bash exited; CPU time 3.604100
15:34:05 (42979): app exit status: 0x1
15:34:05 (42979): called boinc_finish(195)

</stderr_txt>

https://www.gpugrid.net/result.php?resultid=33535521

Would this be a software or hardware problem?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60615 - Posted: 20 Jul 2023 | 17:00:39 UTC - in response to Message 60610.
Last modified: 20 Jul 2023 | 17:09:15 UTC

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.


several long standing and well discussed issues are still unresolved with these tasks. in reducing priority:

1. task checkpointing still does not work properly. it may be writing to the checkpoint file, but it does not ever resume from the checkpoint. any pausing or suspending work units for any reason will cause it to error out when it attempts to resume. this is an issue for anyone who runs multiple projects (BOINC will occasionally pause in-progress units to crunch other projects) or needs to shutdown their computer for updates or whatever.

2. runtime progress reporting ONLY works for the first batch "0-5" labelled tasks. anything "1-5" though "4-5" do not work properly, they jump immediately to 100% and stay there until it is complete. this makes it hard to know how long they will run

3. estimated flops setting on these tasks is probably way too high leading to crazy high runtime estimates. this could likely cause indirect issues with the BOINC client either not fetching work properly or not managing other projects properly.

4. many batches are being sent out malformed occasionally. leading to errors. seems most are due to incorrect formatting or naming. stuff like this:
"+ tar cjvf restart.tar.bz2 'r*/*.xml'
tar: r*/*.xml: Cannot stat: No such file or directory"

these are things I've seen constant complaints about every time these tasks come back.

I would highly recommend that you guys attach a computer to the project like a normal user so that you can experience them first hand and properly troubleshoot them.
____________

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60616 - Posted: 20 Jul 2023 | 18:39:57 UTC

Yes, the constant WUs throwing errors! Luckily most at the beginning, but some run for quite some time before erroring out. More Errors than Valid is a huge waste of resources and time for everyone.

Ian&Steve says it well!

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60617 - Posted: 21 Jul 2023 | 7:48:12 UTC - in response to Message 60610.

Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.

________

Could you please make these tasks be able to suspend? Monsoons in my part of the World and every time it rains there is a power outage. Even though in Preferences I have set it to keep WU in Memory while on batteries, every time the power goes WU ends up with an error. Now 100% of the WUs at my end in error are due to this reason.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60618 - Posted: 21 Jul 2023 | 8:39:49 UTC - in response to Message 60616.

Yes, the constant WUs throwing errors! Luckily most at the beginning, but some run for quite some time before erroring out. More Errors than Valid is a huge waste of resources and time for everyone.

Ian&Steve says it well!

mentioning the "waste of resources": "ValueError: Energy is NaN." has happened again quite a lot in the recent past. Mostly after between 1-1/2 and 2 hours runtime.
Given that electricity cost has trippled here since last year, such waste has become quite expensive :-(

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60619 - Posted: 26 Jul 2023 | 22:57:14 UTC

Another new batch...same old Errors.

Does Krembill have anything to do with this project lololol.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60620 - Posted: 27 Jul 2023 | 17:20:42 UTC

Valid 17, error 26. I know it makes no difference; plenty of computers are standing by and it will get done.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60621 - Posted: 29 Jul 2023 | 10:21:38 UTC - in response to Message 60612.

AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread.


My driver for the RTX 4080 under Win11 is 536.23
All units error out after about 40 seconds.
I do not see this on a 2070S nor on a 3070 Laptop.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60622 - Posted: 29 Jul 2023 | 16:26:53 UTC - in response to Message 60621.

AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread.


My driver for the RTX 4080 under Win11 is 536.23
All units error out after about 40 seconds.
I do not see this on a 2070S nor on a 3070 Laptop.

It would be helpful if you unhid your computers so we could examine the output files to get a clue on why the tasks are failing on your 40 series card,

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60623 - Posted: 29 Jul 2023 | 19:35:33 UTC - in response to Message 60622.
Last modified: 29 Jul 2023 | 19:45:12 UTC

It would be helpful if you unhid your computers so we could examine the output files to get a clue on why the tasks are failing on your 40 series card,


Done.
I just ran 2 fresh WUs that errored out as usual.
Thank you very much.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60624 - Posted: 29 Jul 2023 | 19:38:57 UTC - in response to Message 60623.
Last modified: 29 Jul 2023 | 19:41:14 UTC


It would be helpful if you unhid your computers so we could examine the output files to get a clue on why the tasks are failing on your 40 series card,


Done. Thank you very much.

Wasn't helpful. You don't have any result output at all. The tasks never even get to start the setup process. They just exit immediately.

Quico needs to reexamine his statement that the 40 series cards are working OK on the ATMbeta tasks.

[Edit]
I would reset the project to start with in the hope that the task and app packages gets downloaded again. Maybe the necessary Python environment never got set up correctly initially.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60625 - Posted: 29 Jul 2023 | 19:52:58 UTC - in response to Message 60624.


I would reset the project to start with in the hope that the task and app packages gets downloaded again. Maybe the necessary Python environment never got set up correctly initially.

Project reset and tried two new WU. Same result - error after a few seconds.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60626 - Posted: 29 Jul 2023 | 21:30:00 UTC - in response to Message 60624.
Last modified: 29 Jul 2023 | 21:32:16 UTC

Quico needs to reexamine his statement that the 40 series cards are working OK on the ATMbeta tasks.


they do. look at the leaderboard. many 40-series hosts returning valid work from both linux and Windows.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60627 - Posted: 29 Jul 2023 | 21:40:38 UTC - in response to Message 60626.

OK, so 40 series works fine for both Windows and Linux.

So what would you recommend for this volunteer to do for troubleshooting when tasks don't report any useful information?

The most logical step of project reset was not fruitful.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60628 - Posted: 29 Jul 2023 | 22:51:39 UTC

Does the 4080 run other projects gpu tasks without errors?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60629 - Posted: 30 Jul 2023 | 3:12:46 UTC - in response to Message 60627.

Maybe a problem with BOINC itself. Might try a different BOINC version.
____________

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60630 - Posted: 30 Jul 2023 | 5:45:30 UTC - in response to Message 60628.

Does the 4080 run other projects gpu tasks without errors?

Yes, it does without errors. PrimeGrid, SRBase, Einstein and WCG OPNG.
BOINC was updated to 7.22.2 within this Beat phase - same result.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60632 - Posted: 3 Aug 2023 | 5:25:52 UTC - in response to Message 60619.

Another new batch...same old Errors.

Does Krembill have anything to do with this project lololol.

forget Krembil - it's down most of the time. Too bad what happened to WCG :-(

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60633 - Posted: 3 Aug 2023 | 11:21:33 UTC

It does not make any difference to Quico. The task will be completed on one or another computer and his science is done. It is our very expensive energy that is wasted but as Quico himself said, the science gets done who cares about wasted energy?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60634 - Posted: 4 Aug 2023 | 17:50:37 UTC - in response to Message 60633.

You also have the option to crunch something else if your time is wasted here.
____________

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60636 - Posted: 6 Aug 2023 | 1:02:51 UTC - in response to Message 60634.

You also have the option to crunch something else if your time is wasted here.


I wish you would put a dirty sock where required. In Asia, the transmission of power is through overhead lines. They run red hot and expanded in our heat. Many people used to die due to electrocution. They switch off the grid. If the WU's cannot handle a suspension then there is no need for cati useless useless remarks. You also have the option of not running off with your writing skills.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60638 - Posted: 6 Aug 2023 | 12:16:04 UTC - in response to Message 60636.
Last modified: 6 Aug 2023 | 12:18:50 UTC

If it hurts when you <do that> then the most obvious solution is to not <do that>. This applies to most things in life.

It’s well known that these tasks do not like to be interrupted. If your power grid is that unstable then it’s probably best for you to crunch something else, or invest in a battery backup to keep the computer running during power outages.

These are still classified as Beta after all and that comes with the implication that things will not always work, and you need to accept whatever compromises that comes with it. If you don’t then your other solution could be to just disable Beta processing from your profile and wait for ACEMD3 work.
____________

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60651 - Posted: 16 Aug 2023 | 10:44:28 UTC

Ok, back from holidays.
I've saw that in the last batch of jobs many Energy NaN errors were happening, which it was completely unexpected.

I am testing some different settings internally to see if it overcomes this issue. In case it is successfull, new jobs would be send by tomorrow/Friday.

This might be more time consuming and I would not like to split them in even more chunks (might have a suspicion that this gives wonky results at some point) but if people see that they take too long time/space please let me know.

Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 28 Feb 23
Posts: 34
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60652 - Posted: 16 Aug 2023 | 10:50:32 UTC - in response to Message 60633.

It does not make any difference to Quico. The task will be completed on one or another computer and his science is done. It is our very expensive energy that is wasted but as Quico himself said, the science gets done who cares about wasted energy?

I'm pretty sure I never said that about wasted energy. What I might have mentioned is that completed jobs come back to me and since I don't check what happens to every WU manually then these crashes might go under my radar.

As Ian&Steve C. these app is in "beta"/not ideal conditions. Sadly I don't have the knowledge to fix it, otherwise I would. Errors on my end can be one I forget to upload some files (happened) or I sent jobs without equilibrated systems (also happened). By trial and error I ended up with a workflow that should avoid these issues 99% of the time. Any other kind of errors I can pass them to the devs but I can't promise much more apart from it.
I'm here testing the science.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60653 - Posted: 16 Aug 2023 | 12:57:20 UTC - in response to Message 60651.

This might be more time consuming and I would not like to split them in even more chunks (might have a suspicion that this gives wonky results at some point) but if people see that they take too long time/space please let me know.

I think that the most harmful risk is that excessively heavy tasks generate result files bigger than 512 MB in size.
GPUGRID server can't handle them, and they won't upload...

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60654 - Posted: 16 Aug 2023 | 18:07:55 UTC - in response to Message 60653.

Absolutely, this is the biggest risk. Shame that you can't get Gianni or Toni to reconfigure the website html upload size limit to 1GB.

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60655 - Posted: 17 Aug 2023 | 14:27:48 UTC

Can you at least fix the "daily quota" limit or whatever it is that prevents a machine from getting more WUs?

8/17/2023 10:25:12 AM | GPUGRID | This computer has finished a daily quota of 14 tasks


After all, it is your WUs that are Erroring out by the hundreds and causing this "daily quota" to kick in. Seems this batch is even worse than before.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60656 - Posted: 17 Aug 2023 | 17:05:27 UTC - in response to Message 60655.

The tasks I received this morning are running fine, and have reached 92%. Any problems will be the result of the combination of their tasks and your computer. The quota is there to protect their science from your computer. Trying to increase the quota without sorting out the cause of the underlying problem would be counter-productive.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60657 - Posted: 17 Aug 2023 | 18:21:49 UTC

It seems windows hosts in general seem to have a lot more problems than Linux hosts.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60658 - Posted: 17 Aug 2023 | 18:46:06 UTC - in response to Message 60657.

It seems windows hosts in general seem to have a lot more problems than Linux hosts.

Depends what the error is. I looked at host 553738:

Coprocessors [4] NVIDIA NVIDIA GeForce RTX 2080 Ti (11263MB) driver: 528.2

and tasks with

openmm.OpenMMException: Illegal value for DeviceIndex: 1

BOINC has a well-known design flaw: it reports a number of identical GPUs, even if in reality they're different. And this project's apps are very picky about being told exactly what sort of GPU they've been told to run on. So, if Device_0 is a RTX 2080 Ti, and Device_1 isn't, you'll get an error like that.

The machine has completed other tasks today, presumably on Device 0, although the project doesn't report that for a successful task.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60659 - Posted: 17 Aug 2023 | 18:48:58 UTC

Just anecdotally, most issues reported seem to be coming from windows users with ATMbeta.
____________

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60661 - Posted: 18 Aug 2023 | 4:18:08 UTC - in response to Message 60659.

Most of the errors so far have occurred on the otherwise reliable Linux machine.
Since the app version is still 1.09, I had little hope that the 4080 would work on Windows this time - and rightly so.
When an OS-hardware combination works flawlessly on all other BOINC projects, it is not very far-fetched to suspect that the app is buggy when there are errors on GPUGrid.
I hope that the developers can look deeply into it again.

Freewill
Send message
Joined: 18 Mar 10
Posts: 13
Credit: 6,627,287,394
RAC: 30,274,262
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60662 - Posted: 18 Aug 2023 | 10:47:18 UTC

I have 11 valid and 3 error so far for this batch of tasks. I am getting the "raise ValueError('Energy is NaN." error. This is for Ubuntu 22.04 and single GPU or 2 identical GPU systems. What really hurts is the tasks are running a long time and then the error comes.

mmonnin
Send message
Joined: 2 Jul 16
Posts: 332
Credit: 4,006,821,065
RAC: 12,216,791
Level
Arg
Scientific publications
watwatwatwatwat
Message 60663 - Posted: 18 Aug 2023 | 16:08:20 UTC

I completed my first of these longer tasks in Win10. Sometimes this PC completes tasks failed by others and the opposite also happens.

This one failed 3x by the same person on different PCs, then a 4th fail by someone. All 4 PCs with over 100 fails in 1-2 min. Seems these users are getting plenty past 14 per day.

https://www.gpugrid.net/workunit.php?wuid=27541862

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60664 - Posted: 19 Aug 2023 | 9:17:49 UTC

This is a new one on me: task 33579127

openmm.OpenMMException: Called setPositions() on a Context with the wrong number of positions

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60665 - Posted: 19 Aug 2023 | 11:58:50 UTC

one strange thing I noticed when I re-started downloading and crunching tasks on 2 PCs yesterday evening:
one of the two PCs got a new 6-digit number in the GPUGRID system.
It's now 610779, before it was 600874. No big deal, but I am wondering how come.
Does anyone have a logical explanation?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60666 - Posted: 19 Aug 2023 | 16:29:42 UTC - in response to Message 60665.
Last modified: 19 Aug 2023 | 16:31:17 UTC

BOINC thought it was a new host.

Could be many reasons.

Scheduler database issues, corruption of or missing client_state.xml file, change in hardware or OS etc. etc.

All it takes is the server to lose track of how many times the host has contacted the project and the server will reissue a new host ID.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60667 - Posted: 20 Aug 2023 | 11:45:45 UTC - in response to Message 60653.

I think that the most harmful risk is that excessively heavy tasks generate result files bigger than 512 MB in size.
GPUGRID server can't handle them, and they won't upload...

I see that one of the issues that Quico seems to have addressed is just this one.
At this two GTX1650 GPU host, I was processing these two tasks:
syk_m39_m14_1-QUICO_ATM_Mck_Sage_2fs-1-5-RND8946_0, with this characteristics:
syk_m31_m03_2-QUICO_ATM_Mck_GAFF2_2fs-1-5-RND5226_0, with this characteristics
To test the sizes of the generated result files, I suspended network activity at BOINC Manager.
And now there are two only result files per task, both being much lighter than previous batches.
Excessively heavy result files problem solved, and by the way, less stress for server storage.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60668 - Posted: 20 Aug 2023 | 14:11:29 UTC - in response to Message 60667.

Yes, I saw that and it looks like good news. But, I would say, unproven as yet. This new batch have varying run times, according to what gene or protein is being worked on - the first part of the name, anyway.

The longest ones seem to be shp2_, but even those finish in under 12 hours on my cards. I think the really massive 500MB+ tasks ran over 24 hours on the same cards, so we haven't really tested the limits yet.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60671 - Posted: 21 Aug 2023 | 2:41:07 UTC

I am a bit curious what's going on with this 1660 super host. I occasionally check the error out units to make sure it's not just always failing on my hosts and I noticed more than once some host would finish it super fast. It happened to some of my own valid WUs too. While this sometimes happens, what I haven't seen until now is a host that either errors out, or finishes super fast: https://www.gpugrid.net/results.php?hostid=610334

What makes a WU finish super fast when other computers error out? Are these fast completion actually valid results?
____________

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60672 - Posted: 22 Aug 2023 | 17:18:26 UTC
Last modified: 22 Aug 2023 | 17:41:13 UTC

What's with the new error/issue...

8/22/2023 12:49:47 PM | GPUGRID | Output file syk_m12_m14_3-QUICO_ATM_Mck_Sage_v3-1-5-RND3920_0_0 for task syk_m12_m14_3-QUICO_ATM_Mck_Sage_v3-1-5-RND3920_0 absent


Of course, may not ne new...I just saw it in my log after having a string of errors.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60673 - Posted: 22 Aug 2023 | 18:04:39 UTC

There appears to be a flaw or weakness in the application right at the end of the finishing up stage where the app needs to tar the output file.

Seems it can lose track of the filename often or looks for the incorrect filename.

Might have something to do with the way the host is handling access permissions or the slowness of the filesystem in accessing the output file.

I wonder if putting a wait loop into the code right before this process would let the filesystem settle down long enough to get access to the file for tar'ing

Art_Brown
Send message
Joined: 3 Jun 09
Posts: 4
Credit: 1,074,953,114
RAC: 126
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60674 - Posted: 23 Aug 2023 | 19:12:52 UTC

ATMbeta tasks show 100% complete but do not finish, and Task Manager shows 'normal' working CPU loads (5 to 6% in my case) (12 core, 16 logical). I abort the runs and stop accepting new work for a week or so hoping the next batch finishes and uploads, but it might be an issue with my PC.
Windows 10, 64GB ram, two GTX-980ti cards
Can anyone offer suggestions?
Thanks.

Two examples:

Computer: Pecan
Project GPUGRID

Name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083_0

Application ATMbeta 1.09 (cuda1121)
Workunit name tnks2_m07_m5d_2-QUICO_ATM_Mck_GAFF2_2fs-2-5-RND9083
State Running
Received 8/23/2023 6:41:59 AM
Report deadline 8/28/2023 6:41:59 AM
Estimated app speed 580.20 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.983 CPUs + 1 NVIDIA GPU (device 1)
CPU time at last checkpoint 00:00:00
CPU time 04:25:24
Elapsed time 05:11:46
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 6,867.10 MB
Working set size 761.31 MB
Directory slots/1
Process ID 5364

Debug State: 2 - Scheduler: 2

Computer: Pecan
Project GPUGRID

Name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546_0

Application ATMbeta 1.09 (cuda1121)
Workunit name shp2_m24_m19_1-QUICO_ATM_Mck_Sage_2fs-2-5-RND5546
State Running
Received 8/23/2023 6:42:37 AM
Report deadline 8/28/2023 6:42:36 AM
Estimated app speed 580.20 GFLOPs/sec
Estimated task size 1,000,000,000 GFLOPs
Resources 0.983 CPUs + 1 NVIDIA GPU (device 0)
CPU time at last checkpoint 00:00:00
CPU time 04:40:09
Elapsed time 05:17:34
Estimated time remaining 00:00:00
Fraction done 100.000%
Virtual memory size 7,414.65 MB
Working set size 1,184.61 MB
Directory slots/0
Process ID 20016

Debug State: 2 - Scheduler: 2

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60675 - Posted: 23 Aug 2023 | 21:07:29 UTC - in response to Message 60674.

Yes. That is - sadly - normal for the current tasks in this series. Ultra-long tasks were split into bite-sized chunks - you can tell which chunk you're running from the task name. They're split into 0-5 to 4-5 (towards the end of the task name: there's no 5-5).

If you get a 0-5, it'll report progress normally, from 0% to 100%. All the others will jump quickly to 100%, and stay there.

Irritating, and it messes up work fetch, but it doesn't ultimately matter. Just let them run: they'll finish eventually, and life will move on.

Art_Brown
Send message
Joined: 3 Jun 09
Posts: 4
Credit: 1,074,953,114
RAC: 126
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60676 - Posted: 24 Aug 2023 | 15:39:38 UTC - in response to Message 60675.

Hi Richard and thanks for the info.
I stop all processing at 4pm daily, and when the ATMbeta tasks are still running, they don't survive the suspension and restart process. They spontaneously abort. So I'll still have to try new batches from time to time to avoid wasting CPU time for other projects.
Regards, Art

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60677 - Posted: 24 Aug 2023 | 18:40:52 UTC - in response to Message 60676.

The ATMbeta tasks cannot be stopped during processing or they will error out.

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 60683 - Posted: 26 Aug 2023 | 21:45:17 UTC - in response to Message 60677.

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60684 - Posted: 27 Aug 2023 | 2:04:39 UTC - in response to Message 60638.

If it hurts when you <do that> then the most obvious solution is to not <do that>. This applies to most things in life.

It’s well known that these tasks do not like to be interrupted. If your power grid is that unstable then it’s probably best for you to crunch something else, or invest in a battery backup to keep the computer running during power outages.

These are still classified as Beta after all and that comes with the implication that things will not always work, and you need to accept whatever compromises that comes with it. If you don’t then your other solution could be to just disable Beta processing from your profile and wait for ACEMD3 work.



The sad part is that you are extra smart.

KAMasud
Send message
Joined: 27 Jul 11
Posts: 137
Credit: 523,901,354
RAC: 3
Level
Lys
Scientific publications
watwat
Message 60685 - Posted: 27 Aug 2023 | 2:10:56 UTC - in response to Message 60683.

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?

You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60686 - Posted: 27 Aug 2023 | 10:45:52 UTC

this morning, a ATM on a GTX980ti errored out after more than 10 hours :-(((

The reason, as often enough, is:
ValueError: Energy is NaN

What a waste of valuable energy! What I don't understand is that by now the developer does not have enough experience with this type of task so that errors like this could be eliminated.

At any rate, if this happens again, I will quit crunching ATM. Electricity rates here have trippled last year, and I can no longer afford to waste money.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60687 - Posted: 27 Aug 2023 | 11:42:37 UTC - in response to Message 60686.

this morning, a ATM on a GTX980ti errored out after more than 10 hours :-(((

The reason, as often enough, is:
ValueError: Energy is NaN


Similar here after 11,363.13 seconds on a RTX 2070S for Unit 33589476.
I understand that it can happen in a Beta project. However, I would expect that the developer irons out the most common errors such as 'Energy is NaN', progress indication jumping to 100%, wrong remaining runtime indication, RTX4xxx errors on Windows - to name a few.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60688 - Posted: 28 Aug 2023 | 2:20:07 UTC - in response to Message 60685.

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?

You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted.


I’m sure doing the same thing over and over and expecting a different result is the solution :)
____________

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60689 - Posted: 28 Aug 2023 | 3:14:03 UTC - in response to Message 60688.

I also pause computing during peak hours, both for cost and not to put more load onto the grid at the wrong time. I guess when to stop fetching work from gpugrid so that WU finishes before the peak hours begin.

Given the WU can be split up, it would be nice if each chunk can be smaller, especially when the workload doesn't support checkpoints or suspend/resume. The average time sits at 6 hour now. Is there good reasons it can't be shorter? Like additional bandwidth, or overhead for each chunk?

You cannot predict but someone extra smart will come and post irrelevant BS. 28 WU each crunched eight hours then errored out. Multiply 28 by 8 then the power tariff. All wasted.


I’m sure doing the same thing over and over and expecting a different result is the solution :)


You mean like releasing batch after batch after batch of WUs with the same issues that people have been complaining about for how long now? :)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60690 - Posted: 28 Aug 2023 | 5:13:43 UTC - in response to Message 60689.

It’s beta. Accept it or move on.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60691 - Posted: 28 Aug 2023 | 9:06:32 UTC - in response to Message 60690.

It’s beta. Accept it or move on.

the question though is how much longer it will be beta.
Isn't the reason for beta that the developper of a tool is working on it in order to eliminate problems?
Here, not much seems to have been done so far. Always the same errors and problems, none of them have been solved after so long time :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60692 - Posted: 28 Aug 2023 | 11:06:30 UTC

here the next one:

https://www.gpugrid.net/result.php?resultid=33593354

it failed after 11.267 seconds, and doesn't even tell why it failed (unless I am unable to catch it).

So I will stop crunching ATMs at least on this machine. Waste of expensive electricity :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60693 - Posted: 28 Aug 2023 | 21:38:36 UTC - in response to Message 60691.

It’s beta. Accept it or move on.

the question though is how much longer it will be beta.
Isn't the reason for beta that the developper of a tool is working on it in order to eliminate problems?
Here, not much seems to have been done so far. Always the same errors and problems, none of them have been solved after so long time :-(

Could be forever . . . does not matter as the science still gets done.

Either accept failures with a beta app or move on.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60694 - Posted: 29 Aug 2023 | 8:50:42 UTC - in response to Message 60693.

Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode.

It would be appreciated if the beta wrinkles could be ironed out from the administrative aspects of the app, too.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60695 - Posted: 29 Aug 2023 | 16:05:30 UTC - in response to Message 60694.

Judging by the number of tasks which have passed through the system over the past week (and yet more have just been added), it would appear that the scientific part of the project is now operating in 'production' mode, rather than 'beta' mode.

It would be appreciated if the beta wrinkles could be ironed out from the administrative aspects of the app, too.

+ 1

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60696 - Posted: 29 Aug 2023 | 17:30:47 UTC - in response to Message 60690.

It’s beta. Accept it or move on.


This excuse/reason has been used for too long now. It's getting old and I'm sick of people letting devs/admins of projects slide by with crap apps instead of fixing them like they know they should.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60697 - Posted: 29 Aug 2023 | 20:47:15 UTC - in response to Message 60696.

then don't support the project? and move on?

Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it.
____________

Speedy
Send message
Joined: 19 Aug 07
Posts: 42
Credit: 28,391,082
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60698 - Posted: 29 Aug 2023 | 21:34:21 UTC - in response to Message 60697.

I totally agree.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60699 - Posted: 29 Aug 2023 | 21:52:37 UTC - in response to Message 60697.

Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it.

Quico is a research scientist - and at least he communicates with us (thank you). I wouldn't expect him to be an expert in project administration.

That's why my comment was explicitly directed at the (silent) administrators.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60700 - Posted: 30 Aug 2023 | 5:16:27 UTC - in response to Message 60699.

That's why my comment was explicitly directed at the (silent) administrators.

yes, they are very silent; and obviously they don't care whether or not we volunteers are confronted with annoyingly faulty tasks :-(

bluestang
Send message
Joined: 13 Apr 15
Posts: 10
Credit: 2,542,462,606
RAC: 0
Level
Phe
Scientific publications
wat
Message 60701 - Posted: 30 Aug 2023 | 15:48:08 UTC - in response to Message 60699.

Quico has said multiple times that he doesn't know how to fix it (the runtime/% and checkpointing). complaining more wont get it fixed. at this point, it's your own choice to run this or not. if you don't like it, don't do it.

Quico is a research scientist - and at least he communicates with us (thank you). I wouldn't expect him to be an expert in project administration.

That's why my comment was explicitly directed at the (silent) administrators.


Yes exactly. My comments are about the Admins/Devs...not Quico.

And as Richard has said, at least he communicates with us and does what he can. It's a shame the others can't, or won't.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 32
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60704 - Posted: 31 Aug 2023 | 12:04:54 UTC

As I understand it, GPUgrid is now just one of several projects under the computational science lab and the developers are mostly involved with Acellera

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60705 - Posted: 31 Aug 2023 | 18:15:03 UTC

the next problem I have been faced with for several days: the download of a task takes forever. Speed is about 10 kB/ps :-(

This is nothing new, though. I remember that this kind of server problem comes up on a pretty regular basis :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60706 - Posted: 31 Aug 2023 | 20:10:49 UTC - in response to Message 60705.

the next problem I have been faced with for several days: the download of a task takes forever. Speed is about 10 kB/ps :-(

This is nothing new, though. I remember that this kind of server problem comes up on a pretty regular basis :-(


right now, the download of a task has been taking 1:40 hrs so far and the progress is about 55 %.
That's ridiculous :-(

What's going on at GPUGRID? Are the servers breaking down?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60707 - Posted: 31 Aug 2023 | 22:30:17 UTC

Lots of tasks going out to hosts and lots of results returning.

Network speed has decreased under the increased congestion.

We've seen this before when we had tons of acemd3 work.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60708 - Posted: 1 Sep 2023 | 5:46:33 UTC - in response to Message 60707.

Lots of tasks going out to hosts and lots of results returning.
Network speed has decreased under the increased congestion-

currently only 171 users are receiving and sending tasks with several hours between receiving and sending.
So we are definitely not talking about outragiously high network traffic.
Something seems to be wrong with their servers.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60711 - Posted: 1 Sep 2023 | 19:16:39 UTC

The download times you mentioned are very long and not at all what I am experiencing.

Don't know if your network connection speed is very slow or whether your ISP is having issues routing your traffic from the project to you.

My download speeds are mainly in the range of 50-100 Mb/s according to the Transfers page in the Manager when I download new work.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60712 - Posted: 2 Sep 2023 | 7:41:43 UTC - in response to Message 60711.

I had previously reported that ATMbeta fails after about 40 seconds on my RTX4080 under Windows 11, while I see other users getting valid results on different RTX40x0s.
Yesterday I installed Linux on the same machine and ATMbeta delivered valid results. The known bugs can again be observed, of course: Energy is NaN (some WU), progress bar jumps to 100% (exept the 0-5 units), no checkpoints.

On Windows, ATMbeta seems to have a particular problem.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60714 - Posted: 3 Sep 2023 | 13:00:17 UTC - in response to Message 60711.

The download times you mentioned are very long and not at all what I am experiencing.

Don't know if your network connection speed is very slow or whether your ISP is having issues routing your traffic from the project to you.

My download speeds are mainly in the range of 50-100 Mb/s according to the Transfers page in the Manager when I download new work.

here, some downloads get done rather quickly, some others take forever and sometimes they error out after long time. STDERR then says the following:

<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-3-cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-2-5-RND1222_1</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
</message>

The download speed of my ISP is 300 Mbit/s which normally works well as long as the download server at the other end has no problems.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60715 - Posted: 3 Sep 2023 | 16:11:31 UTC

since yesterday, I face a new problem:
a tasks fails after some time, but there is no stderr so that one could see what the problem was.
Example:
https://www.gpugrid.net/result.php?resultid=33612295
the task failed after 2.731 seconds.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60716 - Posted: 3 Sep 2023 | 16:17:43 UTC - in response to Message 60714.

The download times you mentioned are very long and not at all what I am experiencing.

Don't know if your network connection speed is very slow or whether your ISP is having issues routing your traffic from the project to you.

My download speeds are mainly in the range of 50-100 Mb/s according to the Transfers page in the Manager when I download new work.

here, some downloads get done rather quickly, some others take forever and sometimes they error out after long time. STDERR then says the following:

<message>
WU download error: couldn't get input files:
<file_xfer_error>
<file_name>cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-3-cmet_m16_m20_3-QUICO_ATM_Mck_GAFF2_v4-2-5-RND1222_1</file_name>
<error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
</message>

The download speed of my ISP is 300 Mbit/s which normally works well as long as the download server at the other end has no problems.


here an example of the download problem which I keep facing:

https://www.gpugrid.net/result.php?resultid=33613115

Erstellt 3 Sep 2023 | 11:20:22 UTC
Gesendet 3 Sep 2023 | 12:29:54 UTC
Empfangen 3 Sep 2023 | 12:43:06 UTC

since the download still did not get finished after almost 70 minutes, it broke off :-(

I think GPUGRID needs to work on their servers quickly.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60717 - Posted: 3 Sep 2023 | 19:33:56 UTC

What do you have for transfers in your cc_config.xml file?

Just the basic 2 connections?

Any rate limiting?

I think the default should be 8 connections per project and 32 per host.

Especially if there are other BOINC projects running besides GPUGrid.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60719 - Posted: 3 Sep 2023 | 20:26:11 UTC - in response to Message 60717.

What do you have for transfers in your cc_config.xml file?

Just the basic 2 connections?

Any rate limiting?

I think the default should be 8 connections per project and 32 per host.

Especially if there are other BOINC projects running besides GPUGrid.

it's 8 connections per project

and the downloads get even worse now. Several times now downloads have stopped after proceeding extremely slowly, with "download failed" in the BOINC manager :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60720 - Posted: 3 Sep 2023 | 20:44:51 UTC
Last modified: 3 Sep 2023 | 20:45:57 UTC

the BOINC event log keeps saying: project servers may be temporarily down.

And the pending upload jumps to "repeat in ... hours" immediately.

So there is definitely something wrong with the servers over there.

P.S.: even sending this posting out took almost 1 minute

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60721 - Posted: 3 Sep 2023 | 22:11:23 UTC - in response to Message 60720.

Still believe the issue is local to you. In all the while you have reported issues with the downloads, I have not experienced any issues or backoffs.

Project is working normally for me though the speeds have degraded from what I experienced a month ago or so.

Still no issues keeping all the hosts crunching the ATMbeta tasks.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 32
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60726 - Posted: 4 Sep 2023 | 9:24:19 UTC
Last modified: 4 Sep 2023 | 9:29:31 UTC

I regularly get backoffs on transfers, so it's not just you.
Their server has issues for sure. I have to use a different IP address than that which my hosts use, just to access this site. Some ideas,

Reboot your router at least every 24hrs. If you are not on a fixed IP, this is likely to get you a new IP address. This helps me greatly with maintaining good transfer speeds.

Use a VPN and set location as Spain.

Use a script on each host to keep tickling their server, such as....

:top
"C:\Program Files\BOINC\boinccmd" --host 127.0.0.1:31416 --passwd "yourpasswordhere" --network_available
TIMEOUT /T 300
goto top


Create a text file with this script. Edit to suit your install. Save as a batch file then double click to run it.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60727 - Posted: 4 Sep 2023 | 16:06:36 UTC
Last modified: 4 Sep 2023 | 16:07:19 UTC

i started processing ATM again on my known stable host (Linux Ubuntu LTS, EPYC + 4x A4000).

out of 160 tasks that have processed, 9 had an error ( excluded 9 tasks that had download errors and never wasted any processing time, download errors are just a "cost of doing business" with GPUGRID, IMO)

that's roughly a 5% error rate, and reasonable IMO. yeah some failed after a decent processing time, but I'm not gonna get upset about it since the vast majority of tasks that touch my system complete successfully.

if anyone is having a significantly higher error rate, you might need to look into the stability of the system itself, or switch to linux, or re-examine how you are operating (dont stop the tasks for any reason if you can help it, don't reboot, dont run other projects, etc) or any combination of the three.

when setup properly and accounting for project specific idiosyncrasies, these tasks mostly run fine.
____________

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60728 - Posted: 5 Sep 2023 | 2:35:40 UTC

The uploads are definitely slow, upwards of 25 minutes for an 86 Meg files. I just notice it this weekend. Until last week, the uploads were taking less than a minute for this particular file. Though downloads still take less than a minute for me.

I think the server needs a reboot, and the management of this project needs a good kick in the ass.......................

That's just my opinion.



Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60734 - Posted: 6 Sep 2023 | 16:58:05 UTC

Any idea why there is no longer any Stderr information on failed tasks?

Example: https://www.gpugrid.net/result.php?resultid=33616730

The task failed after 2344 secs, and I have no idea why? Although I guess that it may again have been the old and well known "energy is nan" problem.

P.S. The servers, download as well as upload, are getting worse and worse :-(

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60735 - Posted: 6 Sep 2023 | 17:34:36 UTC - in response to Message 60734.

likely something specific to that host, nothing to do with the project. none of my errors exhibit that.

stderr information is stored and uploaded from the client state file directly. for it to be empty probably means something got corrupted with that file.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60736 - Posted: 6 Sep 2023 | 20:31:02 UTC

Also finally noticing the slow downloads/uploads for tasks for the project that many have been complaining about for a week.

Started to get really bad yesterday and today. Only 20-50kb/s and lots of stalled activity now. Taking 25-30 minutes for uploads for example across all hosts now.

I agree the servers need a good reboot. Should do that now that the available tasks have dwindled down to a dozen.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60739 - Posted: 11 Sep 2023 | 8:32:51 UTC

Getting the "Energy is NaN" error doesn't necessarily mean that the unit is a bad one. It means that unit more sensitive to failure.

Here is an example where my computer and two others had that error, but someone else finished it successfully:

https://www.gpugrid.net/workunit.php?wuid=27572371

Here is an example where my computer finished it successfully, while other didn't:

https://www.gpugrid.net/workunit.php?wuid=27573320

My computer is 610674.

BTW: The uploads are running at normal speeds, when there is very little work. The project obviously needs more bandwidth.



Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60740 - Posted: 11 Sep 2023 | 9:25:02 UTC - in response to Message 60739.

BTW: The uploads are running at normal speeds, when there is very little work. The project obviously needs more bandwidth.

Download problems are still persisting, a recent example see here:

https://www.gpugrid.net/result.php?resultid=33624287

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60741 - Posted: 11 Sep 2023 | 12:28:05 UTC - in response to Message 60739.

Getting the "Energy is NaN" error doesn't necessarily mean that the unit is a bad one. It means that unit more sensitive to failure.

this may be the case; but after such long time, the developper should have been able to iron out this unusual extremely high sensivity.

2 or 3 weeks ago I stopped crunching ATMs on one of my hosts, after about every other task had failed, for unknown reason (no overclocking, no other tasks from other projects running).

Now I am experiencing the curious situation on the other two hosts, that since - as opposed to until short time ago - no tasks from other projects are running (because there are none available), the dropout rate of ATMs even increased.
No idea how come.
And if this happens after several hours (which it does), it is more than annoying. If the situation continues like this, I will stop crunching ATMs also on these other two hosts, coming back only once the developper has improved the performance of the ATMs.

Too bad that GPUGRID has stopped all other sub-projects like Python or ACEMD (both 3 and 4).

I have been with GPUGRID for almost 9 years, but during this period of time, the situation has never been as bad aa it has been in the recent past, strange server problems included :-(
No idea what's going on there ???

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60742 - Posted: 13 Sep 2023 | 16:00:45 UTC

I added up the runtimes of tasks that failed with "energy is NAN" alone on September 11:
45.228 seconds = 12,56 hours.
That's not nice :-(

Speedy
Send message
Joined: 19 Aug 07
Posts: 42
Credit: 28,391,082
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60743 - Posted: 13 Sep 2023 | 21:01:00 UTC - in response to Message 60742.

It may not be nice but it's part of beta testing

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60744 - Posted: 14 Sep 2023 | 10:35:39 UTC - in response to Message 60743.

It may not be nice but it's part of beta testing

well, as already discussed here earlier: ATM has run as beta for half a year now.
So, at some point the ongoing problems could be solved, right?

Speedy
Send message
Joined: 19 Aug 07
Posts: 42
Credit: 28,391,082
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwat
Message 60745 - Posted: 14 Sep 2023 | 21:39:02 UTC - in response to Message 60744.

It may not be nice but it's part of beta testing

well, as already discussed here earlier: ATM has run as beta for half a year now.
So, at some point the ongoing problems could be solved, right?

Yes you could be right in what you are saying. Nevertheless the application is still in "Beta" so errors are still likely to show up every now and then or lots.

AnandBhat
Send message
Joined: 3 Mar 22
Posts: 1
Credit: 13,554,000
RAC: 0
Level
Pro
Scientific publications
wat
Message 60746 - Posted: 15 Sep 2023 | 0:19:15 UTC

A couple of my ATM tasks (my only GPUgrid tasks for a loooong time) failed with this error: openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

https://www.gpugrid.net/result.php?resultid=33624768
https://www.gpugrid.net/result.php?resultid=33624760

Looking at a few of my wingmen who've had similar failures, appears to be a bad batch.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60747 - Posted: 15 Sep 2023 | 16:27:42 UTC - in response to Message 60746.

A couple of my ATM tasks (my only GPUgrid tasks for a loooong time) failed with this error: openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

https://www.gpugrid.net/result.php?resultid=33624768
https://www.gpugrid.net/result.php?resultid=33624760

Looking at a few of my wingmen who've had similar failures, appears to be a bad batch.

one thing you can be happy about: theses tasks seem to fail after a few minutes, and not after many hours as it often is the case with the "Engergy is NAN" error.
So not much waste of resources.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60748 - Posted: 18 Sep 2023 | 9:28:45 UTC

Two errors so far with today's new batch. Both of the form

FileNotFoundError: [Errno 2] No such file or directory: 'MCL1_mXX_mXX_0.xml'

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60750 - Posted: 19 Sep 2023 | 0:17:15 UTC
Last modified: 19 Sep 2023 | 0:17:35 UTC

Just a FYI here for an interesting observation of two tasks run on a host where the first task finished up and reported correctly.

Then the next task started up in the same slot 110 and errored out.

The difference was the second task failed to find the original xml file.

Compare the ends of the output file to see the difference.

Task https://www.gpugrid.net/result.php?resultid=33626002

+ echo 'Save output'
+ tar cjvf output.tar.bz2 run.log r0/MCL1_m27_m05.out r1/MCL1_m27_m05.out r10/MCL1_m27_m05.out r11/MCL1_m27_m05.out r12/MCL1_m27_m05.out r13/MCL1_m27_m05.out r14/MCL1_m27_m05.out r15/MCL1_m27_m05.out r16/MCL1_m27_m05.out r17/MCL1_m27_m05.out r18/MCL1_m27_m05.out r19/MCL1_m27_m05.out r2/MCL1_m27_m05.out r20/MCL1_m27_m05.out r21/MCL1_m27_m05.out r3/MCL1_m27_m05.out r4/MCL1_m27_m05.out r5/MCL1_m27_m05.out r6/MCL1_m27_m05.out r7/MCL1_m27_m05.out r8/MCL1_m27_m05.out r9/MCL1_m27_m05.out r0/MCL1_m27_m05.dcd r1/MCL1_m27_m05.dcd r10/MCL1_m27_m05.dcd r11/MCL1_m27_m05.dcd r12/MCL1_m27_m05.dcd r13/MCL1_m27_m05.dcd r14/MCL1_m27_m05.dcd r15/MCL1_m27_m05.dcd r16/MCL1_m27_m05.dcd r17/MCL1_m27_m05.dcd r18/MCL1_m27_m05.dcd r19/MCL1_m27_m05.dcd r2/MCL1_m27_m05.dcd r20/MCL1_m27_m05.dcd r21/MCL1_m27_m05.dcd r3/MCL1_m27_m05.dcd r4/MCL1_m27_m05.dcd r5/MCL1_m27_m05.dcd r6/MCL1_m27_m05.dcd r7/MCL1_m27_m05.dcd r8/MCL1_m27_m05.dcd r9/MCL1_m27_m05.dcd
tar: run.log: file changed as we read it
+ true
+ echo 'Save restart'
+ tar cjvf restart.tar.bz2 r0/MCL1_m27_m05_ckpt.xml r1/MCL1_m27_m05_ckpt.xml r10/MCL1_m27_m05_ckpt.xml r11/MCL1_m27_m05_ckpt.xml r12/MCL1_m27_m05_ckpt.xml r13/MCL1_m27_m05_ckpt.xml r14/MCL1_m27_m05_ckpt.xml r15/MCL1_m27_m05_ckpt.xml r16/MCL1_m27_m05_ckpt.xml r17/MCL1_m27_m05_ckpt.xml r18/MCL1_m27_m05_ckpt.xml r19/MCL1_m27_m05_ckpt.xml r2/MCL1_m27_m05_ckpt.xml r20/MCL1_m27_m05_ckpt.xml r21/MCL1_m27_m05_ckpt.xml r3/MCL1_m27_m05_ckpt.xml r4/MCL1_m27_m05_ckpt.xml r5/MCL1_m27_m05_ckpt.xml r6/MCL1_m27_m05_ckpt.xml r7/MCL1_m27_m05_ckpt.xml r8/MCL1_m27_m05_ckpt.xml r9/MCL1_m27_m05_ckpt.xml
16:23:56 (1259959): bin/bash exited; CPU time 6653.704111
16:23:56 (1259959): called boinc_finish(0)


And the task https://www.gpugrid.net/result.php?resultid=33626029
that failed later in the same slot.

+ echo 'Save output'
+ tar cjvf output.tar.bz2 run.log r0/MCL1_m09_m41.out r1/MCL1_m09_m41.out r10/MCL1_m09_m41.out r11/MCL1_m09_m41.out r12/MCL1_m09_m41.out r13/MCL1_m09_m41.out r14/MCL1_m09_m41.out r15/MCL1_m09_m41.out r16/MCL1_m09_m41.out r17/MCL1_m09_m41.out r18/MCL1_m09_m41.out r19/MCL1_m09_m41.out r2/MCL1_m09_m41.out r20/MCL1_m09_m41.out r21/MCL1_m09_m41.out r3/MCL1_m09_m41.out r4/MCL1_m09_m41.out r5/MCL1_m09_m41.out r6/MCL1_m09_m41.out r7/MCL1_m09_m41.out r8/MCL1_m09_m41.out r9/MCL1_m09_m41.out r0/MCL1_m09_m41.dcd r1/MCL1_m09_m41.dcd r10/MCL1_m09_m41.dcd r11/MCL1_m09_m41.dcd r12/MCL1_m09_m41.dcd r13/MCL1_m09_m41.dcd r14/MCL1_m09_m41.dcd r15/MCL1_m09_m41.dcd r16/MCL1_m09_m41.dcd r17/MCL1_m09_m41.dcd r18/MCL1_m09_m41.dcd r19/MCL1_m09_m41.dcd r2/MCL1_m09_m41.dcd r20/MCL1_m09_m41.dcd r21/MCL1_m09_m41.dcd r3/MCL1_m09_m41.dcd r4/MCL1_m09_m41.dcd r5/MCL1_m09_m41.dcd r6/MCL1_m09_m41.dcd r7/MCL1_m09_m41.dcd r8/MCL1_m09_m41.dcd r9/MCL1_m09_m41.dcd
tar: run.log: file changed as we read it
+ true
+ echo 'Save restart'
+ tar cjvf restart.tar.bz2 'r*/*.xml'
tar: r*/*.xml: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
16:44:27 (15335): bin/bash exited; CPU time 58.042528
16:44:27 (15335): app exit status: 0x2
16:44:27 (15335): called boinc_finish(195)


I want to bring attention to the part where it could not parse down the xml string correctly.

+ tar cjvf restart.tar.bz2 'r*/*.xml'
tar: r*/*.xml: Cannot stat: No such file or directory

GoodOlClint
Send message
Joined: 28 Jul 22
Posts: 1
Credit: 145,213,300
RAC: 661,573
Level
Cys
Scientific publications
wat
Message 60751 - Posted: 19 Sep 2023 | 2:39:34 UTC

Got the "Energy is NaN" error today. Interestingly the task jumped to 100% after just a few minutes, however it continued running for over two hours afterwards. When it completed it uploaded a 45 mb result file.

https://www.gpugrid.net/result.php?resultid=33626669

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60752 - Posted: 19 Sep 2023 | 5:44:42 UTC - in response to Message 60751.

Got the "Energy is NaN" error today.

Here the same a few hours ago, after so many times before :-(
It's really unbelievealbe that after this problem has happened on a regular basis for half a year now, the developper is still not willing/able to iron out this nasty error.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60753 - Posted: 19 Sep 2023 | 6:11:02 UTC - in response to Message 60728.

On September 5, Bedrich Hajek wrote:

The uploads are definitely slow, upwards of 25 minutes for an 86 Meg files. I just notice it this weekend. Until last week, the uploads were taking less than a minute for this particular file. Though downloads still take less than a minute for me.

I think the server needs a reboot, and the management of this project needs a good kick in the ass.......................

That's just my opinion.

Still nothing has changed. Despite of only very few new tasks available once in a while, downloads and uploads take forever (25-30 kb/s).
So it's clear that traffic congestion is definitely not the reason.
I am curious when the project management will finally move ahead and straighten out all the problems.
Seemingly, it really needs a good kick in the ass to wake them up ...

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60754 - Posted: 19 Sep 2023 | 7:11:15 UTC

Complaining here is doing no good other than making the other forum participants tired of the constant diatribes.

The ATMbeta researcher has nothing to do with the application development. He just uses the tools that the project gives to him.

You need to vent your frustration at acellera.com

They are the ones that develop the apps. Look here.

https://www.gpugrid.net/about.php

Complain to the principal investigators Gianni and Toni.

Their contact information is in their website URL's posted on the About page.

[CSF] Aleksey Belkov
Avatar
Send message
Joined: 26 Dec 13
Posts: 85
Credit: 1,215,531,270
RAC: 25,194
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60764 - Posted: 21 Sep 2023 | 22:32:00 UTC - in response to Message 60754.

Complaining here is doing no good other than making the other forum participants tired of the constant diatribes.

Golden words

However, it’s still no use, because as practice shows, no matter how you argue, such people will still pour out their dissatisfaction over and over again...

Nuadormrac
Send message
Joined: 21 Jul 12
Posts: 7
Credit: 360,934,258
RAC: 84,041
Level
Asp
Scientific publications
watwatwatwatwatwatwat
Message 60767 - Posted: 27 Sep 2023 | 18:58:19 UTC - in response to Message 60051.

Climate has routine trickle credits which come in, which is also verification that it's making it to the next checkpoint. Some climate models were weaks though, not hours...

If the "excess" data is necessary; which is resulting in these failed uploads, it might be possible to provide a compression algorithm in the app depending on what the data is. If it's text, it should be very compressible, otherwise less so. This could then be decompressed server side to allow the data to go into whatever processing you have; so long as compression doesn't result in a loss of critical data due to compression errors.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60783 - Posted: 30 Oct 2023 | 18:04:13 UTC

It seems that today a new batch of ATMbeta tasks is on the field.
A subtle difference comparing to previous ones:
Name for my last task from previous batch was:
TYK2_m15_m16_3-QUICO_ATM_Sch_GAFF2_AIMNet2_RE-4-5-RND5965_0
Name for my first task from today's batch is:
CDK2_m10_m04_4-QUICO_ATM_Sch_AIMnet2_10-0-10-RND2956_0
If it is as usual, each task for a current investigation line will generate a chain of ten links (from 0 to 9), instead of previous 5 (from 0 to 4).
Verification pending...

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60784 - Posted: 31 Oct 2023 | 8:23:14 UTC

what I notice is:
on both ATM tasks which I downloaded and started about 1 hour ago, the percentage progress bar in the BOINC manager works fine, i.e. it shows real values.
Either this was fixed, or it is coincidence.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60785 - Posted: 31 Oct 2023 | 9:11:30 UTC - in response to Message 60784.

We're still working on the first task in each sequence - 0-5, or now 0-10. This looks to be a big batch, so that may last some time.

The interesting point (the sudden jump to 100%) will come with the 1-10 tasks. But we're still working on the 1.09 application version from late March, so I'm not expecting a miracle.

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60788 - Posted: 31 Oct 2023 | 12:06:14 UTC - in response to Message 60784.

what I notice is:
on both ATM tasks which I downloaded and started about 1 hour ago, the percentage progress bar in the BOINC manager works fine, i.e. it shows real values.
Either this was fixed, or it is coincidence.

Does that mean that ATM tasks are available now? I haven't received anything since.. May or June, if i remember correctly. Even now, i explicitly clicked the Update button - nothing. nVidia RTX3070 on Win 11, if that matters.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60789 - Posted: 31 Oct 2023 | 12:07:03 UTC - in response to Message 60788.

what I notice is:
on both ATM tasks which I downloaded and started about 1 hour ago, the percentage progress bar in the BOINC manager works fine, i.e. it shows real values.
Either this was fixed, or it is coincidence.

Does that mean that ATM tasks are available now? I haven't received anything since.. May or June, if i remember correctly. Even now, i explicitly clicked the Update button - nothing. nVidia RTX3070 on Win 11, if that matters.


the tasks now are "ATMbeta".

you need to go into your preferences and allow beta work.
____________

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60790 - Posted: 31 Oct 2023 | 12:10:22 UTC - in response to Message 60789.


the tasks now are "ATMbeta".

you need to go into your preferences and allow beta work.

That's been done long time ago. My current settings are:
ACEMD 3: yes
ACEMD 4: yes
ATM (beta): yes
Quantum Chemistry (CPU): yes
Quantum Chemistry (CPU, beta): no
Python Runtime (CPU, beta): no
Python Runtime (GPU, beta): yes

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60791 - Posted: 31 Oct 2023 | 12:15:52 UTC - in response to Message 60789.

That's strange. 'Server Status' page reports 1,296 unsent units, but my BOINC client doesn't receive any. This is what i have in the logs:

31/10/2023 11:11:12 PM | GPUGRID | Sending scheduler request: To fetch work.
31/10/2023 11:11:12 PM | GPUGRID | Requesting new tasks for NVIDIA GPU and Intel GPU
31/10/2023 11:11:15 PM | GPUGRID | Scheduler request completed: got 0 new tasks
31/10/2023 11:11:15 PM | GPUGRID | No tasks sent
31/10/2023 11:11:15 PM | GPUGRID | No tasks are available for ACEMD 3: molecular dynamics simulations for GPUs
31/10/2023 11:11:15 PM | GPUGRID | No tasks are available for ACEMD 4: molecular dynamics simulations for GPUs
31/10/2023 11:11:15 PM | GPUGRID | No tasks are available for ATM: Free energy calculations of protein-ligand binding
31/10/2023 11:11:15 PM | GPUGRID | Project requested delay of 31 seconds

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60792 - Posted: 31 Oct 2023 | 12:52:53 UTC - in response to Message 60785.

Got my first 1-10 task of this run - specifically,

JNK1_m06_m05_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-1-10-RND0728_0

It jumped immediately to 100%, as before.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60793 - Posted: 31 Oct 2023 | 13:12:44 UTC - in response to Message 60792.
Last modified: 31 Oct 2023 | 13:51:22 UTC

Same here for JNK1_m06_m05_2-QUICO_ATM_Sch_AIMNet2_10-1-10-RND3720 - direct to 100%, whereas TYK2_m03_m15_1-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-0-10-RND2341_3 progress worked just fine.

Interesting thing I happened to notice though - which might or might not help the devs fixing this:
- On the 0-10 task, I noticed that progress jumped up in 1.446% increments every 5 minutes.
- On the 1-10 task, there's a 'progress' file in the slot directory with a number that jumps up in 0.01449 increments every 5 minutes. this number starts with a 1, like this: 1.391304348 => 1.405797101 => ...

Hypothesis is that this number should start with a 0 instead of a 1 and then progress would be reported OK?

<edit> would be interesting to see if I get a chunck 2-10 if this progress number starts with a 2.xxx then. Might be an 'easy' fix to subtract the chunk number from the progress number before writing?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60794 - Posted: 31 Oct 2023 | 15:13:26 UTC - in response to Message 60790.


the tasks now are "ATMbeta".

you need to go into your preferences and allow beta work.

That's been done long time ago. My current settings are:
ACEMD 3: yes
ACEMD 4: yes
ATM (beta): yes
Quantum Chemistry (CPU): yes
Quantum Chemistry (CPU, beta): no
Python Runtime (CPU, beta): no
Python Runtime (GPU, beta): yes



there's more to it. there are two options you need to select.

make sure this one is checked:
"Run test applications?"
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60795 - Posted: 31 Oct 2023 | 15:25:37 UTC - in response to Message 60793.
Last modified: 31 Oct 2023 | 15:32:41 UTC

Same here for JNK1_m06_m05_2-QUICO_ATM_Sch_AIMNet2_10-1-10-RND3720 - direct to 100%, whereas TYK2_m03_m15_1-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-0-10-RND2341_3 progress worked just fine.

Interesting thing I happened to notice though - which might or might not help the devs fixing this:
- On the 0-10 task, I noticed that progress jumped up in 1.446% increments every 5 minutes.
- On the 1-10 task, there's a 'progress' file in the slot directory with a number that jumps up in 0.01449 increments every 5 minutes. this number starts with a 1, like this: 1.391304348 => 1.405797101 => ...

Hypothesis is that this number should start with a 0 instead of a 1 and then progress would be reported OK?

<edit> would be interesting to see if I get a chunck 2-10 if this progress number starts with a 2.xxx then. Might be an 'easy' fix to subtract the chunk number from the progress number before writing?


there's no need to reanalyze all this. it's been done ad nauseam many months ago and discussed over and over again. all of the behavior is known at this point.

tasks in the "0" group will all process and count the progress naturally.
tasks in "1+" groups will all jump to 100% after the extraction phase. but will complete successfully in the normal time if you leave it alone.

the root cause very likely has to do with how they have decided to configure their asyncre.cntl file. using a relative value rather than absolute based on what group it's in.

for all run types, the last line of this file is listed as just "MAX_SAMPLES = +70". it's very possible that the app is communicating to boinc the run percentage based on the ratio of current sample to max sample.

so for all "0" runs, the current sample will be less than the max sample. 1 through 70.

however, for all "1+" runs, the current sample will always be greater than max sample (71+). boinc can't report greater than 100% completion, so it jumps to 100% and stays there until the task is actually finished.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60796 - Posted: 31 Oct 2023 | 16:01:36 UTC - in response to Message 60795.


the root cause very likely has to do with how they have decided to configure their asyncre.cntl file. using a relative value rather than absolute based on what group it's in.

for all run types, the last line of this file is listed as just "MAX_SAMPLES = +70". it's very possible that the app is communicating to boinc the run percentage based on the ratio of current sample to max sample.

so for all "0" runs, the current sample will be less than the max sample. 1 through 70.

however, for all "1+" runs, the current sample will always be greater than max sample (71+). boinc can't report greater than 100% completion, so it jumps to 100% and stays there until the task is actually finished.


Well that would match with what I'm seeing, if chunk 0 = 1-70 and chunck 1 = 71-140, then progress would run 0-1 in the first case, and 1-2 in the second (and probably 2-3 for chunck #2 and so on).
If the app reports progress in that way, and they can just change it to subtract the chunck number, then that would bring all progress back to 0-1 range where it should be.
Sounds like a reasonably easy fix - if there isn't more to it obviously, and if they have the time to do that...

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60799 - Posted: 31 Oct 2023 | 22:37:13 UTC - in response to Message 60794.


there's more to it. there are two options you need to select.

make sure this one is checked:
"Run test applications?"

Sure, i have it checked:
Run test applications?
This helps us develop applications, but may cause jobs to fail on your computer yes
If no work for selected applications is available, accept work from other applications? yes
Use Graphics Processing Unit (GPU) if available yes
Use Central Processing Unit (CPU) yes

I have both test and other applications, that's why i'm surprised to not have been receiving anything when WUs are available now.[/i]

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60800 - Posted: 31 Oct 2023 | 22:43:27 UTC - in response to Message 60799.

what is your cache setting in BOINC? how much work are you asking for? I think with GPUGRID if you ask for "too much" you end up getting this response and getting nothing. if you're asking for something like 10 days of work, that might explain it. set your work cache to like 1 day or less.
____________

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60801 - Posted: 31 Oct 2023 | 22:48:55 UTC - in response to Message 60799.

I even removed the project from BOINC and re-added it, to no avail:

1/11/2023 9:46:37 AM | GPUGRID | New computer location: work
1/11/2023 9:46:39 AM | GPUGRID | Started download of logogpugrid.png
1/11/2023 9:46:39 AM | GPUGRID | Started download of project_1.png
1/11/2023 9:46:39 AM | GPUGRID | Started download of project_2.png
1/11/2023 9:46:39 AM | GPUGRID | Started download of project_3.png
1/11/2023 9:46:41 AM | GPUGRID | Finished download of logogpugrid.png (0 bytes)
1/11/2023 9:46:42 AM | GPUGRID | Finished download of project_1.png (0 bytes)
1/11/2023 9:46:42 AM | GPUGRID | Finished download of project_2.png (0 bytes)
1/11/2023 9:46:42 AM | GPUGRID | Finished download of project_3.png (0 bytes)
1/11/2023 9:47:11 AM | GPUGRID | Sending scheduler request: To fetch work.
1/11/2023 9:47:11 AM | GPUGRID | Requesting new tasks for NVIDIA GPU and Intel GPU
1/11/2023 9:47:15 AM | GPUGRID | Scheduler request completed: got 0 new tasks
1/11/2023 9:47:15 AM | GPUGRID | No tasks sent
1/11/2023 9:47:15 AM | GPUGRID | No tasks are available for ACEMD 3: molecular dynamics simulations for GPUs
1/11/2023 9:47:15 AM | GPUGRID | No tasks are available for ACEMD 4: molecular dynamics simulations for GPUs
1/11/2023 9:47:15 AM | GPUGRID | No tasks are available for ATM: Free energy calculations of protein-ligand binding
1/11/2023 9:47:15 AM | GPUGRID | Project requested delay of 31 seconds

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60802 - Posted: 31 Oct 2023 | 22:50:38 UTC - in response to Message 60800.

My settings are:


Store at least 0.1 days
and up to an additional 0.75 days of work.

Are these settings too low?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60803 - Posted: 31 Oct 2023 | 22:52:57 UTC - in response to Message 60802.

those should be OK to get at least one task I think. as long as BOINC doesnt think they will take so long to miss the deadline.
____________

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60804 - Posted: 31 Oct 2023 | 22:53:28 UTC - in response to Message 60802.

My computing prefs:

1/11/2023 9:52:09 AM | | General prefs: using separate prefs for work
1/11/2023 9:52:09 AM | | Reading preferences override file
1/11/2023 9:52:09 AM | | Preferences:
1/11/2023 9:52:09 AM | | - When computer is in use
1/11/2023 9:52:09 AM | | - 'In use' means mouse/keyboard input in last 3.00 minutes
1/11/2023 9:52:09 AM | | - max CPUs used: 5
1/11/2023 9:52:09 AM | | - Use at most 85% of the CPU time
1/11/2023 9:52:09 AM | | - suspend if non-BOINC CPU load exceeds 80%
1/11/2023 9:52:09 AM | | - max memory usage: 7.84 GB
1/11/2023 9:52:09 AM | | - When computer is not in use
1/11/2023 9:52:09 AM | | - max CPUs used: 8
1/11/2023 9:52:09 AM | | - Use at most 85% of the CPU time
1/11/2023 9:52:09 AM | | - suspend if non-BOINC CPU load exceeds 80%
1/11/2023 9:52:09 AM | | - max memory usage: 14.12 GB
1/11/2023 9:52:09 AM | | - Leave apps in memory if not running
1/11/2023 9:52:09 AM | | - Store at least 1.00 days of work
1/11/2023 9:52:09 AM | | - Store up to an additional 1.00 days of work
1/11/2023 9:52:09 AM | | - max disk usage: 80.00 GB
1/11/2023 9:52:09 AM | | - (to change preferences, visit a project web site or select Preferences in the Manager)

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60805 - Posted: 31 Oct 2023 | 22:55:43 UTC - in response to Message 60803.

those should be OK to get at least one task I think. as long as BOINC doesnt think they will take so long to miss the deadline.

My laptop used to crunch ATM tasks successfully in the past, before the summer break. Is there anything different now? I mean, system requirements - have they changed dramatically in terms of available memory and graphics cards?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60806 - Posted: 31 Oct 2023 | 23:39:26 UTC - in response to Message 60804.

i see that the host is linking up to your "work" venue/location. verify that the settings for the work venue allow beta/test applications.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60807 - Posted: 31 Oct 2023 | 23:42:10 UTC - in response to Message 60801.

Your system seems not to be asking for ATMbeta tasks.
Try taking a look to message #60725

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60808 - Posted: 1 Nov 2023 | 1:07:54 UTC - in response to Message 60806.

This 'work' profile is coming from WCG and doesn't have anything about beta-testing tasks. Besides, it didn't cause any issues in the past. Where else should i look? The logs show that it's GPUGrid that doesn't return me new tasks:

1/11/2023 12:02:55 PM | GPUGRID | No tasks are available for ATM: Free energy calculations of protein-ligand binding
That is, to me it appears that my BOINC client sends requests to the server but receives nothing. I enabled debugging logs to BOINC, this is the output:
1/11/2023 12:05:39 PM | | [work_fetch] ------- start work fetch state -------
1/11/2023 12:05:39 PM | | [work_fetch] target work buffer: 86400.00 + 86400.00 sec
1/11/2023 12:05:39 PM | | [work_fetch] --- project states ---
1/11/2023 12:05:39 PM | GPUGRID | [work_fetch] REC 0.000 prio -0.000 can request work
1/11/2023 12:05:39 PM | | [work_fetch] --- state for CPU ---
1/11/2023 12:05:39 PM | | [work_fetch] shortfall 844411.53 nidle 0.00 saturated 2273.24 busy 0.00
1/11/2023 12:05:39 PM | GPUGRID | [work_fetch] share 0.000 no applications
1/11/2023 12:05:39 PM | | [work_fetch] --- state for NVIDIA GPU ---
1/11/2023 12:05:39 PM | | [work_fetch] shortfall 172800.00 nidle 1.00 saturated 0.00 busy 0.00
1/11/2023 12:05:39 PM | GPUGRID | [work_fetch] share 0.000 project is backed off (resource backoff: 502.21, inc 600.00)
1/11/2023 12:05:39 PM | | [work_fetch] --- state for Intel GPU ---
1/11/2023 12:05:39 PM | | [work_fetch] shortfall 172800.00 nidle 1.00 saturated 0.00 busy 0.00
1/11/2023 12:05:39 PM | GPUGRID | [work_fetch] share 0.000 project is backed off (resource backoff: 255.48, inc 600.00)
1/11/2023 12:05:39 PM | | [work_fetch] ------- end work fetch state -------
1/11/2023 12:05:39 PM | GPUGRID | choose_project: scanning
1/11/2023 12:05:39 PM | GPUGRID | can't fetch CPU: no applications
1/11/2023 12:05:39 PM | GPUGRID | can't fetch NVIDIA GPU: project is backed off
1/11/2023 12:05:39 PM | GPUGRID | can't fetch Intel GPU: project is backed off
Though i don't know what that means

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60809 - Posted: 1 Nov 2023 | 1:52:58 UTC - in response to Message 60807.

Your system seems not to be asking for ATMbeta tasks.
Try taking a look to message #60725
Thanks for the response. However, I checked earlier and confirmed that beta tasks were enabled: see
my post
Will appreciate any help in resolving this.

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60810 - Posted: 1 Nov 2023 | 7:42:47 UTC - in response to Message 60808.
Last modified: 1 Nov 2023 | 7:49:31 UTC

When properly configured, the message should say somethimg like this:

1/11/2023 12:02:55 PM | GPUGRID | No tasks are available for ATMbeta: Free energy calculations of protein-ligand binding

Try following message #60725, and accessing to the specific links it contains to GPUGRID Project Preferences and GPUGRID Hosts
Edit Preferences as stated for "Home" venue (for example), then change your host to "Home" location. This should work.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60811 - Posted: 1 Nov 2023 | 11:59:31 UTC - in response to Message 60810.
Last modified: 1 Nov 2023 | 11:59:52 UTC

When properly configured, the message should say somethimg like this:

1/11/2023 12:02:55 PM | GPUGRID | No tasks are available for ATMbeta: Free energy calculations of protein-ligand binding

Try following message #60725, and accessing to the specific links it contains to GPUGRID Project Preferences and GPUGRID Hosts
Edit Preferences as stated for "Home" venue (for example), then change your host to "Home" location. This should work.


this is exactly what I meant about checking the venue settings. many people do not know of the different venues or how they are set. the WCG supplied compute web preferences are not the same thing as the project-specific host venue preferences. you can make different selections for different venues to give folks the ability to have different computers crunch different things within the same project.

if your 'home' or 'default' (blank) preferences are allowing ATMbeta, but the host is set to the work venue which does not allow ATMbeta, then you wont get them. you need to be mindful of what venue the host is set to, and what the specific settings for that venue are.

goldfinch, go here: https://gpugrid.net/prefs.php?subset=project and you will see that there are 4 different venues to choose from (default/home/school/work). make sure you are settings the preferences to allow ATMbeta and test apps for the correct venue corresponding to your actual selected venue. you can see what venue it's set to here: https://gpugrid.net/hosts_user.php under the location column. (blank = default)
____________

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60812 - Posted: 1 Nov 2023 | 12:15:04 UTC - in response to Message 60811.

Thank you @Ian&Steve C. and @ServicEnginIC, i didn't realise that i was checking *default* profile, while my *venue* profile was *work*, and the latter didn't have Test tasks checkbox ticked. Tons of appreciation for your patience! Thank you very much!

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60813 - Posted: 1 Nov 2023 | 12:54:17 UTC - in response to Message 60795.

there's no need to reanalyze all this. it's been done ad nauseam many months ago and discussed over and over again. all of the behavior is known at this point.

tasks in the "0" group will all process and count the progress naturally.
tasks in "1+" groups will all jump to 100% after the extraction phase. but will complete successfully in the normal time if you leave it alone.


Well, I did reanalyze it. Because I'm stubborn like that. But mainly because I failed to notice the 400+ posts hidden by default in this thread. :-D

The good news:
I came to the same conclusion as Richard Haselgrove's post:
progress = float(isample - last_sample)/float(num_samples - last_sample)

should fix it, but even better would be:
progress = float(isample - last_sample + 1)/float(num_samples - last_sample + 1)

Since that would make the denominator = number of samples in the batch (fractions of 1/70 instead of now 1/69) and would let the count go from 1->70 instead of 0->69.

The bad news:
None of the github repo's containing the above code, and being retrieved on WU start, contain any branch or issue aiming to fix the progress issue.

So I'll test my code fix locally once more. First test seemed to work fine, but WU terminated on the NaN issue quickly.
Then I'll raise an issue or a pull request on the appropriate Github repo to try and get it fixed there.

Fingers crossed...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60814 - Posted: 1 Nov 2023 | 13:40:51 UTC - in response to Message 60813.

refresh my memory;

isample = current sample?
last_sample = previous sample?
num_samples = MAX_SAMPLES?

if so, i don't think that code will work.

isample - last_sample will always = 1

max_samples being "+70" for ALL units (0-10, 1-10, 2-10, etc) means that it's a relative value, not absolute. I think this is the crux of the issue. because the sample range for a 1-10 unit for example is actually [71,140] and for a 2-10 unit would be [141, 210] and so on, but the denominator is likely always using 70. so anything past the 0-10 unit's [1-70] range is a value >1 and is represented in boinc by just maxing out to 100% straight away.


____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60815 - Posted: 1 Nov 2023 | 13:53:12 UTC - in response to Message 60814.

I think it's this section that has some problem, but I haven't fully digested exactly what it's doing yet.

starting from line 102 here: https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py

last_sample = self.replicas[0].get_cycle()
num_samples = self.config['MAX_SAMPLES']
if num_samples.startswith("+"):
num_extra_samples = int(num_samples[1:])
num_samples = num_extra_samples + last_sample - 1
self.logger.info(f"Additional number of samples: {num_extra_samples}")

____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60818 - Posted: 1 Nov 2023 | 15:06:58 UTC - in response to Message 60815.

I think it's this section that has some problem, but I haven't fully digested exactly what it's doing yet.

starting from line 102 here: https://github.com/Gallicchio-Lab/AToM-OpenMM/blob/master/sync/atm.py

last_sample = self.replicas[0].get_cycle()
num_samples = self.config['MAX_SAMPLES']
if num_samples.startswith("+"):
num_extra_samples = int(num_samples[1:])
num_samples = num_extra_samples + last_sample - 1
self.logger.info(f"Additional number of samples: {num_extra_samples}")


For normal units (0-whatever), MAX_SAMPLES will be "70" and last_sample = 1.
In that case num_samples will be 70. isample iterating from 1->70 (inclusive).
So my formula's denominator will be num_samples-last_sample + 1 or 70-1+1=70.
The numerator (isample - last_sample + 1) goes from 1-1+1=1 until 70-1+1=70.
So works for regular units.

For additional units (>0-whatever), MAX_SAMPLES will be "+70" and last_sample will be 71 or 141 or...
In that case the 'if num_samples.startswith("+")' clause will be triggered.
num_extra_samples will be 70 and num_samples = num_extra_samples + last_sample - 1, giving 70 + 71 - 1 = 140 (or 210, or...)
isample will iterate from 71->140 or 141->210 or...
Denominator will be 140 - 71 + 1 = 70 or 210 - 141 + 1 = 70 or...
Numerator will be 71-71+1=1 until 140-71+1=70, or 141-141+1=1 until 210-141+1=70, or...

so in both cases, whatever the NUM_SAMPLES may be and whatever the first sample number may be, the progress will go from 1/NUM_SAMPLES to NUM_SAMPLES/NUM_SAMPLES and you will get a nice representative percentage.
Except of course for the 0.199% added in the beginning for the unpack tasks...




Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60819 - Posted: 1 Nov 2023 | 15:28:10 UTC - in response to Message 60818.

did you follow the exact logic of the code all thr way through? or are you making some assumptions about how you think it works?

for starters, ALL tasks ("0", or "1+") get +70 in the MAX_SAMPLES, check your async.cntl for confirmation. it's relative all the time. just that the starting point for the 0 is 0, and the starting point for 1+ is whatever the end of the previous segment was. so all tasks follow the same code path in that respect.

second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1?

similarly with num_extra_samples, how does this equal to 70? it needs to be expanded from num_extra_samples = int(num_samples[1:]). not sure how the unbounded num_samples[1:] ends up being 70 in this case.

i still think a problem lies in this section.


____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60820 - Posted: 1 Nov 2023 | 16:26:25 UTC - in response to Message 60819.
Last modified: 1 Nov 2023 | 16:27:17 UTC

did you follow the exact logic of the code all thr way through? or are you making some assumptions about how you think it works?

for starters, ALL tasks ("0", or "1+") get +70 in the MAX_SAMPLES, check your async.cntl for confirmation. it's relative all the time. just that the starting point for the 0 is 0, and the starting point for 1+ is whatever the end of the previous segment was. so all tasks follow the same code path in that respect.


some comments added to follow along:

num_samples = self.config['MAX_SAMPLES'] //HERE, num_samples is a string type!
if num_samples.startswith("+"):
num_extra_samples = int(num_samples[1:]) //[1:] will skip the first character of the string, so "+70" becomes "70". Then cast it into an integer type
num_samples = num_extra_samples + last_sample - 1 // HERE, num_samples becomes an integer type.
self.logger.info(f"Additional number of samples: {num_extra_samples}")
else:
num_samples = int(num_samples) //HERE, num_samples becomes integer
self.logger.info(f"Target number of samples: {num_samples}")


Doesn't matter if it's always "+70" for "0" tasks, last_sample will be 1, so num_samples = num_extra_samples + last_sample - 1 = 70 + 1 - 1 = 70.

second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1?


The get_cycle() calculation is buried deep somewhere in the openMM libraries, so that I haven't managed to find exactly so I can prove it to you, however empirically (run.log) the "0" units start from cycle 1, the "1" units start from cycle 71.
Also it doesn't really matter what the last_sample is.
Let's say it's X.

Then:

num_samples = num_extra_samples + last_sample - 1 = 70 + X - 1
for isample in range(last_sample, num_samples + 1):
=> isample going from X until X + 70 - 1 (last integer of 'range' = excluded!)

numerator = isample - last_sample + 1
=> numerator going from X - X + 1 = 1 until X + 70 - 1 - X + 1 = 70

denominator = num_samples - last_sample + 1 = 70 + X - 1 - X + 1 = 70

so again progress will go from 1/70 to 70/70

replace 70 everywhere by an arbitrary value of 'MAX_VALUES' and again see that it doesn't matter whatever value is in there, it will work as expected.



similarly with num_extra_samples, how does this equal to 70? it needs to be expanded from num_extra_samples = int(num_samples[1:]). not sure how the unbounded num_samples[1:] ends up being 70 in this case.

i still think a problem lies in this section.



Simple python string operation. Python is very clever and flexible with types. See my added comments in the first code snippet.

num_samples = self.config['MAX_SAMPLES'] //HERE, num_samples is a string type!


so it's a string here, but it's also an array of characters, where num_samples[0] will be a '+' in most cases. If it is a plus, then the IF-clause will trigger.

num_extra_samples = int(num_samples[1:]) //[1:] will skip the first character of the string, so "+70" becomes "70". Then cast it into an integer type


since the [0] is a '+', the [1:] will be "70", because if the end of the range is left empty, python 'knows' how long the string is and return until the end of the string but not beyond. So not 'unbounded', but 'implicitly bounded'.
the int() part around it will re-type the "70" string into an integer 70. Assigning it to num_extra_samples will redefine that variable to 'int' (python magic again.

And if it wasn't a '+' because MAX_VALUES was "70", then the 'else' clause will trigger:
num_samples = int(num_samples)


This will simply redefine the 'string' num_samples = "70" to an 'int' num_samples = 70

plug that into the example above and see that once again, the progess counter works as it should.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60821 - Posted: 1 Nov 2023 | 17:04:05 UTC - in response to Message 60820.

OK I'm following better (forgot that [1:] was omitting the first character and was thinking it was starting from one with the 0/1 mixup in what is "first"). I also couldn't find the get_cycle() routine. the way these tasks (scripts) are setup is very convoluted with how all the pieces of code get pulled in. importing bits of code from all over the place while the actual execution script is only a few lines long lol. really have to go down a rabbit hole to see what's happening.

so it seems the original progress calculation of

progress = float(isample)/float(num_samples - last_sample)


ends up evaluating to a negative number since the num_samples will be 70 and the last sample will be >70. BOINC must be freaking out not knowing what to do with a negative number and calling it 100%.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60822 - Posted: 1 Nov 2023 | 17:14:57 UTC - in response to Message 60821.
Last modified: 1 Nov 2023 | 17:18:04 UTC

OK I'm following better (forgot that [1:] was omitting the first character and was thinking it was starting from one with the 0/1 mixup in what is "first"). I also couldn't find the get_cycle() routine. the way these tasks (scripts) are setup is very convoluted with how all the pieces of code get pulled in. importing bits of code from all over the place while the actual execution script is only a few lines long lol. really have to go down a rabbit hole to see what's happening.

so it seems the original progress calculation of
progress = float(isample)/float(num_samples - last_sample)


ends up evaluating to a negative number since the num_samples will be 70 and the last sample will be >70. BOINC must be freaking out not knowing what to do with a negative number and calling it 100%.


Not negative no, but for "1+" units it will go beyond 1, and also (minor issue) increment in fractions of 1/69 instead of 1/70.

Remember that num_samples = the max_samples parameter PLUS the last_sample!

isample will go from 71-140 (or 141-210 etc)
num_samples will be 140 or 210 or...
last_sample will be 71 or 141 or... (but remember it doesn't really matter)

so numerator 71=>140, denominator = 140 - 71 = 69 (or 210 - 141 = 69 or...)

progress going from 71/69 until 140/69. Both > 1 so progress immediately jumps to 100%.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60823 - Posted: 1 Nov 2023 | 18:05:48 UTC - in response to Message 60822.

you're right I missed that bit. thanks.

I think a pull request to the original code would be good if you're able to do that. will probably be the only way to get it to the coder's attention since there is a bit of a game of telephone between the users and the person who has to ultimately make the change.

the other option would be to insert your own version of the atm.py code, and splice in some command(s) to replace it right after it downloads the original copy. which could be done by modifying the wrapper config file (job.xml.[...]) with the additional command to swap in a new version of run.sh. in the new run.sh needs to be a new version of rbfe_explicit_sync.py. and finally in rbfe_explicit_sync.py can be your command to copy in the new version of atm.py. very convoluted, but should work to automate the changes for you locally on newly downloaded tasks without having to stop them (which will fail the task) or trying to intercept things on the fly.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60824 - Posted: 1 Nov 2023 | 18:28:49 UTC - in response to Message 60823.

you're right I missed that bit. thanks.

I think a pull request to the original code would be good if you're able to do that. will probably be the only way to get it to the coder's attention since there is a bit of a game of telephone between the users and the person who has to ultimately make the change.

the other option would be to insert your own version of the atm.py code, and splice in some command(s) to replace it right after it downloads the original copy. which could be done by modifying the wrapper config file (job.xml.[...]) with the additional command to swap in a new version of run.sh. in the new run.sh needs to be a new version of rbfe_explicit_sync.py. and finally in rbfe_explicit_sync.py can be your command to copy in the new version of atm.py. very convoluted, but should work to automate the changes for you locally on newly downloaded tasks without having to stop them (which will fail the task) or trying to intercept things on the fly.


That's basically how I'm testing it on my machine, but that would also imply somebody does that on the server side - adding the new atm.py and run.sh to the 'program package'. If you do it local, you also need to edit the boinc_state.xml in a boinc stopped state to bypass the code sign mechanism by inserting the correct md5sum and bytesize. FYI - run.sh is part of the server-generated input files for each WU.

I'm not a programmer (or not anymore) so no real git skills, I did post an issue on the relevant github. If that's not picked up I'll try the pull request.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60825 - Posted: 1 Nov 2023 | 18:59:49 UTC - in response to Message 60824.

you can set <dont_check_file_sizes> in cc_config.xml and change anything you want in BOINC :) the only "BOINC" files you need to modify are the job.xml and you set it up to copy in the new run.sh after it extracts the input files (overwriting the original). only their tar file is checked, not the extracted contents.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60826 - Posted: 1 Nov 2023 | 20:37:42 UTC - in response to Message 60825.

you can set <dont_check_file_sizes> in cc_config.xml and change anything you want in BOINC :) the only "BOINC" files you need to modify are the job.xml and you set it up to copy in the new run.sh after it extracts the input files (overwriting the original). only their tar file is checked, not the extracted contents.


Could be, although I'm reading this entry as "BOINC will check the integrity of this file (job.xml) to avoid tampering"

<file>
<name>job.xml.789bd8d206da56434f30083d18653299</name>
<nbytes>828.000000</nbytes>
<max_nbytes>0.000000</max_nbytes>
<status>1</status>
<signature_required/>
<file_signature>
4b7b99c3260c591fe387d31d63158d0061c1b2fb5ef74395eada7cbb13c67b80
...etcetera...
0e12d16e50df943339987857aa157b863ad1dcbb8712cd0e21c1968fc7ca561a
.
</file_signature>


But it's a moot point, isn't it? It would potentially fix the issue for me or for anyone willing to put in the tweaking effort but not for the general user.
I'll give it a try though. ;-)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60827 - Posted: 1 Nov 2023 | 20:50:19 UTC - in response to Message 60826.

i did the same thing on PythonGPU earlier this year. worked fine.


____________

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60828 - Posted: 2 Nov 2023 | 9:25:02 UTC

I had this task running for 3+ hours, and the laptop hang. I switched it off, back on, restarted BOINC - the task failed. I saw a post about tasks failing on restarts, but that task's output is different from mine. I have a few questions:

    - can tasks be restarted?
    - if not, what's the purpose of the checkpoints?
    - what was wrong with my task after restart, and where/how can i get more detailed info, if needed? BOINC log doesn't show previous session's log, only the current where the task failed, but i need the logs that were at the time of hanging...

As for the BOINC logs, here they are:

2/11/2023 7:22:35 PM | GPUGRID | [task_debug] task is running in processor group 0
2/11/2023 7:22:35 PM | GPUGRID | [task] task_state=EXECUTING for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from start
2/11/2023 7:23:34 PM | GPUGRID | [task] Process for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 exited, exit code 195, task state 1
2/11/2023 7:23:34 PM | GPUGRID | [task] task_state=EXITED for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from handle_exited_app
2/11/2023 7:23:34 PM | GPUGRID | [task] result state=COMPUTE_ERROR for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from CS::report_result_error
2/11/2023 7:23:34 PM | GPUGRID | [task] Process for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 exited
2/11/2023 7:23:34 PM | GPUGRID | [task] exit code 195 (0xc3): The operating system cannot run %1. (0xc3)
2/11/2023 7:23:36 PM | GPUGRID | Computation for task TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 finished
2/11/2023 7:23:36 PM | GPUGRID | Output file TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0_0 for task TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 absent
2/11/2023 7:23:36 PM | GPUGRID | [task] result state=COMPUTE_ERROR for TYK2_m15_m16_5-QUICO_ATM_Sch_GAFF2_xTB_MM_False_v3-2-10-RND4709_0 from CS::app_finished
Should i activate other debug logs for such cases? Which, if yes?

Thanks for your help.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60829 - Posted: 2 Nov 2023 | 10:29:34 UTC - in response to Message 60828.

At this time. Tasks cannot be restarted at all. This is because checkpointing is broken in some way that the devs haven’t figured out. Checkpointing is there because it’s *supposed* to work. It just doesn’t right now. Restarting for any reason will cause it to fail.

Not sure why your task hung. Could be because it’s a laptop and overheated. Or could be because it’s windows. The windows application has a lot more problems than the Linux application
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60830 - Posted: 2 Nov 2023 | 11:06:07 UTC - in response to Message 60827.

i did the same thing on PythonGPU earlier this year. worked fine.



Well I tried, didn't work until I killed BOINC and edited the (new) filesize into client_state.xml. Even though the logfile clearly showed 'don't check filesizes' enabled, it failed due to job.xml size mismatch.

Either a bug in the latest version, or some setting overriding it, like this one?

<signature_required/>


anyway, with the client_state size edit it does work.

made these changes:

<task>
<application>C:/Windows/system32/cmd.exe</application>
<command_line>/c copy ..\..\newrun.bat run.bat</command_line>
<weight>1</weight>
</task>
<task>
<application>C:/Windows/system32/cmd.exe</application>
<command_line>/c call run.bat *.cntl</command_line>
<setenv>CUDA_DEVICE=$GPU_DEVICE_NUM</setenv>
<stdout_filename>run.log</stdout_filename>
<weight>1000</weight>
<fraction_done_filename>progress</fraction_done_filename>
</task>


the '*.cntl' command line added to run.bat is needed because the input config file xxxxx.cntl is actually hardcoded into run.bat for each WU - as run.bat is part of the WU file.

So changes done to run.bat:
at the beginning:

set PARM1=%1
echo %PARM1%
for %%A in (%PARM1%) do (set "CONFIG_FILE=%%A")
echo %CONFIG_FILE%


...and deleted the original line setting the CONFIG_FILE variable

This will load the (alphabetically last) config file - hoping they never include more than one... ;-)

replace atm.py:
@echo Replace atm.py
copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync
rename Lib\site-packages\sync\atm.py atm.py.orig
rename Lib\site-packages\sync\atm_correct_progress.py atm.py

@echo Run AToM
python.exe Scripts\rbfe_explicit_sync.py %CONFIG_FILE% || goto EX22


and some exit handling to preserve relevant output files

set LEVEL=%ERRORLEVEL%

:EXIT
copy run.log ..\..\projects\www.gpugrid.net\
copy stderr.txt ..\..\projects\www.gpugrid.net\
copy progress ..\..\projects\www.gpugrid.net\

exit %LEVEL%

:EX14
set LEVEL=14
goto EXIT

:EX22
set LEVEL=22
goto EXIT


And now it's working fine with a 3-10 job (samples 211-280) and correct progress - no manual intervention.
Thanks for the tip!

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60831 - Posted: 2 Nov 2023 | 11:35:03 UTC - in response to Message 60830.

Glad to see it’s working. A good amount of work for just a quality of life change though. And the changes for windows seem a bit more involved than they otherwise would be on Linux. Maybe the windows vs Linux client is why you couldn’t get it working initially? I assume you stopped and restarted BOINC after the change to cc_config, and not just a re-read config file. BOINC has to be restarted.

Either way. Cool that it works. Hopefully that can be pushed up to the devs through GitHub.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60832 - Posted: 2 Nov 2023 | 11:40:29 UTC - in response to Message 60831.

Glad to see it’s working. A good amount of work for just a quality of life change though.


Yeah, and it didn't even really bother me in the first place. :-D I just like a good puzzle.

No response on the GIT issue yet, though....

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60838 - Posted: 3 Nov 2023 | 8:55:04 UTC - in response to Message 60830.

Dear [BAT] Svennemans, can you please consolidate your fix in 1 post? I got what you did with the config files, but i couldn't follow what you changed in Python files and where you put your new run.bat because i'm not familiar with the projects structure in BOINC and, particularly, with ATM projects. I also couldn't find the name and location of the modified config file. I think not only i but others, too, will appreciate the guide in the form of steps, especially considering that your system is Windows like mine... E.g.,


  1. go to %programdata%\BOINC\slots\0\Lib\site-packages\sync
  2. change atm.py as per such and such post
  3. go to ...
  4. change so and so as per such and such post
  5. create new run.bat and place it ...
  6. ...


Again, if you have time for that, your guide will be much appreciated. I would love to test your solution. (I have basic Python and cmd/powershell scripting skills - not enough to figure out things for myself, but enough to apply a fix.)
Thank you.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60839 - Posted: 3 Nov 2023 | 14:39:44 UTC - in response to Message 60838.

Sure I can, goldfinch.

Buckle up, because as Ian&Steve C. said:

A good amount of work for just a quality of life change


Here goes:

Procedure for Windows!

  1. Get an ATMbeta running task and set GPUGRID to 'no new tasks' because you'll need to stop BOINC later which would crash any running ATM task.
    If you do have queued up ATMbeta tasks, suspend them before they start.
  2. Figure out in which slot directory within %programdata%\BOINC\slots\ the ATM task is running - let's call that <atmslot>
  3. Determine where you'll put your locally changed files. I chose to put them in the project directory %programdata%\BOINC\projects\www.gpugrid.net\ but that's up to personal choice. You will need the path to this directory for the script changes, either absolute (e.g. C:\ProgramData\BOINC\projects\www.gpugrid.net\) or relative from the perspective of the <atmslot> directory (e.g. ..\..\projects\www.gpugrid.net)
    Let's call this the <localcopy> directory
  4. Copy <atmslot>\run.bat to <localcopy>\newrun.bat
    Copy <atmslot>\Lib\site-packages\sync\atm.py to <localcopy>\atm_correct_progress.py
    Copy %programdata%\BOINC\projects\www.gpugrid.net\job.xml.<a-very-long-number> to <localcopy>\newjob.xml
    Caution: When you're running other GPUGRID applications besides ATMbeta, there may be more than one job.xml.<very-long-number> present! To find the correct one, open <atmslot>\job.xml and check which one to use in all subsequent steps of this procedure.
    <atmslot>\job.xml:
    <soft_link>../../projects/www.gpugrid.net/job.xml.789bd8d206da56434f30083d18653299</soft_link>

  5. Edit <localcopy>\atm_correct_progress.py
    Find the line (approx #139) that says
    # Report progress on GPUGRID
    progress = float(isample)/float(num_samples - last_sample)

    and change it to this
    # Report progress on GPUGRID
    progress = float(isample - last_sample + 1)/float(num_samples - last_sample + 1)

    ATTENTION to readers that are not Python-savvy!! Be very careful not to touch/change any of the whitespace leading those lines, as Python treats that as relevant. Safest to not copy/paste the above lines, just manually make the changes!
    Save&close
  6. Edit <localcopy>\newrun.bat

    1. Add the following lines to the beginning of the file:
      set PARM1=%1
      echo %PARM1%
      for %%A in (%PARM1%) do (set "CONFIG_FILE=%%A")
      echo %CONFIG_FILE%

    2. Look for this line
      python.exe -m pip install %REPO_URL% || exit 14

      And change it to this
      python.exe -m pip install %REPO_URL% || goto EX14

      This is not strictly necessary, but together with the errorhandling in 6.5 will preserve a copy of logfiles when something goes wrong to facilitate debugging.
    3. Look for these lines
      @echo Extract restart
      tar.exe xjvf restart.tar.bz2 || true

      And insert code to copy the local version of atm.py
      @echo Extract restart
      tar.exe xjvf restart.tar.bz2 || true

      @echo Replace atm.py
      copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync
      rename Lib\site-packages\sync\atm.py atm.py.orig
      rename Lib\site-packages\sync\atm_correct_progress.py atm.py

      NOTE: I gave the example where <localcopy> = '..\..\projects\www.gpugrid.net\' which would work for anyone - but remember to change it if you put your <localcopy> elsewhere.
      NOTE2: I'm aware I could have directly copied over atm.py but this gives me a visual confirmation when inspecting <atmslot>\Lib\site-packages\sync that my script is working fine.
    4. Look for these lines
      @echo Run AToM
      set CONFIG_FILE=TYK2_m08_m01_asyncre.cntl
      python.exe Scripts\rbfe_explicit_sync.py %CONFIG_FILE% || exit 22

      Your 'set' statement will be different, because the config file 'xxxxxx.cntl' is different for each task. Don't worry about it.
      Delete the 'set' statement and make following changes to the 'python' statement
      @echo Run AToM
      python.exe Scripts\rbfe_explicit_sync.py %CONFIG_FILE% || goto EX22

      Note again the optional but useful change for errorhandling purposes
    5. Add the following lines to the end of the file
      set LEVEL=%ERRORLEVEL%

      :EXIT
      copy Lib\site-packages\sync\atm.py ..\..\projects\www.gpugrid.net\
      copy run.bat ..\..\projects\www.gpugrid.net\
      copy progress ..\..\projects\www.gpugrid.net\
      copy stderr.txt ..\..\projects\www.gpugrid.net\
      copy run.log ..\..\projects\www.gpugrid.net\

      exit %LEVEL%

      :EX14
      set LEVEL=14
      goto EXIT

      :EX22
      set LEVEL=22
      goto EXIT


      This allows you to inspect logfiles if the ATM task should fail unexpectedly and allows you to check if corrected versions of run.bat and atm.py were indeed copied into <atmslot>
    6. Save & Close


  7. Edit <localcopy>\newjob.xml

    1. Look for the following <task> entry
      <task>
      <application>Library/usr/bin/tar.exe</application>
      <command_line>xjvf input.tar.bz2</command_line>
      <setenv>PATH=$PWD/Library/usr/bin</setenv>
      <weight>1</weight>
      </task>

      and insert a new <task> behind the above
      <task>
      <application>Library/usr/bin/tar.exe</application>
      <command_line>xjvf input.tar.bz2</command_line>
      <setenv>PATH=$PWD/Library/usr/bin</setenv>
      <weight>1</weight>
      </task>
      <task>
      <application>C:/Windows/system32/cmd.exe</application>
      <command_line>/c copy ..\..\projects\www.gpugrid.net\newrun.bat run.bat</command_line>
      <weight>1</weight>
      </task>

      NOTE: I gave the example where <localcopy> = '..\..\projects\www.gpugrid.net\' which would work for anyone - but remember to change it if you put your <localcopy> elsewhere.

    2. In the next <task> entry, look for the following line
      <command_line>/c call run.bat</command_line>

      and change it to this
      <command_line>/c call run.bat *.cntl</command_line>

    3. Save & close
    4. Open a command prompt window and 'CD' to your <localcopy> directory. Type 'DIR newjob.xml'
      02/11/2023 04:29 1.043 newjob.xml
      1 File(s) 1.043 bytes
      0 Dir(s) 2.334.709.661.696 bytes free

      Write down the number of bytes for newjob.xml. I is 1043 in my case, but it may be different for you. Don't write down the '.'.
      NOTE: You could also use <right-click>=>properties on the file in Explorer, but not everyone's OS language is english - and then you also need to remember to use the 'size' and NOT 'size on disk'


  8. Now wait for all running ATM tasks to finish. When they do - and remember from the beginning you should have no new (un-suspended) ATM jobs queued up for starting - shut down BOINC.
    Make sure the BOINC client is stopped, not just the manager. Best way to do this using BOINC manager is selecting menu "File"->"Exit BOINC", make sure "Stop running tasks when exiting the BOINC manager" is selected and press "OK". Using task manager, verify "boinc.exe" is not running.
  9. Navigate to %programdata%\BOINC directory
  10. Edit 'cc_config.xml'
    Look for the following line
    <dont_check_file_sizes>0</dont_check_file_sizes>

    And change it to
    <dont_check_file_sizes>1</dont_check_file_sizes>

    If the line wasn't there, just add it somewhere in the <options> section.
    Save & Close
    NOTE: this alone should do the trick theoretically, but it didn't work for me. Your mileage may vary. It doesn't hurt in any case and you're free to try and skip the next section editing 'client_state.xml' but for me, editing 'client_state.xml' was necessary...
  11. Edit 'client_state.xml'

    1. Search for the following section
      <project>
      <master_url>https://www.gpugrid.net/</master_url>
      <project_name>GPUGRID</project_name>

    2. Within this <project> section, look for the following sub-section
      <file>
      <name>job.xml.789bd8d206da56434f30083d18653299</name>
      <nbytes>1018.000000</nbytes>

      NOTE: You may have a different <very-long-number> after 'job.xml' and a different number in <nbytes> - don't worry about it.
      However, Caution: if you have multiple versions of job.xml, see Step 4 of this procedure to figure out the correct one.
    3. Change the <nbytes> number to the byte-size of newjob.xml you noted down.
      from
      <nbytes>1018.000000</nbytes>

      to
      <nbytes>1043.000000</nbytes>

      But do use your numbers, not mine...
    4. Save & Close


  12. Navigate back to %programdata%\BOINC\projects\www.gpugrid.net
    Copy newjob.xml to job.xml.<very-long-number>
    Use your <very-long-number>, not mine!
  13. Restart BOINC. Unsuspend queued up ATM tasks if any and "Allow new tasks" for GUPGRID.
  14. Cross your fingers and see what happens. If you did everything right, progress will be correct from now on. If tasks fail immediately, you may need to check the steps from the beginning to see where it failed.



Procedure for Linux
Probably very similar than above, except


  • Edit run.sh instead of run.bat
  • Replace all Windows commands by equivalent Linux shell commands
  • Replace all Windows path separators '\' by Linux path separators '/'


I don't have a Linux machine with GPU, so can't verify, but I'm sure some Linux-savvy user can post a full Linux version.

Remember:


  • The GPUGRID project may publish an updated job.xml version at any time! If that happens, you'll have to once again copy this version to 'newjob.xml' and redo everything starting from 'Edit newjob.xml'
  • The run.bat/run.sh files are part of the files downloaded from GPUGRID for every new ATM task. There again, GPUGRID may release a new version at any time. You'll have to redo the appropriate changes to a fresh copy of 'newrun.bat' when that happens.



Have fun! :-)

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60842 - Posted: 4 Nov 2023 | 1:39:11 UTC - in response to Message 60839.

Well... breath-taking! Thanks you so much! It works! I checked both run.bat and atm.py after BOINC restart - they were updated versions. However, this victory came with tears... While I was editing new version of run.bat, my laptop hang, with an ATM task being almost 100% complete... Maybe, overheating - it's a laptop, after all. I wish checkpoints could be fixed as easily as progress indicator!

Question: why did you copy and then rename the atm.py file instead of copying to a new name straight away? As for the ways how to determine a correct slot, here are my 2 cents (works with BOINC Manager; i don't know much about headless BOINC):


  1. Select a running ATMBeta task
  2. Click 'Property'
  3. Check the 'Directory' property - it shows the task's slot



Again, thanks for the instructions. Easy to follow, easy to implement. You even took care of people who don't know Python! So gracious of you... Thanks a lot!

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60843 - Posted: 4 Nov 2023 | 12:41:44 UTC - in response to Message 60842.
Last modified: 4 Nov 2023 | 12:44:40 UTC

Well... breath-taking! Thanks you so much! It works! I checked both run.bat and atm.py after BOINC restart - they were updated versions. However, this victory came with tears... While I was editing new version of run.bat, my laptop hang, with an ATM task being almost 100% complete... Maybe, overheating - it's a laptop, after all. I wish checkpoints could be fixed as easily as progress indicator!


Yeah, those checkpoints would indeed be great. I did see on the dev's Github page an issue for preemption and checkpointing and a comment they're trying to fix it, so fingers crossed...


Question: why did you copy and then rename the atm.py file instead of copying to a new name straight away?


See Note 2 in section 6.3. :-D
Long explanation: I had a couple of faults while debugging my procedure. BOINC then cleans out the slot directory faster than I could edit/validate the content of any files so I did it this way. I could quickly navigate into the Lib\site-packages\sync directory as the code was being unpacked and see if atm.py.orig popped up - or not. That was before I thought to include some errorhandling in run.bat.

As for the ways how to determine a correct slot, here are my 2 cents (works with BOINC Manager; i don't know much about headless BOINC):

  1. Select a running ATMBeta task
  2. Click 'Property'
  3. Check the 'Directory' property - it shows the task's slot



That's very true and didn't even think about that. I just quickly browsed through the few slots directories I had on my system to find the correct one. But your explanation is useful for anyone not knowing how to recognize the correct slot content on sight. I'll edit my post to include it.
<EDIT>: seems I can no longer edit my previous post. Oh well, they'll figure that one out I'm sure. ;-)

For headless BOINC, you'd need to look into client_state.xml for an <active_task> section of gpugrid:
<active_task>
<project_master_url>https://www.gpugrid.net/</project_master_url>
<result_name>Tyk2_jmc_28_jmc_27_1_RE-QUICO_ATM_Sch_GAFF2-3-10-RND3439_0</result_name>
<active_task_state>1</active_task_state>
<app_version_num>109</app_version_num>
<slot>1</slot>
...



Again, thanks for the instructions. Easy to follow, easy to implement. You even took care of people who don't know Python! So gracious of you... Thanks a lot!


You're welcome, I'm happy it's useful to you - and maybe others.

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60846 - Posted: 5 Nov 2023 | 3:10:30 UTC - in response to Message 60843.

I enabled checkpoints logging and discovered that checkpointing occurred right before uploading the results. That is, during computation checkpointing doesn't seem to occur (at least, according to debug logs).

As for this,


Question: why did you copy and then rename the atm.py file instead of copying to a new name straight away?


See Note 2 in section 6.3. :-D
Long explanation: I had a couple of faults while debugging my procedure. BOINC then cleans out the slot directory faster than I could edit/validate the content of any files so I did it this way. I could quickly navigate into the Lib\site-packages\sync directory as the code was being unpacked and see if atm.py.orig popped up - or not. That was before I thought to include some errorhandling in run.bat.

That's not what i meant. My question was about why use this:
@echo Replace atm.py
copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync
rename Lib\site-packages\sync\atm.py atm.py.orig
rename Lib\site-packages\sync\atm_correct_progress.py atm.py

instead of this:
@echo Replace atm.py
rename Lib\site-packages\sync\atm.py atm.py.orig
copy ..\..\projects\www.gpugrid.net\atm_correct_progress.py Lib\site-packages\sync\atm.py

especially, seeing a similar approach in the config file modification:
<command_line>/c copy ..\..\projects\www.gpugrid.net\newrun.bat run.bat</command_line>
.
In the second approach there are only 2 commands instead of 3, and result will be the same. If BOINC clears the slot directory, i don't see how using copy and subsequent rename can help, or how it will be affected differently compared to renaming the original and copying a corrected Python file with the correct name. Maybe, i'm missing something - after all, you debugged it, not i (:
Thanks again. Next thing to solve is cooling the laptop. Any useful scripts for that? (joking:)

goldfinch
Send message
Joined: 5 May 19
Posts: 31
Credit: 395,274,685
RAC: 102,466
Level
Asp
Scientific publications
wat
Message 60847 - Posted: 5 Nov 2023 | 8:06:04 UTC - in response to Message 60843.

Actually, your fix is more than simply improving quality of life. Because the progress is correctly displayed now, BOINC doesn't download next task immediately, but waits for some time. In my case, it downloaded the next task at ~95% of the current task, which is better because the task doesn't spend too much time in the queue. So, the fix also implicitly improves task management.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60850 - Posted: 5 Nov 2023 | 12:23:59 UTC - in response to Message 60846.

I enabled checkpoints logging and discovered that checkpointing occurred right before uploading the results. That is, during computation checkpointing doesn't seem to occur (at least, according to debug logs).


Interesting. I do see that the Python code creates a simulation state checkpoint after every of the 70 samples.

2023-11-05 06:37:16 - INFO - sync_re - Finished: sample 288, replica 21 (duration: 13.26600000000326 s)

2023-11-05 06:37:16 - INFO - sync_re - Started: exchange replicas

2023-11-05 06:37:16 - INFO - sync_re - Replica 15: 18 --> 17

2023-11-05 06:37:16 - INFO - sync_re - Replica 21: 17 --> 18

2023-11-05 06:37:16 - INFO - sync_re - Finished: exchange replicas (duration: 0.031000000017229468 s)

2023-11-05 06:37:16 - INFO - sync_re - Started: update replicas

2023-11-05 06:37:30 - INFO - sync_re - Finished: update replicas (duration: 14.046999999962281 s)

2023-11-05 06:37:30 - INFO - sync_re - Started: write replicas samples and trajectories

2023-11-05 06:37:30 - INFO - sync_re - Finished: write replicas samples and trajectories (duration: 0.0 s)

2023-11-05 06:37:30 - INFO - sync_re - Started: checkpointing

2023-11-05 06:38:45 - INFO - sync_re - Finished: checkpointing (duration: 74.75 s)

2023-11-05 06:38:45 - INFO - sync_re - Finished: sample 288 (duration: 372.85899999999674 s)

2023-11-05 06:38:45 - INFO - sync_re - Started: sample 289


So the potential should be there to have more granular checkpoints.

In the second approach there are only 2 commands instead of 3, and result will be the same. If BOINC clears the slot directory, i don't see how using copy and subsequent rename can help, or how it will be affected differently compared to renaming the original and copying a corrected Python file with the correct name. Maybe, i'm missing something - after all, you debugged it, not i (:


The objective was not to prevent cleaning the slot directory, but to hopefully be able to see the atm.py.orig file pop in existance in the second before boinc cleans the slot. Granted I could have done that with one less command. I stand duly chastised for my reckless waste of processing cycles. ;-)

Thanks again. Next thing to solve is cooling the laptop. Any useful scripts for that? (joking:)


Yup: Try this - worked for me. :-)

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60854 - Posted: 7 Nov 2023 | 16:50:56 UTC

Another idiosyncrasy that has been less often discussed, and I knew in the back of my mind that this was the case, but since these tasks download some packages at runtime, this requires that you maintain internet connectivity for the tasks to run. I had a small issue with one host where it couldnt access the internet due to a network adapter issue, and tasks started to fail one by one (only in the setup phase I think, tasks that already downloaded what they need will run fine).

I think this kind of goes against some long running BOINC methodology where donwloaded tasks are fully self contained and can be run offline. I am aware that some folks might have limited network access or time-of-use billing schemes that make sense for limiting network activity to certain times. this may be another mechanism causing errors for some folks.

I don't often have network connectivity issues, but I do think it would be in the interest of the users for the devs to rework this so that tasks are fully self contained. I just don't think that losing network access should be a source of errors, even if it doesn't waste much computational time.

I'm aware that this can make things harder for the devs, and I 100% understand why they do it the way they do (downloading directly from github allows them to make changes basically real-time without changing anything on the BOINC side of things). I'm not really demanding a change here, I can deal with it either way, but it would be nice.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60857 - Posted: 7 Nov 2023 | 23:02:06 UTC - in response to Message 60854.

Yeah, I follow your logic. I can only assume they have a good reason for it.
On the plus side, it allows us to find and inspect the code when issues pop up.

And on that note, since I didn't get a response on my issue on Github regarding the progress indicator, I've taken a crash-self-course in git and created a pull request for the code fix.

Fingers crossed...

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60858 - Posted: 7 Nov 2023 | 23:19:58 UTC - in response to Message 60857.

nice, a PR should at least get someone's attention lol
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60880 - Posted: 17 Nov 2023 | 14:58:31 UTC - in response to Message 60858.

nice, a PR should at least get someone's attention lol


And so it finally did. :-)

Pull request was accepted and merged into the original AToM-OpenMM/Master repo. All that's left now is for it to be merged into the proper repo that is retrieved at the execution of any WU, and progress % will be fixed.
Which will obviously only be useful if ATMbeta task generation starts up again...

Regarding one of your earlier questions:
second, where do you get that last_sample=1? the code says last_sample = self.replicas[0].get_cycle(), but havent worked through the code yet to see what that actually evaluates to. can you elaborate with specific code paths to where "self.replicas[0].get_cycle()" = 1?

I still haven't found the actual code position where it happens, but you can check that the starting cycle # is actually just read from the worker replica input file <taskname>.xml
<Parameters ATMAcore=".0625" ATMAlpha="0" ATMDirection="1" ATMLambda1=".5" ATMLambda2=".5" ATMU0="0" ATMUbcore="2092" ATMUmax="4184" ATMW0="0" BiasEnergy="0" MonteCarloPressure="1" MonteCarloTemperature="300" REAlchemicalIntermediate="0" RECycle="0" REMDSteps="0" REPertEnergy="0" REPotEnergy="0" REStateId="0" RETemperature="0"/>

This is copied into the r0-rxx replica dirs as checkpoint/restart files. The RECycle parameter will be 0, or 70 or whatever at the start and then increased at any new cycle/checkpoint.

The reason checkpointing/restarting doesn't work is because between the BOINC wrapper and the actual working (Python) program, there is that run.bat/run.sh command shell process acting as a sort of in-between wrapper that doesn't properly forward communication between the BOINC client/wrapper and the actual python program leading to all sorts of mayhem that prevents the python program to gracefully exit and/or restart using its built-in checkpoint/restart functionality.
That's because a restart will re-run the run.bat/run.sh in its entirety, overwriting part but not all of the existing working files, leaving the python program with inconsistent input data at restart leading to a crash.

I'm taking a quick look to see if I can figure out some workaround, but the true fix would be running the python from the actual BOINC wrapper instead of using that .bat/.sh file in between. That would also imply, as you said before, having the AToM-OpenMM code downloaded as part of the project files instead of retrieved for any new WU.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60881 - Posted: 17 Nov 2023 | 16:15:53 UTC - in response to Message 60880.

that's great that someone finally noticed the PR and acted on it. maybe you need to drop a comment or something about merging it into the GPUGRID repo? ATM task generation IS ongoing right now. there appears to be a single small batch (~250 tasks) running right now. new tasks are generated when the previous segment is recieved. so the tasks in progress stays around 250, but the RTS shows 0 most of the time since so many hosts are asking for work.

we're about halfway through this run it seems. most of the tasks I'm getting now are 4-10s or 5-10s or 6-10s. they will replicate until 9-10 just like before.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60882 - Posted: 27 Nov 2023 | 16:58:25 UTC

A new batch was launched this afternoon (27 November 2023).

There's a possible systemic deployment error:

File "/hdd/boinc-client/slots/1/bin/rbfe_explicit_sync.py", line 2, in <module>
from sync.atm import openmm_job_AmberRBFE
ImportError: cannot import name 'openmm_job_AmberRBFE' from 'sync.atm' (/hdd/boinc-client/slots/1/lib/python3.9/site-packages/sync/atm.py)

syk_m22_m32_3-QUICO_ATM_Sch_ANI-0-10-RND4131
syk_m07_m35_5-QUICO_ATM_Sch_ANI-0-10-RND4539
syk_m17_m25_4-QUICO_ATM_Sch_ANI-0-10-RND6268
syk_m43_m15_2-QUICO_ATM_Sch_ANI-0-10-RND5566

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60883 - Posted: 27 Nov 2023 | 17:51:23 UTC - in response to Message 60882.

can confirm. I have like 80+ errors from these.

at least they error after like 30s and don't waste much time.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60884 - Posted: 28 Nov 2023 | 17:24:12 UTC - in response to Message 60882.

A new batch was launched this afternoon (27 November 2023).

There's a possible systemic deployment error:

[quote] File "/hdd/boinc-client/slots/1/bin/rbfe_explicit_sync.py", line 2, in <module>
from sync.atm import openmm_job_AmberRBFE
...

Here, too, all tasks with name "syk..." are failing after about 1 minute :-(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60894 - Posted: 20 Dec 2023 | 16:54:54 UTC - in response to Message 60880.

nice, a PR should at least get someone's attention lol


And so it finally did. :-)

Pull request was accepted and merged into the original AToM-OpenMM/Master repo. All that's left now is for it to be merged into the proper repo that is retrieved at the execution of any WU, and progress % will be fixed.
Which will obviously only be useful if ATMbeta task generation starts up again...


Please provide the direct link to the PR and the repo for the devs to incorporate your fix.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60895 - Posted: 20 Dec 2023 | 17:33:19 UTC - in response to Message 60894.
Last modified: 20 Dec 2023 | 17:58:14 UTC

nice, a PR should at least get someone's attention lol


And so it finally did. :-)

Pull request was accepted and merged into the original AToM-OpenMM/Master repo. All that's left now is for it to be merged into the proper repo that is retrieved at the execution of any WU, and progress % will be fixed.
Which will obviously only be useful if ATMbeta task generation starts up again...


Please provide the direct link to the PR and the repo for the devs to incorporate your fix.


i don't know why the devs need the users to spoon feed them their own code and repos. they accepted the PR and merged it already. it almost seems like theres little to no inter-team communication about what is going on.

it's all here: https://github.com/Gallicchio-Lab/AToM-OpenMM/pull/56/

the PR was merged into this master on November 16th. but the tasks being distributed to users must be pulling from some other repo tag as the changes have not yet been reflected on subsequent tasks that we have received since then

i don't have any tasks for ATM so i don't remember off hand what tag it was pulling. probably not master or the latest v8.1.0 since those have the fix. probably pulling the v8.1.0beta tag from October.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60896 - Posted: 20 Dec 2023 | 17:47:28 UTC - in response to Message 60895.

... the PR was merged into this master on November 16th. but the tasks being distributed to users must be pulling from some other repo tag as the changes have not yet been reflected on subsequent tasks that we have received since then

Like all BOINC projects, GPUGrid has an applications page - it's part of the standard BOINC toolkit.

That shows that the active ATM Beta code was installed for distribution on 27 Mar 2023 for Linux, and the following day for Windows. Now that the source code has been updated, it will need to be re-compiled into binary form and re-deployed. That's the current stumbling block.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60897 - Posted: 20 Dec 2023 | 18:05:38 UTC - in response to Message 60896.
Last modified: 20 Dec 2023 | 18:08:01 UTC

... the PR was merged into this master on November 16th. but the tasks being distributed to users must be pulling from some other repo tag as the changes have not yet been reflected on subsequent tasks that we have received since then

Like all BOINC projects, GPUGrid has an applications page - it's part of the standard BOINC toolkit.

That shows that the active ATM Beta code was installed for distribution on 27 Mar 2023 for Linux, and the following day for Windows. Now that the source code has been updated, it will need to be re-compiled into binary form and re-deployed. That's the current stumbling block.


no that's not correct. you're not understanding how this application works. it's not the normal setup most boinc projects use.

this "app" is NOT a compiled binary! it's just a bunch of python scripts. just watch how these tasks run and you will see. start from the wrapper and look what's actually happening. what gets distributed to users as the "app" is a baseline zip archive package that contains the conda python environment and some prepackaged libraries, etc. when BOINC runs, it's using the wrapper and associated job.xml file to start execution of the scripts. somewhere along the way in the long chain of script execution, it reaches out to github to download the necessary files and the one in question.

wrapper -> unzip archive -> run script -> download stuff from github -> run more scripts

that's why these tasks fail if you try to run them offline or without an internet connection.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60899 - Posted: 21 Dec 2023 | 20:44:22 UTC - in response to Message 60897.
Last modified: 21 Dec 2023 | 20:46:30 UTC

... the PR was merged into this master on November 16th. but the tasks being distributed to users must be pulling from some other repo tag as the changes have not yet been reflected on subsequent tasks that we have received since then

Like all BOINC projects, GPUGrid has an applications page - it's part of the standard BOINC toolkit.

That shows that the active ATM Beta code was installed for distribution on 27 Mar 2023 for Linux, and the following day for Windows. Now that the source code has been updated, it will need to be re-compiled into binary form and re-deployed. That's the current stumbling block.


no that's not correct. you're not understanding how this application works. it's not the normal setup most boinc projects use.

this "app" is NOT a compiled binary! it's just a bunch of python scripts. just watch how these tasks run and you will see. start from the wrapper and look what's actually happening. what gets distributed to users as the "app" is a baseline zip archive package that contains the conda python environment and some prepackaged libraries, etc. when BOINC runs, it's using the wrapper and associated job.xml file to start execution of the scripts. somewhere along the way in the long chain of script execution, it reaches out to github to download the necessary files and the one in question.

wrapper -> unzip archive -> run script -> download stuff from github -> run more scripts

that's why these tasks fail if you try to run them offline or without an internet connection.



Correct, each WU downloads the ATM code on the fly. And the repo it is being pulled from is the "HEAD" of this repo: https://github.com/raimis/AToM-OpenMM.
However, my pull request has been merged into https://github.com/Gallicchio-Lab/AToM-OpenMM (which is the ATM master repo), but NOT into the raimis one. The raimis one is still '17 commits behind' Gallicchio-Lab, and my fix is one of those 'todo' commits - check here:
https://github.com/raimis/AToM-OpenMM/compare/master...Gallicchio-Lab%3AAToM-OpenMM%3Amaster

So no compile needed. 2 things would potentially work:
1) Raimis merges the '17 commits' into his repo. That would then probably become the new 'HEAD' and WU's would automatically pull this (I hope) or the devs would potentially need to change the SHA code in run.bat/run.sh
2) The devs adapt the run.bat/run.sh to pull not from raimis 'HEAD', but from Gallicchio-Lab - but that might obviously have other side effects, i have no idea if they use raimis for a reason...

Relevant code in run.bat/sh:

@echo Install AToM
set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac


And for readers wondering where the hell that run.bat/sh comes from: it's part of each WU's "xxxxxx-input" file - which is just a bzipped tar file.

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 32
Credit: 86,638,150
RAC: 21,621
Level
Thr
Scientific publications
wat
Message 60900 - Posted: 22 Dec 2023 | 2:35:40 UTC - in response to Message 60899.

... the PR was merged into this master on November 16th. but the tasks being distributed to users must be pulling from some other repo tag as the changes have not yet been reflected on subsequent tasks that we have received since then

Like all BOINC projects, GPUGrid has an applications page - it's part of the standard BOINC toolkit.

That shows that the active ATM Beta code was installed for distribution on 27 Mar 2023 for Linux, and the following day for Windows. Now that the source code has been updated, it will need to be re-compiled into binary form and re-deployed. That's the current stumbling block.


no that's not correct. you're not understanding how this application works. it's not the normal setup most boinc projects use.

this "app" is NOT a compiled binary! it's just a bunch of python scripts. just watch how these tasks run and you will see. start from the wrapper and look what's actually happening. what gets distributed to users as the "app" is a baseline zip archive package that contains the conda python environment and some prepackaged libraries, etc. when BOINC runs, it's using the wrapper and associated job.xml file to start execution of the scripts. somewhere along the way in the long chain of script execution, it reaches out to github to download the necessary files and the one in question.

wrapper -> unzip archive -> run script -> download stuff from github -> run more scripts

that's why these tasks fail if you try to run them offline or without an internet connection.



Correct, each WU downloads the ATM code on the fly. And the repo it is being pulled from is the "HEAD" of this repo: https://github.com/raimis/AToM-OpenMM.
However, my pull request has been merged into https://github.com/Gallicchio-Lab/AToM-OpenMM (which is the ATM master repo), but NOT into the raimis one. The raimis one is still '17 commits behind' Gallicchio-Lab, and my fix is one of those 'todo' commits - check here:
https://github.com/raimis/AToM-OpenMM/compare/master...Gallicchio-Lab%3AAToM-OpenMM%3Amaster

So no compile needed. 2 things would potentially work:
1) Raimis merges the '17 commits' into his repo. That would then probably become the new 'HEAD' and WU's would automatically pull this (I hope) or the devs would potentially need to change the SHA code in run.bat/run.sh
2) The devs adapt the run.bat/run.sh to pull not from raimis 'HEAD', but from Gallicchio-Lab - but that might obviously have other side effects, i have no idea if they use raimis for a reason...

Relevant code in run.bat/sh:

@echo Install AToM
set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac


And for readers wondering where the hell that run.bat/sh comes from: it's part of each WU's "xxxxxx-input" file - which is just a bzipped tar file.


So we all "Hope" that either option #1 or #2 happens....

Thank you for your efforts.

Bill F

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 60901 - Posted: 23 Dec 2023 | 0:28:50 UTC - in response to Message 60900.


So no compile needed. 2 things would potentially work:
1) Raimis merges the '17 commits' into his repo. That would then probably become the new 'HEAD' and WU's would automatically pull this (I hope) or the devs would potentially need to change the SHA code in run.bat/run.sh
2) The devs adapt the run.bat/run.sh to pull not from raimis 'HEAD', but from Gallicchio-Lab - but that might obviously have other side effects, i have no idea if they use raimis for a reason...

Relevant code in run.bat/sh:

@echo Install AToM
set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac


And for readers wondering where the hell that run.bat/sh comes from: it's part of each WU's "xxxxxx-input" file - which is just a bzipped tar file.


So we all "Hope" that either option #1 or #2 happens....

Thank you for your efforts.

Bill F


UPDATE: Seems that Raimis is reading the forum, or a little bird told him, because he has merged all commits. That's the good news.

Now some dev still needs to update the commit SHA code in run.bat/run.sh to the new HEAD version 1aa4eb9c39de5e269da430949da2ef377b3d9ca2

New code needed in run.bat/sh:

@echo Install AToM
set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@1aa4eb9c39de5e269da430949da2ef377b3d9ca2


Once that is fixed, the progress issue should be over and done with!

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60902 - Posted: 23 Dec 2023 | 8:25:57 UTC - in response to Message 60901.

Good news indeed.

Profile Bill F
Avatar
Send message
Joined: 21 Nov 16
Posts: 32
Credit: 86,638,150
RAC: 21,621
Level
Thr
Scientific publications
wat
Message 60903 - Posted: 24 Dec 2023 | 21:51:50 UTC - in response to Message 60901.


So no compile needed. 2 things would potentially work:
1) Raimis merges the '17 commits' into his repo. That would then probably become the new 'HEAD' and WU's would automatically pull this (I hope) or the devs would potentially need to change the SHA code in run.bat/run.sh
2) The devs adapt the run.bat/run.sh to pull not from raimis 'HEAD', but from Gallicchio-Lab - but that might obviously have other side effects, i have no idea if they use raimis for a reason...

Relevant code in run.bat/sh:

@echo Install AToM
set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac


And for readers wondering where the hell that run.bat/sh comes from: it's part of each WU's "xxxxxx-input" file - which is just a bzipped tar file.


So we all "Hope" that either option #1 or #2 happens....

Thank you for your efforts.

Bill F


UPDATE: Seems that Raimis is reading the forum, or a little bird told him, because he has merged all commits. That's the good news.

Now some dev still needs to update the commit SHA code in run.bat/run.sh to the new HEAD version 1aa4eb9c39de5e269da430949da2ef377b3d9ca2

New code needed in run.bat/sh:

@echo Install AToM
set REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@1aa4eb9c39de5e269da430949da2ef377b3d9ca2


Once that is fixed, the progress issue should be over and done with!


Are there multiple who do Dev work for the project and is there away to get a "little bird" to talk to them ?

Bill F

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60913 - Posted: 3 Jan 2024 | 9:46:25 UTC

this morning, 2 of my Windows10 machines received ATMbeta tasks, and all of them failed after around 51 seconds (RTX3070) and around 81 seconds ((Quadro P5000).

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60914 - Posted: 3 Jan 2024 | 10:16:35 UTC - in response to Message 60913.

this morning, 2 of my Windows10 machines received ATMbeta tasks, and all of them failed after around 51 seconds (RTX3070) and around 81 seconds ((Quadro P5000).

still these faulty ATMs are being sent out, failing after short time.
I now delisted them from my download choices in the web settings.

Are these tasks failing only on my systems, or do other crunchers experience the same problem ?

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60915 - Posted: 3 Jan 2024 | 10:28:58 UTC - in response to Message 60914.

this morning, 2 of my Windows10 machines received ATMbeta tasks, and all of them failed after around 51 seconds (RTX3070) and around 81 seconds ((Quadro P5000).

still these faulty ATMs are being sent out, failing after short time.
I now delisted them from my download choices in the web settings.

Are these tasks failing only on my systems, or do other crunchers experience the same problem ?

what kind of junk is this now? even after I set the ATMbeta to "no", they still come in and fail :-((((

So something seems to be wrong with the GPUGRID web settings :-(((

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60916 - Posted: 3 Jan 2024 | 11:20:49 UTC

I got 6 ATMbeta so far today. All of them error out with

openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

ATMbetas have worked well on this machine before.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60917 - Posted: 3 Jan 2024 | 11:33:40 UTC

Same here with this:

Stderr output

<core_client_version>7.20.5</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
06:08:31 (183097): wrapper (7.7.26016): starting
06:08:48 (183097): wrapper (7.7.26016): starting
06:08:48 (183097): wrapper: running bin/python (bin/conda-unpack)
06:08:49 (183097): bin/python exited; CPU time 0.184758
06:08:49 (183097): wrapper: running bin/tar (xjvf input.tar.bz2)
06:08:50 (183097): bin/tar exited; CPU time 0.557920
06:08:50 (183097): wrapper: running bin/bash (run.sh)
+ echo 'Setup environment'
+ source bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname bin/activate
++ script_dir=bin
+++ cd bin
+++ pwd
++ local full_path_script_dir=/var/lib/boinc-client/slots/1/bin
+++ dirname /var/lib/boinc-client/slots/1/bin
++ local full_path_env=/var/lib/boinc-client/slots/1
+++ basename /var/lib/boinc-client/slots/1
++ local env_name=1
++ '[' -n '' ']'
++ export CONDA_PREFIX=/var/lib/boinc-client/slots/1
++ CONDA_PREFIX=/var/lib/boinc-client/slots/1
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/var/lib/boinc-client/slots/1/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
++ PS1='(1) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/var/lib/boinc-client/slots/1/etc/conda/activate.d
++ '[' -d /var/lib/boinc-client/slots/1/etc/conda/activate.d ']'
+++ ls -A /var/lib/boinc-client/slots/1/etc/conda/activate.d
++ '[' -n ocl-icd_activate.sh ']'
++ local _path
++ for _path in "$_script_dir"/*.sh
++ . /var/lib/boinc-client/slots/1/etc/conda/activate.d/ocl-icd_activate.sh
+++ conda_ocl_icd_activate
++++ ls /var/lib/boinc-client/slots/1/etc/OpenCL/vendors/
+++ [[ -z ocl-icd-system ]]
+ export PATH=/var/lib/boinc-client/slots/1:/var/lib/boinc-client/slots/1/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ PATH=/var/lib/boinc-client/slots/1:/var/lib/boinc-client/slots/1/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:.
+ echo 'Create a temporary directory'
+ export TMP=/var/lib/boinc-client/slots/1/tmp
+ TMP=/var/lib/boinc-client/slots/1/tmp
+ mkdir -p /var/lib/boinc-client/slots/1/tmp
+ echo 'Install AToM'
+ REPO_URL=git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac
+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@d7931b9a6217232d481731f7589d64b100a514ac
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /var/lib/boinc-client/slots/1/tmp/pip-req-build-x7scn3kc
Running command git rev-parse -q --verify 'sha^d7931b9a6217232d481731f7589d64b100a514ac'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git d7931b9a6217232d481731f7589d64b100a514ac
Running command git checkout -q d7931b9a6217232d481731f7589d64b100a514ac
+ python -m pip list
+ echo 'Configure AToM'
+ echo localhost,0:0,1,CUDA,,/var/lib/boinc-client/slots/1/tmp
+ echo 'Extract restart'
+ tar xjvf restart.tar.bz2
+ echo 'Run AToM'
+ CONFIG_FILE=JAK2_m02_m04_asyncre.cntl
+ python bin/rbfe_explicit_sync.py JAK2_m02_m04_asyncre.cntl
Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.
Traceback (most recent call last):
File "/var/lib/boinc-client/slots/1/bin/rbfe_explicit_sync.py", line 8, in <module>
rx.setupJob()
File "/var/lib/boinc-client/slots/1/lib/python3.9/site-packages/sync/atm.py", line 82, in setupJob
self.worker = OMMWorkerATM(ommsystem, self.config, self.logger)
File "/var/lib/boinc-client/slots/1/lib/python3.9/site-packages/sync/worker.py", line 44, in __init__
self.simulation.loadState(basename + "_0.xml")
File "/var/lib/boinc-client/slots/1/lib/python3.9/site-packages/openmm/app/simulation.py", line 344, in loadState
self.context.setState(mm.XmlSerializer.deserialize(xml))
File "/var/lib/boinc-client/slots/1/lib/python3.9/site-packages/openmm/openmm.py", line 9742, in setState
return _openmm.Context_setState(self, state)
openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'
06:09:26 (183097): bin/bash exited; CPU time 33.678225
06:09:26 (183097): app exit status: 0x1
06:09:26 (183097): called boinc_finish(195)

</stderr_txt>
]]>

Erich56, Regarding your preferences, did you answer yes to these questions:

1. Run test applications?
2. If no work for selected applications is available, accept work from other applications?

Then change them to no.

What is the location setting for your computers? Default, Home, Work or School.
Make sure you modify the correct location setting preferences.

That's all that I can think of.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60918 - Posted: 3 Jan 2024 | 12:47:28 UTC - in response to Message 60917.

Erich56, Regarding your preferences, did you answer yes to these questions:

1. Run test applications?
2. If no work for selected applications is available, accept work from other applications?

Then change them to no.

What is the location setting for your computers? Default, Home, Work or School.
Make sure you modify the correct location setting preferences.

That's all that I can think of.


thanky you, Bedrich, for your hints.
Indeed, I deselected "ATMbeta", but I forgot to deselect "Run test applications".
So now I corrected this.

Still though, I would guess once "ATMbeta" is deselected, no ATMbeta should be downloaded, regardless whether "run test applications" is selected or not.
What happened is rather unlogical :-(

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60919 - Posted: 3 Jan 2024 | 15:30:58 UTC - in response to Message 60918.

11 WUs on 3 different computers (1 Win11, 2 Linux) with 3 different GPUs failed today.
ATMbeta v1.09 has worked fine before on these machines (was there a change in the app but the version number 1.09 kept as before?).

Error messages same as in the postings before.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60920 - Posted: 3 Jan 2024 | 16:53:21 UTC - in response to Message 60919.
Last modified: 3 Jan 2024 | 16:57:58 UTC

11 WUs on 3 different computers (1 Win11, 2 Linux) with 3 different GPUs failed today.
ATMbeta v1.09 has worked fine before on these machines (was there a change in the app but the version number 1.09 kept as before?).

Error messages same as in the postings before.

from what I remember reading somewhere here recently, there was some issue with an expired license which they tried to fix but it failed. This though was true for the Linux version. So, as it looks, the same problem might be true for Windows as well :-(

However, I am wondering that no one from the team notices that all the tasks which they send out are failing, and so they would stop the distribution.

P.S.
I just notice that one of my PCs received serval ACEMD 3 tasks within the past hour - and they also failed after about a minute.
See here: http://www.gpugrid.net/result.php?resultid=33725238
Until this morning, they could be crunched successfully.
So there seems to exist a major problem wiht GPUGRID at this time :-(

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60921 - Posted: 3 Jan 2024 | 16:56:52 UTC - in response to Message 60920.

11 WUs on 3 different computers (1 Win11, 2 Linux) with 3 different GPUs failed today.
ATMbeta v1.09 has worked fine before on these machines (was there a change in the app but the version number 1.09 kept as before?).

Error messages same as in the postings before.

from what I remember reading somewhere here recently, there was some issue with an expired license which they tried to fix but it failed. This though was true for the Linux version. So, as it looks, the same problem might be true for Windows as well :-(

However, I am wondering that no one from the team notices that all the tasks which they send out are failing, and so they would stop the distribution.


You’re confusing two different apps.

acemd3 had the expired license issue, which they tried to fix, but ended up replacing the Linux version with the windows version which remains broken (because a windows app can’t run on Linux)

This thread is about the ATM app. Which is not subject to the same licensing issues.
____________

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60933 - Posted: 7 Jan 2024 | 21:10:13 UTC

From January 5th onwards, my Linux hosts have received a limited quantity of new ATMbeta tasks.
Every of them, are showing now a lineal % progress during execution.
For an example, I reproduce "Properties" from the first I received at two different moments:

Application: ATMbeta: Free energy calculations of protein-ligand binding 1.09 (cuda1121)
Name: CDK2_m01_m10_1-QUICO_ATM_TEST_NEW-3-5-RND8234
State: Running
Received: Fri 05 Jan 2024 11:52:17 WET
Report deadline: Wed 10 Jan 2024 11:52:16 WET
Resources: 0.955 CPUs + 1 NVIDIA GPU
Estimated computation size: 1,000,000,000 GFLOPs
CPU time: 00:55:25
CPU time since checkpoint: 00:55:22
Elapsed time: 00:57:41
Estimated time remaining: 12d 22:00:55
Fraction done: 7.328%
Virtual memory size: 11.94 GB
Working set size: 1.48 GB
Directory: slots/3
Process ID: 14905
Progress rate: 7.560% per hour
Executable: wrapper_26198_x86_64-pc-linux-gnu

----------------------------------------------------------------------

Application: ATMbeta: Free energy calculations of protein-ligand binding 1.09 (cuda1121)
Name: CDK2_m01_m10_1-QUICO_ATM_TEST_NEW-3-5-RND8234
State: Running
Received: Fri 05 Jan 2024 11:52:17 WET
Report deadline: Wed 10 Jan 2024 11:52:16 WET
Resources: 0.955 CPUs + 1 NVIDIA GPU
Estimated computation size: 1,000,000,000 GFLOPs
CPU time: 07:00:18
CPU time since checkpoint: 07:00:16
Elapsed time: 07:11:01
Estimated time remaining: 6d 18:09:42
Fraction done: 51.526%
Virtual memory size: 11.94 GB
Working set size: 1.48 GB
Directory: slots/3
Process ID: 14905
Progress rate: 7.200% per hour
Executable: wrapper_26198_x86_64-pc-linux-gnu

This is an interesting issue correction comparing to precedent ATMbeta tasks!

On the other hand, values for "CPU time since checkpoint:" make me think that "no checkpointing" issue is still pending to correct.
This compels for the tasks to be executed with no interruptions from the beginning to the end...
Also, They seem to continue failing on Windows hosts.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60934 - Posted: 8 Jan 2024 | 1:04:31 UTC

They fixed the percentage completion issue.

But I don't think anyone has tried the stop-restart experiment yet to determine if the tasks can be stopped and restarted without failing yet.

The tasks are too rare yet to throw any away for the experiment.

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60937 - Posted: 8 Jan 2024 | 11:20:56 UTC

I did the suspend and restart experiment. The unit didn't error out, but it didn't save work done before the suspension. It started at zero and is crunching normally. Let's see if it finishes successfully. Give it a few hours. It looks like we are making progress.

https://www.gpugrid.net/result.php?resultid=33726994

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60940 - Posted: 8 Jan 2024 | 20:43:19 UTC

Your stop-started task looks to have finished normally for credit.

Good to know you can pause-stop-restart the tasks now without erroring out.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60944 - Posted: 9 Jan 2024 | 6:20:53 UTC - in response to Message 60933.

They seem to continue failing on Windows hosts.

which is really too bad :-(

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60947 - Posted: 9 Jan 2024 | 12:07:40 UTC - in response to Message 60944.

They seem to continue failing on Windows hosts.

which is really too bad :-(

I now downloaded an ATM on one of my Windows hosts - still failing. To make sure that the problem is not with my system, I double-checked the tasks lists of other volunteers - same thing there.

So it seems clear that the license for the Linux app was updated, but NOT for the Windows app.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60948 - Posted: 9 Jan 2024 | 12:25:19 UTC - in response to Message 60947.

Looking at some of the english-language attempts at that task, they all have

Error occurred while processing: C:\DC\BOINC.
'tar.ext' is not recognized as an internal or external command,
operable program or batch file.
ERROR: Invalid requirement: './Acellera-AToM-OpenMM-*'

I don't see any evidence of an expired licence, but there's clearly something else wrong with the re-deployment.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 60949 - Posted: 9 Jan 2024 | 12:54:27 UTC - in response to Message 60947.

They seem to continue failing on Windows hosts.

which is really too bad :-(

I now downloaded an ATM on one of my Windows hosts - still failing. To make sure that the problem is not with my system, I double-checked the tasks lists of other volunteers - same thing there.

So it seems clear that the license for the Linux app was updated, but NOT for the Windows app.


The license update was for the ACEMD3 app, not ATM. Whatever problem ATM might be having on Windows, it’s not the same issue that they fixed for Linux on ACEMD3.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60950 - Posted: 9 Jan 2024 | 13:51:50 UTC - in response to Message 60949.

I don't see any evidence of an expired licence, but there's clearly something else wrong with the re-deployment.


... The license update was for the ACEMD3 app, not ATM. Whatever problem ATM might be having on Windows, it’s not the same issue that they fixed for Linux on ACEMD3.

okay, folks, thanks for the information.
So all we Windows people can do is: wait and see :-(

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60951 - Posted: 9 Jan 2024 | 14:41:58 UTC - in response to Message 60944.

They seem to continue failing on Windows hosts.

which is really too bad :-(

Also, the intended research for ATM tasks can be altered in some way due to this issue.
As an example:
One of my Linux hosts, #557889, happened to catch early this morning the ATMbeta task TYK2_m15_m16_3-QUICO_ATM_opc-2-5-RND9826_6
This task is the third link of a certain 5 tasks chain, and it is hanging from WU #27642885
This task, by chance, was previously sent to 6 Windows hosts and failed after a few seconds.
If it had been sent to two more Windows hosts and (consequently) failed, it would have reached its maximum number of allowed failures, and this particular investigation line would have been truncated.
Coincidences happen...

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60952 - Posted: 9 Jan 2024 | 14:58:23 UTC - in response to Message 60951.

I have a _7 task running on a Linux box. That's the last chance saloon - I'll try to look after it.

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 60953 - Posted: 10 Jan 2024 | 1:40:03 UTC - in response to Message 60948.
Last modified: 10 Jan 2024 | 1:40:38 UTC

tar.ext instead of tar.exe

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60954 - Posted: 10 Jan 2024 | 1:45:59 UTC - in response to Message 60953.

I noticed the typo also. Should be easy to fix. Proofreading . . . anyone??

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 60955 - Posted: 10 Jan 2024 | 15:16:53 UTC

exact meme probleme.
exact same problem


15:07:41 (7608): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'tar.ext' n'est pas reconnu en tant que commande interne
ou externe, un programme ex&#130;cutable ou un fichier de commandes.
____________

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 60956 - Posted: 10 Jan 2024 | 15:54:29 UTC

16:49:42 (4080): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
'tar.ext' n'est pas reconnu en tant que commande interne
ou externe, un programme ex&#130;cutable ou un fichier de commandes.
____________

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60957 - Posted: 10 Jan 2024 | 17:01:18 UTC

Thank you for spotting the typo. It has been updated. Hopefully the next round of jobs succeed on windows!

Thank you for the patience. (like most academic labs our current ability to support windows is limited. All our in house research and software is on Linux.)

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60958 - Posted: 10 Jan 2024 | 22:58:57 UTC

Thanks for the update Steve.

So are you the new 'contact point' for all the applications now?

Or just for the resurrected Quantum Chemistry app and tasks?

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60959 - Posted: 11 Jan 2024 | 10:19:27 UTC - in response to Message 60958.

I a new researcher/software engineer in the lab. Part of my responsibility is looking after GPUGRID and deploying this updated Quantum Chemistry app. I will try and keep an eye on these forums so issues can be addressed!

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60960 - Posted: 11 Jan 2024 | 13:37:56 UTC - in response to Message 60959.

I will try and keep an eye on these forums so issues can be addressed!

Brilliant, Steve. That would already be a great progress. Thank you!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 60961 - Posted: 11 Jan 2024 | 15:16:39 UTC - in response to Message 60960.

I will try and keep an eye on these forums so issues can be addressed!

Brilliant, Steve. That would already be a great progress. Thank you!


+ 1

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 60962 - Posted: 11 Jan 2024 | 17:59:20 UTC - in response to Message 60959.

I a new researcher/software engineer in the lab. Part of my responsibility is looking after GPUGRID and deploying this updated Quantum Chemistry app. I will try and keep an eye on these forums so issues can be addressed!

+100

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61047 - Posted: 23 Jan 2024 | 11:06:03 UTC - in response to Message 60957.

Steve wrote on Jan. 10th:

Thank you for spotting the typo. It has been updated. Hopefully the next round of jobs succeed on windows!

unfortunately, also the new round of jobs does not work on Windows.
One after the other fails short time after start :-(

See here:
http://www.gpugrid.net/results.php?userid=125700

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61048 - Posted: 23 Jan 2024 | 12:39:05 UTC - in response to Message 61047.
Last modified: 23 Jan 2024 | 12:41:10 UTC

You seem to have something wrong with your BOINC client. it's impossible to say what, but your stderr output is just blank, which is not normal or an artifact of these tasks. since this is the same system that you saw weirdness with Asteroids also, i do think you have some kind of problem with BOINC itself. it's impossible for us to guess without access to your system though.

this is what a Windows output should look like: http://www.gpugrid.net/result.php?resultid=33743283

09:54:40 (15568): wrapper (7.9.26016): starting
09:54:40 (15568): wrapper: running python.exe (bin/conda-unpack)
09:54:42 (15568): python.exe exited; CPU time 0.000000
09:54:42 (15568): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
atom.tar
JNK1_m35_m25_0.xml
JNK1_m35_m25_asyncre.cntl
JNK1_m35_m25.inpcrd
JNK1_m35_m25.prmtop
run.bat
run.sh
09:54:43 (15568): Library/usr/bin/tar.exe exited; CPU time 0.000000
09:54:43 (15568): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
ERROR: Invalid requirement: './Acellera-AToM-OpenMM-*'
09:54:46 (15568): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
09:54:46 (15568): app exit status: 0xd
09:54:46 (15568): called boinc_finish(195)


so yes, there is still a problem on Windows (probably something wrong in the run.bat file, or a file missing from the environment package or input files. but you have a larger problem as well.

while troubleshooting your asteroids problem, I had recommended to upgrade your BOINC client, and I think you did that, but you may have performed an in-place upgrade rather than a fresh install. I would recommend removing all aspects of BOINC on this system. completely delete everything. and re-install from a fresh install package. do not keep anything from the previous install.
____________

wujj123456
Send message
Joined: 9 Jun 10
Posts: 16
Credit: 1,787,119,073
RAC: 3,100,787
Level
His
Scientific publications
watwatwatwat
Message 61049 - Posted: 23 Jan 2024 | 21:11:21 UTC - in response to Message 61048.

Not sure if he has the same problem, but for me, the past few jobs on Windows are sent to the wrong platform AFAIC.

Host: https://www.gpugrid.net/results.php?hostid=615737
Task example: https://www.gpugrid.net/result.php?resultid=33744637
Error: The operating system cannot run %1

I checked tasks where other hosts subsequently succeeded and they are all Linux.

Geoff
Send message
Joined: 30 Aug 10
Posts: 2
Credit: 315,058,094
RAC: 110,399
Level
Asp
Scientific publications
watwatwat
Message 61051 - Posted: 24 Jan 2024 | 11:02:12 UTC

Windows task failer again, this is a copy of the run file up to the point it hit the error

Setup environment

C:\ProgramData\BOINC\slots\9>set HOMEPATH=C:\ProgramData\BOINC\slots\9

C:\ProgramData\BOINC\slots\9>set PATH=C:\ProgramData\BOINC\slots\9;C:\ProgramData\BOINC\slots\9\Library\usr\bin;C:\ProgramData\BOINC\slots\9\Library\bin;C:\Windows\system32;C:\Windows

C:\ProgramData\BOINC\slots\9>set PYTHONPATH=C:\ProgramData\BOINC\slots\9\Lib\python3.9\site-packages

C:\ProgramData\BOINC\slots\9>set SYSTEMROOT=C:\Windows
Create a temporary directory

C:\ProgramData\BOINC\slots\9>set TEMP=C:\ProgramData\BOINC\slots\9\tmp

C:\ProgramData\BOINC\slots\9>mkdir C:\ProgramData\BOINC\slots\9\tmp
Install AToM

C:\ProgramData\BOINC\slots\9>tar.exe xvf atom.tar
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/.github/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/.github/workflows/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/.github/workflows/publish.yml
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/LICENSE
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/README.md
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/abfe_explicit.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/abfe_explicit_zrestr.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/abfe_structprep.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/async_re.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/atom_nnp_wrapper.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/environment.yml
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/README.md
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/but.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/dap.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/dapp.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/dmso.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/dss.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/prop.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/ligands/thi.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/receptor/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/receptor/fkbp.pdb
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/analyze.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/asyncre_template.cntl
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/equil_template.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/free_energies_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/mdlambda_template.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/mintherm_template.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/nodefile
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/prep_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/run_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/runopenmm
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/setup-atm.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/setup-settings.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/fkbp/scripts/uwham_analysis.R
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/scripts/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/scripts/analyze.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/scripts/nodefile
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/scripts/runopenmm
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/scripts/uwham_analysis.R
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/README.md
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/equil.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/mdlambda.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/mintherm.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/npt.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/temoa-g1.inpcrd
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/temoa-g1.prmtop
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/ABFE/temoa-g1/temoa-g1_asyncre.cntl
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/README.md
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1H1Q.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1H1Q.sdf
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1H1R.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1H1R.sdf
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1H1S.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1H1S.sdf
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1OI9.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1OI9.sdf
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1OIU.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1OIU.sdf
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1OIY.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/ligands/1OIY.sdf
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/receptor/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/receptor/cdk2.pdb
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/analyze.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/asyncre_template.cntl
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/free_energies_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/prep_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/run_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/setup-atm.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/setup-settings.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/cdk2/scripts/uwham_analysis.R
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/README.md
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/ligands/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/ligands/2d.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/ligands/2e.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/ligands/3a.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/ligands/3b.mol2
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/receptor/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/receptor/eralpha.pdb
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/analyze.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/asyncre_template.cntl
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/equil_template.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/free_energies_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/mintherm_template.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/nodefile
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/prep_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/run_template.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/runopenmm
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/setup-atm.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/setup-settings.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/eralpha/scripts/uwham_analysis.R
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/scripts/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/scripts/analyze.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/scripts/nodefile
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/scripts/runopenmm
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/scripts/uwham_analysis.R
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/README.md
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/equil.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/mintherm.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/npt.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/temoa-g1-g4.inpcrd
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/temoa-g1-g4.prmtop
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/RBFE/temoa-g1-g4/temoa-g1-g4_asyncre.cntl
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/README.md
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/scripts/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/scripts/analyze.sh
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/scripts/nodefile
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/scripts/runopenmm
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/examples/scripts/uwham_analysis.R
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/gibbs_sampling.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/local_openmm_transport.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/ommreplica.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/ommsystem.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/ommworker.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/openmm_async_re.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/rbfe_explicit.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/rbfe_explicit_sync.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/rbfe_explicit_zrestr.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/rbfe_structprep.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/setup.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/sync/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/sync/__init__.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/sync/atm.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/sync/worker.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/temperatureRE_explicit.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/transport.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/utils/
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/utils/__init__.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/utils/logging.conf
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/utils/singal_guard.py
Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6/utils/timer.py

C:\ProgramData\BOINC\slots\9>python.exe -m pip install ./Acellera-AToM-OpenMM-* || exit 13



Geoff
Send message
Joined: 30 Aug 10
Posts: 2
Credit: 315,058,094
RAC: 110,399
Level
Asp
Scientific publications
watwatwat
Message 61052 - Posted: 24 Jan 2024 | 11:10:37 UTC

Just to add, I've now had multiple failers over the last 10 minutes, all of them are failing at the same point.

Hope this helps with the Windows debug.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61053 - Posted: 24 Jan 2024 | 13:51:07 UTC - in response to Message 61048.

Ian&Steve C. wrote yesterday:

You seem to have something wrong with your BOINC client. it's impossible to say what, but your stderr output is just blank, which is not normal or an artifact of these tasks. since this is the same system that you saw weirdness with Asteroids also, i do think you have some kind of problem with BOINC itself. it's impossible for us to guess without access to your system though.

...

so yes, there is still a problem on Windows (probably something wrong in the run.bat file, or a file missing from the environment package or input files. but you have a larger problem as well.

while troubleshooting your asteroids problem, I had recommended to upgrade your BOINC client, and I think you did that, but you may have performed an in-place upgrade rather than a fresh install. I would recommend removing all aspects of BOINC on this system. completely delete everything. and re-install from a fresh install package. do not keep anything from the previous install.


yes, you are right, there is obviously something wrong with this BOINC installation. I will remove it and install it from scratch, once the currently running Climateprediction tasks (which use to last up 14 days or even longer) are through.

Nevertheless, it's sad to learn that the Windows version of the ATM app is still faulty.
What I don't understand is: do they not test it before hundreds or thousands faulty tasks are being sent out? In fact, a testrun in their own lab would have shown within 5 minutes that still something is wrong. I think these 5 minutes would be worth the time, right?

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61055 - Posted: 24 Jan 2024 | 19:03:50 UTC - in response to Message 61053.
Last modified: 24 Jan 2024 | 19:05:07 UTC

fully agree

Nevertheless, it's sad to learn that the Windows version of the ATM app is still faulty.
What I don't understand is: do they not test it before hundreds or thousands faulty tasks are being sent out? In fact, a testrun in their own lab would have shown within 5 minutes that still something is wrong. I think these 5 minutes would be worth the time, right?

Néanmoins, il est triste d'apprendre que la version Windows de l'application ATM est toujours défectueuse.
Ce que je ne comprends pas, c'est : ne le testent-ils pas avant que des centaines ou des milliers de tâches défectueuses ne soient envoyées ? En fait, un test dans leur propre laboratoire aurait montré en 5 minutes que quelque chose ne va toujours pas. Je pense que ces 5 minutes en vaudraient la peine, non ?
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61056 - Posted: 24 Jan 2024 | 19:16:45 UTC - in response to Message 61053.


Nevertheless, it's sad to learn that the Windows version of the ATM app is still faulty.
What I don't understand is: do they not test it before hundreds or thousands faulty tasks are being sent out? In fact, a testrun in their own lab would have shown within 5 minutes that still something is wrong. I think these 5 minutes would be worth the time, right?


Steve, the researcher, in his first few posts about these tasks said that they don't have any Windows machines in the lab.

They only have Linux.

I'll post to Gianni that he needs to help get the Windows apps sorted out.

tomaras
Send message
Joined: 4 Mar 20
Posts: 12
Credit: 276,966,417
RAC: 310,552
Level
Asn
Scientific publications
wat
Message 61058 - Posted: 24 Jan 2024 | 19:42:15 UTC - in response to Message 60002.

Been 10 months since this was posted. Where is the "hoped for" windows version? Why are you wasting the potential of all of our Windows machines and new fast GPU's?

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61059 - Posted: 24 Jan 2024 | 20:14:36 UTC - in response to Message 61058.

The'Energy is NaN' error is still around:
http://gpugrid.net/result.php?resultid=33745995

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61060 - Posted: 24 Jan 2024 | 20:36:28 UTC - in response to Message 61056.


Nevertheless, it's sad to learn that the Windows version of the ATM app is still faulty.
What I don't understand is: do they not test it before hundreds or thousands faulty tasks are being sent out? In fact, a testrun in their own lab would have shown within 5 minutes that still something is wrong. I think these 5 minutes would be worth the time, right?


Steve, the researcher, in his first few posts about these tasks said that they don't have any Windows machines in the lab.

They only have Linux.

I'll post to Gianni that he needs to help get the Windows apps sorted out.


Hey Keith,

If you contact Gianni, pass on the following info I found from my testing.

There are 2 issues on the same line in this piece of code in run.bat:

@echo Install AToM
tar.exe xvf atom.tar
python.exe -m pip install ./Acellera-AToM-OpenMM-* || exit 13
python.exe -m pip list


1. The path separator '/' is wrong for Windows, should be '\' instead. This makes pip install choke. This should be a trivial fix.
2. Windows CMD shell scripts do not support inline expansion of the '*' wildcard. So pip install doesn't find the module in the location it expects, being "Acellera-AToM-OpenMM-*"
There are a few ways to fix this:
- Use the full name of the package folder 'Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6' This also implies that if this '2dd310b8027c68262906a8946f807896b49947b6' is variable, run.bat should be changed every time
- Generate a new atom.tar with a fixed folder name, for example always using 'Acellera-AToM-OpenMM' as the folder name of the package inside atom.tar - and adapting the run.bat pathname to .\Acellera-AToM-OpenMM accordingly
- use some scripting magic to pre-expand the wildcard into a variable (e.g. ATOM) and passing that variable to pip install. Something like this could work, but may have mixed results on different Windows installs - so solution 1 or 2 preferred.

@echo Install AToM
tar.exe xvf atom.tar

set PARM1=.\Acellera-AToM-OpenMM-*
for %%A in (%PARM1%) do set ATOM=%%A

python.exe -m pip install %ATOM% || exit 13
python.exe -m pip list

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61063 - Posted: 24 Jan 2024 | 23:08:32 UTC

I posted to Gianni and he replied that he copied my message to Steve.

I will try and get a response from Steve directly via PM and reference your post and analysis.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 61064 - Posted: 24 Jan 2024 | 23:17:24 UTC

Heres the top half of my dump:

<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
The operating system cannot run %1.
(0xc3) - exit code 195 (0xc3)</message>
<stderr_txt>
00:14:03 (19992): wrapper (7.9.26016): starting
00:14:03 (19992): wrapper: running python.exe (bin/conda-unpack)
00:14:13 (19992): python.exe exited; CPU time 0.000000
00:14:13 (19992): wrapper: running Library/usr/bin/tar.exe (xjvf input.tar.bz2)
atom.tar
JNK1_m38_m58_0.xml
JNK1_m38_m58_asyncre.cntl
JNK1_m38_m58.inpcrd
JNK1_m38_m58.prmtop
run.bat
run.sh
00:14:14 (19992): Library/usr/bin/tar.exe exited; CPU time 0.015625
00:14:14 (19992): wrapper: running C:/Windows/system32/cmd.exe (/c call run.bat)
ERROR: Invalid requirement: './Acellera-AToM-OpenMM-*'
00:14:18 (19992): C:/Windows/system32/cmd.exe exited; CPU time 0.000000
00:14:18 (19992): app exit status: 0xd
00:14:18 (19992): called boinc_finish(195)
0 bytes in 0 Free Blocks.
310 bytes in 4 Normal Blocks.
1144 bytes in 1 CRT Blocks.
0 bytes in 0 Ignore Blocks.
0 bytes in 0 Client Blocks.
Largest number used: 0 bytes.
Total allocations: 434076 bytes.
Dumping objects ->

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61065 - Posted: 24 Jan 2024 | 23:17:26 UTC

The python package is the large 1.9GB package that downloads to every host at first running of the ATMBeta tasks. It is static and sets up the python environment in the project folder.

It only needs to be downloaded once, not for every task.

The name won't change until Steve updates or make changes to it. If he fixes the package for Windows, the name should change. But he could then make the filename static and reference it directly without paths.

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61069 - Posted: 25 Jan 2024 | 9:13:01 UTC - in response to Message 61065.

Thank you all for the windows debugging info. I am looking into this!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61073 - Posted: 25 Jan 2024 | 12:04:07 UTC - in response to Message 61069.

Thank you all for the windows debugging info. I am looking into this!

thank you Steve, I'm looking forward to crunching ATMs with my altogether 6 GPUs on Windows

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61074 - Posted: 25 Jan 2024 | 12:11:48 UTC - in response to Message 61069.

Thank you all for the windows debugging info. I am looking into this!


Thanks for working on this, Steve!

I just got a WU called "T0_1-STEVE_TEST_ATM-1-5-RND5320" where I noticed you went for a pre-untarred folder "Acellera-AToM-OpenMM-gitrepo" inside the input file.

I'm happy to report that this went past the pip install statement without a hitch and is now happily simulating!

Good job!

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61075 - Posted: 25 Jan 2024 | 12:24:30 UTC - in response to Message 61074.

And done successfully!
https://www.gpugrid.net/result.php?resultid=33751815

I've got another one in queue that is not yet corrected, so I'm going to suspend GPUGRID for now to avoid a string of error-WU's until you let us know the fix has been incorporated in all new WU's.

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61076 - Posted: 25 Jan 2024 | 14:07:43 UTC - in response to Message 61075.
Last modified: 25 Jan 2024 | 14:08:50 UTC

Great thanks for the help! The new changes have been passed onto the researchers. Next round of jobs should have the fix.

(Please note I can't inject this fix into already sent WU's)

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61078 - Posted: 25 Jan 2024 | 15:07:12 UTC

Just to explain a bit about how this app currently works.

The "app" is a python environment we package as a zipfile (~1GB). This is downloaded once. It will be re-downloaded if we update the app. Updating an app is a rather time consuming process and error prone so we try and avoid it unless absolutely necessary.

In each work unit we include three main things: 1. The input molecular structures. 1. A few scripts that run the simulation. 3. A git code folder that contains the python code (Atom-OpenMM, ~a few MB size folder).

The code folder could have been packaged into the "app" python environment. However, this code is something we update regularly with different features so it is easier to include it on a per work unit basis.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61079 - Posted: 25 Jan 2024 | 16:54:15 UTC - in response to Message 61078.


+1

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 61104 - Posted: 27 Jan 2024 | 22:35:16 UTC

What does this mean: <message>
The operating system cannot run %1.
(0xc3) - exit code 195 (0xc3)</message>

Is this something like Linux trying to run on Windows or what?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61106 - Posted: 27 Jan 2024 | 23:24:33 UTC - in response to Message 61104.

What does this mean: <message>
The operating system cannot run %1.
(0xc3) - exit code 195 (0xc3)</message>

Is this something like Linux trying to run on Windows or what?

Googling shows this:

"The operation system cannot run %1" is shown when some apps try to open links

Sounds like the app is trying to open links that are invalid or badly formed.

Or a variation of this:

ImportError: DLL load failed: The operating system cannot run %1.

Greg _BE
Send message
Joined: 30 Jun 14
Posts: 126
Credit: 107,156,939
RAC: 28,034
Level
Cys
Scientific publications
watwatwatwatwatwat
Message 61107 - Posted: 28 Jan 2024 | 13:30:55 UTC - in response to Message 61106.

What does this mean: <message>
The operating system cannot run %1.
(0xc3) - exit code 195 (0xc3)</message>

Is this something like Linux trying to run on Windows or what?

Googling shows this:

"The operation system cannot run %1" is shown when some apps try to open links

Sounds like the app is trying to open links that are invalid or badly formed.

Or a variation of this:

ImportError: DLL load failed: The operating system cannot run %1.



Lovely

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61110 - Posted: 29 Jan 2024 | 14:12:55 UTC

Got four failures for ATMbeta resends today, on two slightly differing versions of Linux Mint. All had

+ python -m pip install ./Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6 ./Acellera-AToM-OpenMM-gitrepo
ERROR: Cannot install acellera-atom-openmm 3.3.0rc4 (from /hdd/boinc-client/slots/4/Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6) and acellera-atom-openmm 3.3.0rc4 (from /hdd/boinc-client/slots/4/Acellera-AToM-OpenMM-gitrepo) because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61111 - Posted: 29 Jan 2024 | 14:18:15 UTC - in response to Message 61110.

Got four failures for ATMbeta resends today, on two slightly differing versions of Linux Mint. All had

+ python -m pip install ./Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6 ./Acellera-AToM-OpenMM-gitrepo
ERROR: Cannot install acellera-atom-openmm 3.3.0rc4 (from /hdd/boinc-client/slots/4/Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6) and acellera-atom-openmm 3.3.0rc4 (from /hdd/boinc-client/slots/4/Acellera-AToM-OpenMM-gitrepo) because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts


same. i had about 150 of these errors.
____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61112 - Posted: 29 Jan 2024 | 14:49:06 UTC - in response to Message 61111.

same. i had about 150 of these errors.

All of mine have gone to full workunit failure, with too many errors. Some of them have been tried by Windows computers, and have failed there too.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61113 - Posted: 29 Jan 2024 | 21:49:23 UTC - in response to Message 61076.

Great thanks for the help! The new changes have been passed onto the researchers. Next round of jobs should have the fix.

(Please note I can't inject this fix into already sent WU's)


Hi Steve,

I notice a new batch of ATMs released, but the fix has only been partially incorporated, leading to failures both on Windows and Linux.

The WU's do contain the pre-untarred 'Acellera-AToM-OpenMM-gitrepo' directory which is good, but the 'atom.tar' file is also still included which is both inefficient use of network bandwith as well as leading to errors.

Worse is, the run.bat and run.sh files both still have the following statement:
python.exe -m pip install ./Acellera-AToM-OpenMM-*


On Windows, this still leads to the same old error because of the invalid path separator / instead of \ and because the '*' is interpreted literally instead of as a wildcard. See Task error example here: https://www.gpugrid.net/result.php?resultid=33760402

On Linux, the same task leads to an error because it does interpret '*' as a wildcard, finding 2 AToM folders (from atom.tar and from Acellera-AToM-OpenMM-gitrepo) instead of 1, with a conflicting dependency error as result. See Task error example here: https://www.gpugrid.net/result.php?resultid=33760795

Resolution:
Replace the statement
python.exe -m pip install ./Acellera-AToM-OpenMM-*


by the following in run.bat (Windows):
python.exe -m pip install .\Acellera-AToM-OpenMM-gitrepo


by the following in run.sh (Linux):
python.exe -m pip install ./Acellera-AToM-OpenMM-gitrepo


As an optimization:
EITHER remove atom.tar from the WU, as well as the corresponding 'tar' statements in run.bat and run.sh
OR package Acellera-AToM-OpenMM-gitrepo folder inside of atom.tar instead of Acellera-AToM-OpenMM-2dd310b8027c68262906a8946f807896b49947b6

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61114 - Posted: 29 Jan 2024 | 21:50:33 UTC - in response to Message 61112.

same. i had about 150 of these errors.

All of mine have gone to full workunit failure, with too many errors. Some of them have been tried by Windows computers, and have failed there too.


See the reason for both Linux and Windows errors in my post above...

Steve
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist
Send message
Joined: 21 Dec 23
Posts: 32
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 61115 - Posted: 30 Jan 2024 | 7:58:46 UTC

thank you for the catch again, I will check the correct scripts are being used

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61116 - Posted: 30 Jan 2024 | 9:40:03 UTC

Task TYK2_m10_m15_2_TEST-QUICO_ATM_500K_dih14fit-0-5-RND5394_0 (ATMbeta) received and is running correctly under Linux Mint 21.3

Hans Sveen
Send message
Joined: 29 Oct 08
Posts: 3
Credit: 402,605,899
RAC: 78
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61125 - Posted: 30 Jan 2024 | 19:33:51 UTC

Hi!

I've got one on Win10 that so far still running!

Task https://www.gpugrid.net/result.php?resultid=33764570

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61142 - Posted: 1 Feb 2024 | 11:22:18 UTC - in response to Message 61125.

Hi!

I've got one on Win10 that so far still running!

Task https://www.gpugrid.net/result.php?resultid=33764570


here, too :-)

Also the progress bar in the BOINC manager now works fine !

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61211 - Posted: 8 Feb 2024 | 9:49:27 UTC

Has the ATM experiment been stopped? Or can we expect more work in future?

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61294 - Posted: 17 Feb 2024 | 13:12:55 UTC

A new batch of ATMbeta tasks is on the field since yesterday.
This batch seems to be heavier in processing times than most of precedent ones.

A pinhead:
Answering my own request for multi GPU hosts at Message #61293.
Watching carefully Stderr output report for a certain ATMbeta task, can be found a line like this:

.
+ echo localhost,0:N,1,CUDA,,/var/lib/boinc-client/slots/X/tmp
.

Where "N" corresponds to the Device Number (GPU) where the task was run on.

With ATMbeta tasks I'm not currently experiencing reliability problems, but identifying every single GPU can be useful, for example, to characterize its performance.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61301 - Posted: 19 Feb 2024 | 0:11:38 UTC - in response to Message 61294.

A new batch of ATMbeta tasks is on the field since yesterday.
This batch seems to be heavier in processing times than most of precedent ones.

Does anyone else have problems loading ATMbeta even though they were explicitly selected in the GPUGrid settings?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61302 - Posted: 19 Feb 2024 | 0:26:14 UTC - in response to Message 61301.

No, none at all.

As long as you also toggle to use "test tasks" in your project preferences.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61303 - Posted: 19 Feb 2024 | 1:29:58 UTC - in response to Message 61302.

No, none at all.

As long as you also toggle to use "test tasks" in your project preferences.

this.

you need beta/test tasks selected.
____________

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61304 - Posted: 19 Feb 2024 | 2:51:02 UTC - in response to Message 61303.

you need beta/test tasks selected.

Silly me. I actually missed that. Thank you.

RockLr
Send message
Joined: 14 Mar 20
Posts: 7
Credit: 11,208,845
RAC: 1
Level
Pro
Scientific publications
wat
Message 61337 - Posted: 28 Feb 2024 | 10:27:08 UTC

Got an error when pause and restart in my non-English system.
https://www.gpugrid.net/result.php?resultid=34111833

&#215;&#211;&#196;&#191;&#194;&#188;&#187;&#242;&#206;&#196;&#188;&#254; F:\apps\BOINC\data\slots\1\tmp &#210;&#209;&#190;&#173;&#180;&#230;&#212;&#218;&#161;&#163;

This may mean "A subdirectory or file F:\apps\BOINC\data\slots\1\tmp already exists."or so.

Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead.

ImportError: DLL load failed while importing _openmm: &#213;&#210;&#178;&#187;&#181;&#189;&#214;&#184;&#182;&#168;&#181;&#196;&#196;&#163;&#191;&#233;&#161;&#163;

I can't read this. Is it because of my system lenguage? Or just something wrong with "importing _openmm" or so?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61338 - Posted: 29 Feb 2024 | 0:45:05 UTC - in response to Message 61337.

Just Windows thing. Doesn't have the character library or something.

You generally can't stop or pause GPUGrid tasks.

Depending on the app, it either errors out or starts calculation over from the beginning.

At least for the tasks that have been available lately.

RockLr
Send message
Joined: 14 Mar 20
Posts: 7
Credit: 11,208,845
RAC: 1
Level
Pro
Scientific publications
wat
Message 61340 - Posted: 29 Feb 2024 | 3:02:52 UTC - in response to Message 61338.

Half-day task without pause may not suitable for me...
By the way, it's also a trouble for my poor laptop that, install gbs of thing in ...\slot\1\ for every task, rather then in ...\projects\www.gpugrid.net\ for once.

:(

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61344 - Posted: 29 Feb 2024 | 18:16:31 UTC - in response to Message 61340.

Tasks are always copied into and crunched in a slot in BOINC.

Never in the main BOINC directory or project directory. That is by design.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61345 - Posted: 29 Feb 2024 | 18:27:55 UTC - in response to Message 61344.

yup. the only slight difference with GPUGRID ATM (and QChem) is that the environment package is very large (2-3GB when extracted). you do need to provision for that in this case. you don't notice it nearly as much with other projects shipping tiny binaries.

this is the way the project has decided to run their work.

but BOINC is all about choice. if the applications here do not mesh well with your system, probably best to find a different project that works better for you.
____________

kotenok2000
Send message
Joined: 18 Jul 13
Posts: 78
Credit: 12,875,793
RAC: 0
Level
Pro
Scientific publications
wat
Message 61346 - Posted: 29 Feb 2024 | 19:58:19 UTC - in response to Message 61338.
Last modified: 29 Feb 2024 | 20:00:16 UTC

Boinc server doesn't have correct character library.

You can use this https://www.online-decoder.com/

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61350 - Posted: 1 Mar 2024 | 10:01:14 UTC

Heads up: A new batch of ATM (not Beta) was released this morning - just 200 tasks, all allocated now, but it might be a sign of things to come.

Two things to note:

a) because of the shared DCF at this project, they were given a very low initial runtime estimate - less than 10 minutes, but likely to take nearer 4 hours on this card.
b) the task limit has also increased - up to four per card.

Don't be too greedy fetching work until this settles down!

Profile ServicEnginIC
Avatar
Send message
Joined: 24 Sep 10
Posts: 566
Credit: 6,098,952,024
RAC: 8,566,567
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61354 - Posted: 2 Mar 2024 | 9:27:15 UTC - in response to Message 61350.

Two things to note:

a) because of the shared DCF at this project, they were given a very low initial runtime estimate - less than 10 minutes, but likely to take nearer 4 hours on this card.
b) the task limit has also increased - up to four per card.

Don't be too greedy fetching work until this settles down!

Thank you for your notes.
Maybe a third thing to point:
I've noticed that recently the delay for GPUGRID server to attend task requests has been decreased to 11 seconds, instead of previous 31 seconds.

| GPUGRID | update requested by user
| GPUGRID | Sending scheduler request: Requested by user.
| GPUGRID | Not requesting tasks: don't need (CPU: ; NVIDIA GPU: job cache full)
| GPUGRID | Scheduler request completed
| GPUGRID | Project requested delay of 11 seconds

This affects to every apps, not only ATM.
And for small batches, it can cause that tasks ready to send decrease triple the rapid until they run out.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61355 - Posted: 2 Mar 2024 | 9:43:02 UTC

bonjour y a t il des taches quantum chemistry pour windows?


hello are there quantum chemistry stains for windows?
____________

[VENETO]Fabrizio74
Send message
Joined: 30 Apr 16
Posts: 1
Credit: 869,790,580
RAC: 691,108
Level
Glu
Scientific publications
watwatwatwatwatwatwat
Message 61356 - Posted: 2 Mar 2024 | 12:30:28 UTC

Good morning, is it possible to process processes lasting 7, 10 hours with RTX 3xxx series cards or possibly resume from the point of view that was stopped without making an error?
Thanks

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61357 - Posted: 2 Mar 2024 | 17:11:23 UTC - in response to Message 61356.

Good morning, is it possible to process processes lasting 7, 10 hours with RTX 3xxx series cards or possibly resume from the point of view that was stopped without making an error?
Thanks


Process tasks in less than 7 hours with RTX 3000 cards?

YES

Stop running tasks and resume them without error or losing progress?

NO

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61358 - Posted: 2 Mar 2024 | 17:21:09 UTC - in response to Message 61355.

bonjour y a t il des taches quantum chemistry pour windows?


hello are there quantum chemistry stains for windows?


not yet.
____________

tomaras
Send message
Joined: 4 Mar 20
Posts: 12
Credit: 276,966,417
RAC: 310,552
Level
Asn
Scientific publications
wat
Message 61359 - Posted: 2 Mar 2024 | 22:52:00 UTC - in response to Message 60002.

Gonna go out on a limb here and say a very high percentage of the computers that look for work units from you are Windows based. Why on earth would you design something for Linux first?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61361 - Posted: 3 Mar 2024 | 1:13:01 UTC - in response to Message 61359.

Because that’s what they know and use in their lab. They develop on Linux, test, tweak, then port to Windows later.
____________

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 2,903,431,809
RAC: 17,369,206
Level
Phe
Scientific publications
wat
Message 61362 - Posted: 3 Mar 2024 | 1:50:36 UTC - in response to Message 61359.

Gonna go out on a limb here and say a very high percentage of the computers that look for work units from you are Windows based. Why on earth would you design something for Linux first?


Very, very little in the world of computational science starts in Windows. You should see the world of genomics! Basically nothing can be accomplished in Windows.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61365 - Posted: 4 Mar 2024 | 13:52:25 UTC

The non-beta ATMs are flowing again. Hopefully I can get up to 11 completions so the server will start to recognise the true processing speed this time.

This seems to be a second, larger batch of 0-7 tasks. Waiting to see if the runtime progress bug has been finally eliminated for this app.

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61366 - Posted: 4 Mar 2024 | 14:17:31 UTC - in response to Message 61365.

Waiting to see if the runtime progress bug has been finally eliminated for this app.


I know this was fixed on ATMbeta, did it not translate to ATM on the previous batch?

____________

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61367 - Posted: 4 Mar 2024 | 14:52:36 UTC - in response to Message 61366.

They were all 0-7 too - and progress during that first segment was always OK. I've not seen a 1-7 or later yet.

Boca Raton Community HS
Send message
Joined: 27 Aug 21
Posts: 36
Credit: 2,903,431,809
RAC: 17,369,206
Level
Phe
Scientific publications
wat
Message 61368 - Posted: 4 Mar 2024 | 15:43:28 UTC - in response to Message 61350.
Last modified: 4 Mar 2024 | 15:47:02 UTC


b) the task limit has also increased - up to four per card.



I can say that running 4x on the 4090 has pushed the GPU harder than any other work in the context of power utilization but are running really smoothly. I won't probably run 4x for too long though- pulls almost 400w(!) (never goes above 58c though).

My time estimations look like they are working on the systems I have checked today.

Edit: Time estimation is not accurate on all of our systems (yet).

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61369 - Posted: 4 Mar 2024 | 15:45:56 UTC

"No tasks available for ATM", says BOINC

700+ tasks ready to send, says gpugrid.net.

???

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61371 - Posted: 4 Mar 2024 | 19:42:55 UTC - in response to Message 61369.
Last modified: 4 Mar 2024 | 20:32:17 UTC

"No tasks available for ATM", says BOINC

700+ tasks ready to send, says gpugrid.net.

???


Same problem here.

Edit: My linux machines can get the ATM tasks. But my windows machines cannot.

zombie67 [MM]
Avatar
Send message
Joined: 16 Jul 07
Posts: 207
Credit: 1,669,151,456
RAC: 719,040
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61372 - Posted: 5 Mar 2024 | 0:55:29 UTC

Update: My win machines finally got some of the ATM tasks. Not sure what changed, but happy to see it.
____________
Reno, NV
Team: SETI.USA

jon b.
Send message
Joined: 8 Jul 12
Posts: 2
Credit: 38,535,197
RAC: 64,129
Level
Val
Scientific publications
watwatwatwatwatwat
Message 61377 - Posted: 6 Mar 2024 | 9:26:59 UTC

I noticed that the ATM tasks are receiving 1,125,000.00 credits each. This seems several hundred times greater than what I would expect for the computation time of these tasks.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61378 - Posted: 6 Mar 2024 | 11:06:40 UTC - in response to Message 61377.

I noticed that the ATM tasks are receiving 1,125,000.00 credits each. This seems several hundred times greater than what I would expect for the computation time of these tasks.


That's because BOINC credits are calculated based on computation operations (flops) not computation time, and GPUs have much higher flops than CPUs.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61379 - Posted: 6 Mar 2024 | 12:22:36 UTC - in response to Message 61378.

That's because BOINC credits are calculated based on computation operations (flops) not computation time, and GPUs have much higher flops than CPUs.

At this project, they aren't calculated at all - they are given at a fixed rate for each application type, set by the adnins.

jon b.
Send message
Joined: 8 Jul 12
Posts: 2
Credit: 38,535,197
RAC: 64,129
Level
Val
Scientific publications
watwatwatwatwatwat
Message 61380 - Posted: 6 Mar 2024 | 14:51:31 UTC - in response to Message 61378.

That's because BOINC credits are calculated based on computation operations (flops) not computation time, and GPUs have much higher flops than CPUs.

Yes, I well am aware of how BOINC credits were originally designed. That is why I am pointing out the credit issue. If I continued crunching these tasks, the credit would eclipse projects I have been crunching (with GPUs) for 12 years in a matter of days. These even seem high compared to other GPUGRID subprojects.

roundup
Send message
Joined: 11 May 10
Posts: 57
Credit: 1,561,695,193
RAC: 7,902,359
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61395 - Posted: 9 Mar 2024 | 10:20:59 UTC

I had 35 ATM Work Units over the recent days and not a single error on 2 Linux machines.
Congrats to the developers, great job!

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61396 - Posted: 9 Mar 2024 | 14:39:53 UTC - in response to Message 61395.

I had 35 ATM Work Units over the recent days and not a single error on 2 Linux machines.
Congrats to the developers, great job!

I too had quite a number of ATMs on my Windows machines, and almost all of them worked well - so obviusly the inital problems with Windows were fixed, which is great! Hence, also my congrats to the developers :-)

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61397 - Posted: 9 Mar 2024 | 14:50:02 UTC - in response to Message 61365.

On March 4, Richard Haselgrove wrote:

The non-beta ATMs are flowing again. Hopefully I can get up to 11 completions so the server will start to recognise the true processing speed this time.

On some of my hosts I've processed clearly more than 11 completions, and still BOINC shows up to 20 days in column "remaining time"; which means that no other tasks are being downloaded (and parked in waiting position) as long as 1 task in being processed.

Interesstingly enough: on the first or even second day, the behaviour was different: a remaining time of about 8-10 hours was shown, hence more than one task per GPU could be downloaded. And then, all of a sudden, the remaining times changed to up to 20 days - no idea why ???

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61398 - Posted: 9 Mar 2024 | 17:56:22 UTC - in response to Message 61397.

Usually it means you did other work like the QC tasks which reset the DCF to a higher value.

That borks the previous lower value and correct estimated times for the ATM tasks.

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61399 - Posted: 9 Mar 2024 | 19:13:03 UTC - in response to Message 61398.

Usually it means you did other work like the QC tasks which reset the DCF to a higher value.

That borks the previous lower value and correct estimated times for the ATM tasks.


For me, on my Linux boot, it's the other way round. I need to set DCF to 0.01 in order to get reasonable estimates for QC tasks (still almost 10x the real duration though), but once ATM gets in the mix, the DCF skyrockets...

Anyway, Erich56 being on Windows, unlikely that QC tasks were the issue.

But I'm seeing similar results on Windows. Even though ATM tasks consistently finish faster than any original estimate, at some point they go up to an estimated time of 20+ days.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61400 - Posted: 9 Mar 2024 | 20:09:17 UTC - in response to Message 61399.

Anyway, Erich56 being on Windows, unlikely that QC tasks were the issue.

exactly

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61408 - Posted: 13 Mar 2024 | 10:05:45 UTC

I now seem to receive several tasks which fail within 1-2 minutes.

As one can see, these tasks have been failing on hosts of other volunteers as well:
https://www.gpugrid.net/workunit.php?wuid=28095144

Why are all these faulty tasks being sent out all of a sudden?

Bedrich Hajek
Send message
Joined: 28 Mar 09
Posts: 467
Credit: 8,358,971,966
RAC: 9,198,359
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61409 - Posted: 13 Mar 2024 | 10:28:01 UTC - in response to Message 61408.

I now seem to receive several tasks which fail within 1-2 minutes.

As one can see, these tasks have been failing on hosts of other volunteers as well:
https://www.gpugrid.net/workunit.php?wuid=28095144

Why are all these faulty tasks being sent out all of a sudden?


I have the same problem:

https://www.gpugrid.net/results.php?hostid=610674&offset=0&show_names=0&state=0&appid=41



Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61410 - Posted: 13 Mar 2024 | 10:46:25 UTC - in response to Message 61408.

Why are all these faulty tasks being sent out all of a sudden?

Look in the task report on this website. The first one I tried was blank, but the next said:

openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

That suggests an error preparing the batch - a project problem, not yours.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61411 - Posted: 13 Mar 2024 | 11:15:39 UTC - in response to Message 61410.
Last modified: 13 Mar 2024 | 11:28:47 UTC

Why are all these faulty tasks being sent out all of a sudden?

Look in the task report on this website. The first one I tried was blank, but the next said:

openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

That suggests an error preparing the batch - a project problem, not yours.

Thank you Richard, I did spot what stderr was showing. So it was clear to me anyway that there seems to be a problem with the batch.
I keep receiving all these faulty tasks, so I'd better switch to "no new tasks"

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61413 - Posted: 13 Mar 2024 | 17:27:57 UTC - in response to Message 61410.

Why are all these faulty tasks being sent out all of a sudden?

Look in the task report on this website. The first one I tried was blank, but the next said:

openmm.OpenMMException: Unknown property 'version' in node 'IntegratorParameters'

That suggests an error preparing the batch - a project problem, not yours.

all these faulty tasks are "CDK8_" - I have received about 30 of them this afternoon.
Once more I am questioning whether tasks are not being testet before a full batch is released. In case of these CDK8, a test taking no longer than a few minutes would have revealed the problem.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,766,936,851
RAC: 8,686,553
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 61414 - Posted: 14 Mar 2024 | 14:25:27 UTC - in response to Message 61413.

CDK8_s are being issued now. I have three running normally, but they're all on Linux machines. That shouldn't be a problem for an error of this type, so I would expect them to run under Windows too - but approach with caution!

tomaras
Send message
Joined: 4 Mar 20
Posts: 12
Credit: 276,966,417
RAC: 310,552
Level
Asn
Scientific publications
wat
Message 61415 - Posted: 17 Mar 2024 | 20:07:51 UTC - in response to Message 61362.

Yeah, but what exactly does that have to do with getting work units processed from computers that are primarily running a Windows OS from what I can see of the folks who are contributing here?

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61416 - Posted: 18 Mar 2024 | 16:32:43 UTC - in response to Message 61415.

Yeah, but what exactly does that have to do with getting work units processed from computers that are primarily running a Windows OS from what I can see of the folks who are contributing here?


this is a prime example: https://github.com/gpugrid/gpugrid/issues/1

I know this thread is about the ATM app specifically. but stuff like this is why a Windows build might not be available.

the only app at this project without a Windows version is the Quantum Chemistry GPU app (PYSCFbeta). And they basically can't even build it for Windows because the main codebase they are using for their version of the application is not offered for Windows at all. there's no build recipe for Windows.

It's a misconception if you think that the applications here are 100% homegrown and original. They are using a lot of other code and applications pieced together and adapted in a custom way to do their specific work. since a major part of their code that they didn't write or maintain isn't available for Windows kind of forces their hand. it's not that they don't want to support Windows, it's either don't do the work at all, or be stuck with Linux-only but at least get some work done, they chose the latter.

to be more on-topic with this thread. ATM tasks DO have a Windows version available. some systems produce errors though for some unknown reason, while others work fine. not getting work at all is a problem with your configuration somehow since other windows systems are getting work just fine.
____________

tomaras
Send message
Joined: 4 Mar 20
Posts: 12
Credit: 276,966,417
RAC: 310,552
Level
Asn
Scientific publications
wat
Message 61417 - Posted: 20 Mar 2024 | 23:48:58 UTC - in response to Message 61416.

I have two windows computers running GPU GRD and I'm lucky to get two ATM work units every couple of days, no matter how often I manually request tasks.

Also, my RTX 4090 card can plow through a work unit in a few hours yet the estimated remaining time begins at over 30 "days." Since my settings ask for 2 days' worth of work, maybe there is something in that miscalculation that is preventing me from getting consistent tasks?

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61418 - Posted: 21 Mar 2024 | 0:25:58 UTC - in response to Message 61417.
Last modified: 21 Mar 2024 | 0:27:16 UTC

Once you've validated 11 ATM tasks, the DCF for that application should come down and the estimated times to completion should come down to something reasonable.

I have a watch script running and I don't think I really need it as I get a new task on the same scheduler connection I return one on.

But that is for the 65K available QC tasks. You have to wait for Windows compatible ATM tasks to appear again.

tomaras
Send message
Joined: 4 Mar 20
Posts: 12
Credit: 276,966,417
RAC: 310,552
Level
Asn
Scientific publications
wat
Message 61419 - Posted: 21 Mar 2024 | 1:00:07 UTC - in response to Message 61418.

Thanks. I've validated well over 11 ATM tasks. Oh well, it is what it is. I just hate to see all this GPU horsepower I have doing nothing.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61420 - Posted: 21 Mar 2024 | 9:19:14 UTC

Bonjour ,
j'aimerais savoir s'il est possible d'attribuer une carte graphique a un calcul en particulier car je possede une gtx 1650 4gb mais GPUGRID lui donne toujours du quantum chemistry a calculer.
Avec 4 gb seulement tout part en erreur contrairement a ma rtx 4060.
Je voudrais que ma gtx 1650 calcule pour atm car cela ne pose pas de probleme.
Le calcul réussit a chaque fois.
d'apres ce que je sais gpugrid ne gere pas de fichier appconfig.xml.
Je voudrais que ma gtx 1650 calcule pour ATM et ma rtx 4060 pour Quantum chemistry.
Est ce possible?

Good morning,
I would like to know if it is possible to assign a graphics card to a particular calculation because I have a gtx 1650 4gb but GPUGRID still gives it quantum chemistry to calculate.
With only 4 gb everything goes wrong unlike my rtx 4060.
I would like my gtx 1650 to calculate for atm as this is not a problem.
The calculation succeeds every time.
from what I know gpugrid does not handle appconfig.xml file.
I would like my gtx 1650 to calculate for ATM and my rtx 4060 for Quantum chemistry.
Is that possible?
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61421 - Posted: 21 Mar 2024 | 12:43:39 UTC - in response to Message 61418.

Keith Myers wrote:

Once you've validated 11 ATM tasks, the DCF for that application should come down and the estimated times to completion should come down to something reasonable.

none of my 4 hosts (with total of 7 GPUs) is showing this behaviour, though.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61422 - Posted: 21 Mar 2024 | 12:45:40 UTC

est ce que gpugrid arretera d'envoyer des unites quantum chemistry sur ma gtx 1650 si elles partent toutes en erreur a cause de la quantité de ram et mettra automatiquement ma gtx 1650 sur ATM qui devrait fonctionner coorectement?


Will gpugrid stop sending quantum chemistry units on my gtx 1650 if they all go wrong because of the amount of ram and will automatically put my gtx 1650 on ATM which should work coorectly?
____________

Ian&Steve C.
Avatar
Send message
Joined: 21 Feb 20
Posts: 1035
Credit: 36,941,282,483
RAC: 47,536,110
Level
Trp
Scientific publications
wat
Message 61423 - Posted: 21 Mar 2024 | 12:54:58 UTC - in response to Message 61422.

est ce que gpugrid arretera d'envoyer des unites quantum chemistry sur ma gtx 1650 si elles partent toutes en erreur a cause de la quantité de ram et mettra automatiquement ma gtx 1650 sur ATM qui devrait fonctionner coorectement?


Will gpugrid stop sending quantum chemistry units on my gtx 1650 if they all go wrong because of the amount of ram and will automatically put my gtx 1650 on ATM which should work coorectly?


go into your project preferences and disable running test applications and uncheck the Quantum Chemistry project.
____________

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61424 - Posted: 21 Mar 2024 | 16:42:16 UTC - in response to Message 61420.
Last modified: 21 Mar 2024 | 16:45:23 UTC


Good morning,
I would like to know if it is possible to assign a graphics card to a particular calculation because I have a gtx 1650 4gb but GPUGRID still gives it quantum chemistry to calculate.
With only 4 gb everything goes wrong unlike my rtx 4060.
I would like my gtx 1650 to calculate for atm as this is not a problem.
The calculation succeeds every time.
from what I know gpugrid does not handle appconfig.xml file.
I would like my gtx 1650 to calculate for ATM and my rtx 4060 for Quantum chemistry.
Is that possible?

Yes, use an exclude_gpu statement in your cc_config.xml file.
https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration


Something like this in the Options section.

<exclude_gpu>
<url>http://www.gpugrid.net/</url>
<device_num>1</device_num>
<type>NVIDIA</type>
<app>PYSCFbeta</app>
</exclude_gpu>

The 1650 should be enumerated as gpu#1 in Boinc in relation to the 4060. But check first in the Event Log which card is #0 and #1 in Boinc's thinking.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61425 - Posted: 22 Mar 2024 | 4:53:09 UTC - in response to Message 61424.

merci mais cela ne fonctionne pas avec gpugrid car ce dernier ne gere pas de fichier appconfig.xml.
je viens d'essayer mais c'est sans effet.

thank you but this does not work with gpugrid because it does not manage appconfig.xml file.
I just tried, but it doesn’t work.
____________

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61426 - Posted: 22 Mar 2024 | 6:52:51 UTC - in response to Message 61425.

merci mais cela ne fonctionne pas avec gpugrid car ce dernier ne gere pas de fichier appconfig.xml.
je viens d'essayer mais c'est sans effet.

thank you but this does not work with gpugrid because it does not manage appconfig.xml file.
I just tried, but it doesn’t work.

you refer to appconfig.xml. However, the addition suggested by Keith needs to be done in the cc_config.xml (located in the main BOINC folder). Further, after making all this in the cc_config.xml, stop and restart the BOINC manager. Only then the changes will work.

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61427 - Posted: 22 Mar 2024 | 6:57:31 UTC - in response to Message 61425.

You asked for a solution for excluding your 1650 from Quantum Chemistry tasks but still able to run both ATM and QC tasks on your 4060.

That is the solution I presented. The exclude statement goes into your cc_config.xml file, NOT your app_config.xml file.

You don't even need a app_config.xml file for this project and can control everything just by your Compute Preferences on the project via your browser.

Look through the client configuration doc I linked for you.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61428 - Posted: 22 Mar 2024 | 10:21:29 UTC - in response to Message 61427.

merci j'ai bien mis dans cc config.xml et je vais attendre des taches atm pour ma gtx 1650.
vous etes sympathique et merci encore.

thank you I put well in cc config.xml and I will wait for atm spots for my gtx 1650.
you are friendly and thank you again.
____________

[BAT] Svennemans
Send message
Joined: 27 May 21
Posts: 49
Credit: 246,897,017
RAC: 293,634
Level
Leu
Scientific publications
wat
Message 61429 - Posted: 22 Mar 2024 | 11:49:22 UTC - in response to Message 61419.

Thanks. I've validated well over 11 ATM tasks. Oh well, it is what it is. I just hate to see all this GPU horsepower I have doing nothing.


You can try to reset your DCF to a low value, which may fix the issue at least temporarily. If you're mixing pyscf & ATM WU's on your machine, this will not work for long but with ATM-only it might have lasting effect.

- Stop BOINC client
- edit client_state.xml (in your main BOINC folder)
- look for the <project> section of gpugrid
- look inside this <project> section for <duration_correction_factor>
- set the value to 0.01 (lower than that is ignored by BOINC and will be updated to 0.01 after the first WU)

<duration_correction_factor>0.010000</duration_correction_factor>


- save and close
- Start BOINC client

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61430 - Posted: 22 Mar 2024 | 11:56:41 UTC - in response to Message 61429.

je vous conseille de passer sous linux .
il y a toujours du travail.c'est moins évident a mettre en ouevre que windows mais vous aurez du quantum chemistry a calculer en permanence.
Perso j'envisage de racheter des gpu rtx a 2000 ou rtx 4000 sff expres pour gpugrid.
Gpugrid étant le seul projet boinc sur gpu qui vaille la peine de se donner du mal.
Les projets sur les chiffres ou l'espace,pour moi,c'est bof bof.

I advise you to switch to linux .
there is always work. it is less obvious to implement than windows but you will have quantum chemistry to calculate constantly.
Personally I plan to buy GPUs rtx a 2000 or rtx 4000 sff expres for gpugrid.
Gpugrid being the only boinc project on gpu that is worth the trouble.
Projects on numbers or space, for me, is bof bof.
____________

tomaras
Send message
Joined: 4 Mar 20
Posts: 12
Credit: 276,966,417
RAC: 310,552
Level
Asn
Scientific publications
wat
Message 61431 - Posted: 24 Mar 2024 | 22:12:29 UTC - in response to Message 61429.

Thanks...all kind of over my head. I've been running boinc stuff since 2003 and have never had to edit something. I'm just an old guy now.

Erich56
Send message
Joined: 1 Jan 15
Posts: 1091
Credit: 6,603,906,926
RAC: 3,160,162
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwat
Message 61432 - Posted: 25 Mar 2024 | 17:21:31 UTC

after the ATMs have been running well within the past few weeks, almost all ATMs that my hosts received this afternoon failed after less than a minute with error

13:14:21 (5332): app exit status: 0xc0000135
13:14:21 (5332): called boinc_finish(195)

why all of a sudden?

P.S. they failed on other volunteers' hosts, too. So the problem is not with my hosts.

tomaras
Send message
Joined: 4 Mar 20
Posts: 12
Credit: 276,966,417
RAC: 310,552
Level
Asn
Scientific publications
wat
Message 61434 - Posted: 27 Mar 2024 | 16:28:42 UTC - in response to Message 61432.

I have two windows computers running the ATM tasks and I have not even received a task to run since 3/22/24

Keith Myers
Send message
Joined: 13 Dec 17
Posts: 1288
Credit: 5,097,631,959
RAC: 8,944,362
Level
Tyr
Scientific publications
watwatwatwatwat
Message 61435 - Posted: 27 Mar 2024 | 21:23:00 UTC - in response to Message 61434.

If you look at the server status page https://www.gpugrid.net/server_status.php
you can see that there are only Quantum Chemistry tasks in large supply.
The ATM tasks have been very spotty for the last month.

Pascal
Send message
Joined: 15 Jul 20
Posts: 45
Credit: 286,049,934
RAC: 2,212,885
Level
Asn
Scientific publications
wat
Message 61437 - Posted: 27 Mar 2024 | 22:47:15 UTC - in response to Message 61435.

je vous conseille de passer sous Linux .
il ya toujours du travail.c'est moins évident a mettre en œuvre que windows mais vous aurez du quantum chemistry a calculer en permanence.
Perso j'envisage de racheter des gpu rtx a 2000 ou rtx 4000 sff express pour gpugrid.
Gpugrid étant le seul projet boinc sur GPU qui vaille la peine de se donner du mal.
Les projets sur les chiffres ou l'espace,pour moi,c'est bof bof.

Je vous conseille de passer à Linux.
il y a toujours du travail. c'est moins évident à mettre en oeuvre que windows mais vous aurez de la chimie quantique à calculer en permanence.
Personnellement, je prévois d'acheter des GPU rtx a 2000 ou rtx 4000 sff express pour gpugrid.
Gpugrid étant le seul projet boinc sur GPU qui en vaut la peine.
Les projets sur les chiffres ou l'espace, pour moi, c'est bof bof.

Post to thread

Message boards : News : ATM

//