ATM

Author	Message
Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60002 - Posted: 3 Mar 2023, 10:39:46 UTC Hello GPUGRID! You‘ve already noticed that a new app called “ATM” has been deployed with some test runs. We are working on its validation and deployment, so expect more jobs to come on this app soon. Let me briefly explain what this new app is about. The ATM application The new ATM application stands for Alchemical Transfer Method, a methodology Emilio Gallicchio et al. designed for absolute and relative binding affinity predictions. The ATM method allows us to estimate binding affinities for molecules against a specific protein, measuring the strength at which they bind. This methodology falls under the category of alchemical free energy calculation methods, where unphysical intermediate states are used to estimate the free energy of physical processes (such as protein-ligand binding). The benefits of ATM, when compared with other common free energy prediction methods (like the popular FEP), come from its simplicity, as it can be used with any forcefield and does not require a lot of expertise to make it work properly. Measuring experimental binding affinities between candidate molecules and the targeted protein is one of the first steps in drug discovery projects, but synthesizing molecules and performing experiments is expensive. Having the capacity to perform computational binding affinity predictions, particularly during drug lead optimization, is extremely beneficial. We are actively working now on testing and validating the ATM method so that we can start applying it to real drug discovery projects as soon as possible. Additionally, since these methods are usually applied to hundreds of molecules, it benefits a lot from the parallelization capabilities of GPUGRID, so if everything goes as expected, this could potentially send lots of work units. The ATM app is based on Python, similar to the PythonRL application, where we ship it with a specific python environment. Here are the two main references for the ATM method, for both absolute and relative binding affinity predictions: Absolute binding free energy estimation with ATM: https://arxiv.org/pdf/2101.07894.pdf Relative binding free energy estimation with ATM: https://pubs.acs.org/doi/10.1021/acs.jcim.1c01129 For now we are only able to send jobs to Linux machines but we are hoping to have a Windows version soon. ID: 60002 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60003 - Posted: 3 Mar 2023, 10:40:19 UTC I’m brand new to GPUGRID so apologies in advance if I make some mistakes. I’m looking forward to learn from you all and discuss about this app :) ID: 60003 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 9,948,167,649 RAC: 8,672,171 Level Scientific publications	Message 60005 - Posted: 3 Mar 2023, 11:50:49 UTC Welcome! Let's start with some good news. I picked up one of your test tasks a couple of days ago. T0_1-QUICO_TEST_ATM-0-1-RND8922_0 It ran right through without raising any red flags, and validated at the end. A good start. ID: 60005 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1099 Credit: 40,331,687,595 RAC: 101,874 Level Scientific publications	Message 60006 - Posted: 3 Mar 2023, 13:21:48 UTC - in response to Message 60003. Thanks for creating an official topic on these types of tasks. The latest problem observed recently was upload hangs due to a file size too big. it didnt cause an error, but it just never uploaded because the file size exceeded the size limt of your apache server. the only resolution for the user was to abort the transfer and hope it didnt get marked as an error. have you already addressed this issue? either by adjusting the apache server file size, or adjusting the tasks to not create such large result files. ID: 60006 · Rating: 0 · rate: / Reply Quote

Greger Send message Joined: 6 Jan 15 Posts: 76 Credit: 25,490,534,331 RAC: 32,636 Level Scientific publications	Message 60007 - Posted: 3 Mar 2023, 20:20:09 UTC Last modified: 3 Mar 2023, 21:19:31 UTC Welcome and thanks for info Quico I did notice on past batch the upload got halted by server. It got rejected to download result. I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload. File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be. event log for TL9_72-RAIMIS_TEST_ATM did say <nbytes>729766132.000000</nbytes> <max_nbytes>10000000000.000000</max_nbytes> https://ibb.co/4pYBfNS parsing upload result response <data_server_reply> <status>0</status> <file_size>0</file_size error code -224 (permanent HTTP error) https://ibb.co/T40gFR9 I will do test new test on new units but would probably face same issue if server have not changed. https://boinc.berkeley.edu/trac/wiki/JobTemplates ID: 60007 · Rating: 0 · rate: / Reply Quote

Erich56 Send message Joined: 1 Jan 15 Posts: 1150 Credit: 11,925,424,501 RAC: 9,216,147 Level Scientific publications	Message 60009 - Posted: 4 Mar 2023, 6:05:52 UTC - in response to Message 60007. File size in past history that max allowed have been 700mb Greger, are you sure it was 700mb? From what I remember, it was 500mb ID: 60009 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 9,948,167,649 RAC: 8,672,171 Level Scientific publications	Message 60011 - Posted: 4 Mar 2023, 9:10:39 UTC I have one which is looking a bit poorly. It's 'running' of host 132158l (Linux Mint 21.1, GTX 1660 super, 64 GB RAM), but it's only showing 3% progress after 18 hours. (image from remote monitoring on a Windows computer) Are there any files I can examine, or which would be useful to you for debugging - or should I simply abort it? ID: 60011 · Rating: 0 · rate: / Reply Quote

Dirk Broer Send message Joined: 4 Oct 09 Posts: 2 Credit: 187,078,712 RAC: 133,231 Level Scientific publications	Message 60012 - Posted: 4 Mar 2023, 9:50:47 UTC I am trying to upload one, but can't get it to do the transfer: Computer: MSI-B550-A-Pro Project GPUGRID Name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943_1 Application ATM: Free energy calculations of protein-ligand binding 1.13 (cuda1121) Workunit name TL9_82-RAIMIS_TEST_ATM-0-1-RND3943 State Uploading Received 3/1/2023 4:46:17 PM Report deadline 3/6/2023 4:46:16 PM Estimated app speed 16.548,99 GFLOPs/sec Estimated task size 1.000.000.000 GFLOPs Resources 0,949 CPUs + 1 NVIDIA GPU CPU time at last checkpoint 00:00:00 CPU time 05:27:34 Elapsed time 05:28:51 Estimated time remaining 00:00:00 Fraction done 100% Virtual memory size 0,00 MB Working set size 0,00 MB Debug State: 4 - Scheduler: 0 ID: 60012 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 9,948,167,649 RAC: 8,672,171 Level Scientific publications	Message 60013 - Posted: 4 Mar 2023, 10:39:52 UTC I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are: + echo 'Run AToM' + CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl + python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. I'm aborting it. NB a previous user also failed with a task from the same workunit: 27418556 ID: 60013 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60022 - Posted: 6 Mar 2023, 9:51:35 UTC - in response to Message 60013. Thanks everyone for the replies! From what I have seen, from the single test job I personally sent, one replica finished without issues but the other two blew up (Particle coordinate is NaN). I do find this strange because I have seen in the preparation that I run locally but not during production, the errors should be different. I'll check a few things locally since I changed a few things from my local runs and we'll try again, also with different inputs. Welcome and thanks for info Quico I did notice on past batch the upload got halted by server. It got rejected to download result. I did a check on client_state file and it was below max_nbyte but still it didn´t allow to upload. File size in past history that max allowed have been 700mb and these have been around 713-730mb in so something else control this cap and a change maybe help but i don´t see where issue would be. event log for TL9_72-RAIMIS_TEST_ATM did say <nbytes>729766132.000000</nbytes> <max_nbytes>10000000000.000000</max_nbytes> https://ibb.co/4pYBfNS parsing upload result response 0 0 error code -224 (permanent HTTP error) https://ibb.co/T40gFR9 I will do test new test on new units but would probably face same issue if server have not changed. https://boinc.berkeley.edu/trac/wiki/JobTemplates Thanks for this, I'll keep that in mind. From the succesful run the size file is 498M so it should be on the limit there to what @Erich56 says. But that's useful information for when I run bigger systems. I think mine is a failure. Nothing has been written to stderr.txt since 14:22:59 UTC yesterday, and the final entries are: + echo 'Run AToM' + CONFIG_FILE=Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl + python bin/rbfe_explicit_sync.py Tyk2_new_2-ejm_49-ejm_50_asyncre.cntl Warning: importing 'simtk.openmm' is deprecated. Import 'openmm' instead. I'm aborting it. NB a previous user also failed with a task from the same workunit: 27418556 Hmmm, that's weird. It shouldn't softlock in that step. Although this warning pops up it should keep running without issues. I'll ask around ID: 60022 · Rating: 0 · rate: / Reply Quote

gemini8 Send message Joined: 3 Jul 16 Posts: 31 Credit: 2,237,559,169 RAC: 279,062 Level Scientific publications	Message 60029 - Posted: 7 Mar 2023, 11:43:14 UTC This task didn't want to upload, but neither would GPUGrid update when I aborted the upload. Only got 24h time-outs. - - - - - - - - - - Greetings, Jens ID: 60029 · Rating: 0 · rate: / Reply Quote

STE\/E Send message Joined: 18 Sep 08 Posts: 368 Credit: 4,173,502,885 RAC: 1,099 Level Scientific publications	Message 60035 - Posted: 8 Mar 2023, 12:14:20 UTC I just aborted 1 ATM Wu https://www.gpugrid.net/result.php?resultid=33338739 that had been running for over 7 Days, it sat at 75% done the whole time. Got another one & it immediately jumped to 75% done. Probably just abort it & deselect any new ATM Wu's ... STE\/E ID: 60035 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 401 Credit: 17,242,399,587 RAC: 16,556,744 Level Scientific publications	Message 60036 - Posted: 8 Mar 2023, 14:24:59 UTC Last modified: 8 Mar 2023, 14:40:30 UTC Some still running, many failing. Does ATM really just need one CPU? I think I saw a new 1.1 GB executable DLing. Maybe the failures tried to run on the older version? What are the VRAM and RAM minimum requirements for ATM? Server Status shows both ATM and ATMbeta tasks but Tasks shows them all as ATM. Strange, all my previously completed ATM WUs have vanished from my Tasks list? Thanks for the papers, I'll read them later. ID: 60036 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 9,948,167,649 RAC: 8,672,171 Level Scientific publications	Message 60037 - Posted: 8 Mar 2023, 15:23:40 UTC Three successive errors on host 132158 All with "python: can't open file '/hdd/boinc-client/slots/2/Scripts/rbfe_explicit_sync.py': [Errno 2] No such file or directory" ID: 60037 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 401 Credit: 17,242,399,587 RAC: 16,556,744 Level Scientific publications	Message 60038 - Posted: 8 Mar 2023, 16:56:45 UTC I let some computers run off all other WUs so they were just running 2 ATM WUs. It appears they do only use one CPU each but that may just be a consequence of specifying a single CPU in the client_state.xml file. Might your ATM project benefit from using multiple CPUs? <app_version> <app_name>ATM</app_name> <version_num>113</version_num> <platform>x86_64-pc-linux-gnu</platform> <avg_ncpus>1.000000</avg_ncpus> <flops>46211986880283.171875</flops> <plan_class>cuda1121</plan_class> <api_version>7.7.0</api_version> nvidia-smi reports ATM 1.13 WUs are using 550 to 568 MB of VRAM so call it 0.6 GB VRAM. BOINCtasks reports all WUs are using less than 1.2 GB RAM. That means that my computers could easily run up to 20 ATM WUs simultaneourly. Sadly GPUGRID does not allow us to control the number of WUs we DL like LHC or WCG do. So we're stuck with 2 set by the ACEMD project. I never run more than a single PYTHON WU on a computer so I get two and abort one and then have to uncheck PYTHON in my GPUGRID Preferences just in case ACEMD or ATM WUs materialize. I wonder how many years it's been since GG has improved the UI to make it more user-friendly? When one clicks their Preferences they still get 2 Warnings and 2 Strict Standards that have never been fixed. Please add a link to your applications: https://www.gpugrid.net/apps.php ID: 60038 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,797,176,011 RAC: 2,106,034 Level Scientific publications	Message 60039 - Posted: 8 Mar 2023, 19:52:31 UTC Is there a way to tell if an ATM WU is progressing? I have had only one succeed so far over the last several weeks. However, all of the failures so far were one of two types: either a failure to upload (and the download aborted by me) or a simple "Error while computing", which happened very quickly. However, I now have an ATM WU which has been processing for over seven hours. Looking at the WU properties, it shows the CPU time nearly equal to the elapsed time. The GPU shows processing spikes up to 99%, and the 'down' periods are short. As others have reported, the Progress shows 75% steadily. I am inclined to keep letting it compute, but want to know what behavior others have seen on successful ATM WUs. ID: 60039 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1099 Credit: 40,331,687,595 RAC: 101,874 Level Scientific publications	Message 60041 - Posted: 8 Mar 2023, 21:02:08 UTC Last modified: 8 Mar 2023, 21:03:20 UTC let me explain something about the 75% since it seems many don't understand what's happening here. the 75% is in no way an indication of how much the task has progressed. it is totally a function of how BOINC acts with the wrapper when the tasks are setup in the way that they are. the wrapper uses a jobs.xml file to instruct BOINC on different "subtasks" to perform over the course of the run of a single task from the project. in the <task> element there is an option to add a <weight> argument. this would tell boinc how much "weight" in percentage of total task completion that this subtask is worth. weight of 1 is equal to 1% and so on. if this weight argument is not defined, each subtask gets equal weight. in the case of the ATM tasks, the job.xml file has four subtasks, and no weights defined. the first 3 tasks are just quick extractions and unpacking and complete quickly. which is why the tasks jump to 75% straight away. if it's staying at 75% indefinitely then that's pretty indicative that the task is stuck and probably wont make more progress. by comparison, the PythonGPU tasks have 2 sub tasks, but the first extraction task has a weight of 1 and the second run.py task has weight of 99 which is why it doesnt have this kind of behavior. and the acemd3 tasks only have one subtask in the file so it doesnt need a weight at all and progress is pretty linear. ID: 60041 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 9,948,167,649 RAC: 8,672,171 Level Scientific publications	Message 60042 - Posted: 8 Mar 2023, 21:59:23 UTC - in response to Message 60039. I have one that's running (?) much the same. I think I've found a way to confirm it's still alive. I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit. I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current: 2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12 2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12 (duration: 12.440164870815352 s) which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch? ID: 60042 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,797,176,011 RAC: 2,106,034 Level Scientific publications	Message 60043 - Posted: 8 Mar 2023, 22:39:10 UTC - in response to Message 60042. Thanks for the idea. Sure enough, that file is showing activity (On sample 324, replica 3 for me.) OK. Just going to sit and wait. Ian&Steve, thanks for the explanation. Just one thought: what if the fourth item is just "do everything else"? Couldn't that mean going straight from 75% to 100% at some point (assuming it is progressing)? ID: 60043 · Rating: 0 · rate: / Reply Quote

Quico Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 28 Feb 23 Posts: 35 Credit: 0 RAC: 0 Level Scientific publications	Message 60045 - Posted: 9 Mar 2023, 9:26:00 UTC - in response to Message 60042. I have one that's running (?) much the same. I think I've found a way to confirm it's still alive. I looked at the task properties to see which slot directory it was running in (slot 2, in my case). Then I found the relevant directory, and poked about a bit. I found our usual touchstone (stderr.txt) to be useless - it hadn't been touched in hours. But another file - run.log - is currently active. The most recent entries are current: 2023-03-08 21:55:05 - INFO - sync_re - Started: sample 107, replica 12 2023-03-08 21:55:17 - INFO - sync_re - Finished: sample 107, replica 12 (duration: 12.440164870815352 s) which seems to suggest that all is well. Perhaps Quico could let us know how many samples to expect in the current batch? Thanks for this input (and everyone's). At least in the runs I sent recently we are expecting 341 samples. I've seen that there were many crashes in the last batch of jobs I sent. I'll check if there were some issues on my end or it's just that the systems decided to blow up. ID: 60045 · Rating: 0 · rate: / Reply Quote