Posts by Quico

1) Message boards : News : In-silico Binding Assay (ISBA/ACEMD3) (Message 61477)
Posted 2 May 2024 by Quico
Post:
This is a message AdriĆ  posted in our discord channel. Since he doesn't have an account in the GPUGRID forum I open the thread for him.
If you want to be more up to date to the news related to this project and others please join our discord, we are usually more active there:

https://discord.gg/dCMkcafPpX

Hello GPUGRID!

Here AdriĆ . I'll be recovering the ACEMD3 application again, and sending new jobs of standard MD simulations
(We've been testing it these past weeks to make sure it worked well for both Windows and Linux)

The main goal of these new batch of simulations will be to validate further our capacity to predict the binding mode of ligands using simulations and adaptive sampling methods. Those of you that have been around for some time here might already be familiar with these simulations, such as the Benzamidin-Trypsin system (https://www.pnas.org/doi/abs/10.1073/pnas.1103547108) or the Dopamine D3 receptor with an antagonist ligand (https://www.nature.com/articles/s41598-018-19345-7#Ack1), which we were able to simulate thanks to GPUGRID and all your effort!

Now, we are revisiting this method, which we call in-silico binding assay (ISBA). During drug discovery campaings, it's common that you know of ligands that bind to your target, but you don't know their binding mode, the exact conformation and structure that both the ligand and the protein have when bound. Knowing the binding mode is critical for further development of the molecule into a potent and usable drug.

The most precise way of discovering the binding mode is with crystallization. However, that can take too much time or be directly impossible, depending on the protein. Therefore, we want to optimize and refine ISBA for binding mode prediction, so it can be usable during drug discovery projects. To summarize a bit our objectives, we want to predict binding modes for larger molecules than Benzamidin, with the same precision, but with less simulation time that was needed for the D3 receptor system.

To do so, we'll be using the latest version of adaptive sampling that we developed, AdaptiveBandit (https://pubs.acs.org/doi/abs/10.1021/acs.jctc.0c00205). The objective of these new simulations I'll be sending will be to benchmark AdaptiveBandit in an ISBA scenario, improve the algorithm if required and fine-tune its hyperparameters.

Let me know if there's any issue with the simulations. I'll be sending 100ns trajectories for the most part, divided in two steps.
2) Message boards : News : ATM (Message 60652)
Posted 16 Aug 2023 by Quico
Post:
It does not make any difference to Quico. The task will be completed on one or another computer and his science is done. It is our very expensive energy that is wasted but as Quico himself said, the science gets done who cares about wasted energy?

I'm pretty sure I never said that about wasted energy. What I might have mentioned is that completed jobs come back to me and since I don't check what happens to every WU manually then these crashes might go under my radar.

As Ian&Steve C. these app is in "beta"/not ideal conditions. Sadly I don't have the knowledge to fix it, otherwise I would. Errors on my end can be one I forget to upload some files (happened) or I sent jobs without equilibrated systems (also happened). By trial and error I ended up with a workflow that should avoid these issues 99% of the time. Any other kind of errors I can pass them to the devs but I can't promise much more apart from it.
I'm here testing the science.
3) Message boards : News : ATM (Message 60651)
Posted 16 Aug 2023 by Quico
Post:
Ok, back from holidays.
I've saw that in the last batch of jobs many Energy NaN errors were happening, which it was completely unexpected.

I am testing some different settings internally to see if it overcomes this issue. In case it is successfull, new jobs would be send by tomorrow/Friday.

This might be more time consuming and I would not like to split them in even more chunks (might have a suspicion that this gives wonky results at some point) but if people see that they take too long time/space please let me know.
4) Message boards : News : ATM (Message 60612)
Posted 20 Jul 2023 by Quico
Post:
Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.


I have question regarding the minimum hardware requirements (i.e. Amount, speed, type of RAM, CPU speed and type, motherboard speed and requirements, etc.) for the computer to be able to complete successfully, these units for either windows and linux OS?

One of my computers has been running these units successfully, the other has not. They both have the same OS, but have different hardware. I just want to know the limits.






I'm not sure to be the most adequate to answer this question but I might try my best. AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread.
5) Message boards : News : ATM (Message 60610)
Posted 20 Jul 2023 by Quico
Post:
Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :)

I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine.

Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist.
6) Message boards : News : ATM (Message 60447)
Posted 18 May 2023 by Quico
Post:
I had 14 errors in a row last night, between about 18:30 and 19:30 UTC. All failed with a variant of

Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /hdd/boinc-client/slots/5/tmp/pip-req-build-_q32nezm
fatal: unable to access 'https://github.com/raimis/AToM-OpenMM.git/': Recv failure: Connection reset by peer

Is that something you can control?


It seems that this is a Github problem. It has been a bit unstable over the past few days.
7) Message boards : News : ATM (Message 60446)
Posted 18 May 2023 by Quico
Post:
Let me see if I can find the other few that crashed and cancel those WU.

FileNotFoundError:
https://www.gpugrid.net/result.php?resultid=33507881
Energy is NaN:
https://www.gpugrid.net/result.php?resultid=33507902
https://www.gpugrid.net/result.php?resultid=33509252
ImportError:
https://www.gpugrid.net/result.php?resultid=33504227
https://www.gpugrid.net/result.php?resultid=33503240


Thanks for this, I will get a close look to these systems to see what could be the reason of the error.
8) Message boards : News : ATM (Message 60440)
Posted 16 May 2023 by Quico
Post:
The difference between 0-5 and n-5 has been consistent throughout - there hasn't been a "fix and revert". Just new data runs starting from 0 again.


So I'm usually trying to hit 350 samples which equivalates to a bit more than 60ns of sampling time.

At the beginning I was sending the full run to a single volunteer but there were the size-issues and some people expressed that the samples were too long. I reduced the frame-saving frequency and started to divide these runs manually but this was too time-consuming and very hard to track. That was also causing issues with the progress bars.

That's why later on it was implemented what we use now. Like in AceMD now we can chain these runs. Instead of sending the further steps manually it is done now automatically. This helped me divide the runs into smaller chunks, making runs smaller in size and faster to run.
In theory this should have also fixed the issue with progress bars, since the cntl file also asks for +70 samples. But I guess that the first step of the runs show a proper progress bar but the following ones get stuck at 100% since the beginning? Since the control file reads +70 and the log file starts at 71.
I'll pester the devs again to see if they can have a fix soon on it.


About the recent errors. Some of them are on my end, I messed up a few times. We changed the preparation protocol and some running conditions for GPUGRID (as explained before) and sometimes a necessary tiny script was left there to run... I've taken the necessary measures to avoid this as much as possible. I hope we do not have an issue like before.
Regarding the BACE files with very big size... Maybe I forgot to cancel some WUs? It was the first time I was doing this and the searchbar works very wonky.
9) Message boards : News : ATM (Message 60439)
Posted 16 May 2023 by Quico
Post:
"ValueError: Energy is NaN" is back quite often :-(



Do these Energy is NaN come back really quickly? Run with similar names? Upon checking results I have seen that some runs have indeed crashed but not very often.
10) Message boards : News : ATM (Message 60438)
Posted 16 May 2023 by Quico
Post:
FileNotFoundError: [Errno 2] No such file or directory: 'TYK2_m42_m54_0.xml'

https://www.gpugrid.net/result.php?resultid=33509037

:-(


Crap, I forgot to clean those that didn't equilibrate succesfully here in local. Let me see if I can find the other few that crashed and cancel those WU.
11) Message boards : News : ATM (Message 60408)
Posted 10 May 2023 by Quico
Post:
OK, confirmed - it is still the Apache problem.

Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: HTTP/1.1 413 Request Entity Too Large
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Date: Tue, 09 May 2023 14:06:15 GMT
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_auth_gssapi/1.5.1 mod_auth_kerb/5.4 mod_fcgid/2.3.9 PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5

File (the larger of two) is 754.1 MB (Linux decimal), 719.15 MB (Boinc binary).

At this end, we have two choices:

1) Abort the data transfer, as Ian suggests.
2) Wait 90 days for somebody to find the key to the server closet.

Quico?


That's weird. I'll get a look.
But this shouldn't happen so cancel the BACE (uppercase) runs. I'll have a look on how to do it from here.
With the last implentation there shouldn't be such related file-size issues.

All bad/bug jobs should be cancelled by now.


this is not something new from this project. this has been a recurring issue from time to time. seems to pop up about every year or so whenever the result files get so large for one reason or another. so don't feel bad if you are unable to find the setting to fix the file size limit. no one else from the project has been able to for the last several years.

why are the result files so large? 500+MB. that's the root cause of the issue. do you need the data in these files? if not, why are they being created?


The heavy files are the .dcd which technically I don't really need to obtain to perform the final free energy calculation but it's necessary in case something weird is happening and we want to revisit those frames. .dcd files contains the information and coordinates of all the system atoms but uncompressed. Since there are other trajectory formats, such as .xtc, that compress this data resulting in much lower filesizes we asked to implement the fileformat into OpenMM. As far as I know this has been implemented in our lab but needs the final approval of the "higher-ups" to get it running and then modify ATM to process trajectory files with .xtc.

Nevertheless, this shouldn't have happened (it run OK in other instances with BACE) and apologise for this.
12) Message boards : News : ATM (Message 60404)
Posted 10 May 2023 by Quico
Post:
OK, confirmed - it is still the Apache problem.

Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: HTTP/1.1 413 Request Entity Too Large
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Date: Tue, 09 May 2023 14:06:15 GMT
Tue 09 May 2023 15:06:15 BST | GPUGRID | [http] [ID#15383] Received header from server: Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_auth_gssapi/1.5.1 mod_auth_kerb/5.4 mod_fcgid/2.3.9 PHP/5.4.16 mod_wsgi/3.4 Python/2.7.5

File (the larger of two) is 754.1 MB (Linux decimal), 719.15 MB (Boinc binary).

At this end, we have two choices:

1) Abort the data transfer, as Ian suggests.
2) Wait 90 days for somebody to find the key to the server closet.

Quico?


That's weird. I'll get a look.
But this shouldn't happen so cancel the BACE (uppercase) runs. I'll have a look on how to do it from here.
With the last implentation there shouldn't be such related file-size issues.

All bad/bug jobs should be cancelled by now.
13) Message boards : News : ATM (Message 60376)
Posted 8 May 2023 by Quico
Post:
And a similar batch configuration error with today's BACE run, like

BACE_m24_m7e_5-QUICO_ATM_Sage_xTB-0-5-RND7993_0

08:05:32 (386384): wrapper: running bin/bash (run.sh)
bin/bash: run.sh: No such file or directory

(five so far)

Edit - now wasted 20 of the things, and switched to Python to avoid quota errors. I should have dropped in to give you a hand when passing through Barcelona at the weekend!


Yes, big mess up on my end. More painful since it happened to two of the sets with more runs. I just forgot to run the script that copies the run.sh and run.bat files to the batch folders. It happened to 2/8 batches but yeah, big whoop. Apologies on that. The "fixed" runs should be sent soon. The "missing *0.xml" errors should not happen anymore too.

Regarding checkpoint, at least I, cannot do much more than pass the message which I have done several times.

Again, sorry for this. I can understand it to be very annoying.
14) Message boards : News : ATM (Message 60237)
Posted 30 Mar 2023 by Quico
Post:
Yeah I'm sorry about that. I'm trying to learn as I go.

I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs.
15) Message boards : News : ATM (Message 60233)
Posted 30 Mar 2023 by Quico
Post:
Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition.

Task PTP1B_new_20669_2qbr_23472_T1_2A-QUICO_TEST_ATM-0-1-RND5753_3 started with a couple of 0.1% initial steps (as usual), but then jumped to 50.983%. It then moved on by 0.441% every five minutes or so.

The run.log shows the same figures as before: a pre-existing run of 114 samples, then the real work starts with sample 115, and should proceed to a max_sample of 341. The progress jumps match the completion of samples 115 - 120.

The %age intervals match the formula in Emilio Gallicchio's post 60160 (115/(341-114)), but I can't see where the initial big value of 50.983 comes from.

Also, I don't follow the logic of the resend explanation. Mine is replication _3, so there have been 3 previous attempts - but none of them got beyond the program setup stages: all failed in less than 100 seconds. So who did the first 114 samples?


The first 114 samples should be calculated by: T_PTP1B_new_20669_2qbr_23472_1A_3-QUICO_TEST_ATM-0-1-RND2542_0.tar.bz2
I've been doing all the division and resends manually and we've been simplifying the naming convention for my sake. Now we are testing a multiple_steps protocol just like in AceMD which should help ease things and I hope mess less with the progress reporter.
16) Message boards : News : ATM (Message 60225)
Posted 29 Mar 2023 by Quico
Post:
OK, it's the same story as yesterday. This task:

PTP1B_23486_23479_4_2A-QUICO_TEST_ATM-0-1-RND5081_2

downloaded at 15:26:54 UTC yesterday, and started running at about 16:30 UTC.

As before, the run.log shows a MAX_SAMPLES: 114, with timings that don't match my machine. The 16:30 run has MAX_SAMPLES: 341, and starts running with sample 115.

The machine downloaded a new task at 3:50:47 UTC: that normally happens around 85 - 90% progress, with an hour to run - but the existing one is still only at sample 308, so maybe three hours to go. And it's another PTP1B_new_ resend, so we may have to go round the cycle again.


I believe it's what I imagined. From the manual division I was doing before I was splitting some runs in 2/3 steps: 114 - 228 - 341 samples. If the job ID has a 2A/3A it's most probably that it's starting from a previous checkpoint and the progress report is going crazy with it. I'll pass this on to Raimondas to see if he can get a look at it.

Our priority first is to be able to that these job divisions are done automatically like ACEMD does, that way we can avoid these really long jobs for everyone. Doing this manually makes it really hard to track all the jobs and the resends. So I hope that in the next days everything goes smoother.
17) Message boards : News : ATM (Message 60204)
Posted 27 Mar 2023 by Quico
Post:
I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?

It varies from task to task - or, I suspect, from batch to batch. I mentioned a specific problem with a JNK1 task - task 33380692 - but it's not a general problem.

I suspect that it may have been a specific problem with setting the data that drives the progress %age calculation - the wrong expected 'total number of samples' may have been used.


This one is a rerun, meaning that 2/3 of the run were previously simulated.
Maybe it was expecting to start from 0 samples and once it saw that we're at 228 from the beginning, it got confused.

I'll comment that.

PS: But others runs have been reporting correctly?
18) Message boards : News : ATM (Message 60202)
Posted 27 Mar 2023 by Quico
Post:
I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone?
19) Message boards : News : ATM (Message 60201)
Posted 27 Mar 2023 by Quico
Post:
I've seen that you are unhappy with the last batch of runs. Seeing that they take too much time. I've been playing to divide the runs in different steps to get a sweet spot that you're happy with it and it's not madness for me to organize all this runs and re-runs. I'll backtrack to the previous setting we had before. Apologies for that.


I can't answer immediately on the termination question, but it's all open-source and I can look through it. In this case, it's more complicated, because BOINC will talk to the wrapper, and the wrapper will talk to the science app.

But the basic idea is that BOINC will send a request to terminate over the API, and wait for the application to close itself down as it sees fit. Actual signals will only be used to force termination in the case of an unconditional quit, such as an operating system closedown.


Right, probably the wrapper should send a termination signal to AToM.

We have of course access to AToM's sources https://github.com/Gallicchio-Lab/AToM-OpenMM and we can make sure that it checkpoints appropriately when it receives the signal.

However, I do not have access to the wrapper. Quico: please advise.


I'll ask Raimondas about this and the other things that have been mentioned since he's the one taking care of this issue.
20) Message boards : News : ATM (Message 60155)
Posted 24 Mar 2023 by Quico
Post:
progress reporting is still not working.

instead of halting progress at 75%, it now halts at 0.19%. the weights help prevent the task from jumping to 75%, but there is still something missing.

Python tasks are able to jump to about 1% after the extraction phase due to the weights, and then slowly creeps up over time as the task progresses. 2%, 3%, 4%, etc until it hits 100% in a natural and linear way. The ATM tasks do not do this at all. they sit at 0.19% for hours and hours with no indication of when they will complete. is it 4hrs? is it 20hrs? there's no feedback to the user. when it's done it just jumps to 100% without warning.

makes it very difficult to tell is a task is stuck or working.

-Edit-

The "BACE" tasks do seem to be reporting progress now. but the earlier tasks from yesterday ("T_p38") do not.


T_p38 were sent before the update so I guess it makes sense that they don't show reporting yet. Is the progress report for the BACE runs good? Is it staying stuck?


Next 20

©2026 Universitat Pompeu Fabra