|
1)
Message boards :
News :
In-silico Binding Assay (ISBA/ACEMD3)
(Message 61477)
Posted 2 May 2024 by Quico Post: This is a message AdriĆ posted in our discord channel. Since he doesn't have an account in the GPUGRID forum I open the thread for him. If you want to be more up to date to the news related to this project and others please join our discord, we are usually more active there: https://discord.gg/dCMkcafPpX Hello GPUGRID! |
|
2)
Message boards :
News :
ATM
(Message 60652)
Posted 16 Aug 2023 by Quico Post: It does not make any difference to Quico. The task will be completed on one or another computer and his science is done. It is our very expensive energy that is wasted but as Quico himself said, the science gets done who cares about wasted energy? I'm pretty sure I never said that about wasted energy. What I might have mentioned is that completed jobs come back to me and since I don't check what happens to every WU manually then these crashes might go under my radar. As Ian&Steve C. these app is in "beta"/not ideal conditions. Sadly I don't have the knowledge to fix it, otherwise I would. Errors on my end can be one I forget to upload some files (happened) or I sent jobs without equilibrated systems (also happened). By trial and error I ended up with a workflow that should avoid these issues 99% of the time. Any other kind of errors I can pass them to the devs but I can't promise much more apart from it. I'm here testing the science. |
|
3)
Message boards :
News :
ATM
(Message 60651)
Posted 16 Aug 2023 by Quico Post: Ok, back from holidays. I've saw that in the last batch of jobs many Energy NaN errors were happening, which it was completely unexpected. I am testing some different settings internally to see if it overcomes this issue. In case it is successfull, new jobs would be send by tomorrow/Friday. This might be more time consuming and I would not like to split them in even more chunks (might have a suspicion that this gives wonky results at some point) but if people see that they take too long time/space please let me know. |
|
4)
Message boards :
News :
ATM
(Message 60612)
Posted 20 Jul 2023 by Quico Post: Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :) I'm not sure to be the most adequate to answer this question but I might try my best. AFAIK it should run anywhere, maybe the issue is more driver related? We recently tested on 40 series GPUs locally and it run fine, since I saw some comments in the thread. |
|
5)
Message boards :
News :
ATM
(Message 60610)
Posted 20 Jul 2023 by Quico Post: Sorry for missing out for a while. We were testing ATM in a setup not available for GPUGRID. But we're back to crunching :) I've seen that more or less everything is running fine. Albeit for some crashes that can happen everything seems to come back to me fine. Is there anything in specific I should look into it? I already know about the progress reporting issue (if it persists) but there's not much more I can do on my end. If they plan to update the GPUGRID app at some point I'll insist. |
|
6)
Message boards :
News :
ATM
(Message 60447)
Posted 18 May 2023 by Quico Post: I had 14 errors in a row last night, between about 18:30 and 19:30 UTC. All failed with a variant of It seems that this is a Github problem. It has been a bit unstable over the past few days. |
|
7)
Message boards :
News :
ATM
(Message 60446)
Posted 18 May 2023 by Quico Post: Let me see if I can find the other few that crashed and cancel those WU. Thanks for this, I will get a close look to these systems to see what could be the reason of the error. |
|
8)
Message boards :
News :
ATM
(Message 60440)
Posted 16 May 2023 by Quico Post: The difference between 0-5 and n-5 has been consistent throughout - there hasn't been a "fix and revert". Just new data runs starting from 0 again. So I'm usually trying to hit 350 samples which equivalates to a bit more than 60ns of sampling time. At the beginning I was sending the full run to a single volunteer but there were the size-issues and some people expressed that the samples were too long. I reduced the frame-saving frequency and started to divide these runs manually but this was too time-consuming and very hard to track. That was also causing issues with the progress bars. That's why later on it was implemented what we use now. Like in AceMD now we can chain these runs. Instead of sending the further steps manually it is done now automatically. This helped me divide the runs into smaller chunks, making runs smaller in size and faster to run. In theory this should have also fixed the issue with progress bars, since the cntl file also asks for +70 samples. But I guess that the first step of the runs show a proper progress bar but the following ones get stuck at 100% since the beginning? Since the control file reads +70 and the log file starts at 71. I'll pester the devs again to see if they can have a fix soon on it. About the recent errors. Some of them are on my end, I messed up a few times. We changed the preparation protocol and some running conditions for GPUGRID (as explained before) and sometimes a necessary tiny script was left there to run... I've taken the necessary measures to avoid this as much as possible. I hope we do not have an issue like before. Regarding the BACE files with very big size... Maybe I forgot to cancel some WUs? It was the first time I was doing this and the searchbar works very wonky. |
|
9)
Message boards :
News :
ATM
(Message 60439)
Posted 16 May 2023 by Quico Post: "ValueError: Energy is NaN" is back quite often :-( Do these Energy is NaN come back really quickly? Run with similar names? Upon checking results I have seen that some runs have indeed crashed but not very often. |
|
10)
Message boards :
News :
ATM
(Message 60438)
Posted 16 May 2023 by Quico Post: FileNotFoundError: [Errno 2] No such file or directory: 'TYK2_m42_m54_0.xml' Crap, I forgot to clean those that didn't equilibrate succesfully here in local. Let me see if I can find the other few that crashed and cancel those WU. |
|
11)
Message boards :
News :
ATM
(Message 60408)
Posted 10 May 2023 by Quico Post: OK, confirmed - it is still the Apache problem. The heavy files are the .dcd which technically I don't really need to obtain to perform the final free energy calculation but it's necessary in case something weird is happening and we want to revisit those frames. .dcd files contains the information and coordinates of all the system atoms but uncompressed. Since there are other trajectory formats, such as .xtc, that compress this data resulting in much lower filesizes we asked to implement the fileformat into OpenMM. As far as I know this has been implemented in our lab but needs the final approval of the "higher-ups" to get it running and then modify ATM to process trajectory files with .xtc. Nevertheless, this shouldn't have happened (it run OK in other instances with BACE) and apologise for this. |
|
12)
Message boards :
News :
ATM
(Message 60404)
Posted 10 May 2023 by Quico Post: OK, confirmed - it is still the Apache problem. That's weird. I'll get a look. But this shouldn't happen so cancel the BACE (uppercase) runs. I'll have a look on how to do it from here. With the last implentation there shouldn't be such related file-size issues. All bad/bug jobs should be cancelled by now. |
|
13)
Message boards :
News :
ATM
(Message 60376)
Posted 8 May 2023 by Quico Post: And a similar batch configuration error with today's BACE run, like Yes, big mess up on my end. More painful since it happened to two of the sets with more runs. I just forgot to run the script that copies the run.sh and run.bat files to the batch folders. It happened to 2/8 batches but yeah, big whoop. Apologies on that. The "fixed" runs should be sent soon. The "missing *0.xml" errors should not happen anymore too. Regarding checkpoint, at least I, cannot do much more than pass the message which I have done several times. Again, sorry for this. I can understand it to be very annoying. |
|
14)
Message boards :
News :
ATM
(Message 60237)
Posted 30 Mar 2023 by Quico Post: Yeah I'm sorry about that. I'm trying to learn as I go. I'll be sending (and already sent) some runs through the ATMbeta app. We tested the multiple_steps code and it seems to work fine. That way if everything runs smoothly everything should get 70 sample runs(~13ns), which should be much shorter for everyone and avoid the drag of the +24h runs. |
|
15)
Message boards :
News :
ATM
(Message 60233)
Posted 30 Mar 2023 by Quico Post: Thanks. Now I know what I'm looking for (and when), I was able to watch the next transition. The first 114 samples should be calculated by: T_PTP1B_new_20669_2qbr_23472_1A_3-QUICO_TEST_ATM-0-1-RND2542_0.tar.bz2 I've been doing all the division and resends manually and we've been simplifying the naming convention for my sake. Now we are testing a multiple_steps protocol just like in AceMD which should help ease things and I hope mess less with the progress reporter. |
|
16)
Message boards :
News :
ATM
(Message 60225)
Posted 29 Mar 2023 by Quico Post: OK, it's the same story as yesterday. This task: I believe it's what I imagined. From the manual division I was doing before I was splitting some runs in 2/3 steps: 114 - 228 - 341 samples. If the job ID has a 2A/3A it's most probably that it's starting from a previous checkpoint and the progress report is going crazy with it. I'll pass this on to Raimondas to see if he can get a look at it. Our priority first is to be able to that these job divisions are done automatically like ACEMD does, that way we can avoid these really long jobs for everyone. Doing this manually makes it really hard to track all the jobs and the resends. So I hope that in the next days everything goes smoother. |
|
17)
Message boards :
News :
ATM
(Message 60204)
Posted 27 Mar 2023 by Quico Post: I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? This one is a rerun, meaning that 2/3 of the run were previously simulated. Maybe it was expecting to start from 0 samples and once it saw that we're at 228 from the beginning, it got confused. I'll comment that. PS: But others runs have been reporting correctly? |
|
18)
Message boards :
News :
ATM
(Message 60202)
Posted 27 Mar 2023 by Quico Post: I've read some people mentioning that the reporter doesn't work or that it goes over 100%. Does it work correctly for someone? |
|
19)
Message boards :
News :
ATM
(Message 60201)
Posted 27 Mar 2023 by Quico Post: I've seen that you are unhappy with the last batch of runs. Seeing that they take too much time. I've been playing to divide the runs in different steps to get a sweet spot that you're happy with it and it's not madness for me to organize all this runs and re-runs. I'll backtrack to the previous setting we had before. Apologies for that.
I'll ask Raimondas about this and the other things that have been mentioned since he's the one taking care of this issue. |
|
20)
Message boards :
News :
ATM
(Message 60155)
Posted 24 Mar 2023 by Quico Post: progress reporting is still not working. T_p38 were sent before the update so I guess it makes sense that they don't show reporting yet. Is the progress report for the BACE runs good? Is it staying stuck? |
©2026 Universitat Pompeu Fabra