ATM

Message boards : News : ATM
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 35 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60046 - Posted: 9 Mar 2023, 10:43:01 UTC - in response to Message 60045.  

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.

BOINC doesn't think it's checkpointed since the beginning, even though checkpoints are listed at the end of each sample in the job.log

BOINC Manager shows that the fraction done is 75.000% - and has displayed that figure, unchanging, since a few minutes into the run.

I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the <result> XML:

    <file_ref>
        <file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
        <open_name>output.tar.bz2</open_name>
        <copy_file/>
    </file_ref>

More when it finishes.
ID: 60046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60047 - Posted: 9 Mar 2023, 11:40:25 UTC - in response to Message 60046.  
Last modified: 9 Mar 2023, 11:56:09 UTC

At least in the runs I sent recently we are expecting 341 samples.

Thanks, that's helpful. I've reached sample 266, so I'll be able to predict when it's likely to finish.

But I think you need to reconsider some design decisions. The current task properties (from BOINC Manager) are:



This task will take over 24 hours to run on my GTX 1660 Ti - that's long, even by GPUGrid standards.



That's good to know, thanks. Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?


I'm not seeing any sign of an output file (or I haven't found it yet!), although it's specified in the XML:

    <file_ref>
        <file_name>T_QUICO_Tyk2_new_2_ejm_47_ejm_55_4-QUICO_TEST_ATM-0-1-RND8906_2_0</file_name>
        <open_name>output.tar.bz2</open_name>
        <copy_file/>
    </file_ref>

More when it finishes.


Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.
ID: 60047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60048 - Posted: 9 Mar 2023, 11:57:55 UTC - in response to Message 60047.  

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.
ID: 60048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60049 - Posted: 9 Mar 2023, 14:37:12 UTC - in response to Message 60048.  

Can you see a cntxt_0 folder or several r0-r21 folders? These should be some of the outputs that the run generates, and also the ones I'm getting from the succesful runs.

Yes, I have all of those, and they're filling up nicely. I want to catch the final upload archive, and check it for size.


Ah I see, from what I've seen the final upload archive has been around 500MB for these runs. Taking into accont what was mentioned filesize-wise in the beginning of the thread I'll tweak some paramaters in order to avoid heavier files
ID: 60049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 60050 - Posted: 9 Mar 2023, 14:50:08 UTC - in response to Message 60049.  

you should also add weights to the <tasks> element in the jobs.xml file that's being used as well as adding some kind of progress reporting for the main script. jumping to 75% at the start and staying there for 12-24hrs until it jumps to 100% at the end is counterintuitive for most users and causes confusion about if the task is doing anything or not.
ID: 60050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60051 - Posted: 9 Mar 2023, 14:51:15 UTC - in response to Message 60047.  
Last modified: 9 Mar 2023, 14:53:22 UTC

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.
ID: 60051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60052 - Posted: 9 Mar 2023, 16:51:12 UTC
Last modified: 9 Mar 2023, 17:01:11 UTC

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733
ID: 60052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60053 - Posted: 9 Mar 2023, 18:29:11 UTC - in response to Message 60051.  

Next time I'll prepare them so they run for shorter amounts of time and finish over next submissions. Is there an aprox time you suggest per task?

The sweet spot would be 0.5 to 4 hours. Above 8 hours is starting to drag. Some climate projects take over a week to run. It really depends on your needs, we're here to serve :-) It seems a quicker turn around time while you're tweaking your project would be to your benefit.

It seems it would help you if you created your own BOINC account and ran your WUs the same way we do. Get in the trenches with us and see what we see.


Once the Windows version is live my personal set-up will join the cause and will have more feedback :)

Well, here it is:



BOINC sees that as 500.28 MB (Linux counts in 1000s, BOINC counts in 1024s) - wish me luck!

Edit - phew, it got through. But that's very, very close to the old limit. Task 33344733


Thanks, for the insight. I'll make it save frames less frequently in order to avoid bigger filesizes.
ID: 60053 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 60068 - Posted: 13 Mar 2023, 16:26:17 UTC

nothing but errors from the current ATM batch. run.sh is missing or misnamed/misreferenced.
ID: 60068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60069 - Posted: 13 Mar 2023, 17:49:34 UTC
Last modified: 13 Mar 2023, 17:49:46 UTC

I vaguely recall GG had a rule something like a computer can only DL 200 WUs a day. If it's still in place it would be absurd since the overriding rule is that a computer can only hold 2 WUs at a time.
At the rate ATM WUs are failing I could hit that limit, so I halted GG DLs.
Please delete all your WUs until you fix the bug.
ID: 60069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60074 - Posted: 14 Mar 2023, 12:35:21 UTC

Today's tasks are running OK - the run.sh script problem has been cured.

I'm running one that the previous user aborted before it even started - no need for that any more (WU 27426736).
ID: 60074 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 60075 - Posted: 14 Mar 2023, 12:51:35 UTC - in response to Message 60074.  
Last modified: 14 Mar 2023, 12:52:47 UTC

i wouldnt say "cured". but newer tasks seem to be fine. I'm still getting a good number of resends with the same problem. i guess they'll make their way through the meat grinder before defaulting out.

example: http://www.gpugrid.net/result.php?resultid=33357435
ID: 60075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60076 - Posted: 14 Mar 2023, 14:47:33 UTC - in response to Message 60075.  

My point was: if you get one of these, let it run - it may be going to produce useful science. If it's one of the faulty ones, you waste about 20 seconds, and move on.
ID: 60076 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60084 - Posted: 15 Mar 2023, 9:28:37 UTC
Last modified: 15 Mar 2023, 9:30:08 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.
ID: 60084 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60085 - Posted: 15 Mar 2023, 10:12:31 UTC - in response to Message 60076.  
Last modified: 15 Mar 2023, 10:16:48 UTC

Sorry about the run.sh missing issue of the past few days. It slipped through me. Also they were a few re-send tests that also crashed, but it should be fixed now.


Is there a way I could delete the failed/crashed files from the server?

We're also trying to find alternatives to avoid the filesize issue. I hope we can find a nice solution in the next few days.

Do the last few runs take less time, being less of a drag to run them? I'm trying to find the sweet spot for everyone/most of us.

Thanks everyone!
ID: 60085 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Quico
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 28 Feb 23
Posts: 35
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 60086 - Posted: 15 Mar 2023, 10:13:40 UTC - in response to Message 60084.  

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.


How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.
ID: 60086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 326,008
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 60087 - Posted: 15 Mar 2023, 10:47:56 UTC - in response to Message 60085.  

My host 508381 (GTX 1660 Ti) has finished a couple overnight, in about 9 hours. The last one finished just as I was reading your message, and I saw the upload size - 114 MB. Another failed with 'Energy is NaN', but that's another question.

The size and time figures are comfortable for me, but others will post their own views.

It would be helpful to work on the intermediate progress reports and checkpointing - at the moment, neither are reported to BOINC. This host (Linux Mint 20.3) spends the entire run reporting 75% progress: my other machine (Linux Mint 21.1) is stuck at 3%. Both run exactly the same build of BOINC.
ID: 60087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1114
Credit: 40,838,348,595
RAC: 4,765,598
Level
Trp
Scientific publications
wat
Message 60091 - Posted: 15 Mar 2023, 11:29:34 UTC - in response to Message 60086.  
Last modified: 15 Mar 2023, 11:45:09 UTC

My observations show the GPU switching from periods of high utilization (~96-98%) to periods of idle (0%). About every minute or two.

i think the current size of the ATM are pretty good. about 4hrs on a 3080Ti and about 5hrs on a 2080Ti.

I'll second Richards's comment that you should put some effort into checkpointing about fixing the completion reporting (add weights to the job.xml file)
ID: 60091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,408,899,587
RAC: 2
Level
Trp
Scientific publications
watwatwat
Message 60093 - Posted: 15 Mar 2023, 14:47:42 UTC - in response to Message 60086.  
Last modified: 15 Mar 2023, 14:53:46 UTC

Quico/GDF, GPU utilization is low so I'd like to test running 3 and 4 ATM WUs simultaneously.
Sadly GG chokes off work at 2 WUs per computer so that's presently impossible.
How low is it? It really shouldn't be the case at least taking into account the tests we performed internally.

GPUgrid is set to only DL 2 WUs per computer.

It used to be higher but since ACEMD WUs take around 12ish hours and have approxiamtely 50% GPU utilization a normal BOINC client couldn't really make efficient use of more than 2. The history of setting the limit may have had something to do with DDOS attacks and throttling server access as a defense.

But Python WUs with a very low GPU utilization and ATM with about 25% utilization could run more. I believe it's possible for the work server to designate how many WUs of a given kind based on the client's hardware.

Some use a custom BOINC client that tricks the server into thinking their computer is more than one computer.

I suspect 1080s & 2080s could run 3 and 3080s could run 4 ATM WUs. Be nice to give it a try.

Checkpointing should be high on your To-Do List followed closely by progress reporting. File size is not an issue on the client side since you DL files over a GB. But increasing the limit on your server side would make that problem vanish. Run times have shortened and run fine, maybe a little shorter would be nice but not a priority.
ID: 60093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stephen Uitti

Send message
Joined: 17 Mar 14
Posts: 4
Credit: 77,427,636
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwat
Message 60094 - Posted: 15 Mar 2023, 14:58:08 UTC

I noticed Free energy calculations of protein ligand binding in WUProp. For example, today's time is 0.03 hours. I checked, and i've 68 of these with a total of minimal time. So i checked, and they all get "Error while computing". I looked at a recent work unit, 27429650 T_CDK2_new_2_edit_1oiu_26_T2_2A_1-QUICO_TEST_ATM-0-1-RND4575_0
The log has this:

+ python -m pip install git+https://github.com/raimis/AToM-OpenMM.git@5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git clone --filter=blob:none --quiet https://github.com/raimis/AToM-OpenMM.git /t/boinclib/boinc-client/slots/8/tmp/pip-req-build-3qm67lb1
Running command git rev-parse -q --verify 'sha^5d7eac55295e8c6e777505c3ca7c998f1c85987d'
Running command git fetch -q https://github.com/raimis/AToM-OpenMM.git 5d7eac55295e8c6e777505c3ca7c998f1c85987d
Running command git checkout -q 5d7eac55295e8c6e777505c3ca7c998f1c85987d
error: subprocess-exited-with-error

&#195;&#151; python setup.py egg_info did not run successfully.
&#226;&#148;&#130; exit code: -4


I'm running Linux Mint 19 (a bit out of date), git is git version 2.17.1
/usr/bin/python is Python 2.7.17 and /usr/bin/python3 is Python 3.6.9 -- this was common until recently
uname -a
Linux berfon 5.4.0-104-generic #118~18.04.1-Ubuntu SMP Thu Mar 3 13:53:15 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
My machine has a gtx-950, so cuda tasks are OK.
It's having an issue writing to /t/boinclib/boinc-client/slots/8/tmp

sudo ls -ld /t/boinclib/boinc-client/slots/8/
drwxrwx--x 2 boinc boinc 4096 Mar 15 10:24 /t/boinclib/boinc-client/slots/8/
So it doesn't look like a permissions issue. The disk drive this is on has over 1 TB space free. It looks to me like git failed, and this is what is happening on all the work units.
My machine is running "New version of ACEMD" routinely.
My preferences for GPUGrid is to run everything. I'm not sure which category this is in, but it must be one of the beta apps.

I hope this helps.
ID: 60094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 35 · Next

Message boards : News : ATM

©2025 Universitat Pompeu Fabra