New version of ACEMD 2.17 on multi GPU hosts

Message boards : Number crunching : New version of ACEMD 2.17 on multi GPU hosts
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 3
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57306 - Posted: 17 Sep 2021, 16:02:20 UTC

I have two ADRIA tasks running now on host 132158 - Linux Mint, driver v460.

htop shows that they have different command lines, ending in '--device 0' and '--device 1'.
nvidia-smi shows an acemd3 app running on GPU 0, and another running on GPU 1.

All is looking good so far.

The only strange thing is that one is running app version 101, and the other is running version 1121. Two identical cards, so we'll see who wins!
ID: 57306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 423,674
Level
Trp
Scientific publications
wat
Message 57307 - Posted: 17 Sep 2021, 16:04:22 UTC - in response to Message 57306.  


The only strange thing is that one is running app version 101, and the other is running version 1121. Two identical cards, so we'll see who wins!


that's the best test we can hope for, the most apples to apples.

I'd certainly be interested to know if one is significantly faster than the other.
ID: 57307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,186,946,190
RAC: 1,288,374
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57309 - Posted: 17 Sep 2021, 18:33:26 UTC

I got two new 2.18 tasks, one each on two hosts. Both CUDA_101 though.
ID: 57309 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 3
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57310 - Posted: 17 Sep 2021, 19:03:10 UTC - in response to Message 57307.  

Here's a taster show, after 4 hours elapsed:

v1121 at 12.727%
(device 1, in 4x PCIe slot)

v101 at 11.368%
(device 0, in 16x PCIe slot, driving monitor)
ID: 57310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
888

Send message
Joined: 28 Jan 21
Posts: 6
Credit: 106,022,917
RAC: 0
Level
Cys
Scientific publications
wat
Message 57311 - Posted: 17 Sep 2021, 19:06:49 UTC
Last modified: 17 Sep 2021, 19:19:37 UTC

I received 4 GPUGrid WU's on my dual GPU system - RTX3070 and RTX2070.....
And it was happily crunching 1 unit on each of the GPu's, until Boinc downloaded and ran a WCG unit. The GPUGrid unit then failed with this message....

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
15:50:51 (128895): wrapper (7.7.26016): starting
15:50:51 (128895): wrapper (7.7.26016): starting
15:50:51 (128895): wrapper: running /bin/tar (xf conda-pack.tar.bz2)
15:52:07 (128895): /bin/tar exited; CPU time 75.576773
15:52:07 (128895): wrapper: running bin/acemd3 (--boinc --device 0)
19:27:16 (136305): wrapper (7.7.26016): starting
19:27:16 (136305): wrapper (7.7.26016): starting
19:27:16 (136305): wrapper: running bin/acemd3 (--boinc --device 1)
ERROR: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdsim/context.cpp line 318: Cannot use a restart file on a different device!
19:27:20 (136305): bin/acemd3 exited; CPU time 3.452513
19:27:20 (136305): app exit status: 0x9e
19:27:20 (136305): called boinc_finish(195)


19:27:16 is exactly the timestamp that the WGC process started.

looks like it wont play happily with different projects. Has anyone else seen this?
I've suspended WCG for the moment.
ID: 57311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 593
Credit: 12,143,936,510
RAC: 4,251,066
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57312 - Posted: 17 Sep 2021, 19:22:58 UTC - in response to Message 57311.  

Has anyone else seen this?

It is an old known problem .
Please take a look to Toni Message #52865, dated on Oct 17 2019.
Specially, question about Can I use it on multi-GPU systems?
Your failed task started at device 0, then it restarted at device 1...
ID: 57312 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 423,674
Level
Trp
Scientific publications
wat
Message 57313 - Posted: 17 Sep 2021, 20:02:41 UTC - in response to Message 57311.  

I received 4 GPUGrid WU's on my dual GPU system - RTX3070 and RTX2070.....
And it was happily crunching 1 unit on each of the GPu's, until Boinc downloaded and ran a WCG unit. The GPUGrid unit then failed with this message....

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
15:50:51 (128895): wrapper (7.7.26016): starting
15:50:51 (128895): wrapper (7.7.26016): starting
15:50:51 (128895): wrapper: running /bin/tar (xf conda-pack.tar.bz2)
15:52:07 (128895): /bin/tar exited; CPU time 75.576773
15:52:07 (128895): wrapper: running bin/acemd3 (--boinc --device 0)
19:27:16 (136305): wrapper (7.7.26016): starting
19:27:16 (136305): wrapper (7.7.26016): starting
19:27:16 (136305): wrapper: running bin/acemd3 (--boinc --device 1)
ERROR: /home/user/conda/conda-bld/acemd3_1618916459379/work/src/mdsim/context.cpp line 318: Cannot use a restart file on a different device!
19:27:20 (136305): bin/acemd3 exited; CPU time 3.452513
19:27:20 (136305): app exit status: 0x9e
19:27:20 (136305): called boinc_finish(195)


19:27:16 is exactly the timestamp that the WGC process started.

looks like it wont play happily with different projects. Has anyone else seen this?
I've suspended WCG for the moment.


you need to extend the time period for task switching in compute preferences. depending on how slow or fast your GPU is, and since these GPUGRID tasks can take 12-24+ hrs depending on GPU power, you might need to set this to a very high value. I have it set to 24hrs (1440 minutes) on my hosts.

If you're running GPUGRID, might be a better option to set other projects to a resource share of 0 so that they only ask for work when no GPUGRID work is present.

FYI, this issue will happen if you simply stop BOINC and/or reboot your system. you'll need to be fine with leaving your system on for days at a time potentially.
ID: 57313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
888

Send message
Joined: 28 Jan 21
Posts: 6
Credit: 106,022,917
RAC: 0
Level
Cys
Scientific publications
wat
Message 57314 - Posted: 17 Sep 2021, 20:04:31 UTC

Thanks for the quick reply clarifying the problem.
But 2 years and no fix to what seems like quite a basic problem......
ID: 57314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,186,946,190
RAC: 1,288,374
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57315 - Posted: 17 Sep 2021, 20:07:57 UTC

Wait a minute . . . . . I thought I read in this thread on the previous beta releases that the restarting on a different device issue was solved???
ID: 57315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 423,674
Level
Trp
Scientific publications
wat
Message 57316 - Posted: 17 Sep 2021, 20:37:56 UTC - in response to Message 57315.  

Wait a minute . . . . . I thought I read in this thread on the previous beta releases that the restarting on a different device issue was solved???


That wasn’t the problem seen in previous app versions. We were seeing all tasks running on the same GPU.
ID: 57316 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 3
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57329 - Posted: 19 Sep 2021, 14:43:20 UTC - in response to Message 57307.  
Last modified: 19 Sep 2021, 14:44:21 UTC

I'd certainly be interested to know if one is significantly faster than the other.

The head-to-head speed comparison results are in. Both tasks completed and validated, and both were given the same credit score. Cards are GTX 1660 SUPER (ASUS TUF, if it matters).

Runtime:

v1121	113,110.14 sec
v101	126,707.98 sec (12% longer)

Speed:
v1121	3.18% / hour (12% faster)
v101	2.84% / hour
ID: 57329 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,186,946,190
RAC: 1,288,374
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57331 - Posted: 19 Sep 2021, 15:15:19 UTC - in response to Message 57329.  

If they keep both apps active, then the BOINC mechanism for choosing the most efficient application should become active once 10 valid tasks are completed on both apps.

The 1121 app's APR should prevail.
ID: 57331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,186,946,190
RAC: 1,288,374
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57335 - Posted: 19 Sep 2021, 22:24:41 UTC - in response to Message 57329.  

Hmmmm . . . . not enough tasks to draw a concrete conclusion, but on my daily driver with three identical RTX 2080 cards, the CUDA101 app was 2000 seconds faster than the CUDA1121 app.

https://www.gpugrid.net/results.php?userid=516740&offset=0&show_names=0&state=3&appid=

Though might be attributed to restarting on a different device. But same type of card. All cards are hybrids and have temps well under control and boost the same.
ID: 57335 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,186,946,190
RAC: 1,288,374
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57337 - Posted: 20 Sep 2021, 5:15:42 UTC

Anybody seen any sign of your credits exported to 3rd party aggregation websites yet?
Finished work over a day ago and still no stats from GPUGrid.
ID: 57337 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 3
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57338 - Posted: 20 Sep 2021, 7:26:06 UTC - in response to Message 57337.  

Anybody seen any sign of your credits exported to 3rd party aggregation websites yet?

No. https://www.gpugrid.net/stats/ is accessible, but the files in it are dated September 16.

Somebody needs to restart a script.
ID: 57338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 593
Credit: 12,143,936,510
RAC: 4,251,066
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57340 - Posted: 20 Sep 2021, 8:28:19 UTC - in response to Message 57337.  

Anybody seen any sign of your credits exported to 3rd party aggregation websites yet?
Finished work over a day ago and still no stats from GPUGrid.

Good observation.
My statistics for GPUGRID at BOINC STATS are still also blank since new app v2.18 ADRIA tasks came out.
ID: 57340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 57342 - Posted: 20 Sep 2021, 8:49:54 UTC - in response to Message 57340.  

Looking into this
ID: 57342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 57344 - Posted: 20 Sep 2021, 9:47:17 UTC - in response to Message 57342.  

fixed
ID: 57344 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,186,946,190
RAC: 1,288,374
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57346 - Posted: 20 Sep 2021, 15:04:45 UTC - in response to Message 57344.  

Thanks, Gianni.
ID: 57346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 593
Credit: 12,143,936,510
RAC: 4,251,066
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57349 - Posted: 20 Sep 2021, 16:12:59 UTC - in response to Message 57344.  

fixed

Working again, thanks
ID: 57349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : New version of ACEMD 2.17 on multi GPU hosts

©2026 Universitat Pompeu Fabra