New app update (acemd3)

Message boards : Number crunching : New app update (acemd3)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 52000 - Posted: 5 Jun 2019, 10:06:38 UTC - in response to Message 51999.  
Last modified: 5 Jun 2019, 10:07:04 UTC

I did find a "messy" install of the nvidia driver on the offending machine. There seems to be remnants of a driver installed via download directly from nvidia. I'll clean that up.

'sudo apt search nvidia' showed significant differences between the 2 machines.



From what I know, "apt search" does not look at the packages installed in your system but those "accessible" online. So, the difference may be in the repository configurations.
ID: 52000 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
biodoc

Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52001 - Posted: 5 Jun 2019, 10:49:24 UTC - in response to Message 52000.  


From what I know, "apt search" does not look at the packages installed in your system but those "accessible" online. So, the difference may be in the repository configurations.


Yeah, dpkg -l | grep -i nvidia is the right command.

I went ahead and purged everything nvidia and reinstalled the nvidia driver. I didn't install the cuda toolkit though.

UPDATE: tasks still failing on this machine.
ID: 52001 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Anon

Send message
Joined: 4 Jun 19
Posts: 3
Credit: 11,999,700
RAC: 0
Level
Pro
Scientific publications
wat
Message 52002 - Posted: 5 Jun 2019, 11:40:30 UTC - in response to Message 51998.  
Last modified: 5 Jun 2019, 12:39:34 UTC

Toni
host:
CUDA: NVIDIA GPU 0: GeForce GTX 1080 (driver version 418.56, CUDA version 10.1, compute capability 6.1, 4096MB, 3968MB available, 9718 GFLOPS peak)
OpenCL: NVIDIA GPU 0: GeForce GTX 1080 (driver version 418.56, device version OpenCL 1.2 CUDA, 8112MB, 3968MB available, 9718 GFLOPS peak)

Progress.log from a vaild task:


#
# ACEMD version 3.2.0rc0-65-gdb8d7f8[/code]
#
# Copyright (C) 2017-2019 Acellera (www.acellera.com)
#
# When publishing, please cite:
# ACEMD: Accelerating Biomolecular Dynamics in the Microsecond Time Scale
# M. J. Harvey, G. Giupponi and G. De Fabritiis,
# J Chem. Theory. Comput. 2009 5(6), pp1632-1639
# DOI: 10.1021/ct9000685
#
# ACEMD is running in Boinc mode!
#
# Read input file: input
# Parse input file
# WARNING: Keyword "hydrogenscale" is deprecated: Hydrogen mass scaling enabled when timestep > 2.0
# WARNING: Keyword "rigidbonds" is deprecated: Rigid bonds set when timestep > 1.0
# WARNING: Keyword "exclude" is deprecated: 1-4 exclusion automatically set by force-field
# WARNING: Keyword "1-4scaling" is deprecated: 1-4 scaling automatically set by force-field
# WARNING: Keyword "pmegridsizex" is deprecated: Feature not supported
# WARNING: Keyword "pmegridsizey" is deprecated: Feature not supported
# WARNING: Keyword "pmegridsizez" is deprecated: Feature not supported
# WARNING: Keyword "pmefreq" is deprecated: MTS not supported
# WARNING: Deprecated keyword "langevin" is replaced with "thermostat"
# WARNING: Deprecated keyword "langevindamping" is replaced with "thermostatDamping"
# WARNING: Keyword "energyfreq" is deprecated: Energies are now output every trajectoryFreq steps
$
$# Forcefield configuration
$
$ parameters parameters
$
$# Initial State
$
$ structure structure.psf
$ coordinates structure.pdb
$ temperature 300.00 # K
$ celldimension 62.230000 62.230000 62.230000 # A
$
$# Output
$
$ trajectoryFile output.xtc
$ trajectoryFreq 25000
$
$# Electrostatics
$
$ PME on
$ cutoff 9.00 # A
$ switching on
$ switchDist 7.50 # A
$ implicit off
$
$# Temperature Control
$
$ thermostat on
$ thermostatTemp 298.15 # K
$ thermostatDamping 1.00 # /ps
$
$# Pressure Control
$
$ barostat off
$ barostatPressure 1.0000 # bar
$ useFlexibleCell off
$ useConstantArea off
$ useConstantRatio off
$
$# Integration
$
$ timestep 4.00 # fs
$
$# External forces
$
$
$# Restraints
$
$
$# Run Configuration
$
$ restart off
$ run 250000
# Topology reports 23558 atoms
# Initializing engine
# Version: 7.3.1
# WARNING: overriding the plugin path to /var/lib/boinc-client/slots/40 with ACEMD_PLUGIN_DIR
# Plugin directory: /var/lib/boinc-client/slots/40
# Loaded plugins
# libOpenMMCUDA
# libOpenMMPME
# libOpenMMOpenCL
# libOpenMMCPU
# libOpenMMCudaCompiler
# Available platforms
# CUDA
# OpenCL
# CPU
#
# Bonded interactions
# Harmonic bond interactions
# Number of terms: 16569
# Harmonic angle interactions
# Number of terms: 11584
# Urey-Bradley interactions
# Number of terms: 2117
# Proper dihedral interations
# Number of terms: 5621
# Number of skipped terms: 1379
# NOTE: the skipped terms have zero force constants
# Improper dihedral interations
# Number of terms: 408
# Number of skipped terms: 10
# NOTE: the skipped terms have zero force constants
# CMAP interactions
# Number of terms: 0
# NOTE: CMAP interations skipped
#
# Non-bonded interactions
# Number of exclusions: 34709
# Lennard-Jones terms
# Cutoff distance: 9.000 A
# Switching distance: 7.500 A
# Coulombic (PME) term
# Ewald tolerance: 0.000500
# No NBFIX
# No implicit solvent
#
# Constraining hydrogen (X-H) bonds
# Number of constrained bonds: 15267
# Making water molecules rigid
# Number of water molecules: 7023
# Number of constraints: 22290
#
# Repartitioning hydrogen atom mass
# New hydrogen mass: 4.032 au
# Number of hydrogen atoms: 15267
#
# Creating simulation system
# Number of particles: 23558
# Number of degrees of freedom 48381
# Periodic box size: 62.230 62.230 62.230 A
#
# Using Langevin integrator (with temperature control)
# Thermostat target temperature: 298.15 K
# Thermostat friction coeficient: 1.00 ps^-1
#


Slotfolder 40 zip: https://filebin.net/jfv8ec4c6q8uszuw/Slot_40.zip?t=tvn13kdj

On failed host slot folder are empty. Boinc wipe at crash or application never add files to slotfolder. I could not grab progress.log.
Task failed after in 1 sec is impossible to grab and it doesnt store to upload so it wiped out.
Getting error this on older os 16.04 with GTX970 driver: 418.56. Same drivers hand out valid task on later system 18.10.
So it looks to be on system not driver version. This compile issue still exist on latest application but only effect old system.

<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
14:22:07 (102554): wrapper (7.7.26016): starting
14:22:07 (102554): wrapper (7.7.26016): starting
14:22:07 (102554): wrapper: running acemd3 (--boinc input --device 0)
# Engine failed: Error launching CUDA compiler: 32512
sh: 1: : Permission denied

14:22:08 (102554): acemd3 exited; CPU time 0.132000
14:22:08 (102554): app exit status: 0x1
14:22:08 (102554): called boinc_finish(195)

</stderr_txt>
]]>
ID: 52002 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 52003 - Posted: 5 Jun 2019, 12:15:25 UTC - in response to Message 52002.  

Aehm, to clarify: I see the process.log file of successful tasks only.
ID: 52003 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 52004 - Posted: 5 Jun 2019, 13:38:42 UTC - in response to Message 52003.  
Last modified: 5 Jun 2019, 13:39:05 UTC

If anybody is so inclined, can they try to run the boinc client manually with the --exit_after_finish flag, so the slot directory is preserved on failure?


Thanks
ID: 52004 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
biodoc

Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52005 - Posted: 5 Jun 2019, 15:31:07 UTC - in response to Message 51996.  


If somebody can post or upload the three components of a test workunit specification:

* <app_version>
* <workunit>
* <result>

all from client_state.xml - make sure you get the right (latest) version of <app_version>, there will be several of them - I can proofread that there are no bugs in the BOINC deployment of the app files. This one could be a problem with the version renaming or copying.


There is information in <app_version> but nothing for <workunit> or <result

<app_version>
    <app_name>acemd3</app_name>
    <version_num>202</version_num>
    <platform>x86_64-pc-linux-gnu</platform>
    <avg_ncpus>0.987442</avg_ncpus>
    <flops>28742507251613.187500</flops>
    <plan_class>cuda80</plan_class>
    <api_version>7.7.0</api_version>
    <file_ref>
        <file_name>wrapper_26198_x86_64-pc-linux-gnu</file_name>
        <main_program/>
    </file_ref>
    <file_ref>
        <file_name>acemd3.e72153abf98cb1fcd0f05fc443818dfc</file_name>
        <open_name>acemd3</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>job.xml.1245cc127550a015dcc9b3e1c2c84e13</file_name>
        <open_name>job.xml</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libOpenMMOpenCL.so.6a31fa1ff5ae3a26ea64f2abfb5a66cc</file_name>
        <open_name>libOpenMMOpenCL.so</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libOpenCL.so.1.0.0.43d4300566ce59d77e0fa316f8ee5b02</file_name>
        <open_name>libOpenCL.so.1</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libgomp.so.1.0.0.efdf718669edc7fff00e0c5f7f0b8791</file_name>
        <open_name>libgomp.so.1.0.0</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libOpenMM.so.5406dfd716045d08ad6369e2399a98e2</file_name>
        <open_name>libOpenMM.so</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libOpenMMCUDA.so.8867021fdc0daf2e39f1b7228ece45af</file_name>
        <open_name>libOpenMMCUDA.so</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libcudart.so.8.0.61.af43be839e6366e731accc514633bd1f</file_name>
        <open_name>libcudart.so.8.0</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libfftw3f_threads.so.3.4.4.dd0c6fcfa550371acf730db2d9d5a270</file_name>
        <open_name>libfftw3f_threads.so.3</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libgcc_s.so.1.d7f787a9bf6c3633eaebb9015c6d9044</file_name>
        <open_name>libgcc_s.so.1</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libnvrtc-builtins.so.8.0.61.684f2f1d9f0934bcce91e77b69e17ec7</file_name>
        <open_name>libnvrtc-builtins.so</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libOpenMMCudaCompiler.so.aaed781fe4caa9d1099312d458a9b902</file_name>
        <open_name>libOpenMMCudaCompiler.so</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libfftw3f.so.3.4.4.a4580ddf9efebaad56fab49847a8c899</file_name>
        <open_name>libfftw3f.so.3</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libOpenMMPME.so.3208e45e71567824e8390ab1c79c6a66</file_name>
        <open_name>libOpenMMPME.so</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libnvrtc.so.8.0.61.ea3bff3d91151ddf671a0a1491635b57</file_name>
        <open_name>libnvrtc.so.8.0</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libOpenMMCPU.so.19849b4ff1cf4d33f75d9433b4d5c6bb</file_name>
        <open_name>libOpenMMCPU.so</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libcufft.so.8.0.61.889be25939bec6f9a2abec790772d28f</file_name>
        <open_name>libcufft.so.8.0</open_name>
        <copy_file/>
    </file_ref>
    <file_ref>
        <file_name>libstdc++.so.6.0.25.e344f48acfbd4f5abbf99b2c75cc5e50</file_name>
        <open_name>libstdc++.so.6</open_name>
        <copy_file/>
    </file_ref>
    <coproc>
        <type>NVIDIA</type>
        <count>1.000000</count>
    </coproc>
    <gpu_ram>512.000000</gpu_ram>
    <dont_throttle/>
</app_version>
ID: 52005 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52006 - Posted: 5 Jun 2019, 17:12:58 UTC - in response to Message 52005.  

Thanks. The context was

libcudart.so.8.0 => not found
libcufft.so.8.0 => not found

So right off the bat, the app had no chance of succeeding when it can't find its own downloaded libcudart.so.8.0 and libcufft.so.8.0 files in the project directory.

Both files will be copied with the correct names into the slot directory, although they will be downloaded under a different (versioned) name. So a static test outside the running BOINC environment will fail to find them, but a dynamic test during running should be OK. I don't think this one will take us much further.
ID: 52006 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52007 - Posted: 5 Jun 2019, 17:44:05 UTC - in response to Message 52004.  

If anybody is so inclined, can they try to run the boinc client manually with the --exit_after_finish flag, so the slot directory is preserved on failure?


Thanks

I just tried the manual run of the client with the suggested --exit_after_finish parameter but it did not preserve the slot contents.

05-Jun-2019 10:38:59 [GPUGRID] Starting task a3-TONI_TEST9-2-3-RND2847_2
05-Jun-2019 10:39:03 [GPUGRID] [sched_op] Deferring communication for 00:06:31
05-Jun-2019 10:39:03 [GPUGRID] [sched_op] Reason: Unrecoverable error for task a3-TONI_TEST9-2-3-RND2847_2
mv: cannot stat 'slots/8/output.coor': No such file or directory
mv: cannot stat 'slots/8/output.vel': No such file or directory
mv: cannot stat 'slots/8/output.idx': No such file or directory
mv: cannot stat 'slots/8/output.dcd': No such file or directory
mv: cannot stat 'slots/8/COLVAR': No such file or directory
mv: cannot stat 'slots/8/log.file': No such file or directory
mv: cannot stat 'slots/8/HILLS': No such file or directory
mv: cannot stat 'slots/8/output.vel.dcd': No such file or directory
mv: cannot stat 'slots/8/output.xtc': No such file or directory
mv: cannot stat 'slots/8/output.xsc': No such file or directory
mv: cannot stat 'slots/8/output.xstfile': No such file or directory
05-Jun-2019 10:39:03 [GPUGRID] Computation for task a3-TONI_TEST9-2-3-RND2847_2 finished
05-Jun-2019 10:39:03 [GPUGRID] Output file a3-TONI_TEST9-2-3-RND2847_2_9 for task a3-TONI_TEST9-2-3-RND2847_2 absent
05-Jun-2019 10:39:05 [GPUGRID] Started upload of a3-TONI_TEST9-2-3-RND2847_2_0
05-Jun-2019 10:39:07 [GPUGRID] Finished upload of a3-TONI_TEST9-2-3-RND2847_2_0
^C05-Jun-2019 10:39:11 [---] Received signal 2
05-Jun-2019 10:39:11 [---] Exiting
keith@Darksider:~/Desktop/BOINC$
ID: 52007 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52008 - Posted: 5 Jun 2019, 18:17:09 UTC

I thought that all the tasks I had downloaded had failed but I see I have one host that has been successfully processing the acemd3 tasks.

But I just aborted the cache thinking all the hosts were unsuccessful. Oops.

Now to try and compare what is different about that machine compared to the rest.

I believe the difference is that at one time I had installed the cuda toolkit on that host and then removed it long in the past.
ID: 52008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52009 - Posted: 5 Jun 2019, 18:33:55 UTC

Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards.

Could this be the reason that app fails?
ID: 52009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52010 - Posted: 5 Jun 2019, 18:49:55 UTC

I see that there is a new version 2.02, which I just tried on my GTX 1070 (Ubuntu 16.04.6). I just use the Ubuntu repository driver, which is 396.54 (proprietary), without any toolbox that I know of.

It failed immediately.
GPUGRID 2.02 New version of ACEMD (cuda80) a67-TONI_TEST8-2-3-RND3156_0 00:00:03 (-) 0.00 100.000 - 6/10/2019 2:42:16 PM 0.985C + 1NV Computation error 0.00 MB i7-4790-G

http://www.gpugrid.net/results.php?hostid=482386

Explain to me (simply) what I should check, and I will do it.
ID: 52010 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
biodoc

Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52011 - Posted: 5 Jun 2019, 19:16:03 UTC - in response to Message 52009.  

Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards.

Could this be the reason that app fails?


I think the plan is to get a stable acemd3 app running on legacy hardware and then release a beta for turing cards.

@jim1348, I get the same error on one of my machines with dual GTX 1080 cards.
ID: 52011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mdxi

Send message
Joined: 11 Feb 18
Posts: 1
Credit: 104,599,162
RAC: 0
Level
Cys
Scientific publications
wat
Message 52012 - Posted: 5 Jun 2019, 19:27:21 UTC

I am also seeing failures due to the acemd binary not finding some libs:

[root@node02 www.gpugrid.net]# ldd acemd.919-80.bin 
        linux-vdso.so.1 (0x00007fff6a317000)
        libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007f740db2b000)
        libcudart.so.8.0 => not found
        libcufft.so.8.0 => not found
        libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f740db26000)
        libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f740db05000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f740d975000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f740d82d000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f740d813000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f740d64e000)
        librt.so.1 => /usr/lib/librt.so.1 (0x00007f740d644000)
        libnvidia-fatbinaryloader.so.430.14 => /usr/lib/libnvidia-fatbinaryloader.so.430.14 (0x00007f740d3f6000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f740eca8000)


This is despite the libs being right there in the directory with the binary:

[root@node02 www.gpugrid.net]# ls -l libcu*
-rwxr-xr-x 1 boinc boinc    394472 May 17 18:34 libcudart.so.8.0
-rwxr-xr-x 1 boinc boinc    426680 Jun  4 18:26 libcudart.so.8.0.61.af43be839e6366e731accc514633bd1f
-rwxr-xr-x 1 boinc boinc 146745600 May 17 18:35 libcufft.so.8.0
-rwxr-xr-x 1 boinc boinc 146772424 Jun  4 18:28 libcufft.so.8.0.61.889be25939bec6f9a2abec790772d28f


This machine is running Arch linux. Boinc was compiled locally, from the github source. The NVIDIA drivers are from Arch, with no modifications.

[root@node02 www.gpugrid.net]# pacman -Ss nvidia | grep installed
extra/nvidia 430.14-6 [installed]
extra/nvidia-utils 430.14-1 [installed]
extra/opencl-nvidia 430.14-1 [installed]


This machine is currently successfully crunching GPGPU WUs for Primegrid and Einstein@Home, so its configuration is known good.
ID: 52012 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52013 - Posted: 5 Jun 2019, 19:34:07 UTC - in response to Message 52011.  

Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards.

Could this be the reason that app fails?


I think the plan is to get a stable acemd3 app running on legacy hardware and then release a beta for turing cards.

@jim1348, I get the same error on one of my machines with dual GTX 1080 cards.

OK, that is a very different comprehension that I have for the wrapper app. I thought it was to allow use of the Turing cards.

I guess I should put the gpu_exclude back in play for the hosts that failed the tasks.
ID: 52013 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
biodoc

Send message
Joined: 26 Aug 08
Posts: 183
Credit: 10,085,929,375
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 52014 - Posted: 5 Jun 2019, 19:55:12 UTC - in response to Message 52013.  
Last modified: 5 Jun 2019, 19:55:59 UTC

Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards.

Could this be the reason that app fails?


I think the plan is to get a stable acemd3 app running on legacy hardware and then release a beta for turing cards.

@jim1348, I get the same error on one of my machines with dual GTX 1080 cards.

OK, that is a very different comprehension that I have for the wrapper app. I thought it was to allow use of the Turing cards.

I guess I should put the gpu_exclude back in play for the hosts that failed the tasks.


See this post: http://www.gpugrid.net/forum_thread.php?id=4927&nowrap=true#51934
ID: 52014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52015 - Posted: 5 Jun 2019, 20:04:13 UTC - in response to Message 52014.  
Last modified: 5 Jun 2019, 20:50:16 UTC

Thanks for the edification.

[Edit]This is the error for trying to run on a Turing card.

<core_client_version>7.15.0</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
12:04:16 (22587): wrapper (7.7.26016): starting
12:04:16 (22587): wrapper (7.7.26016): starting
12:04:16 (22587): wrapper: running acemd3 (--boinc input --device 0)
# Engine failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch)

12:04:17 (22587): acemd3 exited; CPU time 0.164594
12:04:17 (22587): app exit status: 0x1
12:04:17 (22587): called boinc_finish(195)

</stderr_txt>
]]>
ID: 52015 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 339
Credit: 7,990,341,558
RAC: 3,287
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52016 - Posted: 5 Jun 2019, 22:20:05 UTC - in response to Message 51990.  

I completed one while 5 others had errors.
https://www.gpugrid.net/workunit.php?wuid=16520276

nvcc -V results
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85


Same result on another PC but all tasks error on a 1080Ti
https://www.gpugrid.net/show_host_detail.php?hostid=477247
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
ID: 52016 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1423
Credit: 9,189,196,190
RAC: 1,326,743
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52017 - Posted: 5 Jun 2019, 23:11:10 UTC

I think I should take one of the hosts that fail the app and install the cuda toolkit and see if it changes anything.

I know that Toni said the toolkit is unnecessary supposedly, but it might show something.
ID: 52017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Jul 16
Posts: 339
Credit: 7,990,341,558
RAC: 3,287
Level
Tyr
Scientific publications
watwatwatwatwat
Message 52018 - Posted: 6 Jun 2019, 1:38:53 UTC

It won't hurt. One PC of mine with 1070/1070Ti works and another with 1080Ti doesn't. Both have the same nvcc -V results.
ID: 52018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 52019 - Posted: 6 Jun 2019, 8:59:43 UTC - in response to Message 52018.  
Last modified: 6 Jun 2019, 9:00:22 UTC

Misc answers:

- No turing support YET. If the app works, there will be many more possibilities
- I don't think installing the cuda toolkit will change anything, but who knows... but please don't break your systems (e.g. tweaking PATH) to install it.
- I'm fairly positive about library copying/renaming being ok.
- I'll be updating the app soon. Seems some system-specific non-reproducible behavior.
- In any case, updated drivers won't hurt.
ID: 52019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : Number crunching : New app update (acemd3)

©2026 Universitat Pompeu Fabra