Message boards :
Number crunching :
New app update (acemd3)
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next
| Author | Message |
|---|---|
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
I did find a "messy" install of the nvidia driver on the offending machine. There seems to be remnants of a driver installed via download directly from nvidia. I'll clean that up. From what I know, "apt search" does not look at the packages installed in your system but those "accessible" online. So, the difference may be in the repository configurations. |
|
Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yeah, dpkg -l | grep -i nvidia is the right command. I went ahead and purged everything nvidia and reinstalled the nvidia driver. I didn't install the cuda toolkit though. UPDATE: tasks still failing on this machine. |
|
Send message Joined: 4 Jun 19 Posts: 3 Credit: 11,999,700 RAC: 0 Level ![]() Scientific publications
|
Toni host: CUDA: NVIDIA GPU 0: GeForce GTX 1080 (driver version 418.56, CUDA version 10.1, compute capability 6.1, 4096MB, 3968MB available, 9718 GFLOPS peak) OpenCL: NVIDIA GPU 0: GeForce GTX 1080 (driver version 418.56, device version OpenCL 1.2 CUDA, 8112MB, 3968MB available, 9718 GFLOPS peak) Progress.log from a vaild task:
Slotfolder 40 zip: https://filebin.net/jfv8ec4c6q8uszuw/Slot_40.zip?t=tvn13kdj On failed host slot folder are empty. Boinc wipe at crash or application never add files to slotfolder. I could not grab progress.log. Task failed after in 1 sec is impossible to grab and it doesnt store to upload so it wiped out. Getting error this on older os 16.04 with GTX970 driver: 418.56. Same drivers hand out valid task on later system 18.10. So it looks to be on system not driver version. This compile issue still exist on latest application but only effect old system. <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61) </message> <stderr_txt> 14:22:07 (102554): wrapper (7.7.26016): starting 14:22:07 (102554): wrapper (7.7.26016): starting 14:22:07 (102554): wrapper: running acemd3 (--boinc input --device 0) # Engine failed: Error launching CUDA compiler: 32512 sh: 1: : Permission denied 14:22:08 (102554): acemd3 exited; CPU time 0.132000 14:22:08 (102554): app exit status: 0x1 14:22:08 (102554): called boinc_finish(195) </stderr_txt> ]]> |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Aehm, to clarify: I see the process.log file of successful tasks only. |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
If anybody is so inclined, can they try to run the boinc client manually with the --exit_after_finish flag, so the slot directory is preserved on failure? Thanks |
|
Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There is information in <app_version> but nothing for <workunit> or <result <app_version>
<app_name>acemd3</app_name>
<version_num>202</version_num>
<platform>x86_64-pc-linux-gnu</platform>
<avg_ncpus>0.987442</avg_ncpus>
<flops>28742507251613.187500</flops>
<plan_class>cuda80</plan_class>
<api_version>7.7.0</api_version>
<file_ref>
<file_name>wrapper_26198_x86_64-pc-linux-gnu</file_name>
<main_program/>
</file_ref>
<file_ref>
<file_name>acemd3.e72153abf98cb1fcd0f05fc443818dfc</file_name>
<open_name>acemd3</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>job.xml.1245cc127550a015dcc9b3e1c2c84e13</file_name>
<open_name>job.xml</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libOpenMMOpenCL.so.6a31fa1ff5ae3a26ea64f2abfb5a66cc</file_name>
<open_name>libOpenMMOpenCL.so</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libOpenCL.so.1.0.0.43d4300566ce59d77e0fa316f8ee5b02</file_name>
<open_name>libOpenCL.so.1</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libgomp.so.1.0.0.efdf718669edc7fff00e0c5f7f0b8791</file_name>
<open_name>libgomp.so.1.0.0</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libOpenMM.so.5406dfd716045d08ad6369e2399a98e2</file_name>
<open_name>libOpenMM.so</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libOpenMMCUDA.so.8867021fdc0daf2e39f1b7228ece45af</file_name>
<open_name>libOpenMMCUDA.so</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libcudart.so.8.0.61.af43be839e6366e731accc514633bd1f</file_name>
<open_name>libcudart.so.8.0</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libfftw3f_threads.so.3.4.4.dd0c6fcfa550371acf730db2d9d5a270</file_name>
<open_name>libfftw3f_threads.so.3</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libgcc_s.so.1.d7f787a9bf6c3633eaebb9015c6d9044</file_name>
<open_name>libgcc_s.so.1</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libnvrtc-builtins.so.8.0.61.684f2f1d9f0934bcce91e77b69e17ec7</file_name>
<open_name>libnvrtc-builtins.so</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libOpenMMCudaCompiler.so.aaed781fe4caa9d1099312d458a9b902</file_name>
<open_name>libOpenMMCudaCompiler.so</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libfftw3f.so.3.4.4.a4580ddf9efebaad56fab49847a8c899</file_name>
<open_name>libfftw3f.so.3</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libOpenMMPME.so.3208e45e71567824e8390ab1c79c6a66</file_name>
<open_name>libOpenMMPME.so</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libnvrtc.so.8.0.61.ea3bff3d91151ddf671a0a1491635b57</file_name>
<open_name>libnvrtc.so.8.0</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libOpenMMCPU.so.19849b4ff1cf4d33f75d9433b4d5c6bb</file_name>
<open_name>libOpenMMCPU.so</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libcufft.so.8.0.61.889be25939bec6f9a2abec790772d28f</file_name>
<open_name>libcufft.so.8.0</open_name>
<copy_file/>
</file_ref>
<file_ref>
<file_name>libstdc++.so.6.0.25.e344f48acfbd4f5abbf99b2c75cc5e50</file_name>
<open_name>libstdc++.so.6</open_name>
<copy_file/>
</file_ref>
<coproc>
<type>NVIDIA</type>
<count>1.000000</count>
</coproc>
<gpu_ram>512.000000</gpu_ram>
<dont_throttle/>
</app_version> |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks. The context was libcudart.so.8.0 => not found Both files will be copied with the correct names into the slot directory, although they will be downloaded under a different (versioned) name. So a static test outside the running BOINC environment will fail to find them, but a dynamic test during running should be OK. I don't think this one will take us much further. |
|
Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,189,196,190 RAC: 1,326,743 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
If anybody is so inclined, can they try to run the boinc client manually with the --exit_after_finish flag, so the slot directory is preserved on failure? I just tried the manual run of the client with the suggested --exit_after_finish parameter but it did not preserve the slot contents. 05-Jun-2019 10:38:59 [GPUGRID] Starting task a3-TONI_TEST9-2-3-RND2847_2 05-Jun-2019 10:39:03 [GPUGRID] [sched_op] Deferring communication for 00:06:31 05-Jun-2019 10:39:03 [GPUGRID] [sched_op] Reason: Unrecoverable error for task a3-TONI_TEST9-2-3-RND2847_2 mv: cannot stat 'slots/8/output.coor': No such file or directory mv: cannot stat 'slots/8/output.vel': No such file or directory mv: cannot stat 'slots/8/output.idx': No such file or directory mv: cannot stat 'slots/8/output.dcd': No such file or directory mv: cannot stat 'slots/8/COLVAR': No such file or directory mv: cannot stat 'slots/8/log.file': No such file or directory mv: cannot stat 'slots/8/HILLS': No such file or directory mv: cannot stat 'slots/8/output.vel.dcd': No such file or directory mv: cannot stat 'slots/8/output.xtc': No such file or directory mv: cannot stat 'slots/8/output.xsc': No such file or directory mv: cannot stat 'slots/8/output.xstfile': No such file or directory 05-Jun-2019 10:39:03 [GPUGRID] Computation for task a3-TONI_TEST9-2-3-RND2847_2 finished 05-Jun-2019 10:39:03 [GPUGRID] Output file a3-TONI_TEST9-2-3-RND2847_2_9 for task a3-TONI_TEST9-2-3-RND2847_2 absent 05-Jun-2019 10:39:05 [GPUGRID] Started upload of a3-TONI_TEST9-2-3-RND2847_2_0 05-Jun-2019 10:39:07 [GPUGRID] Finished upload of a3-TONI_TEST9-2-3-RND2847_2_0 ^C05-Jun-2019 10:39:11 [---] Received signal 2 05-Jun-2019 10:39:11 [---] Exiting keith@Darksider:~/Desktop/BOINC$ |
|
Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,189,196,190 RAC: 1,326,743 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I thought that all the tasks I had downloaded had failed but I see I have one host that has been successfully processing the acemd3 tasks. But I just aborted the cache thinking all the hosts were unsuccessful. Oops. Now to try and compare what is different about that machine compared to the rest. I believe the difference is that at one time I had installed the cuda toolkit on that host and then removed it long in the past. |
|
Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,189,196,190 RAC: 1,326,743 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards. Could this be the reason that app fails? |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I see that there is a new version 2.02, which I just tried on my GTX 1070 (Ubuntu 16.04.6). I just use the Ubuntu repository driver, which is 396.54 (proprietary), without any toolbox that I know of. It failed immediately. GPUGRID 2.02 New version of ACEMD (cuda80) a67-TONI_TEST8-2-3-RND3156_0 00:00:03 (-) 0.00 100.000 - 6/10/2019 2:42:16 PM 0.985C + 1NV Computation error 0.00 MB i7-4790-G http://www.gpugrid.net/results.php?hostid=482386 Explain to me (simply) what I should check, and I will do it. |
|
Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards. I think the plan is to get a stable acemd3 app running on legacy hardware and then release a beta for turing cards. @jim1348, I get the same error on one of my machines with dual GTX 1080 cards. |
|
Send message Joined: 11 Feb 18 Posts: 1 Credit: 104,599,162 RAC: 0 Level ![]() Scientific publications
|
I am also seeing failures due to the acemd binary not finding some libs:
[root@node02 www.gpugrid.net]# ldd acemd.919-80.bin
linux-vdso.so.1 (0x00007fff6a317000)
libcuda.so.1 => /usr/lib/libcuda.so.1 (0x00007f740db2b000)
libcudart.so.8.0 => not found
libcufft.so.8.0 => not found
libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f740db26000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f740db05000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f740d975000)
libm.so.6 => /usr/lib/libm.so.6 (0x00007f740d82d000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f740d813000)
libc.so.6 => /usr/lib/libc.so.6 (0x00007f740d64e000)
librt.so.1 => /usr/lib/librt.so.1 (0x00007f740d644000)
libnvidia-fatbinaryloader.so.430.14 => /usr/lib/libnvidia-fatbinaryloader.so.430.14 (0x00007f740d3f6000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007f740eca8000)
This is despite the libs being right there in the directory with the binary: [root@node02 www.gpugrid.net]# ls -l libcu* -rwxr-xr-x 1 boinc boinc 394472 May 17 18:34 libcudart.so.8.0 -rwxr-xr-x 1 boinc boinc 426680 Jun 4 18:26 libcudart.so.8.0.61.af43be839e6366e731accc514633bd1f -rwxr-xr-x 1 boinc boinc 146745600 May 17 18:35 libcufft.so.8.0 -rwxr-xr-x 1 boinc boinc 146772424 Jun 4 18:28 libcufft.so.8.0.61.889be25939bec6f9a2abec790772d28f This machine is running Arch linux. Boinc was compiled locally, from the github source. The NVIDIA drivers are from Arch, with no modifications. [root@node02 www.gpugrid.net]# pacman -Ss nvidia | grep installed extra/nvidia 430.14-6 [installed] extra/nvidia-utils 430.14-1 [installed] extra/opencl-nvidia 430.14-1 [installed] This machine is currently successfully crunching GPGPU WUs for Primegrid and Einstein@Home, so its configuration is known good. |
|
Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,189,196,190 RAC: 1,326,743 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards. OK, that is a very different comprehension that I have for the wrapper app. I thought it was to allow use of the Turing cards. I guess I should put the gpu_exclude back in play for the hosts that failed the tasks. |
|
Send message Joined: 26 Aug 08 Posts: 183 Credit: 10,085,929,375 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Anybody successfully run the new acemd3 app on a Turing card yet? I just realized that I still had a gpu_exclude for my Turing card on the host that had been successfully processing tasks. I somehow had skipped over removing the exclusion from that machine while I had done so on all the other hosts with Turing cards. See this post: http://www.gpugrid.net/forum_thread.php?id=4927&nowrap=true#51934 |
|
Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,189,196,190 RAC: 1,326,743 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thanks for the edification. [Edit]This is the error for trying to run on a Turing card. <core_client_version>7.15.0</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 12:04:16 (22587): wrapper (7.7.26016): starting 12:04:16 (22587): wrapper (7.7.26016): starting 12:04:16 (22587): wrapper: running acemd3 (--boinc input --device 0) # Engine failed: Error compiling program: nvrtc: error: invalid value for --gpu-architecture (-arch) 12:04:17 (22587): acemd3 exited; CPU time 0.164594 12:04:17 (22587): app exit status: 0x1 12:04:17 (22587): called boinc_finish(195) </stderr_txt> ]]> |
|
Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 3,287 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I completed one while 5 others had errors. Same result on another PC but all tasks error on a 1080Ti https://www.gpugrid.net/show_host_detail.php?hostid=477247 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2017 NVIDIA Corporation Built on Fri_Nov__3_21:07:56_CDT_2017 Cuda compilation tools, release 9.1, V9.1.85 |
|
Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,189,196,190 RAC: 1,326,743 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I think I should take one of the hosts that fail the app and install the cuda toolkit and see if it changes anything. I know that Toni said the toolkit is unnecessary supposedly, but it might show something. |
|
Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 3,287 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
It won't hurt. One PC of mine with 1070/1070Ti works and another with 1080Ti doesn't. Both have the same nvcc -V results. |
|
Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() |
Misc answers: - No turing support YET. If the app works, there will be many more possibilities - I don't think installing the cuda toolkit will change anything, but who knows... but please don't break your systems (e.g. tweaking PATH) to install it. - I'm fairly positive about library copying/renaming being ok. - I'll be updating the app soon. Seems some system-specific non-reproducible behavior. - In any case, updated drivers won't hurt. |
©2026 Universitat Pompeu Fabra