Message boards :
Graphics cards (GPUs) :
WUs fall after resume [NOELIA]
Message board moderation
| Author | Message |
|---|---|
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi I turn-off my PC's and suspend all Wu for GPUGrid (using a cronjob) When I resume the WUs they fall after a while <core_client_version>7.0.29</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib64/libc.so.6(+0x38030)[0x7f4d28f25030] /lib64/libc.so.6(gsignal+0x35)[0x7f4d28f24fb5] /lib64/libc.so.6(abort+0x148)[0x7f4d28f26438] /lib64/libc.so.6(+0x30f92)[0x7f4d28f1df92] /lib64/libc.so.6(+0x31042)[0x7f4d28f1e042] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4d28f116c5] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... </stderr_txt> ]]> my PCs Error Hosts and Others but this happen not for all PC's All Operative Systems (Gentoo Linux) are cloned. Tnx |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job. MrS Scanning for our furry friends since Jan 2002 |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job. i won't quit the client but only suspend this project to elaborate another, and then shut down the pc I suspend the project not the WU(s) |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yes.. that's why I'm suggesting trying something else :) MrS Scanning for our furry friends since Jan 2002 |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
doh ! I turned On my PC and resume a task ... after 5 min. WU crashed Same Prob http://www.gpugrid.net/result.php?resultid=6313064 <core_client_version>7.0.29</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073283072 bytes # Number of multiprocessors: 7 # Number of cores: 56 # Device 1: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073414144 bytes # Number of multiprocessors: 7 # Number of cores: 56 MDIO: cannot open file "restart.coor" # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073283072 bytes # Number of multiprocessors: 7 # Number of cores: 56 # Device 1: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073414144 bytes # Number of multiprocessors: 7 # Number of cores: 56 SWAN: FATAL : swanMemcpyDtoH failed acemd.linux64.2352: swanlib_nv.c:390: error: Assertion `0' failed. SIGABRT: abort called Stack trace (13 frames): ../../projects/www.gpugrid.net/acemd.linux64.2352(boinc_catch_signal+0x4d)[0x482bed] /lib64/libc.so.6(+0x38030)[0x7f9e9f122030] /lib64/libc.so.6(gsignal+0x35)[0x7f9e9f121fb5] /lib64/libc.so.6(abort+0x148)[0x7f9e9f123438] /lib64/libc.so.6(+0x30f92)[0x7f9e9f11af92] /lib64/libc.so.6(+0x31042)[0x7f9e9f11b042] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x491b33] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x474510] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x413c60] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x407cba] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x40857e] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9e9f10e6c5] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x407a19] Exiting... </stderr_txt> ]]> SWAN: FATAL : swanMemcpyDtoH failed ---- What is it ?[/b] over 6 hours lost |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Probably something to do with using the cuda3.1 app: 9 Jan 2013 | 21:04:19 UTC 14 Jan 2013 | 12:08:27 UTC Error while computing 62,671.43 662.52 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) Even at that it seems very long for a 'normal' length task; most of your NOELIA_hfXA tasks ran in 8.5K to 9K seconds (albeit on the 4.2app). On the 3.1app I would have expected it to take around twice that, but it ran for ~9times as long. Perhaps there was a problem with the task or it was a long WU that ended up in the wrong queue somehow? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I tried this : The following WUs went wrong when the system restarts http://www.gpugrid.net/result.php?resultid=6329901 (NATHAN) http://www.gpugrid.net/result.php?resultid=6327036 (NOELIA) http://www.gpugrid.net/result.php?resultid=6326898 (NOELIA) Where is the problem ? I'm running Gentoo Linux: emerge --info Portage 2.1.11.31 (default/linux/amd64/10.0, gcc-4.7.2, glibc-2.15-r3, 3.6.11-gentoo x86_64) ================================================================= System uname: Linux-3.6.11-gentoo-x86_64-Pentium-R-_Dual-Core_CPU_E6500_@_2.93GHz-with-gentoo-2.1 Timestamp of tree: Wed, 09 Jan 2013 18:45:01 +0000 ld GNU ld (GNU Binutils) 2.22 distcc 3.1 x86_64-pc-linux-gnu [enabled] app-shells/bash: 4.2_p37 dev-lang/python: 2.7.3-r2, 3.2.3 dev-util/pkgconfig: 0.27.1 sys-apps/baselayout: 2.1-r1 sys-apps/openrc: 0.11.8 sys-apps/sandbox: 2.5 sys-devel/autoconf: 2.68 sys-devel/automake: 1.11.6, 1.12.4 sys-devel/binutils: 2.22-r1 sys-devel/gcc: 4.5.4, 4.6.3, 4.7.2 sys-devel/gcc-config: 1.7.3 sys-devel/libtool: 2.4-r1 sys-devel/make: 3.82-r4 sys-kernel/linux-headers: 3.6 (virtual/os-headers) sys-libs/glibc: 2.15-r3 Repositories: gentoo science ACCEPT_KEYWORDS="amd64" ACCEPT_LICENSE="* -@EULA" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-O2 -march=core2 -pipe" CHOST="x86_64-pc-linux-gnu" [...] NVIDIA GPU 0: GeForce GTX 660 (driver version unknown, CUDA version 5.0, compute capability 3.0, 133801984MB, 134215644MB available, 1982 GFLOPS peak) OpenCL: NVIDIA GPU 0: GeForce GTX 660 (driver version 310.19, device version OpenCL 1.1 CUDA, 2048MB, 134215644MB available) Nvidia-Drivers : 310.19 This happens for all my PCs and only for GPUGrid |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Did you already try ending BOINC instead of suspending the project, prior to a restart? MrS Scanning for our furry friends since Jan 2002 |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Did you already try ending BOINC instead of suspending the project, prior to a restart? Yes,I Did. If WU goes on ... nothing happens. WU is reported without errors. |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Not sure if I understand you correctly. So it solves your problem? MrS Scanning for our furry friends since Jan 2002 |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Not sure if I understand you correctly. So it solves your problem? no. WU(s) fail if I resume them after reboot Pc(s). As I wrote , I suspend the project via crontab+boinc_command_line(suspend).Then turn Off PCs can not be switched on 24/24. When I resume the project (and WU(s) ), WU(s) fail. I don't know why. all processings hint at SWAN SWAN: FATAL : swanMemcpyDtoH failed SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. and MDIO: cannot open file "restart.coor" ... restart.... resume ... |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I solved ( maybe...) Cuda library wasn t linked properly for acemd.2562.x64.cuda42 app. I reinstalled cuda-toolkit-4.x.x libcufft.so.4 link was missed libcudart.so.4 link was missed Now : ldd acemd.2562.x64.cuda42 linux-vdso.so.1 (0x00007fff46dff000) libcufft.so.4 => /opt/cuda/lib64/libcufft.so.4 (0x00007fccc23a2000) libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fccc17a2000) libcudart.so.4 => /opt/cuda/lib64/libcudart.so.4 (0x00007fccc1543000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fccc133f000) libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libstdc++.so.6 (0x00007fccc1038000) libm.so.6 => /lib64/libm.so.6 (0x00007fccc0d41000) libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libgcc_s.so.1 (0x00007fccc0b2b000) libc.so.6 => /lib64/libc.so.6 (0x00007fccc0783000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fccc0566000) libz.so.1 => /lib64/libz.so.1 (0x00007fccc0350000) librt.so.1 => /lib64/librt.so.1 (0x00007fccc0147000) /lib64/ld-linux-x86-64.so.2 (0x00007fccc43c6000) I tried to suspend/reusme after reboot and WU(s) goes on now ... I will check better tomorrow. :) [edit] nothing ... After 10 minutes this WU (NOELIA) fails Another WU (NATHAN) still goes on ... I think I give up :( |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
You can ignore this error, MDIO: cannot open file "restart.coor" Perhaps this is another lib problem? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You can ignore this error, MDIO: cannot open file "restart.coor" Ok. Today I resume the project and NATHAN's WU still crunching without error As soon as possible I'll try to reset the project for all PCs and will check all dinamic libraries for all GPUgrid apps |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
So you suspend GPU-Grid via the cron-job, set the PC to standby, restart, resume GPU-Grid and occasionally (or mostly) get these errors (as you said before). I'm not an expert in this, but to me your error message sound like "something was in a state it should not have been in - be it driver, CUDA, GPU memory or whatever". I think we can agree on this, right? When you suspend GPU-Grid and have "leave applications in memory while suspended" active, BOINC and GPU-Grid think they can just continue exactly where they left upon resuming. However, your PC went into standby in the mean time. The main memory contents should be preserved, but what about GPU memory, GPU caches and registers? I would not rule out the possibility that something goes wrong here. Some state is reset during standby, but the app is not expecting this, as it was only temporarly suspended. So I'm asking you again to test the following: instead of suspending GPU-Grid with your cron-job just shut down the entire BOINC core client. Now upon resuming from standby BOINC and GPU-Grid know they're starting new from the last checkpoint, which should definitely work. Much better than giving up, isn't it? MrS Scanning for our furry friends since Jan 2002 |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
No "leave applications in memory..." is set with your cron-job It is a simple cli command line : boinccmd --project http://www.gpugrid.net/ suspend see $ boinccmd --help or "Command-line options " ---> http://boinc.berkeley.edu/wiki/Client_configuration instead of suspending GPU-Grid with your cron-job just shut down the entire BOINC core client I can't suspend the entire client . I must suspend GPUGrid and all GPU projects for my own reasons. Crunching still goes on for a while with CPU only . then shutdown the PC (not standby or something like..) All PCs are not equipped with monitor,keyboard,mouse and are controlled via ssh /SecureSHell) but the app is not expecting this, as it was only temporarly suspended I never had problems like this until a few months ago I'll do tests |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said. If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail. :D Only GPUGrid app has this problems . Other GPU Projects work fine (turn-off/on , suspend,reusme etc ...) ;| |
[VENETO] sabayoninoSend message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm trying to update nvidia-drivers-313.18 http://www.nvidia.com/object/linux-display-amd64-313.18-driver.html I see several BugsFix Fixed a regression that could cause OpenGL applications to crash while compiling shaders. ...hope this well |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail. Well.. this doesn't happen under Windows. Maybe suspend+resume just causes the same error to appear later? The new driver could help, but not the OpenGL fixes, as CUDA is something completely separate. MrS Scanning for our furry friends since Jan 2002 |
©2025 Universitat Pompeu Fabra