WUs fall after resume [NOELIA]

Author	Message
[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 27997 - Posted: 9 Jan 2013, 22:32:35 UTC Hi I turn-off my PC's and suspend all Wu for GPUGrid (using a cronjob) When I resume the WUs they fall after a while <core_client_version>7.0.29</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. SIGABRT: abort called Stack trace (15 frames): ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(boinc_catch_signal+0x4d)[0x551f6d] /lib64/libc.so.6(+0x38030)[0x7f4d28f25030] /lib64/libc.so.6(gsignal+0x35)[0x7f4d28f24fb5] /lib64/libc.so.6(abort+0x148)[0x7f4d28f26438] /lib64/libc.so.6(+0x30f92)[0x7f4d28f1df92] /lib64/libc.so.6(+0x31042)[0x7f4d28f1e042] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x482916] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x4848da] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44d4bd] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x44e54c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42[0x41ec14] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0xb6c)[0x407d6c] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sin+0x256)[0x407456] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f4d28f116c5] ../../projects/www.gpugrid.net/acemd.2562.x64.cuda42(sinh+0x49)[0x4072f9] Exiting... </stderr_txt> ]]> my PCs Error Hosts and Others but this happen not for all PC's All Operative Systems (Gentoo Linux) are cloned. Tnx ID: 27997 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 28008 - Posted: 10 Jan 2013, 21:56:30 UTC I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job. MrS Scanning for our furry friends since Jan 2002 ID: 28008 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28010 - Posted: 10 Jan 2013, 23:15:22 UTC - in response to Message 28008. Last modified: 10 Jan 2013, 23:16:23 UTC I'd try to send BOINC the "quit client" command instead of suspending WUs with the cron job. MrS i won't quit the client but only suspend this project to elaborate another, and then shut down the pc I suspend the project not the WU(s) ID: 28010 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 28018 - Posted: 11 Jan 2013, 22:50:22 UTC - in response to Message 28010. Yes.. that's why I'm suggesting trying something else :) MrS Scanning for our furry friends since Jan 2002 ID: 28018 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28034 - Posted: 14 Jan 2013, 12:30:54 UTC Last modified: 14 Jan 2013, 12:38:47 UTC doh ! I turned On my PC and resume a task ... after 5 min. WU crashed Same Prob http://www.gpugrid.net/result.php?resultid=6313064 <core_client_version>7.0.29</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073283072 bytes # Number of multiprocessors: 7 # Number of cores: 56 # Device 1: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073414144 bytes # Number of multiprocessors: 7 # Number of cores: 56 MDIO: cannot open file "restart.coor" # Using device 0 # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073283072 bytes # Number of multiprocessors: 7 # Number of cores: 56 # Device 1: "GeForce GTX 560" # Clock rate: 1.62 GHz # Total amount of global memory: 1073414144 bytes # Number of multiprocessors: 7 # Number of cores: 56 SWAN: FATAL : swanMemcpyDtoH failed acemd.linux64.2352: swanlib_nv.c:390: error: Assertion `0' failed. SIGABRT: abort called Stack trace (13 frames): ../../projects/www.gpugrid.net/acemd.linux64.2352(boinc_catch_signal+0x4d)[0x482bed] /lib64/libc.so.6(+0x38030)[0x7f9e9f122030] /lib64/libc.so.6(gsignal+0x35)[0x7f9e9f121fb5] /lib64/libc.so.6(abort+0x148)[0x7f9e9f123438] /lib64/libc.so.6(+0x30f92)[0x7f9e9f11af92] /lib64/libc.so.6(+0x31042)[0x7f9e9f11b042] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x491b33] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x474510] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x413c60] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x407cba] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x40857e] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9e9f10e6c5] ../../projects/www.gpugrid.net/acemd.linux64.2352[0x407a19] Exiting... </stderr_txt> ]]> SWAN: FATAL : swanMemcpyDtoH failed ---- What is it ?[/b] over 6 hours lost ID: 28034 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 28038 - Posted: 14 Jan 2013, 15:36:18 UTC - in response to Message 28034. Last modified: 14 Jan 2013, 15:37:05 UTC Probably something to do with using the cuda3.1 app: 9 Jan 2013 \| 21:04:19 UTC 14 Jan 2013 \| 12:08:27 UTC Error while computing 62,671.43 662.52 --- ACEMD2: GPU molecular dynamics v6.16 (cuda31) Even at that it seems very long for a 'normal' length task; most of your NOELIA_hfXA tasks ran in 8.5K to 9K seconds (albeit on the 4.2app). On the 3.1app I would have expected it to take around twice that, but it ran for ~9times as long. Perhaps there was a problem with the task or it was a long WU that ended up in the wrong queue somehow? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 28038 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28046 - Posted: 14 Jan 2013, 19:41:30 UTC Last modified: 14 Jan 2013, 19:44:30 UTC I tried this : The following WUs went wrong when the system restarts http://www.gpugrid.net/result.php?resultid=6329901 (NATHAN) http://www.gpugrid.net/result.php?resultid=6327036 (NOELIA) http://www.gpugrid.net/result.php?resultid=6326898 (NOELIA) Where is the problem ? I'm running Gentoo Linux: emerge --info Portage 2.1.11.31 (default/linux/amd64/10.0, gcc-4.7.2, glibc-2.15-r3, 3.6.11-gentoo x86_64) ================================================================= System uname: Linux-3.6.11-gentoo-x86_64-Pentium-R-_Dual-Core_CPU_E6500_@_2.93GHz-with-gentoo-2.1 Timestamp of tree: Wed, 09 Jan 2013 18:45:01 +0000 ld GNU ld (GNU Binutils) 2.22 distcc 3.1 x86_64-pc-linux-gnu [enabled] app-shells/bash: 4.2_p37 dev-lang/python: 2.7.3-r2, 3.2.3 dev-util/pkgconfig: 0.27.1 sys-apps/baselayout: 2.1-r1 sys-apps/openrc: 0.11.8 sys-apps/sandbox: 2.5 sys-devel/autoconf: 2.68 sys-devel/automake: 1.11.6, 1.12.4 sys-devel/binutils: 2.22-r1 sys-devel/gcc: 4.5.4, 4.6.3, 4.7.2 sys-devel/gcc-config: 1.7.3 sys-devel/libtool: 2.4-r1 sys-devel/make: 3.82-r4 sys-kernel/linux-headers: 3.6 (virtual/os-headers) sys-libs/glibc: 2.15-r3 Repositories: gentoo science ACCEPT_KEYWORDS="amd64" ACCEPT_LICENSE="* -@EULA" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-O2 -march=core2 -pipe" CHOST="x86_64-pc-linux-gnu" [...] NVIDIA GPU 0: GeForce GTX 660 (driver version unknown, CUDA version 5.0, compute capability 3.0, 133801984MB, 134215644MB available, 1982 GFLOPS peak) OpenCL: NVIDIA GPU 0: GeForce GTX 660 (driver version 310.19, device version OpenCL 1.1 CUDA, 2048MB, 134215644MB available) Nvidia-Drivers : 310.19 This happens for all my PCs and only for GPUGrid ID: 28046 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 28048 - Posted: 14 Jan 2013, 22:47:46 UTC - in response to Message 28046. Did you already try ending BOINC instead of suspending the project, prior to a restart? MrS Scanning for our furry friends since Jan 2002 ID: 28048 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28050 - Posted: 15 Jan 2013, 0:09:11 UTC - in response to Message 28048. Did you already try ending BOINC instead of suspending the project, prior to a restart? MrS Yes,I Did. If WU goes on ... nothing happens. WU is reported without errors. ID: 28050 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 28055 - Posted: 15 Jan 2013, 21:42:56 UTC - in response to Message 28050. Not sure if I understand you correctly. So it solves your problem? MrS Scanning for our furry friends since Jan 2002 ID: 28055 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28057 - Posted: 15 Jan 2013, 23:05:43 UTC - in response to Message 28055. Not sure if I understand you correctly. So it solves your problem? MrS no. WU(s) fail if I resume them after reboot Pc(s). As I wrote , I suspend the project via crontab+boinc_command_line(suspend).Then turn Off PCs can not be switched on 24/24. When I resume the project (and WU(s) ), WU(s) fail. I don't know why. all processings hint at SWAN SWAN: FATAL : swanMemcpyDtoH failed SWAN : FATAL : Cuda driver error 700 in file 'swanlibnv2.cpp' in line 1574. acemd.2562.x64.cuda42: swanlibnv2.cpp:59: void swan_assert(int): Assertion `a' failed. and MDIO: cannot open file "restart.coor" ... restart.... resume ... ID: 28057 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28058 - Posted: 16 Jan 2013, 0:37:58 UTC Last modified: 16 Jan 2013, 0:45:51 UTC I solved ( maybe...) Cuda library wasn t linked properly for acemd.2562.x64.cuda42 app. I reinstalled cuda-toolkit-4.x.x libcufft.so.4 link was missed libcudart.so.4 link was missed Now : ldd acemd.2562.x64.cuda42 linux-vdso.so.1 (0x00007fff46dff000) libcufft.so.4 => /opt/cuda/lib64/libcufft.so.4 (0x00007fccc23a2000) libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007fccc17a2000) libcudart.so.4 => /opt/cuda/lib64/libcudart.so.4 (0x00007fccc1543000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fccc133f000) libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libstdc++.so.6 (0x00007fccc1038000) libm.so.6 => /lib64/libm.so.6 (0x00007fccc0d41000) libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.7.2/libgcc_s.so.1 (0x00007fccc0b2b000) libc.so.6 => /lib64/libc.so.6 (0x00007fccc0783000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fccc0566000) libz.so.1 => /lib64/libz.so.1 (0x00007fccc0350000) librt.so.1 => /lib64/librt.so.1 (0x00007fccc0147000) /lib64/ld-linux-x86-64.so.2 (0x00007fccc43c6000) I tried to suspend/reusme after reboot and WU(s) goes on now ... I will check better tomorrow. :) [edit] nothing ... After 10 minutes this WU (NOELIA) fails Another WU (NATHAN) still goes on ... I think I give up :( ID: 28058 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 28059 - Posted: 16 Jan 2013, 1:02:03 UTC - in response to Message 28057. Last modified: 16 Jan 2013, 1:11:23 UTC You can ignore this error, MDIO: cannot open file "restart.coor" Perhaps this is another lib problem? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 28059 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28063 - Posted: 16 Jan 2013, 12:04:01 UTC - in response to Message 28059. Last modified: 16 Jan 2013, 12:16:17 UTC You can ignore this error, MDIO: cannot open file "restart.coor" Perhaps this is another lib problem? Ok. Today I resume the project and NATHAN's WU still crunching without error As soon as possible I'll try to reset the project for all PCs and will check all dinamic libraries for all GPUgrid apps ID: 28063 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 28065 - Posted: 16 Jan 2013, 22:49:00 UTC So you suspend GPU-Grid via the cron-job, set the PC to standby, restart, resume GPU-Grid and occasionally (or mostly) get these errors (as you said before). I'm not an expert in this, but to me your error message sound like "something was in a state it should not have been in - be it driver, CUDA, GPU memory or whatever". I think we can agree on this, right? When you suspend GPU-Grid and have "leave applications in memory while suspended" active, BOINC and GPU-Grid think they can just continue exactly where they left upon resuming. However, your PC went into standby in the mean time. The main memory contents should be preserved, but what about GPU memory, GPU caches and registers? I would not rule out the possibility that something goes wrong here. Some state is reset during standby, but the app is not expecting this, as it was only temporarly suspended. So I'm asking you again to test the following: instead of suspending GPU-Grid with your cron-job just shut down the entire BOINC core client. Now upon resuming from standby BOINC and GPU-Grid know they're starting new from the last checkpoint, which should definitely work. Much better than giving up, isn't it? MrS Scanning for our furry friends since Jan 2002 ID: 28065 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28067 - Posted: 16 Jan 2013, 23:50:30 UTC - in response to Message 28065. When you suspend GPU-Grid and have "leave applications in memory while suspended" active No "leave applications in memory..." is set with your cron-job It is a simple cli command line : boinccmd --project http://www.gpugrid.net/ suspend see $ boinccmd --help or "Command-line options " ---> http://boinc.berkeley.edu/wiki/Client_configuration instead of suspending GPU-Grid with your cron-job just shut down the entire BOINC core client I can't suspend the entire client . I must suspend GPUGrid and all GPU projects for my own reasons. Crunching still goes on for a while with CPU only . then shutdown the PC (not standby or something like..) All PCs are not equipped with monitor,keyboard,mouse and are controlled via ssh /SecureSHell) but the app is not expecting this, as it was only temporarly suspended I never had problems like this until a few months ago I'll do tests ID: 28067 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 28068 - Posted: 17 Jan 2013, 9:52:13 UTC - in response to Message 28067. Last modified: 17 Jan 2013, 9:57:01 UTC Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 28068 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28071 - Posted: 17 Jan 2013, 12:00:22 UTC - in response to Message 28068. Try turning LAIM off, and if that doesn't work shut down Boinc completely, as MrS said. If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail. :D Only GPUGrid app has this problems . Other GPU Projects work fine (turn-off/on , suspend,reusme etc ...) ;\| ID: 28071 · Rating: 0 · rate: / Reply Quote

[VENETO] sabayonino Send message Joined: 4 Apr 10 Posts: 50 Credit: 650,142,596 RAC: 0 Level Scientific publications	Message 28072 - Posted: 17 Jan 2013, 12:25:15 UTC Last modified: 17 Jan 2013, 12:27:30 UTC I'm trying to update nvidia-drivers-313.18 http://www.nvidia.com/object/linux-display-amd64-313.18-driver.html I see several BugsFix Fixed a regression that could cause OpenGL applications to crash while compiling shaders. ...hope this well ID: 28072 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 28075 - Posted: 17 Jan 2013, 19:56:57 UTC - in response to Message 28071. If I try to Torn-Off the client when GPUGrid is running (not suspend) , WU(s) fail. Well.. this doesn't happen under Windows. Maybe suspend+resume just causes the same error to appear later? The new driver could help, but not the OpenGL fixes, as CUDA is something completely separate. MrS Scanning for our furry friends since Jan 2002 ID: 28075 · Rating: 0 · rate: / Reply Quote