some hosts won't get tasks

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 57195 - Posted: 11 Jul 2021, 12:57:41 UTC I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts. ID: 57195 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 13 Level Scientific publications	Message 57196 - Posted: 11 Jul 2021, 14:56:30 UTC Might be a solution. Easy enough to do and you can always merge the old hostID back into the new ID. ID: 57196 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3 Level Scientific publications	Message 57197 - Posted: 11 Jul 2021, 18:57:45 UTC - in response to Message 57195. I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts. Could be a solution. But right now since not much work is available anyway. I will wait until work is plentiful again and reassess. If I’m still not getting work when there are thousands of tasks ready to send, then I’ll do it. Really prefer not to though. ID: 57197 · Rating: 0 · rate: / Reply Quote

jiipee Send message Joined: 4 Jun 15 Posts: 19 Credit: 8,949,558,416 RAC: 506,623 Level Scientific publications	Message 57198 - Posted: 12 Jul 2021, 6:48:01 UTC Last modified: 12 Jul 2021, 6:49:18 UTC Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0) acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory 10:36:11 (18462): acemd3 exited; CPU time 0.000578 10:36:11 (18462): app exit status: 0x7f 10:36:11 (18462): called boinc_finish(195) </stderr_txt> ]]> Perhaps some bugs waiting to be solved? ID: 57198 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3 Level Scientific publications	Message 57199 - Posted: 12 Jul 2021, 14:09:22 UTC - in response to Message 57198. Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61)</message> <stderr_txt> 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper (7.7.26016): starting 10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0) acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory 10:36:11 (18462): acemd3 exited; CPU time 0.000578 10:36:11 (18462): app exit status: 0x7f 10:36:11 (18462): called boinc_finish(195) </stderr_txt> ]]> Perhaps some bugs waiting to be solved? you need to install the boost 1.74 package from your distribution or from a PPA. no idea what system you have since your computers are hidden, the install process will vary from distribution to distribution. On Ubuntu there is a PPA for it. that will fix your error. ID: 57199 · Rating: 0 · rate: / Reply Quote

jiipee Send message Joined: 4 Jun 15 Posts: 19 Credit: 8,949,558,416 RAC: 506,623 Level Scientific publications	Message 57200 - Posted: 12 Jul 2021, 18:17:09 UTC - in response to Message 57199. Ok, thanks for the info. My computers run mostly CentOS 6/7, but there is one Linux Mint and one Win10 also. ID: 57200 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3 Level Scientific publications	Message 57267 - Posted: 3 Sep 2021, 15:17:40 UTC I think it's resolved now. Background, When the cuda1121 app was released on July 1st, 2 of my hosts stopped receiving any tasks/applications. the cuda100 linux app was pulled and replaced with cuda1121. all systems had compatible drivers and displayed compatible driver versions, however only some systems continued to receive this new app. all others constantly got a "no tasks available" message. I had no problems getting the cuda100 task before I run my coproc_info files on all hosts in a locked down state, so it always shows the same obfuscated driver version and doesn't change when I change drivers. this can be beneficial for testing sometimes (for example, it's the only way i can get Einstein to send me the new 1.28 beta app for AMD because BOINC detects OpenCL 1.2 even with the compatible drivers, and Einstein will not send the app unless you display OpenCL 2.0+), also gives me the ability to control what is actually shown. Usually I do not update the coproc file with the latest info. if i wanted to change something I just unlocked it, changed what needed to be changed, and locked it back down. Recently, They pushed updates for the cuda1121 app, but also brought back a new cuda101 app. It was this app that I received. but I did not receive cuda1121. So I had the thought, maybe they are actually checking the CUDA version from BOINC, and not just checking for compatible driver version. so I checked the CUDA version listed in the "good" hosts' coproc file and they all reported greater than 11.2. and the bad hosts were outdated and still reporting cuda 11.1 from older driver installs, though the driver version itself was listed as being high enough for cuda 11.2 based on the nvidia thresholds. So this made sense as to why one of my hosts actually picked up tasks for the cuda 101 app, as previously the cuda100 was taken away, and it didn't report high enough cuda version to get the 11.2 app. but now that 101 was brought back, I now qualified for that again. So I've now recycled the coproc file on the two bad hosts to report CUDA 11.4 so i expect I'll get the new app now. it might be useful in the future to test 101 vs 1121 by manipulating the coproc file to control what gets sent, I assume GPUGRID will send the highest version you qualify for. so the combination of an outdated coproc file (that was locked from updates), and the removal of the old cuda100 app is what caused my previous issues getting work on a few hosts. if they would have kept the old cuda100 app in play, I would have still received that. ID: 57267 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57279 - Posted: 7 Sep 2021, 8:24:56 UTC - in response to Message 57267. Ian, Did you make any other changes - to coproc_info or elsewhere? After rebooting a Linux Mint machine, I get Tue 07 Sep 2021 09:03:21 BST \| \| CUDA: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91, CUDA version 11.2, compute capability 7.5, 4096MB, 3974MB available, 5153 GFLOPS peak) Tue 07 Sep 2021 09:03:21 BST \| \| OpenCL: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91.03, device version OpenCL 1.2 CUDA, 5942MB, 3974MB available, 5153 GFLOPS peak) - all of which seems to match your settings, but I've still never been sent a task beyond version 212. Any ideas? ID: 57279 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3 Level Scientific publications	Message 57282 - Posted: 7 Sep 2021, 22:44:28 UTC - in response to Message 57279. Last modified: 7 Sep 2021, 22:48:11 UTC i have to assume that their CUDA version is "11.2.1" the .1 denoting the Update 1 version. based on the fact that their app plan class is cuda1121. does BOINC reflect CUDA version 11.2.1 or greater in the coproc file? your driver version is sufficient, but it's possible that BOINC isn't capturing these minor versions and the project only knows what you have based on what BOINC tells it. try upgrading the drivers to 465+ to get into the CUDA 11.3+ to ensure that your version is greater than required. also keep in mind the low task availability. seems like new work hasnt been available for a few days. maybe they pulled back on sending work after I reported the issues with the new app. ID: 57282 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57283 - Posted: 8 Sep 2021, 11:10:25 UTC - in response to Message 57282. OK, I'll see your 465 and raise you 470 (-: Wed 08 Sep 2021 12:04:41 BST \| \| CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57, CUDA version 11.4, compute capability 7.5, 4096MB, 3972MB available, 5530 GFLOPS peak) Wed 08 Sep 2021 12:04:41 BST \| \| OpenCL: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57.02, device version OpenCL 3.0 CUDA, 5942MB, 3972MB available, 5530 GFLOPS peak) It sounds plausible, coproc_info had <cudaVersion>11020</cudaVersion>: it now has <cudaVersion>11040</cudaVersion>. No tasks on the first request, but as you say, they're as rare as hen's teeth. I'll leave it trying and see what happens. ID: 57283 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57285 - Posted: 8 Sep 2021, 21:18:07 UTC OK, so I've got a Crypic_Scout task running with v217 and cuda 1121. But it's on the machine where I didn't update the video driver. Go figure. It's - according to BOINC Manager - on device 1, and two Einstein tasks are running on device 0. As usual. I've had a long day in the hills (last day of summer weather), so I'll leave it for tonight. But at least I'll have some entrails to pick over in the morning. Thought - I might exclude the project from devices other than 0, until we get to the bottom of this. ID: 57285 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57286 - Posted: 8 Sep 2021, 21:38:48 UTC Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported. ID: 57286 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 13,083,686,510 RAC: 4,433,775 Level Scientific publications	Message 57287 - Posted: 8 Sep 2021, 21:48:21 UTC - in response to Message 57286. Last modified: 8 Sep 2021, 21:49:00 UTC Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported. nvidia-smi command will quickly confirm this ID: 57287 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57288 - Posted: 8 Sep 2021, 22:18:27 UTC - in response to Message 57287. Yup, so it has. +-----------------------------------------------------------------------------+ \| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 \| \|-------------------------------+----------------------+----------------------+ \| GPU Name Persistence-M\| Bus-Id Disp.A \| Volatile Uncorr. ECC \| \| Fan Temp Perf Pwr:Usage/Cap\| Memory-Usage \| GPU-Util Compute M. \| \| \| \| MIG M. \| \|===============================+======================+======================\| \| 0 GeForce GTX 166... Off \| 00000000:01:00.0 On \| N/A \| \| 55% 87C P2 126W / 125W \| 1531MiB / 5941MiB \| 100% Default \| \| \| \| N/A \| +-------------------------------+----------------------+----------------------+ \| 1 GeForce GTX 166... Off \| 00000000:05:00.0 Off \| N/A \| \| 31% 37C P8 11W / 125W \| 8MiB / 5944MiB \| 0% Default \| \| \| \| N/A \| +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 1133 G /usr/lib/xorg/Xorg 89MiB \| \| 0 N/A N/A 49977 C bin/acemd3 302MiB \| \| 0 N/A N/A 50085 C ...nux-gnu__GW-opencl-nvidia 1135MiB \| \| 1 N/A N/A 1133 G /usr/lib/xorg/Xorg 4MiB \| +-----------------------------------------------------------------------------+ acemd3 running on GPU 0 is conclusive. And so to bed. ID: 57288 · Rating: 0 · rate: / Reply Quote

kksplace Send message Joined: 4 Mar 18 Posts: 53 Credit: 2,857,476,011 RAC: 148,700 Level Scientific publications	Message 57583 - Posted: 11 Oct 2021, 19:07:43 UTC Last modified: 11 Oct 2021, 19:13:46 UTC After not crunching for several months I started back again about a month ago. It took some time due to limited work units, but I received some GPUGrid WUs starting the first week of October, but now haven't received any since October 6th. I have tried snagging one when some are showing as available and only receive a message "No tasks are available for New version of ACEMD" on BOINC Manager Event log. Any ideas what I may have changed/not set correctly? (I am receiving and crunching Einstein and Milkway WUs. GPUGrid resource share is set 15 times higher than Einstein and 50 times higher than Milkyway.) Nvidia 1080 Driver 470.63.01 Cuda Version: 11.4 Linux Mint OS Edit: I have also tried a project reset, which did not help. Computer is not hidden. Thank you for taking a look. ID: 57583 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 3 Level Scientific publications	Message 57584 - Posted: 11 Oct 2021, 19:16:58 UTC - in response to Message 57583. No tasks available. Your system looks fine to me. If you want to snag some tasks as they become available or any resends as they become available, you’ll have to setup some kind of script or looping command to constantly check GPUGRID for more work. BOINC’s default work fetch behavior will fall into kind of hidden back off and will stop checking if it has several instances of no work received. A script to manually check periodically is the only sure way to defeat this. Just make sure it’s checking on some interval longer than the default server cooldown (I think it’s 30 seconds). Checking every 5 mins will give you a good chance to catch some resends or new work. ID: 57584 · Rating: 0 · rate: / Reply Quote