some hosts won't get tasks

Message boards : Number crunching : some hosts won't get tasks
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57195 - Posted: 11 Jul 2021, 12:57:41 UTC

I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts.
ID: 57195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57196 - Posted: 11 Jul 2021, 14:56:30 UTC

Might be a solution. Easy enough to do and you can always merge the old hostID back into the new ID.
ID: 57196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57197 - Posted: 11 Jul 2021, 18:57:45 UTC - in response to Message 57195.  

I suggest you to force BOINC / GPUGrid to assign a new host ID for your non-working hosts.


Could be a solution. But right now since not much work is available anyway. I will wait until work is plentiful again and reassess. If I’m still not getting work when there are thousands of tasks ready to send, then I’ll do it. Really prefer not to though.
ID: 57197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jiipee

Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,813,058,416
RAC: 114
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 57198 - Posted: 12 Jul 2021, 6:48:01 UTC
Last modified: 12 Jul 2021, 6:49:18 UTC

Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it:

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0)
acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory
10:36:11 (18462): acemd3 exited; CPU time 0.000578
10:36:11 (18462): app exit status: 0x7f
10:36:11 (18462): called boinc_finish(195)

</stderr_txt>
]]>


Perhaps some bugs waiting to be solved?
ID: 57198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57199 - Posted: 12 Jul 2021, 14:09:22 UTC - in response to Message 57198.  

Last ACEMD3 work unit seen was 27077654 (8th July 2021). It errored out. This same error seems to happen on other's hosts too, yet one has successfully completed it:

<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)</message>
<stderr_txt>
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper (7.7.26016): starting
10:36:10 (18462): wrapper: running acemd3 (--boinc input --device 0)
acemd3: error while loading shared libraries: libboost_filesystem.so.1.74.0: cannot open shared object file: No such file or directory
10:36:11 (18462): acemd3 exited; CPU time 0.000578
10:36:11 (18462): app exit status: 0x7f
10:36:11 (18462): called boinc_finish(195)

</stderr_txt>
]]>


Perhaps some bugs waiting to be solved?


you need to install the boost 1.74 package from your distribution or from a PPA. no idea what system you have since your computers are hidden, the install process will vary from distribution to distribution. On Ubuntu there is a PPA for it.

that will fix your error.
ID: 57199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
jiipee

Send message
Joined: 4 Jun 15
Posts: 19
Credit: 8,813,058,416
RAC: 114
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 57200 - Posted: 12 Jul 2021, 18:17:09 UTC - in response to Message 57199.  

Ok, thanks for the info. My computers run mostly CentOS 6/7, but there is one Linux Mint and one Win10 also.
ID: 57200 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57267 - Posted: 3 Sep 2021, 15:17:40 UTC

I think it's resolved now.

Background,
When the cuda1121 app was released on July 1st, 2 of my hosts stopped receiving any tasks/applications. the cuda100 linux app was pulled and replaced with cuda1121. all systems had compatible drivers and displayed compatible driver versions, however only some systems continued to receive this new app. all others constantly got a "no tasks available" message. I had no problems getting the cuda100 task before

I run my coproc_info files on all hosts in a locked down state, so it always shows the same obfuscated driver version and doesn't change when I change drivers. this can be beneficial for testing sometimes (for example, it's the only way i can get Einstein to send me the new 1.28 beta app for AMD because BOINC detects OpenCL 1.2 even with the compatible drivers, and Einstein will not send the app unless you display OpenCL 2.0+), also gives me the ability to control what is actually shown. Usually I do not update the coproc file with the latest info. if i wanted to change something I just unlocked it, changed what needed to be changed, and locked it back down.

Recently,
They pushed updates for the cuda1121 app, but also brought back a new cuda101 app. It was this app that I received. but I did not receive cuda1121.

So I had the thought, maybe they are actually checking the CUDA version from BOINC, and not just checking for compatible driver version. so I checked the CUDA version listed in the "good" hosts' coproc file and they all reported greater than 11.2. and the bad hosts were outdated and still reporting cuda 11.1 from older driver installs, though the driver version itself was listed as being high enough for cuda 11.2 based on the nvidia thresholds. So this made sense as to why one of my hosts actually picked up tasks for the cuda 101 app, as previously the cuda100 was taken away, and it didn't report high enough cuda version to get the 11.2 app. but now that 101 was brought back, I now qualified for that again.

So I've now recycled the coproc file on the two bad hosts to report CUDA 11.4 so i expect I'll get the new app now. it might be useful in the future to test 101 vs 1121 by manipulating the coproc file to control what gets sent, I assume GPUGRID will send the highest version you qualify for.

so the combination of an outdated coproc file (that was locked from updates), and the removal of the old cuda100 app is what caused my previous issues getting work on a few hosts. if they would have kept the old cuda100 app in play, I would have still received that.
ID: 57267 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57279 - Posted: 7 Sep 2021, 8:24:56 UTC - in response to Message 57267.  

Ian,

Did you make any other changes - to coproc_info or elsewhere?

After rebooting a Linux Mint machine, I get

Tue 07 Sep 2021 09:03:21 BST |  | CUDA: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91, CUDA version 11.2, compute capability 7.5, 4096MB, 3974MB available, 5153 GFLOPS peak)
Tue 07 Sep 2021 09:03:21 BST |  | OpenCL: NVIDIA GPU 0: GeForce GTX 1660 SUPER (driver version 460.91.03, device version OpenCL 1.2 CUDA, 5942MB, 3974MB available, 5153 GFLOPS peak)

- all of which seems to match your settings, but I've still never been sent a task beyond version 212. Any ideas?
ID: 57279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57282 - Posted: 7 Sep 2021, 22:44:28 UTC - in response to Message 57279.  
Last modified: 7 Sep 2021, 22:48:11 UTC

i have to assume that their CUDA version is "11.2.1" the .1 denoting the Update 1 version. based on the fact that their app plan class is cuda1121.

does BOINC reflect CUDA version 11.2.1 or greater in the coproc file? your driver version is sufficient, but it's possible that BOINC isn't capturing these minor versions and the project only knows what you have based on what BOINC tells it.

try upgrading the drivers to 465+ to get into the CUDA 11.3+ to ensure that your version is greater than required.

also keep in mind the low task availability. seems like new work hasnt been available for a few days. maybe they pulled back on sending work after I reported the issues with the new app.
ID: 57282 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57283 - Posted: 8 Sep 2021, 11:10:25 UTC - in response to Message 57282.  

OK, I'll see your 465 and raise you 470 (-:

Wed 08 Sep 2021 12:04:41 BST |  | CUDA: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57, CUDA version 11.4, compute capability 7.5, 4096MB, 3972MB available, 5530 GFLOPS peak)
Wed 08 Sep 2021 12:04:41 BST |  | OpenCL: NVIDIA GPU 0: NVIDIA GeForce GTX 1660 Ti (driver version 470.57.02, device version OpenCL 3.0 CUDA, 5942MB, 3972MB available, 5530 GFLOPS peak)

It sounds plausible, coproc_info had <cudaVersion>11020</cudaVersion>: it now has <cudaVersion>11040</cudaVersion>. No tasks on the first request, but as you say, they're as rare as hen's teeth. I'll leave it trying and see what happens.

ID: 57283 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57285 - Posted: 8 Sep 2021, 21:18:07 UTC

OK, so I've got a Crypic_Scout task running with v217 and cuda 1121.

But it's on the machine where I didn't update the video driver. Go figure.

It's - according to BOINC Manager - on device 1, and two Einstein tasks are running on device 0. As usual.

I've had a long day in the hills (last day of summer weather), so I'll leave it for tonight. But at least I'll have some entrails to pick over in the morning.

Thought - I might exclude the project from devices other than 0, until we get to the bottom of this.
ID: 57285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57286 - Posted: 8 Sep 2021, 21:38:48 UTC

Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported.
ID: 57286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57287 - Posted: 8 Sep 2021, 21:48:21 UTC - in response to Message 57286.  
Last modified: 8 Sep 2021, 21:49:00 UTC

Initial observations are that the Einstein tasks are running far slower than usual - implying that both sets of tasks are running on device zero, as other people have reported.

nvidia-smi command will quickly confirm this
ID: 57287 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57288 - Posted: 8 Sep 2021, 22:18:27 UTC - in response to Message 57287.  

Yup, so it has.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03    Driver Version: 460.91.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:01:00.0  On |                  N/A |
| 55%   87C    P2   126W / 125W |   1531MiB /  5941MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 166...  Off  | 00000000:05:00.0 Off |                  N/A |
| 31%   37C    P8    11W / 125W |      8MiB /  5944MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1133      G   /usr/lib/xorg/Xorg                 89MiB |
|    0   N/A  N/A     49977      C   bin/acemd3                        302MiB |
|    0   N/A  N/A     50085      C   ...nux-gnu__GW-opencl-nvidia     1135MiB |
|    1   N/A  N/A      1133      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

acemd3 running on GPU 0 is conclusive. And so to bed.
ID: 57288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
kksplace

Send message
Joined: 4 Mar 18
Posts: 53
Credit: 2,815,476,011
RAC: 0
Level
Phe
Scientific publications
wat
Message 57583 - Posted: 11 Oct 2021, 19:07:43 UTC
Last modified: 11 Oct 2021, 19:13:46 UTC

After not crunching for several months I started back again about a month ago. It took some time due to limited work units, but I received some GPUGrid WUs starting the first week of October, but now haven't received any since October 6th. I have tried snagging one when some are showing as available and only receive a message "No tasks are available for New version of ACEMD" on BOINC Manager Event log. Any ideas what I may have changed/not set correctly? (I am receiving and crunching Einstein and Milkway WUs. GPUGrid resource share is set 15 times higher than Einstein and 50 times higher than Milkyway.)

Nvidia 1080
Driver 470.63.01 Cuda Version: 11.4
Linux Mint OS

Edit: I have also tried a project reset, which did not help.

Computer is not hidden. Thank you for taking a look.
ID: 57583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57584 - Posted: 11 Oct 2021, 19:16:58 UTC - in response to Message 57583.  

No tasks available. Your system looks fine to me.

If you want to snag some tasks as they become available or any resends as they become available, you’ll have to setup some kind of script or looping command to constantly check GPUGRID for more work. BOINC’s default work fetch behavior will fall into kind of hidden back off and will stop checking if it has several instances of no work received. A script to manually check periodically is the only sure way to defeat this. Just make sure it’s checking on some interval longer than the default server cooldown (I think it’s 30 seconds). Checking every 5 mins will give you a good chance to catch some resends or new work.
ID: 57584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : some hosts won't get tasks

©2025 Universitat Pompeu Fabra