some hosts won't get tasks

Message boards : Number crunching : some hosts won't get tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57146 - Posted: 5 Jul 2021, 14:44:55 UTC

this one is a real head-scratcher for me. ever since the new app was released, two of my hosts have not been able to receive tasks. they don't give any error, or other obvious sign that anyting is wrong, they just always get the "no tasks available" response.

now I know that task availability is slim right now, but all hosts are requesting work on the same interval (every 100s) and 3 out of the 5 hosts are getting work fairly regularly. after running for several days, I would expect all hosts to at least get one task. it seems odd that 2 hosts will never get any tasks, they can't be THAT unlucky.

These hosts have no problem getting some tasks occasionally:
[7] RTX 2080 Ti / EPYC 7402P / Ubuntu 20.04
[1] RTX 3080 Ti / R9-5950X / Ubuntu 20.04
[1] GTX 1660 Super / [2] EPYC 7642 / Ubuntu 20.04


These hosts have not received tasks since the new app was released:
[8] RTX 2070 / EPYC 7402P / Ubuntu 20.04
[7] RTX 2080 / EPYC 7502 / Ubuntu 20.04

All hosts have the same "venue/location" in preferences, same OS/software package, compatible drivers, the proper boost packages installed, and I've reset the project on all hosts. I can't see an obvious reason why the two haven't received any work.
ID: 57146 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57148 - Posted: 5 Jul 2021, 15:27:57 UTC - in response to Message 57146.  

We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens.
ID: 57148 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57149 - Posted: 5 Jul 2021, 15:34:19 UTC - in response to Message 57148.  

We have had discussions before about that. It appears there is a limit on the number of machines it will send work to. I expect it is part of their (undisclosed) anti-ddos system, but I don't think we know much about it other than it happens.


that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address). In that case, a schedule request would fail, but occasionally get through. I've worked around this problem for a long time and nothing has changed in that regard. these systems are spread across 3 physical locations and one of the systems (the 8x 2070) is actually the only host at it's IP, it's not competing with any other system.

so that's not the issue here. I have no problem making schedule requests, and it's always asking for work, but these two for some reason always get the response that no tasks are available. it seems unlikely that they would be THAT unlucky to never get a resend when 3 other systems are occasionally picking them up

ID: 57149 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57153 - Posted: 5 Jul 2021, 17:28:56 UTC - in response to Message 57149.  

that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address).

I am well familiar with the temporary block. There are two problems present, the second problem is longer-term.
ID: 57153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57154 - Posted: 5 Jul 2021, 17:56:50 UTC - in response to Message 57153.  

that situation is different. what you're referencing is a temporary communications block when too many computers are at the same physical location (sharing the same external IP address).

I am well familiar with the temporary block. There are two problems present, the second problem is longer-term.


can you link to some additional information about this second case? I've never seen that discussed here. only the one I mentioned.

but again, the server is responding, so it's not actually being blocked from communication, the server just always responds that there are no tasks, even when there probably is at some times.
ID: 57154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57156 - Posted: 5 Jul 2021, 19:27:53 UTC - in response to Message 57154.  

It has been over a year ago since I last saw it mentioned. I searched own posts, but unfortunately the search function does not work correctly.
ID: 57156 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57157 - Posted: 5 Jul 2021, 21:05:50 UTC

I found this interesting Message #54344 from Retvari Zoltan
Also this other Message #51060 from kksplace
ID: 57157 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57158 - Posted: 5 Jul 2021, 21:20:03 UTC - in response to Message 57157.  
Last modified: 5 Jul 2021, 21:22:01 UTC

Thanks for digging, but those are both describing different situations. In the case from Zoltan, the user was getting a message that tasks won’t finish in time. I am not getting any such message. Only that tasks are not available.

And I’m not having any issue, or getting messages for, low disk space preventing work.
ID: 57158 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57159 - Posted: 6 Jul 2021, 0:31:39 UTC

I'm assuming GPUGrid is your only gpu project?

Any past projects that were gpu on those hosts?

You might still have an REC debt to those other projects.

Have you tried a work_fetch_debug or a rr_simulation_debug?
ID: 57159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57160 - Posted: 6 Jul 2021, 0:54:25 UTC - in response to Message 57159.  
Last modified: 6 Jul 2021, 0:54:42 UTC

GPUGRID is the only non-zero resource GPU project.

i never have to deal with REC, I do not run multiple projects at the same time. only one as prime (GPUGRID) and another as backup (Einstein), so it's always prioritizing GPUGRID.

it's asking for work (1sec, n devices), just never gets any. project always says no tasks available. if it was some REC thing I would get a different response, either something stating that, or just not even asking for work (0sec, 0 devices).
ID: 57160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57175 - Posted: 8 Jul 2021, 3:57:58 UTC - in response to Message 57160.  

this "feels" like a similar issue being described here. maybe not exactly the same, but something similar at least.

https://einsteinathome.org/content/invalid-global-preferences-problem#comment-169966

but as best I can tell, all of my global_prefs files are the same, but they come from WCG anyway. not sure what that has to do with GPUGRID. and there is no <venue> line item on any of them, including the hosts which get work fine. they are pretty much identical between all hosts.

but it seems the symptoms from something like this are similar. the project just telling you that no tasks are available when they really are.

Richard, do you remember anything about this?

I've updated preferences (just re-instating existing settings) at both WCG and GPUGRID. and also tried removing GPUGRID totally from one host and adding it back. so far nothing changed.
ID: 57175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57176 - Posted: 8 Jul 2021, 7:40:17 UTC - in response to Message 57175.  

Richard, do you remember anything about this?

Yes, I remember it well. Message 150509 was one of my better bits of bug-hunting.

But I also draw your attention to Message 150489:

All my machines have global_prefs_override.xml files, so are functioning normally in spite of the oddities.
ID: 57176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57179 - Posted: 8 Jul 2021, 11:41:46 UTC - in response to Message 57176.  

I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue?
ID: 57179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Keith Myers
Avatar

Send message
Joined: 13 Dec 17
Posts: 1419
Credit: 9,119,446,190
RAC: 891
Level
Tyr
Scientific publications
watwatwatwatwat
Message 57180 - Posted: 8 Jul 2021, 13:36:56 UTC - in response to Message 57179.  

I have an override file as well. What’s the significance of that? It has my local settings, but that’s the significance to this issue?

An override file always takes precedence over any project preference file.

It is completely local to the host.
ID: 57180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 428
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57181 - Posted: 8 Jul 2021, 13:48:01 UTC - in response to Message 57179.  

The significance lies in the fact that Einstein has re-written large parts of their server code in Drupal, In some respects, their re-write didn't exactly correspond with the original Berkeley PHP (or whatever it was) version of the code.

In particular, the Drupal code barfed when reading the Berkeley version of the global_prefs.xml file, when venues were in use. The Berkeley file was mal-formed: the BOINC client was relaxed enough to read it, but the Drupal server was stricter and threw an error: that's when scheduler requests failed and no work was sent.

But 'no work' was a specifically Drupal (Einstein project) problem. Work could be fetched from other projects as normal. Einstein solved the problem by modifying their Drupal code so the the missing tag became a non-fatal error - it just wrote a warning into the server log instead. And they fixed the Berkeley code so that projects which updated their server code didn't trigger the bug any more.

So, I don't think the Einsten discussion will be a pointer to the cause of your problem here - even though this project still uses server version 613 (dating to around 2012-2013, before the Einstein fixes of 2016). My machines continue to request GPU work from all three of Einstein, GPUGrid and WCG. Only Einstein has 100% work availability - 'no work available' is common at the other two.
ID: 57181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57182 - Posted: 8 Jul 2021, 14:18:46 UTC - in response to Message 57181.  
Last modified: 8 Jul 2021, 14:27:23 UTC

I figured the exact Einstein issue was not causing any issue at GPUGRID, just that some of the aspects of that situation feel similar to whats happening now.

I know this happened at seti once before. and I think it was somehow related to how many "days" of work were set in the compute preferences. IIRC if it was set too high, it would request work, but the project always responded that no tasks were available, even when there were. reducing the work request "days" allowed work to finally get sent.

That's what I think is happening now, though not necessarily related to days requested, just some situation LIKE that. nothing changed on my systems, just one day 2 hosts stopped ever getting work, even when it was available (always got no tasks available response). of course it's harder to troubleshoot now that there is much less work available and only the occasional resend.

I've now disabled work fetch for 2 of the "good" systems, and removed the script constantly looking for a top-up on the 3rd. the 3080ti host will now only check for work on BOINC's logic. "while" the two bad hosts are trying for work every 100 seconds. so the two bad hosts should have a MUCH greater probability to grab work than the 3080ti host, yet the 3080ti host still manages to catch some here and there, and the other two hosts have not received anything since July 1st. always getting the "no tasks available" message. that's what makes me think there's something deeper going on, the behavior is outside of statistical norms.
ID: 57182 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57183 - Posted: 8 Jul 2021, 15:02:52 UTC - in response to Message 57182.  
Last modified: 8 Jul 2021, 15:03:22 UTC

...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message...

July 1st is exactly the date when new application version ACEMD 2.12 (cuda1121) was launched.
And both your problematic hosts haven't received any task of this new version.
Simply coincidence?
I think not.
ID: 57183 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57184 - Posted: 8 Jul 2021, 15:06:55 UTC - in response to Message 57183.  

...the other two hosts have not received anything since July 1st. always getting the "no tasks available" message...

July 1st is exactly the date when new application version ACEMD 2.12 (cuda1121) was launched.
And both your problematic hosts haven't received any task of this new version.
Simply coincidence?
I think not.


I agree with this. but so far can find no difference between the setup of the two bad hosts which would prevent it getting work. it's the same as hosts that are getting work.
ID: 57184 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ServicEnginIC
Avatar

Send message
Joined: 24 Sep 10
Posts: 592
Credit: 11,972,186,510
RAC: 1,447
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 57185 - Posted: 8 Jul 2021, 15:50:34 UTC - in response to Message 57184.  

I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side...
ID: 57185 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,839,470,595
RAC: 6,423
Level
Trp
Scientific publications
wat
Message 57186 - Posted: 8 Jul 2021, 17:32:21 UTC - in response to Message 57185.  

I was thinking of any subtle change in requirements for task sending from server side, more than from your hosts side...


yeah but if the hosts look the same from the outside, they should meet the same requirements. I think it's something not so obvious where the server isnt telling me what the problem is.
ID: 57186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : some hosts won't get tasks

©2025 Universitat Pompeu Fabra