Message boards :
Server and website :
Optimized bandwith
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I had connection problems with these DC projects: GPUGrid Einstein@home folding@home However I didn't have connection problems with: TN-Grid Rosetta@home (I set 24h workunits for Rosetta, so this project may be not affected by this). |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
I see nothing obviously wrong, so I hope it's some international connectivity issue.I'm in the USA. These are European BOINC projects that have never behaved like GG: Ibercivis LHC TN-GRID Asteroids YoYo Yafu Universe QuChemPedIA I would look at your server configuration some more. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: https://i.ibb.co/5BM0t5f/an-example-Transfers.jpg I contacted Fred at eFMer and he pointed out the <refresh> command which I tested: <config> <refresh> <uploads>60</uploads> <downloads>60</downloads> </refresh> </config> or <config> <refresh> <auto>60</auto> </refresh> </config> But sadly it only works on localhost and does not help my headless fleet. Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command??? It would also help to eliminate the stifling 2 WU per GPU limitation. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: you could write a script to issue the retry transfers. then just have it run locally on each system looping with a timed wait. your systems are hidden so i can't really be more specific since I don't know what kind of setups you have.
|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
your systems are hidden so i can't really be more specific since I don't know what kind of setups you have.They're naked as a Jaybird now :-) |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
ok since you are running Linux, try this. you can use the boinccmd tool to retry transfers, you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. this script will search the stuck transfers, grab their file names, then retry them for the given project. *note: if you have stuck transfers from another project you'll get an error, but you can just ignore that. create a script with the following content: if using a repository install of BOINC: #!/bin/bash for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done name the script something like "update_transfers.sh" change permissions of the script to make it executable sudo chmod +x update_transfers.sh run it with the following command from the same directory where the script is saved: watch -n 600 ./update_transfers.sh *replace the value 600 with whatever wait (in seconds) you want. if you have a user install version of BOINC (ie, one that does not need to be "installed" and just runs from your home folder) then you need to put the script in the same directory where your boinccmd executable is, and modify the script, replacing "boinccmd" with "./boinccmd" you'll have to do this on each of your 40ish hosts.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
but the real problem is the short cooldown time from the project combined with a block on communications from the same IP (it seems they have a 30 second timer on that). they need to fix that. basically if most of your computers are at the same location, and thus have the same external IP address, the GPUGRID servers will only allow communication from 1 system at a time. when you have a lot, its very likely to have multiple systems trying to communicate with the project in some way at the same time or in very close time interval. the project seems to reject these successive requests, BOINC thinks there a problem, and eventually just stops trying until you manually intervene. this is exacerbated by the short cooldown time. i think it's like 10 seconds or something ridiculously short, so the project is kind of forcing this behavior. if you only have 1 or 2 systems, especially if they are rather slow, you'll rarely or never run into this problem. they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes.
|
|
Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 8 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes. +100 |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
Thanks for the script. I installed it but I also upgraded Nvidia driver to 455.23.04 and when I rebooted I lost the headless computer. Is anyone seeing a problem with 455.23.04??? Yes, all my rigs are behind the same external IP address. I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said. Yea, today was the last day of TOU season so I can crunch 24x7 for the next 6 months. I'm watching BoincTasks All Computers Transfers and every thing is flying and working good. I expect as the first round of 2080 WUs finish they'll start entering this banishment mode. Then the 1080s will follow into banishment. You'd think these guys would want to get more work done faster instead of forcing less work to get done slower. They've implemented 3 things that reduce my throughput by about 75%: max 2 WUs per GPU, IP Blocker and short cooldown time. I'm fine with 2 WU/GPU if they'd deliver work quickly. WCG implemented their per project Device Profiles:Project Limits The following settings allow you to set the maximum number of tasks assigned to one of your devices for a project. So I set the limits to threads+2 and return all WUs within 24 hours, even the massive ARPs. I'd like to do the same for GG but they've hog-tied me. This is really frustrating since this is the only BOINC life science GPU available. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
"cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..." basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds. in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU.
|
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well. It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks. The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
its a project server setting that controls the requested delay. the IP blocking timeout must be something setup on their network. its the combination of the short deferral time and the IP block timeout. either thing by itself doesn't cause a problem. i've forced my systems to go into a cooldown for 10 minutes after each schedule request (using a custom BOINC client) and that fixed all issues for my systems. the IP block timer is still in effect. they no longer get stuck in idle because the chance that it's trying to communicate at the same time as another system is drastically reduced. the way things are by default almost guarantees that if you have more than 1 fast computer, you will have issues. but the requested delay is absolutely in their control to change. I've seen many other projects adjust this value when they wanted to. there's no reason GPUGRID cant change it either since they are using the BOINC server software.
|
|
Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 8 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Some projects are even more "busy" than GPUGrid. When I joined Universe I found it had a project delay time of 11 seconds. Totally ridiculous. No host or download server needs to be polled that often. So the first thing I did was put in a 120 second cooldown for that project for my client. As Ian states, that parameter is set in the project server software. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
Not a peep from GG staff. You'd think lengthening the cooldown period would be trivial to try. This problem is just getting worse for me. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
All I hear are crickets ;-( |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
Will GPUGrid ever outgrow the need for a babysitter??? |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Will GPUGrid ever outgrow the need for a babysitter???While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches. In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.) The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue. I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging". 0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.) With lesser cards you can try higher values: days seconds h:m:s 0.01 864 14:24 0.02 1728 28:48 0.03 2592 43:12 0.04 3456 57:36 0.05 4320 1:12:00 0.06 5174 1:26:24 0.07 6048 1:40:48 0.08 6912 1:55:12 0.09 7776 2:09:36 0.10 8640 2:24:00As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.) I'm curious about that fix is working for you or not. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Will GPUGrid ever outgrow the need for a babysitter???While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. Lets see how many people try this. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen.No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking. |
|
Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen.No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking. So are we dealing with DDOS, contention from a saturated link, rate limiting on a under resourced link, badly configured router, QOS putting our connects at the bottom of the list or a combination of these factors? Comments by volunteers in the forum indicates DDOS and a saturated link is quite likely. The other options listed above also rate a consideration. The title of this thread also suggest the rules on the network edge equipment have been modified to change bandwidth allocation. I guess we will never really know unless it is identified by Gpugrid. We are really just hypothesizing. It passes the time....and gives us a distraction. |
©2026 Universitat Pompeu Fabra