Optimized bandwith

Author	Message
Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 54609 - Posted: 7 May 2020, 12:01:56 UTC I had connection problems with these DC projects: GPUGrid Einstein@home folding@home However I didn't have connection problems with: TN-Grid Rosetta@home (I set 24h workunits for Rosetta, so this project may be not affected by this). ID: 54609 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 55300 - Posted: 16 Sep 2020, 16:30:26 UTC - in response to Message 54596. I see nothing obviously wrong, so I hope it's some international connectivity issue. I'm in the USA. These are European BOINC projects that have never behaved like GG: Ibercivis LHC TN-GRID Asteroids YoYo Yafu Universe QuChemPedIA I would look at your server configuration some more. ID: 55300 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 55387 - Posted: 30 Sep 2020, 16:25:06 UTC Last modified: 30 Sep 2020, 16:42:41 UTC If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: https://i.ibb.co/5BM0t5f/an-example-Transfers.jpg I contacted Fred at eFMer and he pointed out the <refresh> command which I tested: <config> <refresh> <uploads>60</uploads> <downloads>60</downloads> </refresh> </config> or <config> <refresh> <auto>60</auto> </refresh> </config> But sadly it only works on localhost and does not help my headless fleet. Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command??? It would also help to eliminate the stifling 2 WU per GPU limitation. ID: 55387 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55389 - Posted: 30 Sep 2020, 17:29:39 UTC - in response to Message 55387. Last modified: 30 Sep 2020, 17:32:35 UTC If I stop babysitting, i.e. clicking Retry All on the BoincTasks Transfer tab, GPUGrid for a couple of hours this is what greets me: https://i.ibb.co/5BM0t5f/an-example-Transfers.jpg I contacted Fred at eFMer and he pointed out the <refresh> command which I tested: <config> <refresh> <uploads>60</uploads> <downloads>60</downloads> </refresh> </config> or <config> <refresh> <auto>60</auto> </refresh> </config> But sadly it only works on localhost and does not help my headless fleet. Does anyone know a way to get either BOINC (Retry pending transfers) or BoincTasks to hourly issue the "Retry All" command??? It would also help to eliminate the stifling 2 WU per GPU limitation. you could write a script to issue the retry transfers. then just have it run locally on each system looping with a timed wait. your systems are hidden so i can't really be more specific since I don't know what kind of setups you have. ID: 55389 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 55392 - Posted: 30 Sep 2020, 18:20:02 UTC - in response to Message 55389. your systems are hidden so i can't really be more specific since I don't know what kind of setups you have. They're naked as a Jaybird now :-) ID: 55392 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55393 - Posted: 30 Sep 2020, 18:50:55 UTC - in response to Message 55392. Last modified: 30 Sep 2020, 18:57:40 UTC ok since you are running Linux, try this. you can use the boinccmd tool to retry transfers, you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. this script will search the stuck transfers, grab their file names, then retry them for the given project. note: if you have stuck transfers from another project you'll get an error, but you can just ignore that. create a script with the following content: if using a repository install of BOINC: #!/bin/bash for i in `boinccmd --get_file_transfers \| sed -n -e 's/^.name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done name the script something like "update_transfers.sh" change permissions of the script to make it executable sudo chmod +x update_transfers.sh run it with the following command from the same directory where the script is saved: watch -n 600 ./update_transfers.sh *replace the value 600 with whatever wait (in seconds) you want. if you have a user install version of BOINC (ie, one that does not need to be "installed" and just runs from your home folder) then you need to put the script in the same directory where your boinccmd executable is, and modify the script, replacing "boinccmd" with "./boinccmd" you'll have to do this on each of your 40ish hosts. ID: 55393 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55394 - Posted: 30 Sep 2020, 19:40:12 UTC Last modified: 30 Sep 2020, 19:44:07 UTC but the real problem is the short cooldown time from the project combined with a block on communications from the same IP (it seems they have a 30 second timer on that). they need to fix that. basically if most of your computers are at the same location, and thus have the same external IP address, the GPUGRID servers will only allow communication from 1 system at a time. when you have a lot, its very likely to have multiple systems trying to communicate with the project in some way at the same time or in very close time interval. the project seems to reject these successive requests, BOINC thinks there a problem, and eventually just stops trying until you manually intervene. this is exacerbated by the short cooldown time. i think it's like 10 seconds or something ridiculously short, so the project is kind of forcing this behavior. if you only have 1 or 2 systems, especially if they are rather slow, you'll rarely or never run into this problem. they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes. ID: 55394 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 55399 - Posted: 30 Sep 2020, 22:15:43 UTC they need to fix one or both of these settings. either by shortening or turning off the IP block timer that they have setup, or by changing their project cooldown to something much longer, like 10 minutes. +100 ID: 55399 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 55402 - Posted: 1 Oct 2020, 1:50:57 UTC Last modified: 1 Oct 2020, 1:52:28 UTC Thanks for the script. I installed it but I also upgraded Nvidia driver to 455.23.04 and when I rebooted I lost the headless computer. Is anyone seeing a problem with 455.23.04??? Yes, all my rigs are behind the same external IP address. I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said. Yea, today was the last day of TOU season so I can crunch 24x7 for the next 6 months. I'm watching BoincTasks All Computers Transfers and every thing is flying and working good. I expect as the first round of 2080 WUs finish they'll start entering this banishment mode. Then the 1080s will follow into banishment. You'd think these guys would want to get more work done faster instead of forcing less work to get done slower. They've implemented 3 things that reduce my throughput by about 75%: max 2 WUs per GPU, IP Blocker and short cooldown time. I'm fine with 2 WU/GPU if they'd deliver work quickly. WCG implemented their per project Device Profiles:Project Limits The following settings allow you to set the maximum number of tasks assigned to one of your devices for a project. So I set the limits to threads+2 and return all WUs within 24 hours, even the massive ARPs. I'd like to do the same for GG but they've hog-tied me. This is really frustrating since this is the only BOINC life science GPU available. ID: 55402 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55403 - Posted: 1 Oct 2020, 2:51:46 UTC - in response to Message 55402. Last modified: 1 Oct 2020, 3:00:08 UTC I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said. "cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..." basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds. in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU. ID: 55403 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55404 - Posted: 1 Oct 2020, 3:33:05 UTC - in response to Message 55403. I don't understand what a "project cooldown time" is. I thought it was GG won't answer the phone for xx seconds but then 10 seconds is good and 10 minutes is bad and that's the opposite of what you said. "cooldown" is an unofficial term. I don't know what it's called on the project server side maybe "requested delay"?. you can see it in your event log where it says "Project requested a delay of.." and then "Deferring communications for..." basically when you communicate with the project for a schedule request. after the request is completed, the project always tells your system to wait some amount of time before trying to communicate again. this is standard BOINC behavior and every project has a different delay pre-set on their server configuration. SETI was 303 seconds. Einstein is like 60 seconds. in the case of GPUGRID, that time is 31 seconds. which is much too short when you have many fast systems at the same IP. you dont need to be asking the project for more work every 30 seconds when it takes 20-60+ minutes to run a single WU. Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well. It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks. The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF. ID: 55404 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55405 - Posted: 1 Oct 2020, 3:55:19 UTC - in response to Message 55404. Last modified: 1 Oct 2020, 3:57:59 UTC Not sure it is a BOINC server setting stopping the communications as it also affects the entire web site access as well. It is more likely to be at the perimeter of the network, probably part of network defence strategy against DDOS and similar style attacks. The settings may be out of Gpugrids hands and controlled by the Network Administrators at UPF. its a project server setting that controls the requested delay. the IP blocking timeout must be something setup on their network. its the combination of the short deferral time and the IP block timeout. either thing by itself doesn't cause a problem. i've forced my systems to go into a cooldown for 10 minutes after each schedule request (using a custom BOINC client) and that fixed all issues for my systems. the IP block timer is still in effect. they no longer get stuck in idle because the chance that it's trying to communicate at the same time as another system is drastically reduced. the way things are by default almost guarantees that if you have more than 1 fast computer, you will have issues. but the requested delay is absolutely in their control to change. I've seen many other projects adjust this value when they wanted to. there's no reason GPUGRID cant change it either since they are using the BOINC server software. ID: 55405 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 0 Level Scientific publications	Message 55406 - Posted: 1 Oct 2020, 7:03:59 UTC Some projects are even more "busy" than GPUGrid. When I joined Universe I found it had a project delay time of 11 seconds. Totally ridiculous. No host or download server needs to be polled that often. So the first thing I did was put in a 120 second cooldown for that project for my client. As Ian states, that parameter is set in the project server software. ID: 55406 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 55419 - Posted: 4 Oct 2020, 10:39:07 UTC Not a peep from GG staff. You'd think lengthening the cooldown period would be trivial to try. This problem is just getting worse for me. ID: 55419 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 55432 - Posted: 5 Oct 2020, 21:50:14 UTC All I hear are crickets ;-( ID: 55432 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 32 Level Scientific publications	Message 55532 - Posted: 10 Oct 2020, 9:50:21 UTC Will GPUGrid ever outgrow the need for a babysitter??? ID: 55532 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55535 - Posted: 10 Oct 2020, 20:58:23 UTC - in response to Message 55532. Last modified: 10 Oct 2020, 21:00:49 UTC Will GPUGrid ever outgrow the need for a babysitter??? While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches. In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.) The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue. I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging". 0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.) With lesser cards you can try higher values: days seconds h:m:s 0.01 864 14:24 0.02 1728 28:48 0.03 2592 43:12 0.04 3456 57:36 0.05 4320 1:12:00 0.06 5174 1:26:24 0.07 6048 1:40:48 0.08 6912 1:55:12 0.09 7776 2:09:36 0.10 8640 2:24:00 As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.) I'm curious about that fix is working for you or not. ID: 55535 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55539 - Posted: 11 Oct 2020, 0:56:52 UTC - in response to Message 55535. Will GPUGrid ever outgrow the need for a babysitter??? While we're waiting (probably in vain) for that I've figured out the only way to mitigate this problem on our side: You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches. In this way the host will ask for a new task only when the server will actually send a new (spare) one, so there will be no futile requests for getting work, that results in a lower chance to get your WAN IP address "banned" for a time period, so your other hosts (behind the same WAN IP) have a bigger chance for getting work as well. As there is plenty of work available at the moment, a new wu will be sent for sure, provided your IP is not "banned". (Note that the GPUGrid server will send only two workunits per GPU for a given host.) The actual value depends on the GPU and the workunit too, as the present workunits are quite short we should set a very short queue. I've set 0.01(+0) days on my host with a 2080Ti. This made my connection to the GPUGrid servers much less "lagging". 0.01 days is the lowest you can set, this is also the 'unit' for the size of the cache. (so you can't set 0.015 days.) With lesser cards you can try higher values: days seconds h:m:s 0.01 864 14:24 0.02 1728 28:48 0.03 2592 43:12 0.04 3456 57:36 0.05 4320 1:12:00 0.06 5174 1:26:24 0.07 6048 1:40:48 0.08 6912 1:55:12 0.09 7776 2:09:36 0.10 8640 2:24:00 As my hosts (except for one) are crunching folding@home at the moment (btw my team is the 789th as I wrote this), I haven't tested it with getting work on multiple host, only by browsing the GPUGrid forums on my other PC. (But it should have an effect on getting work too.) I'm curious about that fix is working for you or not. This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. Lets see how many people try this. ID: 55539 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55544 - Posted: 11 Oct 2020, 13:21:05 UTC - in response to Message 55539. Last modified: 11 Oct 2020, 13:21:40 UTC This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking. ID: 55544 · Rating: 0 · rate: / Reply Quote

rod4x4 Send message Joined: 4 Aug 14 Posts: 266 Credit: 2,219,935,054 RAC: 0 Level Scientific publications	Message 55545 - Posted: 11 Oct 2020, 14:53:18 UTC - in response to Message 55544. Last modified: 11 Oct 2020, 15:04:30 UTC This approach definitely has merit, but would rely on a large percentage of Gpugrid users applying this method for any results to be seen. No. When my WAN IP gets blocked by gpugrid's DDOS prevention due to my hosts issue too many requests in rapid succession for www.gpugrid.net, it does not have any effect on any other user's WAN IP blocking. So are we dealing with DDOS, contention from a saturated link, rate limiting on a under resourced link, badly configured router, QOS putting our connects at the bottom of the list or a combination of these factors? Comments by volunteers in the forum indicates DDOS and a saturated link is quite likely. The other options listed above also rate a consideration. The title of this thread also suggest the rules on the network edge equipment have been modified to change bandwidth allocation. I guess we will never really know unless it is identified by Gpugrid. We are really just hypothesizing. It passes the time....and gives us a distraction. ID: 55545 · Rating: 0 · rate: / Reply Quote