Optimized bandwith

Message boards : Server and website : Optimized bandwith
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55546 - Posted: 11 Oct 2020, 15:26:30 UTC
Last modified: 11 Oct 2020, 15:27:01 UTC

if you can find a way to prevent your system from communicating with the project for longer than the default 31 second comms delay, you will solve the problem.

or the project could simply increase that delay so you don't have to find some workaround yourself.

increasing the delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them. it's really the best solution. it just seems like the project admins either don't know where this setting is in their own server software or the suggestion is falling on deaf ears.

this 100% solves the problem.
ID: 55546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55547 - Posted: 11 Oct 2020, 17:33:06 UTC - in response to Message 55546.  
Last modified: 11 Oct 2020, 17:41:57 UTC

increasing the [default 31 seconds communications] delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them.
I wonder about the ideal length of that delay. We don't know the exact rules of the DDOS protection hitting on us, which, in combination with the number of hosts the given user has behind a single WAN IP would decide the ideal delay length. This delay can't be longer than the shortest workunit on the fastest GPU, because it would make those hosts to starve. (so it won't be better than the DDOS protection making those hosts to starve.)
Taking the signs of the present DDOS protection and the present short workunits in consideration, I think there is a maximum number of hosts behind the single WAN IP which can work without some of them starving. Above that number some random one of the hosts behind that WAN IP will be inevitably hit by that DDOS protection.
So I think the delay should be around 600 seconds (10 minutes); but also the length of the workunits should be (at least) doubled, even quadrupled. The present workunits should be in the short queue (for lesser GPUs).
ID: 55547 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55551 - Posted: 12 Oct 2020, 3:11:26 UTC - in response to Message 55547.  

A 10min delay seems pretty ideal to me. The fastest WUs I see through my 2080ti is about 15mins. The 2080ti is about the fastest card right now (Titan RTX is barely faster, RTX 8000 is a little slower) until Ampere support is added and we can properly gauge how the 30-series cards will perform.

I have no objection to the longer WUs, but they should be able to be restarted. Every time I have a power outage and my system was in the middle of some of those older PABLO WUs, I lose several hours of work since they error out immediately when they try to resume.

I’m aware of the issue with restarting tasks on a different device, but all of my systems run identical GPUs within the same system. (Not even just identical type, I run the same part number SKU cards within the same system.) but it even happens on my single GPU system, so there’s really no excuse there.

10 mins is the delay I run. And it definitely solved the problem. But again the most elegant solution is to have the project make a global change server side so that the clients don’t have to individually implement a workaround.
ID: 55551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55553 - Posted: 12 Oct 2020, 14:53:29 UTC - in response to Message 55535.  

You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches.
I've tried this before but not to the extreme of 0.01/0.01. As I recall it reduces the chances of getting big WUs such as ARP & HST from WCG. I'll give it a try on all my computers today but it'll probably take a day to see if it's working. I'm certain that 0.5/0.1 does not help GG.
Note that the GPUGrid server will send only two workunits per GPU for a given host.
This is part of the problem.
ID: 55553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55554 - Posted: 12 Oct 2020, 15:31:43 UTC - in response to Message 55551.  

10 mins is the delay I run. And it definitely solved the problem.

Does this mean you set this Preferences like this???
Store at least 0.01 days of work.
Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.}
ID: 55554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55555 - Posted: 12 Oct 2020, 16:13:55 UTC - in response to Message 55393.  

...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option.

this script will search the stuck transfers, grab their file names, then retry them for the given project. *note: if you have stuck transfers from another project you'll get an error, but you can just ignore that.

create a script with the following content:

if using a repository install of BOINC:
#!/bin/bash
for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done

This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab.

name the script something like "update_transfers.sh"
change permissions of the script to make it executable
sudo chmod +x update_transfers.sh

The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh.
The script cannot be writable by a user other than root (aurum). So I did this:
sudo chmod 700 BOINC_Retry_sh
https://manpages.debian.org/stretch/cron/cron.8.en.html

So my script BOINC_Retry_sh is now this:
#!/bin/bash
for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do 
   boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry;
done

It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder.

run it with the following command from the same directory where the script is saved:
watch -n 600 ./update_transfers.sh

*replace the value 600 with whatever wait (in seconds) you want.
Forgot the watch command and will revisit that now.
ID: 55555 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55558 - Posted: 12 Oct 2020, 16:36:19 UTC
Last modified: 12 Oct 2020, 16:40:20 UTC

It only took an hour to remind me of the problem with using a very short work queue 0.01/0.01. I believe I saw the same thing when I previously tried 0.1/0.1 which proved unworkable. I always run Milkyway along with GPUGrid since it dries up so quickly and I abhor idle computers.
Rig-02 13907 GPUGRID 12-10-2020 08:31 Not requesting tasks: don't need (CPU: ; NVIDIA GPU: job cache full)

So if I have MW WUs then I cannot DL a replacement GG WU.
I will not run GG exclusively just to implement a kluge.
GDF needs to fix this issue from his server side.
ID: 55558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55562 - Posted: 12 Oct 2020, 18:03:03 UTC - in response to Message 55555.  

...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option.

this script will search the stuck transfers, grab their file names, then retry them for the given project. *note: if you have stuck transfers from another project you'll get an error, but you can just ignore that.

create a script with the following content:

if using a repository install of BOINC:
#!/bin/bash
for i in `boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p'`;do boinccmd --file_transfer https://gpugrid.net $i retry;done

This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab.

name the script something like "update_transfers.sh"
change permissions of the script to make it executable
sudo chmod +x update_transfers.sh

The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh.
The script cannot be writable by a user other than root (aurum). So I did this:
sudo chmod 700 BOINC_Retry_sh
https://manpages.debian.org/stretch/cron/cron.8.en.html

So my script BOINC_Retry_sh is now this:
#!/bin/bash
for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do 
   boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry;
done

It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder.

run it with the following command from the same directory where the script is saved:
watch -n 600 ./update_transfers.sh

*replace the value 600 with whatever wait (in seconds) you want.
Forgot the watch command and will revisit that now.


there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential.
ID: 55562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55571 - Posted: 12 Oct 2020, 19:23:26 UTC - in response to Message 55562.  

there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential.

There is if you need to run it from a crontab:
As described above, the files under these directories have to be pass some sanity checks including the following: be executable, be owned by root, not be writable by group or other and, if symlinks, point to files owned by root. Additionally, the file names must conform to the filename requirements of run-parts: they must be entirely made up of letters, digits and can only contain the special signs underscores ('_') and hyphens ('-'). Any file that does not conform to these requirements will not be executed by run-parts. For example, any file containing dots will be ignored. This is done to prevent cron from running any of the files that are left by the Debian package management system when handling files in /etc/cron.d/ as configuration files (i.e. files ending in .dpkg-dist, .dpkg-orig, and .dpkg-new).
ID: 55571 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55572 - Posted: 12 Oct 2020, 19:30:00 UTC - in response to Message 55571.  
Last modified: 12 Oct 2020, 19:30:21 UTC

well that's why my instructions were designed around running it in an open terminal ;). just open the terminal and run it there with the watch command
ID: 55572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55573 - Posted: 12 Oct 2020, 19:35:51 UTC - in response to Message 55554.  
Last modified: 12 Oct 2020, 19:36:46 UTC

10 mins is the delay I run. And it definitely solved the problem.

Does this mean you set this Preferences like this???
Store at least 0.01 days of work.
Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.}


I have a custom client that was developed by a team member, which overrides the default comms delay and forces a longer timeout to whatever you wish. this is how i KNOW that the issue is solved with a longer timeout, because i've done it (as have several other teammates). this software is locked to our team however, so even if I were to give you the BOINC client software, it wont work unless you are on our team.

doesnt sound like you use anything but service installs anyway. this is a custom BOINC client that runs stand alone from wherever you have it on your system. the benefit is that you dont have to "install" anything. you just copy the folder wherever you want, and run the executable from there. the downside is that it wont auto-run when you boot the system. but when you have a stable system with failover projects, it's not too much hassle. i reboot maybe every few months due to power outages or system upgrades.
ID: 55573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55579 - Posted: 12 Oct 2020, 22:10:26 UTC

I have another (much more sophisticated, yet not implemented) idea:
We should write a script that:
1. disables work fetch from GPUGrid
2. waits while there are two GPUGrid workunits per GPU on the host
3. enables work fetch from GPUGrid
4. waits until there are two GPUGrid workunits per GPU on the host
5. GOTO 1.

#1, #3 and #5 are trivial.
#2 and #4 are complex (especially to check how many usable Nvidia GPUs are present in the system), they should also include some sleep period. #4 should include the "update transfers" script.
ID: 55579 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55583 - Posted: 13 Oct 2020, 0:17:13 UTC - in response to Message 55579.  

I have another (much more sophisticated, yet not implemented) idea:
We should write a script that:
1. disables work fetch from GPUGrid
2. waits while there are two GPUGrid workunits per GPU on the host
3. enables work fetch from GPUGrid
4. waits until there are two GPUGrid workunits per GPU on the host
5. GOTO 1.

#1, #3 and #5 are trivial.
#2 and #4 are complex (especially to check how many usable Nvidia GPUs are present in the system), they should also include some sleep period. #4 should include the "update transfers" script.


Just make it wait a set amount of time, rather than wait for x number of tasks. Would be much simpler.

boinccmd project update (to initiate send/receive)
Wait 20-30secs (to allow proj update to complete)
boinccmd set NNT
wait 10 mins
boinccmd allow NT

And just loop that.
ID: 55583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55584 - Posted: 13 Oct 2020, 2:17:09 UTC - in response to Message 55583.  
Last modified: 13 Oct 2020, 2:22:03 UTC

here, i wrote it.

#!/bin/bash
while :
do
   ./boinccmd --project https://www.gpugrid.net update
   sleep 20
   ./boinccmd --project https://www.gpugrid.net nomorework
   sleep 10m
   ./boinccmd --project https://www.gpugrid.net allowmorework
   sleep 1
done


easy. put this script in whatever directory contains your boinccmd tool executable. edit it to whatever suits your needs. this is an infinite loop, best not to run this as a cronjob. just run it in a terminal and ctrl+c if you want to kill it.

unsure if this will totally fix the problem though. since it will still do a schedule request to report any finished work on 31 sec cycles. setting NNT only stops asking for new work, it doesnt stop reporting of completed work, and doesnt stop schedule requests (there is no boinccmd tool to do that other than shutting off network comms to all projects, which is likely not desired). but since you wont finish WUs faster than 30s anyway maybe it works? BOINC stops trying after a while when theres nothing to do.
ID: 55584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55593 - Posted: 13 Oct 2020, 13:29:08 UTC - in response to Message 55584.  

since GPUGRID is back, i'm running my script on my computers. i've removed my custom 10 min timer via the custom boinc client. so GPUGRID is running with a default comms delay of 31 seconds.

the script works as intended. it still performs reporting of completed work when they finish (if the 31 seconds has expired), but so far it doesnt seem to be causing any problems. work stays topped up on each 10 min schedule request.

Aurum, give this one a shot. feel free to play around with the sleep values. try bumping it up to 15mins if you still have issues with a 10 min timer.

note: I am not running the previous script at all to update_transfers. with the longer timer, the transfers seem to not be getting clogged up. but this is only 2 systems at the same IP. i'd be curious to know if it helps with your 40+ systems that you have at the same location.

having the project change their settings project-side is still the best solution, so that you wont even report completed work during the 10 min deferred comms.
ID: 55593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55594 - Posted: 13 Oct 2020, 15:38:38 UTC

Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own.

Ian, The first script works great I just can't get my crontab to invoke it periodically. I'll try your new approach today.

Thanks
ID: 55594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55595 - Posted: 13 Oct 2020, 16:08:04 UTC - in response to Message 55594.  

Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own.

Ian, The first script works great I just can't get my crontab to invoke it periodically. I'll try your new approach today.

Thanks


as you'll find out, you probably will need to remove the "./" prefix on the boinccmd lines. since my implementation is with a user install of BOINC that just runs the executable directly.
ID: 55595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55596 - Posted: 13 Oct 2020, 16:57:16 UTC

It does not seem to get WUs suffering from the Project Backup syndrome to move but for WUs that finish after your script starts running it works. So I added training wheels & invoked your Retry script:
#!/bin/bash
while :
do
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update
/home/aurum/BOINC_Retry.sh
echo "Update & Retry GPUGrid then sleep for 20 seconds"
sleep 20
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework
echo "NoMoreWork GPUGrid then sleep for 10 minutes"
sleep 10m
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework
echo "AllowMoreWork GPUGrid then sleep for 1 second"
sleep 1
done

It's working good so far on 2 rigs. I'll be adding it to more.
ID: 55596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55597 - Posted: 13 Oct 2020, 17:07:00 UTC - in response to Message 55596.  

yeah, nothing in this script will retry the stuck transfers. and if you have too many stuck transfers, the schedule requests wont even get new work (you'll see in the event log that you have too many stuck uploads or whatever).

clear your pending transfers, and ideally you wont need the retry transfers script anymore. but no promises. with so many systems, you might need to run both. together. just have it retry any pending transfers every few mins or something. experiment with different values and find the setup that works for you.
ID: 55597 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55598 - Posted: 13 Oct 2020, 20:53:53 UTC - in response to Message 55597.  

I think that Aurum needs the "retry transfers scripts" if all of his hosts are behind the same WAN IP.
The only solution to that many hosts is to make the workunits longer, or to lighten the DDOS protection, but I think the latter is out of GPUGrid's control.
ID: 55598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Server and website : Optimized bandwith

©2026 Universitat Pompeu Fabra