Message boards :
Server and website :
Optimized bandwith
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
if you can find a way to prevent your system from communicating with the project for longer than the default 31 second comms delay, you will solve the problem. or the project could simply increase that delay so you don't have to find some workaround yourself. increasing the delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them. it's really the best solution. it just seems like the project admins either don't know where this setting is in their own server software or the suggestion is falling on deaf ears. this 100% solves the problem.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
increasing the [default 31 seconds communications] delay will not change the presence of the DDOS protections, but it will prevent the users from hitting them.I wonder about the ideal length of that delay. We don't know the exact rules of the DDOS protection hitting on us, which, in combination with the number of hosts the given user has behind a single WAN IP would decide the ideal delay length. This delay can't be longer than the shortest workunit on the fastest GPU, because it would make those hosts to starve. (so it won't be better than the DDOS protection making those hosts to starve.) Taking the signs of the present DDOS protection and the present short workunits in consideration, I think there is a maximum number of hosts behind the single WAN IP which can work without some of them starving. Above that number some random one of the hosts behind that WAN IP will be inevitably hit by that DDOS protection. So I think the delay should be around 600 seconds (10 minutes); but also the length of the workunits should be (at least) doubled, even quadrupled. The present workunits should be in the short queue (for lesser GPUs). |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
A 10min delay seems pretty ideal to me. The fastest WUs I see through my 2080ti is about 15mins. The 2080ti is about the fastest card right now (Titan RTX is barely faster, RTX 8000 is a little slower) until Ampere support is added and we can properly gauge how the 30-series cards will perform. I have no objection to the longer WUs, but they should be able to be restarted. Every time I have a power outage and my system was in the middle of some of those older PABLO WUs, I lose several hours of work since they error out immediately when they try to resume. I’m aware of the issue with restarting tasks on a different device, but all of my systems run identical GPUs within the same system. (Not even just identical type, I run the same part number SKU cards within the same system.) but it even happens on my single GPU system, so there’s really no excuse there. 10 mins is the delay I run. And it definitely solved the problem. But again the most elegant solution is to have the project make a global change server side so that the clients don’t have to individually implement a workaround.
|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
You should reduce your work cache settings on all of your hosts to roughly match the shortest workunits your host crunches.I've tried this before but not to the extreme of 0.01/0.01. As I recall it reduces the chances of getting big WUs such as ARP & HST from WCG. I'll give it a try on all my computers today but it'll probably take a day to see if it's working. I'm certain that 0.5/0.1 does not help GG. Note that the GPUGrid server will send only two workunits per GPU for a given host.This is part of the problem. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
10 mins is the delay I run. And it definitely solved the problem. Does this mean you set this Preferences like this??? Store at least 0.01 days of work. Store up to an additional 0.01 days of work. {I have no idea what this line does or why it even exists.} |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. This wouldn't work for the strangest reason, it must use the exact URL as returned by boinccmd --get_project_urls. I thought Apache rendered "www." useless a couple of decades ago. I can run it manually and it works well but I cannot get it to run on my crontab. name the script something like "update_transfers.sh" The script cannot use a dot so I changed it to an underscore, BOINC_Retry_sh. The script cannot be writable by a user other than root (aurum). So I did this: sudo chmod 700 BOINC_Retry_sh https://manpages.debian.org/stretch/cron/cron.8.en.html So my script BOINC_Retry_sh is now this: #!/bin/bash for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do boinccmd --host localhost --passwd mypw --file_transfer https://www.gpugrid.net $i retry; done It wouldn't work without including --host localhost --passwd mypw but that might because I didn't store the script in the right folder. run it with the following command from the same directory where the script is saved:Forgot the watch command and will revisit that now. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
It only took an hour to remind me of the problem with using a very short work queue 0.01/0.01. I believe I saw the same thing when I previously tried 0.1/0.1 which proved unworkable. I always run Milkyway along with GPUGrid since it dries up so quickly and I abhor idle computers. Rig-02 13907 GPUGRID 12-10-2020 08:31 Not requesting tasks: don't need (CPU: ; NVIDIA GPU: job cache full) So if I have MW WUs then I cannot DL a replacement GG WU. I will not run GG exclusively just to implement a kluge. GDF needs to fix this issue from his server side. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
...you have to script it since the boinccmd tool only seems to have the ability to use the retry command on a single transfer. there is no "all" option. there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential.
|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
there should be no reason you cant run a ".sh" it's just a file extension. you could name it .anything or with no extension at all as you did. it will execute the same either way, its really inconsequential. There is if you need to run it from a crontab: As described above, the files under these directories have to be pass some sanity checks including the following: be executable, be owned by root, not be writable by group or other and, if symlinks, point to files owned by root. Additionally, the file names must conform to the filename requirements of run-parts: they must be entirely made up of letters, digits and can only contain the special signs underscores ('_') and hyphens ('-'). Any file that does not conform to these requirements will not be executed by run-parts. For example, any file containing dots will be ignored. This is done to prevent cron from running any of the files that are left by the Debian package management system when handling files in /etc/cron.d/ as configuration files (i.e. files ending in .dpkg-dist, .dpkg-orig, and .dpkg-new). |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
well that's why my instructions were designed around running it in an open terminal ;). just open the terminal and run it there with the watch command
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
10 mins is the delay I run. And it definitely solved the problem. I have a custom client that was developed by a team member, which overrides the default comms delay and forces a longer timeout to whatever you wish. this is how i KNOW that the issue is solved with a longer timeout, because i've done it (as have several other teammates). this software is locked to our team however, so even if I were to give you the BOINC client software, it wont work unless you are on our team. doesnt sound like you use anything but service installs anyway. this is a custom BOINC client that runs stand alone from wherever you have it on your system. the benefit is that you dont have to "install" anything. you just copy the folder wherever you want, and run the executable from there. the downside is that it wont auto-run when you boot the system. but when you have a stable system with failover projects, it's not too much hassle. i reboot maybe every few months due to power outages or system upgrades.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have another (much more sophisticated, yet not implemented) idea: We should write a script that: 1. disables work fetch from GPUGrid 2. waits while there are two GPUGrid workunits per GPU on the host 3. enables work fetch from GPUGrid 4. waits until there are two GPUGrid workunits per GPU on the host 5. GOTO 1. #1, #3 and #5 are trivial. #2 and #4 are complex (especially to check how many usable Nvidia GPUs are present in the system), they should also include some sleep period. #4 should include the "update transfers" script. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
I have another (much more sophisticated, yet not implemented) idea: Just make it wait a set amount of time, rather than wait for x number of tasks. Would be much simpler. boinccmd project update (to initiate send/receive) Wait 20-30secs (to allow proj update to complete) boinccmd set NNT wait 10 mins boinccmd allow NT And just loop that.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
here, i wrote it. #!/bin/bash while : do ./boinccmd --project https://www.gpugrid.net update sleep 20 ./boinccmd --project https://www.gpugrid.net nomorework sleep 10m ./boinccmd --project https://www.gpugrid.net allowmorework sleep 1 done easy. put this script in whatever directory contains your boinccmd tool executable. edit it to whatever suits your needs. this is an infinite loop, best not to run this as a cronjob. just run it in a terminal and ctrl+c if you want to kill it. unsure if this will totally fix the problem though. since it will still do a schedule request to report any finished work on 31 sec cycles. setting NNT only stops asking for new work, it doesnt stop reporting of completed work, and doesnt stop schedule requests (there is no boinccmd tool to do that other than shutting off network comms to all projects, which is likely not desired). but since you wont finish WUs faster than 30s anyway maybe it works? BOINC stops trying after a while when theres nothing to do.
|
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
since GPUGRID is back, i'm running my script on my computers. i've removed my custom 10 min timer via the custom boinc client. so GPUGRID is running with a default comms delay of 31 seconds. the script works as intended. it still performs reporting of completed work when they finish (if the 31 seconds has expired), but so far it doesnt seem to be causing any problems. work stays topped up on each 10 min schedule request. Aurum, give this one a shot. feel free to play around with the sleep values. try bumping it up to 15mins if you still have issues with a 10 min timer. note: I am not running the previous script at all to update_transfers. with the longer timer, the transfers seem to not be getting clogged up. but this is only 2 systems at the same IP. i'd be curious to know if it helps with your 40+ systems that you have at the same location. having the project change their settings project-side is still the best solution, so that you wont even report completed work during the 10 min deferred comms.
|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own. Ian, The first script works great I just can't get my crontab to invoke it periodically. I'll try your new approach today. Thanks |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
Zoltan, I left all rigs with 0.01/0.01 overnight and awoke to the usual idle GPUs and a long list of WUs with (Project Backoff: x:x:x). Once they get tagged with Project Backoff they never seem to restart on their own. as you'll find out, you probably will need to remove the "./" prefix on the boinccmd lines. since my implementation is with a user install of BOINC that just runs the executable directly.
|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
It does not seem to get WUs suffering from the Project Backup syndrome to move but for WUs that finish after your script starts running it works. So I added training wheels & invoked your Retry script: #!/bin/bash while : do /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update /home/aurum/BOINC_Retry.sh echo "Update & Retry GPUGrid then sleep for 20 seconds" sleep 20 /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework echo "NoMoreWork GPUGrid then sleep for 10 minutes" sleep 10m /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework echo "AllowMoreWork GPUGrid then sleep for 1 second" sleep 1 done It's working good so far on 2 rigs. I'll be adding it to more. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
yeah, nothing in this script will retry the stuck transfers. and if you have too many stuck transfers, the schedule requests wont even get new work (you'll see in the event log that you have too many stuck uploads or whatever). clear your pending transfers, and ideally you wont need the retry transfers script anymore. but no promises. with so many systems, you might need to run both. together. just have it retry any pending transfers every few mins or something. experiment with different values and find the setup that works for you.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think that Aurum needs the "retry transfers scripts" if all of his hosts are behind the same WAN IP. The only solution to that many hosts is to make the workunits longer, or to lighten the DDOS protection, but I think the latter is out of GPUGrid's control. |
©2026 Universitat Pompeu Fabra