Message boards :
Server and website :
Optimized bandwith
Message board moderation
Previous · 1 · 2 · 3 · 4
| Author | Message |
|---|---|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
It won't work without a Retry. I've seen WUs that completed after the BOINC_Nap.sh started that went into the fatal Download pending (Project backoff) mode. E.g., Rig-44 GPUGRID 3qhoA00_320_4-TONI_MDADex2sq-15-par_file 0.000 178.34 K 00:04:02 - 00:17:28 0.00 KBps Download pending (Project backoff: 01:07:59) So I inserted Retry. #!/bin/bash while : do /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do boinccmd --host localhost --passwd pw --file_transfer https://www.gpugrid.net $i retry; done echo "Update & Retry GPUGrid then sleep for 20 seconds" sleep 20 /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework echo "NoMoreWork GPUGrid then sleep for 10 minutes" sleep 10m /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework echo "AllowMoreWork GPUGrid then sleep for 1 second" sleep 1 doneIt's a bit unnerving to look at my Project tab and see most of my GG set to "No new work." But watch for 10 minutes and they turn on and off. The downside is that if one wants to gracefully shutdown ARPs with their 2 hour checkpoints it turns them back on when they need to stay off. Suspending a WU stops new DLs. |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
I saw a few instances of transfers getting stuck. But they usually clear on the next attempt in a few mins. The first back off from a stuck transfer is rather short. Then they get longer and longer on each successive retry failure. Having 1 stuck task doesn’t prevent downloads of more work. But having a lot of stuck ones does. I ran my script without automatic transfer retries on my systems for over 24hrs and even though one would occasionally get stuck, it always eventually cleared itself without intervention. That was my point. Getting stuck occasionally isn’t a problem if it eventually gets uploaded, where in my case they always did. You just have to trust it a bit and not get too anxious if you see a stuck one. I can see how having 40+ systems might be a different situation though. So if you absolutely need it, then do what works for you. I don’t know what you are referring to with ARPs and 2hr checkpoints though. Care to elaborate? What needs to stay off?
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
ARP is one of the many projects of World Community Grid. Africa Rainfall Project |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
ok, that only seems to further the confusion. my script wont change anything with WCG or its projects. nor do i know what he means by suspending WUs since my script doesnt do that either, it just stops getting new work for GPUGRID. So i don't see what the issue or connection between this script and WCG/ARP or suspending WUs or whatever.
|
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 8,996 Level ![]() Scientific publications ![]() ![]()
|
Africa Rainfall Project has 2 to 3 hour checkpoints. If one wants to avoid discarding all that work then an orderly shut down is required. One selects "No New Work" for all projects and waits for everything to checkpoint and then shuts down. This command reverses that: /usr/bin/boinccmd --project https://www.gpugrid.net allowmorework Switches GG to Allow New Work long enough to start additional 1-2 hour GG WUs going. Just a small occasional nuisance. I still have not heard a peep out of GDF or Toni about whether they intend to fix their server issue. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Now that COVID moonshot sprint 5 is finished at folding@home, my hosts have run out of work. So I've put them back to GPUGrid, and immediately got blocked by that DDOS defense. I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task. To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting <max_file_xfers>20</max_file_xfers>
<max_file_xfers_per_project>10</max_file_xfers_per_project>in the <options> section of cc_config.xml file, and re-read the config files.I can see in the log that the manager starts all 9 downloads at the same time: The "starting" and the "finished" messages were mixed up before, now all 9 "starting download of ..." messages are in a block, having the same timestamp. It seems to help, at least I can access the forum. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it: 26/11/2020 09:29:07 | GPUGRID | [http] [ID#12984] Info: Connection #7366 to host www.gpugrid.net left intact 26/11/2020 09:29:08 | GPUGRID | Finished upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_0 26/11/2020 09:29:08 | GPUGRID | Started upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_2 26/11/2020 09:29:08 | GPUGRID | [http] [ID#12985] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:18 | GPUGRID | Sending scheduler request: To report completed tasks. 26/11/2020 09:29:18 | GPUGRID | Reporting 1 completed tasks 26/11/2020 09:29:18 | GPUGRID | [http] [ID#1] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:21 | GPUGRID | Started download of 2hy5B00_320_0-TONI_MDADex2sh-33-conf_file_enc 26/11/2020 09:29:22 | GPUGRID | [http] [ID#12990] Info: Re-using existing connection! (#7366) with host www.gpugrid.net That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established. By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it:I didn't examine the http_debug log before, so that's the reason for the flaw in my logic, however... By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours.I thought that raising the "file transfers per project" limit would help, because I saw the same thing happen when the "per project" limit is 2 (or 5). Some of the files are downloaded, some of the downloads get stuck. After a few unsuccessful retries, the project backoff kicks in, even when the "per project" limit is low. My point is that this unknown DDOS protection is triggered even if the BOINC manager reuses the open http connection(s). In the meantime it turned out that this method is not the adequate workaround: the uploads / downloads still get stuck at my hosts. So the only working method is the "file transfer retry" script. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My experience is slightly different. I have seven machines attached, made up of 2x Linux machines, with 2 GPUs each 3x Windows machines, with 2 GPUs each 2x Windows machines, with 1 GPU each Each machine may make a random attempt to download new work, but usually gets rebuffed because the machines are at the limit of 'one task and a spare' per GPU. The fun starts when a machine completes a task and starts to upload. If no other machine has phoned home in the last few minutes (define 'few'?), it connects straight away, uploads all six files, reports, and downloads a replacement - all without a delay, reusing the same connection. The sample log I posted this morning came into that category. A Linux machine may hit too soon, and be rebuffed. But it'll keep trying the same pair of uploads for a full two minutes. If they don't get through, each upload will be backed off for one or two minutes, but another two will be tried. Usually, two pairs - four minutes - will be enough to establish the connection, and the final four uploads will sail through. The first two will retry, usually get through (I'm not sure how long BOINC keeps the successful connection open), and then the report/replacement also follows immediately. Windows machines have a problem. When rebuffed, then only keep retrying the first pair for 21 or 22 seconds. The second pair, likewise, are only retried for 21/22 seconds. The third pair will always be attempted, but the whole task upload only gets 66 seconds (maximum) to complete uploading. That isn't enough - if the uploading hasn't started by then, the six consecutive failed uploads are enough to drive BOINC into a project-wide backoff of well over an hour. When writing that bit of code, David Anderson made an unconscious assumption that one task = one upload, so three consecutive failed uploads (the actual trigger) implies three failed tasks over a period of time, and hence a server experiencing problems. His safeguard is safeguarding us from a completely different problem from the one that's facing us here. That's something I tried to address in https://github.com/BOINC/boinc/issues/3778, to singularly little effect. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My experience is slightly different. I have seven machines attached, made up of Take your Windows machines to Folding. Their core_22 now has a CUDA version that works well. I will bring my Linux machines here (mostly GTX 1070's). Their control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing). |
|
Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 2 Level ![]() Scientific publications
|
From what the admins have posted, GPUGRID includes the whole Python package with the application, so the environment doesn’t matter. I run all my systems on Ubuntu 20.04 and no issues. Or did you mean “they” as in FAH?
|
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Or did you mean “they” as in FAH? Yes, it is the Folding control program that has the problem. I am using Ubuntu 20.04 here too. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Their [FAH] control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing).If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. If you do a clean install of Ubuntu 20.04, FAH won't work. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. Good thought, but whenever I do an upgrade, it never works. I always end up having to do a clean install anyway. So I will just keep some machines on Ubuntu 18.04 for the time being. By the way, I just did my usual efficiency tests on GPUGrid, and found that the GTX 1660 Ti and GTX 1650 Super are the best, a little ahead of both the GTX 1060 and GTX 1070, so those are the ones I will use here. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task. Good idea. I routinely set that to 4, but 10 is better. It is remarkable what you have to do (in most projects) to get work. |
©2026 Universitat Pompeu Fabra