Optimized bandwith

Author	Message
Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 39 Level Scientific publications	Message 55602 - Posted: 14 Oct 2020, 18:34:30 UTC Last modified: 14 Oct 2020, 18:41:36 UTC It won't work without a Retry. I've seen WUs that completed after the BOINC_Nap.sh started that went into the fatal Download pending (Project backoff) mode. E.g., Rig-44 GPUGRID 3qhoA00_320_4-TONI_MDADex2sq-15-par_file 0.000 178.34 K 00:04:02 - 00:17:28 0.00 KBps Download pending (Project backoff: 01:07:59) So I inserted Retry. #!/bin/bash while : do /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update for i in $(boinccmd --get_file_transfers \| sed -n -e 's/^.*name: //p');do boinccmd --host localhost --passwd pw --file_transfer https://www.gpugrid.net $i retry; done echo "Update & Retry GPUGrid then sleep for 20 seconds" sleep 20 /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework echo "NoMoreWork GPUGrid then sleep for 10 minutes" sleep 10m /usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework echo "AllowMoreWork GPUGrid then sleep for 1 second" sleep 1 done It's a bit unnerving to look at my Project tab and see most of my GG set to "No new work." But watch for 10 minutes and they turn on and off. The downside is that if one wants to gracefully shutdown ARPs with their 2 hour checkpoints it turns them back on when they need to stay off. Suspending a WU stops new DLs. ID: 55602 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55603 - Posted: 14 Oct 2020, 19:37:14 UTC - in response to Message 55602. I saw a few instances of transfers getting stuck. But they usually clear on the next attempt in a few mins. The first back off from a stuck transfer is rather short. Then they get longer and longer on each successive retry failure. Having 1 stuck task doesn’t prevent downloads of more work. But having a lot of stuck ones does. I ran my script without automatic transfer retries on my systems for over 24hrs and even though one would occasionally get stuck, it always eventually cleared itself without intervention. That was my point. Getting stuck occasionally isn’t a problem if it eventually gets uploaded, where in my case they always did. You just have to trust it a bit and not get too anxious if you see a stuck one. I can see how having 40+ systems might be a different situation though. So if you absolutely need it, then do what works for you. I don’t know what you are referring to with ARPs and 2hr checkpoints though. Care to elaborate? What needs to stay off? ID: 55603 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55604 - Posted: 14 Oct 2020, 20:58:56 UTC - in response to Message 55603. ARP is one of the many projects of World Community Grid. Africa Rainfall Project ID: 55604 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55605 - Posted: 14 Oct 2020, 22:37:12 UTC - in response to Message 55604. ok, that only seems to further the confusion. my script wont change anything with WCG or its projects. nor do i know what he means by suspending WUs since my script doesnt do that either, it just stops getting new work for GPUGRID. So i don't see what the issue or connection between this script and WCG/ARP or suspending WUs or whatever. ID: 55605 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,412,649,587 RAC: 39 Level Scientific publications	Message 55608 - Posted: 15 Oct 2020, 15:47:09 UTC Last modified: 15 Oct 2020, 15:48:16 UTC Africa Rainfall Project has 2 to 3 hour checkpoints. If one wants to avoid discarding all that work then an orderly shut down is required. One selects "No New Work" for all projects and waits for everything to checkpoint and then shuts down. This command reverses that: /usr/bin/boinccmd --project https://www.gpugrid.net allowmorework Switches GG to Allow New Work long enough to start additional 1-2 hour GG WUs going. Just a small occasional nuisance. I still have not heard a peep out of GDF or Toni about whether they intend to fix their server issue. ID: 55608 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55819 - Posted: 25 Nov 2020, 22:52:04 UTC Now that COVID moonshot sprint 5 is finished at folding@home, my hosts have run out of work. So I've put them back to GPUGrid, and immediately got blocked by that DDOS defense. I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task. To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting <max_file_xfers>20</max_file_xfers> <max_file_xfers_per_project>10</max_file_xfers_per_project> in the <options> section of cc_config.xml file, and re-read the config files. I can see in the log that the manager starts all 9 downloads at the same time: The "starting" and the "finished" messages were mixed up before, now all 9 "starting download of ..." messages are in a block, having the same timestamp. It seems to help, at least I can access the forum. ID: 55819 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 55825 - Posted: 26 Nov 2020, 9:42:15 UTC - in response to Message 55819. There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it: 26/11/2020 09:29:07 \| GPUGRID \| [http] [ID#12984] Info: Connection #7366 to host www.gpugrid.net left intact 26/11/2020 09:29:08 \| GPUGRID \| Finished upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_0 26/11/2020 09:29:08 \| GPUGRID \| Started upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_2 26/11/2020 09:29:08 \| GPUGRID \| [http] [ID#12985] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:18 \| GPUGRID \| Sending scheduler request: To report completed tasks. 26/11/2020 09:29:18 \| GPUGRID \| Reporting 1 completed tasks 26/11/2020 09:29:18 \| GPUGRID \| [http] [ID#1] Info: Re-using existing connection! (#7366) with host www.gpugrid.net 26/11/2020 09:29:21 \| GPUGRID \| Started download of 2hy5B00_320_0-TONI_MDADex2sh-33-conf_file_enc 26/11/2020 09:29:22 \| GPUGRID \| [http] [ID#12990] Info: Re-using existing connection! (#7366) with host www.gpugrid.net That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established. By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours. ID: 55825 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55827 - Posted: 26 Nov 2020, 15:35:39 UTC - in response to Message 55825. There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it: ... That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established. I didn't examine the http_debug log before, so that's the reason for the flaw in my logic, however... By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours. I thought that raising the "file transfers per project" limit would help, because I saw the same thing happen when the "per project" limit is 2 (or 5). Some of the files are downloaded, some of the downloads get stuck. After a few unsuccessful retries, the project backoff kicks in, even when the "per project" limit is low. My point is that this unknown DDOS protection is triggered even if the BOINC manager reuses the open http connection(s). In the meantime it turned out that this method is not the adequate workaround: the uploads / downloads still get stuck at my hosts. So the only working method is the "file transfer retry" script. ID: 55827 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 55828 - Posted: 26 Nov 2020, 16:53:36 UTC - in response to Message 55827. My experience is slightly different. I have seven machines attached, made up of 2x Linux machines, with 2 GPUs each 3x Windows machines, with 2 GPUs each 2x Windows machines, with 1 GPU each Each machine may make a random attempt to download new work, but usually gets rebuffed because the machines are at the limit of 'one task and a spare' per GPU. The fun starts when a machine completes a task and starts to upload. If no other machine has phoned home in the last few minutes (define 'few'?), it connects straight away, uploads all six files, reports, and downloads a replacement - all without a delay, reusing the same connection. The sample log I posted this morning came into that category. A Linux machine may hit too soon, and be rebuffed. But it'll keep trying the same pair of uploads for a full two minutes. If they don't get through, each upload will be backed off for one or two minutes, but another two will be tried. Usually, two pairs - four minutes - will be enough to establish the connection, and the final four uploads will sail through. The first two will retry, usually get through (I'm not sure how long BOINC keeps the successful connection open), and then the report/replacement also follows immediately. Windows machines have a problem. When rebuffed, then only keep retrying the first pair for 21 or 22 seconds. The second pair, likewise, are only retried for 21/22 seconds. The third pair will always be attempted, but the whole task upload only gets 66 seconds (maximum) to complete uploading. That isn't enough - if the uploading hasn't started by then, the six consecutive failed uploads are enough to drive BOINC into a project-wide backoff of well over an hour. When writing that bit of code, David Anderson made an unconscious assumption that one task = one upload, so three consecutive failed uploads (the actual trigger) implies three failed tasks over a period of time, and hence a server experiencing problems. His safeguard is safeguarding us from a completely different problem from the one that's facing us here. That's something I tried to address in https://github.com/BOINC/boinc/issues/3778, to singularly little effect. ID: 55828 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55848 - Posted: 29 Nov 2020, 17:41:15 UTC - in response to Message 55828. My experience is slightly different. I have seven machines attached, made up of 2x Linux machines, with 2 GPUs each 3x Windows machines, with 2 GPUs each 2x Windows machines, with 1 GPU each Take your Windows machines to Folding. Their core_22 now has a CUDA version that works well. I will bring my Linux machines here (mostly GTX 1070's). Their control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing). ID: 55848 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1117 Credit: 40,876,970,595 RAC: 0 Level Scientific publications	Message 55849 - Posted: 29 Nov 2020, 22:33:45 UTC - in response to Message 55848. Last modified: 29 Nov 2020, 22:35:07 UTC From what the admins have posted, GPUGRID includes the whole Python package with the application, so the environment doesn’t matter. I run all my systems on Ubuntu 20.04 and no issues. Or did you mean “they” as in FAH? ID: 55849 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55852 - Posted: 30 Nov 2020, 2:35:43 UTC - in response to Message 55849. Or did you mean “they” as in FAH? Yes, it is the Folding control program that has the problem. I am using Ubuntu 20.04 here too. ID: 55852 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55862 - Posted: 1 Dec 2020, 0:41:38 UTC - in response to Message 55848. Last modified: 1 Dec 2020, 0:42:29 UTC Their [FAH] control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing). If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. If you do a clean install of Ubuntu 20.04, FAH won't work. ID: 55862 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55865 - Posted: 1 Dec 2020, 15:39:56 UTC - in response to Message 55862. If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work. Good thought, but whenever I do an upgrade, it never works. I always end up having to do a clean install anyway. So I will just keep some machines on Ubuntu 18.04 for the time being. By the way, I just did my usual efficiency tests on GPUGrid, and found that the GTX 1660 Ti and GTX 1650 Super are the best, a little ahead of both the GTX 1060 and GTX 1070, so those are the ones I will use here. ID: 55865 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 55869 - Posted: 2 Dec 2020, 15:46:30 UTC - in response to Message 55819. I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task. To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting <max_file_xfers>20</max_file_xfers> <max_file_xfers_per_project>10</max_file_xfers_per_project> in the <options> section of cc_config.xml file, and re-read the config files. Good idea. I routinely set that to 4, but 10 is better. It is remarkable what you have to do (in most projects) to get work. ID: 55869 · Rating: 0 · rate: / Reply Quote