Optimized bandwith

Message boards : Server and website : Optimized bandwith
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55602 - Posted: 14 Oct 2020, 18:34:30 UTC
Last modified: 14 Oct 2020, 18:41:36 UTC

It won't work without a Retry. I've seen WUs that completed after the BOINC_Nap.sh started that went into the fatal Download pending (Project backoff) mode. E.g.,
Rig-44 GPUGRID 3qhoA00_320_4-TONI_MDADex2sq-15-par_file 0.000 178.34 K 00:04:02 - 00:17:28 0.00 KBps Download pending (Project backoff: 01:07:59)
So I inserted Retry.
#!/bin/bash
while :
do
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net update
for i in $(boinccmd --get_file_transfers | sed -n -e 's/^.*name: //p');do 
   boinccmd --host localhost --passwd pw --file_transfer https://www.gpugrid.net $i retry;
done
echo "Update & Retry GPUGrid then sleep for 20 seconds"
sleep 20
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net nomorework
echo "NoMoreWork GPUGrid then sleep for 10 minutes"
sleep 10m
/usr/bin/boinccmd --host localhost --passwd pw --project https://www.gpugrid.net allowmorework
echo "AllowMoreWork GPUGrid then sleep for 1 second"
sleep 1
done
It's a bit unnerving to look at my Project tab and see most of my GG set to "No new work." But watch for 10 minutes and they turn on and off. The downside is that if one wants to gracefully shutdown ARPs with their 2 hour checkpoints it turns them back on when they need to stay off. Suspending a WU stops new DLs.
ID: 55602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55603 - Posted: 14 Oct 2020, 19:37:14 UTC - in response to Message 55602.  

I saw a few instances of transfers getting stuck. But they usually clear on the next attempt in a few mins. The first back off from a stuck transfer is rather short. Then they get longer and longer on each successive retry failure. Having 1 stuck task doesn’t prevent downloads of more work. But having a lot of stuck ones does. I ran my script without automatic transfer retries on my systems for over 24hrs and even though one would occasionally get stuck, it always eventually cleared itself without intervention. That was my point. Getting stuck occasionally isn’t a problem if it eventually gets uploaded, where in my case they always did. You just have to trust it a bit and not get too anxious if you see a stuck one. I can see how having 40+ systems might be a different situation though. So if you absolutely need it, then do what works for you.

I don’t know what you are referring to with ARPs and 2hr checkpoints though. Care to elaborate? What needs to stay off?


ID: 55603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55604 - Posted: 14 Oct 2020, 20:58:56 UTC - in response to Message 55603.  

ARP is one of the many projects of World Community Grid.
Africa Rainfall Project
ID: 55604 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55605 - Posted: 14 Oct 2020, 22:37:12 UTC - in response to Message 55604.  

ok, that only seems to further the confusion. my script wont change anything with WCG or its projects. nor do i know what he means by suspending WUs since my script doesnt do that either, it just stops getting new work for GPUGRID. So i don't see what the issue or connection between this script and WCG/ARP or suspending WUs or whatever.
ID: 55605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Aurum
Avatar

Send message
Joined: 12 Jul 17
Posts: 404
Credit: 17,412,649,587
RAC: 8,996
Level
Trp
Scientific publications
watwatwat
Message 55608 - Posted: 15 Oct 2020, 15:47:09 UTC
Last modified: 15 Oct 2020, 15:48:16 UTC

Africa Rainfall Project has 2 to 3 hour checkpoints. If one wants to avoid discarding all that work then an orderly shut down is required. One selects "No New Work" for all projects and waits for everything to checkpoint and then shuts down.
This command reverses that:
/usr/bin/boinccmd --project https://www.gpugrid.net allowmorework

Switches GG to Allow New Work long enough to start additional 1-2 hour GG WUs going. Just a small occasional nuisance.

I still have not heard a peep out of GDF or Toni about whether they intend to fix their server issue.
ID: 55608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55819 - Posted: 25 Nov 2020, 22:52:04 UTC

Now that COVID moonshot sprint 5 is finished at folding@home, my hosts have run out of work.
So I've put them back to GPUGrid, and immediately got blocked by that DDOS defense.
I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task.
To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting
    <max_file_xfers>20</max_file_xfers>
    <max_file_xfers_per_project>10</max_file_xfers_per_project>
in the <options> section of cc_config.xml file, and re-read the config files.

I can see in the log that the manager starts all 9 downloads at the same time:
The "starting" and the "finished" messages were mixed up before, now all 9 "starting download of ..." messages are in a block, having the same timestamp.

It seems to help, at least I can access the forum.
ID: 55819 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55825 - Posted: 26 Nov 2020, 9:42:15 UTC - in response to Message 55819.  

There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it:

26/11/2020 09:29:07 | GPUGRID | [http] [ID#12984] Info:  Connection #7366 to host www.gpugrid.net left intact
26/11/2020 09:29:08 | GPUGRID | Finished upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_0
26/11/2020 09:29:08 | GPUGRID | Started upload of 2jh1A01_348_1-TONI_MDADex2sj-33-50-RND7955_0_2
26/11/2020 09:29:08 | GPUGRID | [http] [ID#12985] Info:  Re-using existing connection! (#7366) with host www.gpugrid.net
26/11/2020 09:29:18 | GPUGRID | Sending scheduler request: To report completed tasks.
26/11/2020 09:29:18 | GPUGRID | Reporting 1 completed tasks
26/11/2020 09:29:18 | GPUGRID | [http] [ID#1] Info:  Re-using existing connection! (#7366) with host www.gpugrid.net
26/11/2020 09:29:21 | GPUGRID | Started download of 2hy5B00_320_0-TONI_MDADex2sh-33-conf_file_enc
26/11/2020 09:29:22 | GPUGRID | [http] [ID#12990] Info:  Re-using existing connection! (#7366) with host www.gpugrid.net

That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established.

By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours.
ID: 55825 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55827 - Posted: 26 Nov 2020, 15:35:39 UTC - in response to Message 55825.  

There's a flaw in the logic there. If you examine BOINC's http_debug log, you can see that once the host has established a connection, it preserves it and keeps re-using it:
...
That's a very short extract from a very long log, but connection #7366 was used for uploads, reporting, and downloads without needing to be re-established.
I didn't examine the http_debug log before, so that's the reason for the flaw in my logic, however...

By contrast, if your initial attempt was made at a moment when GPUGrid was unready to accept a connection from your IP address, all nine downloads will fail at the same time. This will drive BOINC into a project-wide backoff lasting several hours.
I thought that raising the "file transfers per project" limit would help, because I saw the same thing happen when the "per project" limit is 2 (or 5). Some of the files are downloaded, some of the downloads get stuck. After a few unsuccessful retries, the project backoff kicks in, even when the "per project" limit is low.
My point is that this unknown DDOS protection is triggered even if the BOINC manager reuses the open http connection(s).

In the meantime it turned out that this method is not the adequate workaround: the uploads / downloads still get stuck at my hosts.

So the only working method is the "file transfer retry" script.
ID: 55827 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55828 - Posted: 26 Nov 2020, 16:53:36 UTC - in response to Message 55827.  

My experience is slightly different. I have seven machines attached, made up of

2x Linux machines, with 2 GPUs each
3x Windows machines, with 2 GPUs each
2x Windows machines, with 1 GPU each

Each machine may make a random attempt to download new work, but usually gets rebuffed because the machines are at the limit of 'one task and a spare' per GPU.

The fun starts when a machine completes a task and starts to upload. If no other machine has phoned home in the last few minutes (define 'few'?), it connects straight away, uploads all six files, reports, and downloads a replacement - all without a delay, reusing the same connection. The sample log I posted this morning came into that category.

A Linux machine may hit too soon, and be rebuffed. But it'll keep trying the same pair of uploads for a full two minutes. If they don't get through, each upload will be backed off for one or two minutes, but another two will be tried. Usually, two pairs - four minutes - will be enough to establish the connection, and the final four uploads will sail through. The first two will retry, usually get through (I'm not sure how long BOINC keeps the successful connection open), and then the report/replacement also follows immediately.

Windows machines have a problem. When rebuffed, then only keep retrying the first pair for 21 or 22 seconds. The second pair, likewise, are only retried for 21/22 seconds. The third pair will always be attempted, but the whole task upload only gets 66 seconds (maximum) to complete uploading. That isn't enough - if the uploading hasn't started by then, the six consecutive failed uploads are enough to drive BOINC into a project-wide backoff of well over an hour.

When writing that bit of code, David Anderson made an unconscious assumption that one task = one upload, so three consecutive failed uploads (the actual trigger) implies three failed tasks over a period of time, and hence a server experiencing problems. His safeguard is safeguarding us from a completely different problem from the one that's facing us here.

That's something I tried to address in https://github.com/BOINC/boinc/issues/3778, to singularly little effect.
ID: 55828 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55848 - Posted: 29 Nov 2020, 17:41:15 UTC - in response to Message 55828.  

My experience is slightly different. I have seven machines attached, made up of

2x Linux machines, with 2 GPUs each
3x Windows machines, with 2 GPUs each
2x Windows machines, with 1 GPU each

Take your Windows machines to Folding. Their core_22 now has a CUDA version that works well.

I will bring my Linux machines here (mostly GTX 1070's). Their control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing).
ID: 55848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ian&Steve C.

Send message
Joined: 21 Feb 20
Posts: 1116
Credit: 40,876,970,595
RAC: 2
Level
Trp
Scientific publications
wat
Message 55849 - Posted: 29 Nov 2020, 22:33:45 UTC - in response to Message 55848.  
Last modified: 29 Nov 2020, 22:35:07 UTC

From what the admins have posted, GPUGRID includes the whole Python package with the application, so the environment doesn’t matter. I run all my systems on Ubuntu 20.04 and no issues.

Or did you mean “they” as in FAH?
ID: 55849 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55852 - Posted: 30 Nov 2020, 2:35:43 UTC - in response to Message 55849.  

Or did you mean “they” as in FAH?

Yes, it is the Folding control program that has the problem.
I am using Ubuntu 20.04 here too.
ID: 55852 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55862 - Posted: 1 Dec 2020, 0:41:38 UTC - in response to Message 55848.  
Last modified: 1 Dec 2020, 0:42:29 UTC

Their [FAH] control program works only with Python 2, and Ubuntu 20.04 only has Python 3, so I am being squeezed out as I upgrade. (They might fix it someday, but it is a generational thing).
If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work.
If you do a clean install of Ubuntu 20.04, FAH won't work.
ID: 55862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55865 - Posted: 1 Dec 2020, 15:39:56 UTC - in response to Message 55862.  

If you install Ubuntu 18.04 first, then upgrade it to 20.04 it will leave Python 2 on the system, and FAH will work.

Good thought, but whenever I do an upgrade, it never works. I always end up having to do a clean install anyway.
So I will just keep some machines on Ubuntu 18.04 for the time being.

By the way, I just did my usual efficiency tests on GPUGrid, and found that the GTX 1660 Ti and GTX 1650 Super are the best, a little ahead of both the GTX 1060 and GTX 1070, so those are the ones I will use here.
ID: 55865 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 55869 - Posted: 2 Dec 2020, 15:46:30 UTC - in response to Message 55819.  

I've set my hosts to 0.01 days work buffer, but then I've realized that they start only 2 file transfers simultaneously, while a workunit has 9 files to download, so the given host contacts the GPUGrid servers 5 times to download a task.
To reduce that I've increased the number of simultaneous file transfers per project to 10 (the global number to 20) by putting
    <max_file_xfers>20</max_file_xfers>
    <max_file_xfers_per_project>10</max_file_xfers_per_project>
in the <options> section of cc_config.xml file, and re-read the config files.

Good idea. I routinely set that to 4, but 10 is better.
It is remarkable what you have to do (in most projects) to get work.
ID: 55869 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : Server and website : Optimized bandwith

©2026 Universitat Pompeu Fabra