SOS-Downloads stuck

Message boards : Server and website : SOS-Downloads stuck
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Vagelis Giannadakis

Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44772 - Posted: 18 Oct 2016, 22:28:44 UTC - in response to Message 44767.  

Same here, but only downloads stall for me, and again: only for GPUGrid.


After checking the logs a little closer, I concur, it is only downloads that exhibit this symptom not uploads, for example:

18-Oct-2016 23:54:49 [GPUGRID] Requesting new tasks for NVIDIA GPU
18-Oct-2016 23:54:52 [GPUGRID] Scheduler request completed: got 1 new tasks
18-Oct-2016 23:54:54 [GPUGRID] Started download of ...
...
19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file: transient HTTP error
19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:03:05 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file
19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file: transient HTTP error
19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:02:47 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file
19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-psf_file
19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-par_file
...

The download for these two files kept failing and retrying, it took them about 10 minutes to download.
ID: 44772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44774 - Posted: 18 Oct 2016, 23:42:13 UTC - in response to Message 44767.  

It seems that everyone (including me) has this happening:

17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70]
18 * * * Request timed out.
19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145]

Is it the problem?

I assume you refer to #18: It's quite normal that some routers don't reply to requests which come from random computers on the internet.
I hoped to get some clues, but we're still just guessing the problem.
To investigate this issue some network traffic analysis on the packet level should be done by the network admins at the campus, and decide to take some countermeasures locally, or contact some other ISPs for a solution. But frankly I think this issue doesn't have that much impact on the project's throughput. I don't know how many sites are hosted on this server (besides ps3grid.net and gpugrid.net). I presume there are a lot of servers hosting a lot of webpages at the campus which are routed through the same devices. Their traffic may interfere GPUGrid's traffic, but it can't be analysed from outside.
ID: 44774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vagelis Giannadakis

Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44775 - Posted: 19 Oct 2016, 8:06:06 UTC

I think this is more likely a gateway / firewall / reverse proxy issue. The connections are not closed, they are just stalled. Force-closing a connection raises an error on both sides immediately, and clearly this does not happen with our downloads. I think some network component (hardware or software), through which our connections are routed, intervenes and stalls them. Perhaps some network traffic limiter?
ID: 44775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44776 - Posted: 19 Oct 2016, 8:13:26 UTC - in response to Message 44772.  

transient HTTP error

Transient HTTP errors can be diagnosed further by setting the <http_debug> event log flag in BOINC. I'll do that next time I'm due to download a new task (if I remember to notice in time), but my expectation is that it will turn out to be simply BOINC's own timeout, which doesn't get us much further forward.

But it would confirm that reducing the timeout to 60 seconds is likely to help.
ID: 44776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44780 - Posted: 19 Oct 2016, 10:48:19 UTC

Well, I've downloaded and logged a new task, and - wouldn't you believe - it didn't get stuck.

But here's a log section the network gurus could have a look at.

19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 2736 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] http op done; retval 0 (Success)
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] file transfer status 0 (Success)
19-Oct-2016 10:46:45 [GPUGRID] Finished download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-vel_file
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] Throughput 0 bytes/sec
19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file
19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'D:\BOINC\ca-bundle.crt'
19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set
19-Oct-2016 10:46:45 [GPUGRID] Started download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file
19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] URL: http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info:  Found bundle for host www.gpugrid.org: 0x40b89e0 [can pipeline]
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info:  Re-using existing connection! (#1191) with host www.gpugrid.org
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info:  Connected to www.gpugrid.org (84.89.134.145) port 80 (#1191)
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: GET /PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file HTTP/1.1
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Host: www.gpugrid.org
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.7.0)
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept: */*
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Encoding: deflate, gzip
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Content-Type: application/x-www-form-urlencoded
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Language: en_GB
19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: 
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 12696 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes
19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes

Most of the time the download jogged along writing 1368 bytes at a time: I'm interpreting that as individual packets being received in the right order, and being sent to the disk-writing queue immediately.

"wrote 2736 bytes" appears a lot of times too - probably two packets arriving in reverse order, and both needing to be processed before being written.

But when a new file was being requested, the writes increased to 16384 bytes, and stayed that way for some time. That suggests to me that something in one or other system - server or client - is having problems walking and chewing gum at the same time. Since this is the only project where it happens, I'd suggest that possibly the server is the one on the verge of being overloaded.

Connection [ID#1271] was downloading:

e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-coor_file - 924 KB with throughput 730677 bytes/sec (over 500 packets/sec, if my analysis is right). That's going to be really hard to diagnose from outside the lab, and even inside it without specialist equipment and skills. But one thing comes to mind - restricting BOINC to one file being transferred at a time might ease the pressure caused by that hiccup in the middle.
ID: 44780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44782 - Posted: 19 Oct 2016, 16:32:25 UTC - in response to Message 44776.  

But it would confirm that reducing the timeout to 60 seconds is likely to help.

It does help. I'll post again what made these GPUGrid downloads acceptable for me:

<http_transfer_timeout>60</http_transfer_timeout>

That helps but in order to make it more acceptable I also have to start BOINC from the command line and use this argument:

--pers_retry_delay_max 60

It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now around 7-8 minutes. We shouldn't have to jump through these hoops but at least these workarounds help (a lot).
ID: 44782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vagelis Giannadakis

Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 44790 - Posted: 20 Oct 2016, 9:52:52 UTC

OK, I was able to capture a DEBUG-level section of the log with a failed download. It confirms (of course) the experience we have: the file download begins, a part of the file is downloaded, then the download stalls.

Here's how the download begins, notice the connection #2131 (already open from a previous download / scheduler request), the thread performing the download ID#2558 and the server-reported file size 3193090 bytes:


20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set
20-Oct-2016 01:14:18 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:14:18 [---] [http_xfer] [ID#2555] HTTP: wrote 4140 bytes
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Found bundle for host www.gpugrid.org: 0x3f09850
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Re-using existing connection! (#2131) with host www.gpugrid.org
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2131)
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: GET /PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file HTTP/1.1
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.6.9)
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Host: www.gpugrid.org
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept: */*
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept-Encoding: deflate, gzip
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Content-Type: application/x-www-form-urlencoded
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept-Language: en_US
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server:
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: HTTP/1.1 200 OK
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Date: Wed, 19 Oct 2016 22:09:48 GMT
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Server: Apache/2.2.3 (CentOS)
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Last-Modified: Mon, 03 Oct 2016 08:52:19 GMT
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: ETag: "6b8c03c-30b902-53df20fbd56c0"
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Accept-Ranges: bytes
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Content-Length: 3193090
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Cache-Control: max-age=300
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Expires: Wed, 19 Oct 2016 22:14:48 GMT
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Content-Type: text/plain; charset=UTF-8
20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server:
20-Oct-2016 01:14:18 [---] [http_xfer] [ID#2558] HTTP: wrote 1053 bytes
...
20-Oct-2016 01:14:19 [---] [http_xfer] [ID#2558] HTTP: wrote 1380 bytes


At this point, the client has downloaded a part of the file and the connection stalls. Then, after about 5 minutes, the client gives up. It closes the connection and the thread dies (it does not appear in the log again):


20-Oct-2016 01:19:25 [GPUGRID] [http] [ID#2558] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds
20-Oct-2016 01:19:25 [GPUGRID] [http] [ID#2558] Info: Closing connection 2131
20-Oct-2016 01:19:25 [GPUGRID] [http] HTTP error: Timeout was reached
20-Oct-2016 01:19:26 [GPUGRID] Temporarily failed download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file: transient HTTP error
20-Oct-2016 01:19:26 [GPUGRID] Backing off 00:02:04 on download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file


About 2 minutes later, it gives it another try, notice the new connection #2134 and the new thread ID#2579. Also, notice the Range header in the request, asking the server to start sending from the 227893rd byte instead of the beginning, and the 206 Partial Content status code, the Content-Length and the Content-Range headers in the server's response:


20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt'
20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set
20-Oct-2016 01:21:31 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2134)
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Sent header to server: Range: bytes=227893-
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: HTTP/1.1 206 Partial Content
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: Content-Length: 2965197
...
20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: Content-Range: bytes 227893-3193089/3193090
20-Oct-2016 01:21:32 [---] [http_xfer] [ID#2579] HTTP: wrote 995 bytes


The file sizes match up and the download begins once more. The client downloads another chunk of the file and then the connection stalls again. Again the connection is closed and the thread dies:


...
20-Oct-2016 01:21:34 [---] [http_xfer] [ID#2579] HTTP: wrote 1380 bytes
20-Oct-2016 01:26:39 [GPUGRID] [http] [ID#2579] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds
20-Oct-2016 01:26:39 [GPUGRID] [http] [ID#2579] Info: Closing connection 2134
20-Oct-2016 01:26:39 [GPUGRID] [http] HTTP error: Timeout was reached
20-Oct-2016 01:26:39 [GPUGRID] Temporarily failed download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file: transient HTTP error
20-Oct-2016 01:26:39 [GPUGRID] Backing off 00:06:28 on download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file


After about 6 minutes, the client retries:


20-Oct-2016 01:33:07 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
...
20-Oct-2016 01:33:07 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Connection 2137 seems to be dead!
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Closing connection 2137
...
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2138)
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Sent header to server: GET /PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file HTTP/1.1
...
20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Received header from server: HTTP/1.1 206 Partial Content
...
20-Oct-2016 01:33:08 [---] [http_xfer] [ID#2581] HTTP: wrote 995 bytes
...


Finally, the download succeeds:


20-Oct-2016 01:33:26 [---] [http_xfer] [ID#2581] HTTP: wrote 179 bytes
20-Oct-2016 01:33:26 [GPUGRID] [http] [ID#2581] Info: Connection #2138 to host www.gpugrid.org left intact
20-Oct-2016 01:33:27 [GPUGRID] Finished download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file


This is not an error at the transport layer or lower, these errors are either automatically corrected by the mechanism implementing the relevant layer (e.g. a packet getting lost is retransmitted automatically by TCP), or they immediately send an error to the application layer above (the BOINC client in our case). This is caused by some mechanism working at the application layer, through which our connections are routed and which intervenes under certain conditions.

I checked the life times of the connections, to see if they are stalled some time after they are opened, but life time doesn't seem to count: the second download attempt above was stalled a few seconds after the connection had been opened.

I assert that GPUGRID has a mechanism, like a reverse proxy, which filters HTTP connections and selectively pauses them (but does not force-close them!) under some criteria, I believe when a certain bandwidth is reached.

Can someone from the project please check with the network people and see if this is the case?
ID: 44790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44830 - Posted: 24 Oct 2016, 14:34:34 UTC - in response to Message 44790.  
Last modified: 24 Oct 2016, 14:58:43 UTC

Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed.

Cache-Control: max-age=300
That tells us the server's Hyper Text Transfer Protocol time-out is 5min.
Perhaps that should be reduced server side, say to 120?

Maybe reducing the number of simultaneous connections would help, but it might just spread the problem.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 44830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile caffeineyellow5
Avatar

Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44831 - Posted: 25 Oct 2016, 0:08:10 UTC - in response to Message 44830.  

It also may be a consideration that this is on a university campus. When I worked for my last company, we worked with municipalities and schools to get them streaming television channels online and also allow online access to transfer MPEG-2 files into the servers from around the campus and off campus. Many times the university television staff would be at constant odds with the IT department because many departments wanted the bandwidth and there was only so much to go around. The IT department would act like they were working with us and the department to improve speed or cut down on interruptions, but then we would catch them by doing ongoing pings between us and the station computers and giving the data to IT and they would deny for a while and then say, "Oh yeah, that limiter parameter! We forgot about that! We'll 'loosen' that for you to get a better stream." Then we would still get calls from them asking why "our stream" was cutting out on people and always traced it back to IT giving bandwidth to other departments and putting limiters on the bandwidth that would make the signal intermittent. So maybe the department needs to battle for a more steady bandwidth, even if they have to trade speed for stability. If everybody was a few K slower up and downloading but the signal never broke, maybe we could live with that easier?
1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!"
Ephesians 6:18-20, please ;-)
http://tbc-pa.org
ID: 44831 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44833 - Posted: 25 Oct 2016, 14:10:50 UTC - in response to Message 44830.  

Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed.

Cache-Control: max-age=300
That tells us the server's Hyper Text Transfer Protocol time-out is 5min.
Perhaps that should be reduced server side, say to 120?

Maybe reducing the number of simultaneous connections would help, but it might just spread the problem.

I don't think it's a traffic issue as it stalls here on every WU download, often several times before the download is complete. Doesn't matter what time of day. Again, this never happens on any other download of any kind. Only GPUGrid.
ID: 44833 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44835 - Posted: 25 Oct 2016, 17:43:10 UTC
Last modified: 25 Oct 2016, 17:47:07 UTC

FWIW, I see the same thing on almost every download. But I have two GTX 960s on the same machine, which I just started up again. One of them downloaded all the files, while the second one got stuck as usual. It is always the longest file (or maybe the second-longest), and I concluded some time ago that it must be a problem with the server rather than the network. It seems to pause the long ones to give preference to the shorter ones, and then can't start up again.

After it times out at least once (after 5 minutes), I can manually restart the download OK. Or else I can just leave it alone, and it completes on the second or third try. So I lose maybe 10 minutes, and I don't worry about it.
ID: 44835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 44837 - Posted: 25 Oct 2016, 18:10:35 UTC

Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr?
ID: 44837 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44838 - Posted: 25 Oct 2016, 18:47:40 UTC - in response to Message 44837.  

Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr?

All I can say is that it is reproducible, and so I don't see how it can be a network issue, unless it is a router or switch on your own network. It could be some sort of traffic-shaping that a router might do; I don't really know that it is a server per se, but it is not at all random. About the only time I don't see it is when re-attaching to the project after an absence, though I have not done rigorous tests on that.
ID: 44838 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44846 - Posted: 26 Oct 2016, 0:52:09 UTC - in response to Message 44838.  

I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps?
ID: 44846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44847 - Posted: 26 Oct 2016, 1:28:08 UTC - in response to Message 44846.  

I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps?

I think that's not the right reason. Right now the network traffic at GPUGrid is very low, because there's plenty of work available in both queues, so there's no constant unfulfilled work requests. However network statistics from calm and disturbed periods could prove or disprove it.
ID: 44847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44848 - Posted: 26 Oct 2016, 2:18:59 UTC - in response to Message 44847.  

That may be so, but I wonder if it is related to the fact that I run GPUGrid with a zero resource share? As one work unit ends and starts to upload, a new work unit starts to download. That is when I see the pauses, sometimes both on the upload and download.

So I wonder whether other people who are having the problem use a zero resource share also? If not, the pauses should not matter even if they occur, since the downloads for the next work unit will normally occur long before the current work unit is finished, and any pauses will be hidden.
ID: 44848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44850 - Posted: 26 Oct 2016, 2:54:30 UTC - in response to Message 44848.  

So I wonder whether other people who are having the problem use a zero resource share also?

I'm using a high resource share and almost always have stalls/pauses.
ID: 44850 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 44852 - Posted: 26 Oct 2016, 8:36:09 UTC - in response to Message 44850.  

I just started another download three hours after the last upload finished, and it got stuck again on the longest file:
e15s9_e13s9p0f80-SDOERR_CASP11_crystal_ss_20ns_ntl9_1-0-pdb_file
(897.12 K file size)

All the shorter files downloaded quickly. So it does not seem to be dependent on uploading and downloading at the same time. And as usual, I was able to restart it after the 5 minutes timeout, and it finished the downloaded OK. So again it seems to be a server problem or something related thereto. I don't see how a transmission problem could distinguish so reliably between files based on their size. And I have a good cable modem connection, at 25Mbps/4Mbps, which I usually exceed in tests.

ID: 44852 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Stefan
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 5 Mar 13
Posts: 348
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 44856 - Posted: 26 Oct 2016, 13:06:03 UTC
Last modified: 26 Oct 2016, 13:11:13 UTC

Hm this sounds suspiciously familiar. We are having issues with another webservice of ours getting stuck at loading from time to time these days. The two could be related if the network of the university is having problems. I will report this to our guys just in case.
ID: 44856 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile caffeineyellow5
Avatar

Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwat
Message 44857 - Posted: 26 Oct 2016, 14:05:32 UTC - in response to Message 44856.  

On the Computing Preferences tab of the BOINC Options list has up and download limiting. I noticed on some of my systems I set this to less than half what they can push opening the connection and have seen these user-side limited speed connections pause and timeout less if at all. It may be that the university is limiting the bandwidth and one of the triggers is a noticeable spike for a single connection.
ID: 44857 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Server and website : SOS-Downloads stuck

©2026 Universitat Pompeu Fabra