Message boards :
Server and website :
SOS-Downloads stuck
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
| Author | Message |
|---|---|
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Same here, but only downloads stall for me, and again: only for GPUGrid. After checking the logs a little closer, I concur, it is only downloads that exhibit this symptom not uploads, for example: 18-Oct-2016 23:54:49 [GPUGRID] Requesting new tasks for NVIDIA GPU 18-Oct-2016 23:54:52 [GPUGRID] Scheduler request completed: got 1 new tasks 18-Oct-2016 23:54:54 [GPUGRID] Started download of ... ... 19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file: transient HTTP error 19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:03:05 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-coor_file 19-Oct-2016 00:00:04 [GPUGRID] Temporarily failed download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file: transient HTTP error 19-Oct-2016 00:00:04 [GPUGRID] Backing off 00:02:47 on download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-pdb_file 19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-psf_file 19-Oct-2016 00:00:04 [GPUGRID] Started download of e22s9_e15s1p0f97-GERARD_ENDOPEP_frag53P_1-0-par_file ... The download for these two files kept failing and retrying, it took them about 10 minutes to download.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It seems that everyone (including me) has this happening: I assume you refer to #18: It's quite normal that some routers don't reply to requests which come from random computers on the internet. I hoped to get some clues, but we're still just guessing the problem. To investigate this issue some network traffic analysis on the packet level should be done by the network admins at the campus, and decide to take some countermeasures locally, or contact some other ISPs for a solution. But frankly I think this issue doesn't have that much impact on the project's throughput. I don't know how many sites are hosted on this server (besides ps3grid.net and gpugrid.net). I presume there are a lot of servers hosting a lot of webpages at the campus which are routed through the same devices. Their traffic may interfere GPUGrid's traffic, but it can't be analysed from outside. |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think this is more likely a gateway / firewall / reverse proxy issue. The connections are not closed, they are just stalled. Force-closing a connection raises an error on both sides immediately, and clearly this does not happen with our downloads. I think some network component (hardware or software), through which our connections are routed, intervenes and stalls them. Perhaps some network traffic limiter?
|
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
transient HTTP error Transient HTTP errors can be diagnosed further by setting the <http_debug> event log flag in BOINC. I'll do that next time I'm due to download a new task (if I remember to notice in time), but my expectation is that it will turn out to be simply BOINC's own timeout, which doesn't get us much further forward. But it would confirm that reducing the timeout to 60 seconds is likely to help. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Well, I've downloaded and logged a new task, and - wouldn't you believe - it didn't get stuck. But here's a log section the network gurus could have a look at. 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 2736 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] http op done; retval 0 (Success) 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] file transfer status 0 (Success) 19-Oct-2016 10:46:45 [GPUGRID] Finished download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-vel_file 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] Throughput 0 bytes/sec 19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file 19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'D:\BOINC\ca-bundle.crt' 19-Oct-2016 10:46:45 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set 19-Oct-2016 10:46:45 [GPUGRID] Started download of e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file 19-Oct-2016 10:46:45 [GPUGRID] [file_xfer] URL: http://www.gpugrid.org/PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Found bundle for host www.gpugrid.org: 0x40b89e0 [can pipeline] 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Re-using existing connection! (#1191) with host www.gpugrid.org 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#1191) 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: GET /PS3GRID/download/b9/e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-idx_file HTTP/1.1 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Host: www.gpugrid.org 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.7.0) 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept: */* 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Encoding: deflate, gzip 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Content-Type: application/x-www-form-urlencoded 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: Accept-Language: en_GB 19-Oct-2016 10:46:45 [GPUGRID] [http] [ID#1273] Sent header to server: 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 16384 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 12696 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes 19-Oct-2016 10:46:45 [---] [http_xfer] [ID#1271] HTTP: wrote 1368 bytes Most of the time the download jogged along writing 1368 bytes at a time: I'm interpreting that as individual packets being received in the right order, and being sent to the disk-writing queue immediately. "wrote 2736 bytes" appears a lot of times too - probably two packets arriving in reverse order, and both needing to be processed before being written. But when a new file was being requested, the writes increased to 16384 bytes, and stayed that way for some time. That suggests to me that something in one or other system - server or client - is having problems walking and chewing gum at the same time. Since this is the only project where it happens, I'd suggest that possibly the server is the one on the verge of being overloaded. Connection [ID#1271] was downloading: e24s6_e1s19p0f341-GERARD_ENDOPEP_frag30P_1-0-coor_file - 924 KB with throughput 730677 bytes/sec (over 500 packets/sec, if my analysis is right). That's going to be really hard to diagnose from outside the lab, and even inside it without specialist equipment and skills. But one thing comes to mind - restricting BOINC to one file being transferred at a time might ease the pressure caused by that hiccup in the middle. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
But it would confirm that reducing the timeout to 60 seconds is likely to help. It does help. I'll post again what made these GPUGrid downloads acceptable for me: <http_transfer_timeout>60</http_transfer_timeout> That helps but in order to make it more acceptable I also have to start BOINC from the command line and use this argument: --pers_retry_delay_max 60 It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now around 7-8 minutes. We shouldn't have to jump through these hoops but at least these workarounds help (a lot). |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OK, I was able to capture a DEBUG-level section of the log with a failed download. It confirms (of course) the experience we have: the file download begins, a part of the file is downloaded, then the download stalls. Here's how the download begins, notice the connection #2131 (already open from a previous download / scheduler request), the thread performing the download ID#2558 and the server-reported file size 3193090 bytes: 20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file 20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt' 20-Oct-2016 01:14:18 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set 20-Oct-2016 01:14:18 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file 20-Oct-2016 01:14:18 [---] [http_xfer] [ID#2555] HTTP: wrote 4140 bytes 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Found bundle for host www.gpugrid.org: 0x3f09850 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Re-using existing connection! (#2131) with host www.gpugrid.org 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2131) 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: GET /PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file HTTP/1.1 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: User-Agent: BOINC client (windows_x86_64 7.6.9) 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Host: www.gpugrid.org 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept: */* 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept-Encoding: deflate, gzip 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Content-Type: application/x-www-form-urlencoded 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: Accept-Language: en_US 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Sent header to server: 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: HTTP/1.1 200 OK 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Date: Wed, 19 Oct 2016 22:09:48 GMT 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Server: Apache/2.2.3 (CentOS) 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Last-Modified: Mon, 03 Oct 2016 08:52:19 GMT 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: ETag: "6b8c03c-30b902-53df20fbd56c0" 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Accept-Ranges: bytes 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Content-Length: 3193090 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Cache-Control: max-age=300 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Expires: Wed, 19 Oct 2016 22:14:48 GMT 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: Content-Type: text/plain; charset=UTF-8 20-Oct-2016 01:14:18 [GPUGRID] [http] [ID#2558] Received header from server: 20-Oct-2016 01:14:18 [---] [http_xfer] [ID#2558] HTTP: wrote 1053 bytes ... 20-Oct-2016 01:14:19 [---] [http_xfer] [ID#2558] HTTP: wrote 1380 bytes At this point, the client has downloaded a part of the file and the connection stalls. Then, after about 5 minutes, the client gives up. It closes the connection and the thread dies (it does not appear in the log again): 20-Oct-2016 01:19:25 [GPUGRID] [http] [ID#2558] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds 20-Oct-2016 01:19:25 [GPUGRID] [http] [ID#2558] Info: Closing connection 2131 20-Oct-2016 01:19:25 [GPUGRID] [http] HTTP error: Timeout was reached 20-Oct-2016 01:19:26 [GPUGRID] Temporarily failed download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file: transient HTTP error 20-Oct-2016 01:19:26 [GPUGRID] Backing off 00:02:04 on download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file About 2 minutes later, it gives it another try, notice the new connection #2134 and the new thread ID#2579. Also, notice the Range header in the request, asking the server to start sending from the 227893rd byte instead of the beginning, and the 206 Partial Content status code, the Content-Length and the Content-Range headers in the server's response: 20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file 20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle 'C:\Program Files\BOINC\ca-bundle.crt' 20-Oct-2016 01:21:31 [GPUGRID] [http] HTTP_OP::libcurl_exec(): ca-bundle set 20-Oct-2016 01:21:31 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file ... 20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2134) ... 20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Sent header to server: Range: bytes=227893- ... 20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: HTTP/1.1 206 Partial Content ... 20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: Content-Length: 2965197 ... 20-Oct-2016 01:21:32 [GPUGRID] [http] [ID#2579] Received header from server: Content-Range: bytes 227893-3193089/3193090 20-Oct-2016 01:21:32 [---] [http_xfer] [ID#2579] HTTP: wrote 995 bytes The file sizes match up and the download begins once more. The client downloads another chunk of the file and then the connection stalls again. Again the connection is closed and the thread dies: ... 20-Oct-2016 01:21:34 [---] [http_xfer] [ID#2579] HTTP: wrote 1380 bytes 20-Oct-2016 01:26:39 [GPUGRID] [http] [ID#2579] Info: Operation too slow. Less than 10 bytes/sec transferred the last 300 seconds 20-Oct-2016 01:26:39 [GPUGRID] [http] [ID#2579] Info: Closing connection 2134 20-Oct-2016 01:26:39 [GPUGRID] [http] HTTP error: Timeout was reached 20-Oct-2016 01:26:39 [GPUGRID] Temporarily failed download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file: transient HTTP error 20-Oct-2016 01:26:39 [GPUGRID] Backing off 00:06:28 on download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file After about 6 minutes, the client retries: 20-Oct-2016 01:33:07 [GPUGRID] [http] HTTP_OP::init_get(): http://www.gpugrid.org/PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file ... 20-Oct-2016 01:33:07 [GPUGRID] Started download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file 20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Connection 2137 seems to be dead! 20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Closing connection 2137 ... 20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Info: Connected to www.gpugrid.org (84.89.134.145) port 80 (#2138) 20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Sent header to server: GET /PS3GRID/download/1f4/e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file HTTP/1.1 ... 20-Oct-2016 01:33:08 [GPUGRID] [http] [ID#2581] Received header from server: HTTP/1.1 206 Partial Content ... 20-Oct-2016 01:33:08 [---] [http_xfer] [ID#2581] HTTP: wrote 995 bytes ... Finally, the download succeeds: 20-Oct-2016 01:33:26 [---] [http_xfer] [ID#2581] HTTP: wrote 179 bytes 20-Oct-2016 01:33:26 [GPUGRID] [http] [ID#2581] Info: Connection #2138 to host www.gpugrid.org left intact 20-Oct-2016 01:33:27 [GPUGRID] Finished download of e26s1_e8s9p0f59-GERARD_ENDOPEP_frag30P_1-0-pdb_file This is not an error at the transport layer or lower, these errors are either automatically corrected by the mechanism implementing the relevant layer (e.g. a packet getting lost is retransmitted automatically by TCP), or they immediately send an error to the application layer above (the BOINC client in our case). This is caused by some mechanism working at the application layer, through which our connections are routed and which intervenes under certain conditions. I checked the life times of the connections, to see if they are stalled some time after they are opened, but life time doesn't seem to count: the second download attempt above was stalled a few seconds after the connection had been opened. I assert that GPUGRID has a mechanism, like a reverse proxy, which filters HTTP connections and selectively pauses them (but does not force-close them!) under some criteria, I believe when a certain bandwidth is reached. Can someone from the project please check with the network people and see if this is the case?
|
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed. Cache-Control: max-age=300 That tells us the server's Hyper Text Transfer Protocol time-out is 5min. Perhaps that should be reduced server side, say to 120? Maybe reducing the number of simultaneous connections would help, but it might just spread the problem. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It also may be a consideration that this is on a university campus. When I worked for my last company, we worked with municipalities and schools to get them streaming television channels online and also allow online access to transfer MPEG-2 files into the servers from around the campus and off campus. Many times the university television staff would be at constant odds with the IT department because many departments wanted the bandwidth and there was only so much to go around. The IT department would act like they were working with us and the department to improve speed or cut down on interruptions, but then we would catch them by doing ongoing pings between us and the station computers and giving the data to IT and they would deny for a while and then say, "Oh yeah, that limiter parameter! We forgot about that! We'll 'loosen' that for you to get a better stream." Then we would still get calls from them asking why "our stream" was cutting out on people and always traced it back to IT giving bandwidth to other departments and putting limiters on the bandwidth that would make the signal intermittent. So maybe the department needs to battle for a more steady bandwidth, even if they have to trade speed for stability. If everybody was a few K slower up and downloading but the signal never broke, maybe we could live with that easier? 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Maybe this is down to data cost management or network limitations but if not I guess the apparent network issues could stem from a disk issue (simply can't read from one of the drives/arrays fast enough - say when lots of users are downloading simultaneously). If that's the situation then the only real solution is faster disks/arrays, if it's really needed. I don't think it's a traffic issue as it stalls here on every WU download, often several times before the download is complete. Doesn't matter what time of day. Again, this never happens on any other download of any kind. Only GPUGrid. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
FWIW, I see the same thing on almost every download. But I have two GTX 960s on the same machine, which I just started up again. One of them downloaded all the files, while the second one got stuck as usual. It is always the longest file (or maybe the second-longest), and I concluded some time ago that it must be a problem with the server rather than the network. It seems to pause the long ones to give preference to the shorter ones, and then can't start up again. After it times out at least once (after 5 minutes), I can manually restart the download OK. Or else I can just leave it alone, and it completes on the second or third try. So I lose maybe 10 minutes, and I don't worry about it. |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr? |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Is there something we need to take a look at or is it an individual issue? Can someone give me a tl;dr? All I can say is that it is reproducible, and so I don't see how it can be a network issue, unless it is a router or switch on your own network. It could be some sort of traffic-shaping that a router might do; I don't really know that it is a server per se, but it is not at all random. About the only time I don't see it is when re-attaching to the project after an absence, though I have not done rigorous tests on that. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps? |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I allowed both GTX 960s to run dry, then requested a download and got a SDOERR-CASP20M and a SDOERR-CASP1XX. Both downloaded without a pause. So maybe allowing the connection to go idle for a while helps? I think that's not the right reason. Right now the network traffic at GPUGrid is very low, because there's plenty of work available in both queues, so there's no constant unfulfilled work requests. However network statistics from calm and disturbed periods could prove or disprove it. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
That may be so, but I wonder if it is related to the fact that I run GPUGrid with a zero resource share? As one work unit ends and starts to upload, a new work unit starts to download. That is when I see the pauses, sometimes both on the upload and download. So I wonder whether other people who are having the problem use a zero resource share also? If not, the pauses should not matter even if they occur, since the downloads for the next work unit will normally occur long before the current work unit is finished, and any pauses will be hidden. |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So I wonder whether other people who are having the problem use a zero resource share also? I'm using a high resource share and almost always have stalls/pauses. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just started another download three hours after the last upload finished, and it got stuck again on the longest file: e15s9_e13s9p0f80-SDOERR_CASP11_crystal_ss_20ns_ntl9_1-0-pdb_file (897.12 K file size) All the shorter files downloaded quickly. So it does not seem to be dependent on uploading and downloading at the same time. And as usual, I was able to restart it after the 5 minutes timeout, and it finished the downloaded OK. So again it seems to be a server problem or something related thereto. I don't see how a transmission problem could distinguish so reliably between files based on their size. And I have a good cable modem connection, at 25Mbps/4Mbps, which I usually exceed in tests. |
|
Send message Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level ![]() Scientific publications ![]() |
Hm this sounds suspiciously familiar. We are having issues with another webservice of ours getting stuck at loading from time to time these days. The two could be related if the network of the university is having problems. I will report this to our guys just in case. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
On the Computing Preferences tab of the BOINC Options list has up and download limiting. I noticed on some of my systems I set this to less than half what they can push opening the connection and have seen these user-side limited speed connections pause and timeout less if at all. It may be that the university is limiting the bandwidth and one of the triggers is a noticeable spike for a single connection. |
©2026 Universitat Pompeu Fabra