SOS-Downloads stuck

href="https://gpugrid.net/gpugrid/view_profile.php?userid=109388"> Profile

caffeineyellow5
Avatar

Send message
Joined: 30 Jul 14
Posts: 225
Credit: 2,658,976,345
RAC: 0
Level
Phe

Scientific publications

Author	Message
nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 44717 - Posted: 15 Oct 2016, 21:25:43 UTC After not having run this project for months I return to find the same problem that was here when I left. Absolutely unconscionable. Good by again. ID: 44717 · Rating: 0 · rate: / Reply Quote

caffeineyellow5 Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level Scientific publications	Message 44718 - Posted: 15 Oct 2016, 22:46:49 UTC What is the problem and why are you having it? Maybe we can help you figure it out. The project up and down loading are working just fine and we are all getting tons of tasks at the moment... some every few minutes with these SDOERR_CASP tasks being handed out that are literally running in less than 5-10 minutes on many hosts. One thing you could do if you have frequent interrupts over the internet pausing up/down loads is change the line in your BOINC cc_config from the default <http_transfer_timeout>3000</http_transfer_timeout> to <http_transfer_timeout>60</http_transfer_timeout> That will make it so if something does interrupt the transfer, it will retry a connection after 60 seconds and not wait 3000 seconds (50 minutes). As far as I can tell, all the servers are running fine (from the Server Status page and all my up and downloads are running smooth and nobody else has complained about this issue in weeks when there was a server full issue for a weekend. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org ID: 44718 · Rating: 0 · rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 44721 - Posted: 16 Oct 2016, 0:47:19 UTC The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. ID: 44721 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 44723 - Posted: 16 Oct 2016, 6:36:21 UTC - in response to Message 44721. The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. Note that the default, as stated in http://boinc.berkeley.edu/wiki/Client_configuration#Options, is actually 300. <http_transfer_timeout>seconds</http_transfer_timeout> Abort HTTP transfers if idle for this many seconds; default 300. ID: 44723 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 44724 - Posted: 16 Oct 2016, 12:53:15 UTC - in response to Message 44721. Last modified: 16 Oct 2016, 12:54:40 UTC The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. This is the only project with that problem. Have no idea what they have set wrong but we've complained about it a number of times. Anyway, to address this problem I use the switch above: <http_transfer_timeout>60</http_transfer_timeout> That helps but in order to make it at acceptable I also have to start BOINC from the command line and use this argument: --pers_retry_delay_max 60 It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now a maximum of 7-8 minutes. We shouldn't have to jump through these hoops but I really don't think there's anyone anymore on the project that knows how to configure the system. There are other easy fixes that we've asked for that never get addressed. ID: 44724 · Rating: 0 · rate: / Reply Quote

caffeineyellow5 Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level Scientific publications	Message 44736 - Posted: 16 Oct 2016, 22:47:21 UTC I just don't get these problems except for the times when everyone is because of server failures. I have systems between 2 different locations on 2 different internet providers (Comcast and RCN). ID: 44736 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 44742 - Posted: 17 Oct 2016, 2:20:53 UTC - in response to Message 44736. Last modified: 17 Oct 2016, 2:22:57 UTC You've complained about it previously as have many others. It happens here on virtually every GPUGrid download (Centurylink). It never happens on any other downloads, BOINC or otherwise. ID: 44742 · Rating: 0 · rate: / Reply Quote

nanoprobe Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level Scientific publications	Message 44751 - Posted: 17 Oct 2016, 16:16:27 UTC Maybe I'm overreacting but there are just too many issues with seeming simple remedies for me. Since I'm already in the process of scaling back my DC operation it doesn't really matter. With that said, I have 3 SuperMicro dual socket boards with Xeon ES V4 CPUs that will be going up for sale. 2 are 28c/56t and 1 is 36c/72t. Won't go into all the details now. If there's anyone here interested PM me and we'll discuss things. ID: 44751 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 44756 - Posted: 17 Oct 2016, 20:04:27 UTC While I don't think the staff of GPUGrid could do anything about your HTTP timeout problem, out of curiosity I ask you to run a very basic network diagnostics: If you have a Windows based PC on the same network as your crunching box, please open a command prompt and type ping www.gpugrid.net -n 100 You can do it on Linux also, but I'm not familiar with its command syntax (the -n 100 parameter tells the ping command to try 100 times). You'll see a lot of (exactly 100, if everything's going well) messages like: Reply from 84.89.134.145: bytes=32 time=83ms TTL=49 Then, at the end: Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 83ms, Maximum = 88ms, Average = 83ms These are the actual results of my host, I'm curious about your statistics. I expect your loss of packets and the round trip times be significantly higher than what I experience. Unfortunately these numbers do not reveal the device which is responsible for your problem, but I'm quite confident in that it's closer to your end (most probably it's at your ISP) than to the GPUGrid site (in this case much more users would have such difficulties). You could also try a traceroute command: tracert www.gpugrid.net Which gives you a list of the devices between your end and grosso.upf.edu (on which the gpugrid.net project resides). Perhaps this list could help us to figure out what's wrong. Especially if it gives you very different results when you run it multiple times. In some cases these errors are simply caused by network congestion (when the ISP has limited bandwidth to certain destinations), but it could depend on the time of the day. On your end however, P2P file sharing applications or appliances, a faulty router/switch could cause such strange errors (but I'm sure in this case there would be problems with other sites as well). ID: 44756 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 44757 - Posted: 17 Oct 2016, 21:03:57 UTC - in response to Message 44756. Actually Retvari, I experience the same problems with downloads sticking, not the whole package just one or two files that stick. My Stats: Pinging www.gpugrid.net [84.89.134.145] with 32 bytes of data: Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 58ms, Maximum = 59ms, Average = 58ms My Tracert: Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms 192.168.0.1 2 * * * Request timed out. 3 13 ms 14 ms 12 ms be363.pr2.hobir.isp.sky.com [89.200.135.232] 4 12 ms 12 ms 12 ms ae-3.r02.londen03.uk.bb.gin.ntt.net [83.231.221. 45] 5 11 ms 11 ms 11 ms ae-3.r24.londen12.uk.bb.gin.ntt.net [129.250.4.2 3] 6 33 ms 33 ms 33 ms ae-6.r01.mdrdsp03.es.bb.gin.ntt.net [129.250.4.1 38] 7 36 ms 34 ms 34 ms rediris.baja.espanix.net [193.149.1.26] 8 48 ms 48 ms 48 ms CIEMAT.AE1.cica.rt1.and.red.rediris.es [130.206. 245.38] 9 52 ms 52 ms 51 ms CICA.AE1.uv.rt1.val.red.rediris.es [130.206.245. 34] 10 58 ms 60 ms 68 ms anella-val1-router.red.rediris.es [130.206.211.7 0] 11 * * * Request timed out. 12 58 ms 57 ms 57 ms grosso.upf.edu [84.89.134.145] 13 58 ms 57 ms 57 ms grosso.upf.edu [84.89.134.145] 14 58 ms 58 ms 58 ms grosso.upf.edu [84.89.134.145] Trace complete. I have also noticed the same thing on a remote host with a differernt ISP. ID: 44757 · Rating: 0 · rate: / Reply Quote

captainjack Send message Joined: 9 May 13 Posts: 171 Credit: 4,739,796,466 RAC: 1,182 Level Scientific publications	Message 44758 - Posted: 17 Oct 2016, 21:16:10 UTC I am also having issues downloading individual files. Usually take two or three retries. My trace route says: Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms dsldevice.attlocal.net [192.168.1.254] 2 21 ms 20 ms 21 ms 99-17-40-3.lightspeed.edmdok.sbcglobal.net [99.17.40.3] 3 * * * Request timed out. 4 * * * Request timed out. 5 21 ms 22 ms 23 ms 12.83.71.89 6 28 ms 30 ms 29 ms ggr3.dlstx.ip.att.net [12.122.139.17] 7 27 ms 27 ms 27 ms 192.205.36.222 8 28 ms 26 ms 28 ms be2764.ccr22.dfw01.atlas.cogentco.com [154.54.47.213] 9 32 ms 32 ms 33 ms be2443.ccr22.iah01.atlas.cogentco.com [154.54.44.229] 10 46 ms 46 ms 45 ms be2690.ccr42.atl01.atlas.cogentco.com [154.54.28.129] 11 54 ms 54 ms 54 ms be2113.ccr42.dca01.atlas.cogentco.com [154.54.24.221] 12 59 ms 59 ms 58 ms be2807.ccr42.jfk02.atlas.cogentco.com [154.54.40.109] 13 129 ms 129 ms 129 ms be2747.ccr42.par01.atlas.cogentco.com [154.54.31.190] 14 143 ms 142 ms 142 ms be2423.ccr22.bio02.atlas.cogentco.com [130.117.50.78] 15 148 ms 147 ms 148 ms be2293.ccr22.mad05.atlas.cogentco.com [130.117.50.26] 16 147 ms 149 ms 147 ms be2853.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.56.62] 17 150 ms 161 ms 149 ms 149.11.68.2 18 149 ms 148 ms 148 ms CIEMAT.AE2.telmad.rt4.mad.red.rediris.es [130.206.245.2] 19 155 ms 154 ms 159 ms TELMAD.AE4.uv.rt1.val.red.rediris.es [130.206.245.89] 20 163 ms 162 ms 163 ms anella-val1-router.red.rediris.es [130.206.211.70] 21 * * * Request timed out. 22 161 ms 160 ms 160 ms grosso.upf.edu [84.89.134.145] 23 160 ms 159 ms 160 ms grosso.upf.edu [84.89.134.145] 24 161 ms 161 ms 160 ms grosso.upf.edu [84.89.134.145] Trace complete. Maybe that will help isolate the issue. ID: 44758 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 44760 - Posted: 17 Oct 2016, 21:29:27 UTC After a quiet period when most downloads completed at the first attempt, in the last few days I've seen a marked increase in download delays - as Betting Slip says, usually just one file dropping to zero speed, while nominally still 'active'. That's coincided with more work being downloaded (and re-downloaded - see Pascal thread): I doubt that's a coincidence. If I notice quickly, I can briefly 'suspend network activity' for BOINC, then go back to 'network activity always', while the download status is still 'active', and hence the underlying TCP/IP connection is still alive (an authenticated route still exists). If I manage that, the download usually completes far faster than a normal download. That leads me to suspect that the majority of the packets have already arrived safely, with just a few gaps where individual packets have dropped out. And, for some reason, the 'resend packet xxxx' messages aren't getting through, or are themselves being dropped by the server. Three years ago, we had a great deal of success at SETI with advising Windows users to enable rfc1323 - but that was to overcome exactly the problem with "large bandwidth*delay" paths described in the RFC. SETI has moved to a better network environment since then. We don't have exactly the same problem here (I've tried the fix, and it made no difference), but I suspect we may need a similar sort of packet-level analysis to identify and alleviate the problem we are observing. ID: 44760 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 44762 - Posted: 17 Oct 2016, 23:13:06 UTC I'm suspecting
My trace route looks very similar after the first couple of hops: ">Tracing route to www.gpugrid.net [84.89.134.145] <1 ms <1 ms 192.168.11.254 [192.168.11.254] 16 ms 16 ms lo1.bsr0-zugliget.net.telekom.hu [145.236.238.178] 16 ms 16 ms 81.183.3.4 16 ms 17 ms 81.183.3.4 16 ms 16 ms 81.183.3.145 23 ms 23 ms 80.157.202.125 22 ms 22 ms 80.150.171.74 28 ms 28 ms be2974.ccr21.muc03.atlas.cogentco.com [154.54.58.5] 34 ms 34 ms be3072.ccr21.zrh01.atlas.cogentco.com [130.117.0.17] 46 ms 45 ms be3080.ccr21.mrs01.atlas.cogentco.com [130.117.49.1] 58 ms 57 ms be2354.ccr21.vlc02.atlas.cogentco.com [130.117.0.150] 61 ms 62 ms be2339.ccr22.mad05.atlas.cogentco.com [130.117.49.81] 62 ms 63 ms be2853.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.56.62] 62 ms 63 ms 149.11.68.50 74 ms 74 ms CIEMAT.AE1.cica.rt1.and.red.rediris.es [130.206.245.38] 77 ms 77 ms CICA.AE1.uv.rt1.val.red.rediris.es [130.206.245.34] 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70] * Request timed out. 83 ms 86 ms grosso.upf.edu [84.89.134.145] 83 ms 83 ms grosso.upf.edu [84.89.134.145] 91 ms 84 ms grosso.upf.edu [84.89.134.145] that one of my hosts has had a stalled download, and that made it crunch for Einstein@home for awhile. But these glitches usually happen to my hosts almost only when new workunits become available after a near-empty period. That's when the ghost workunits are appear too. Probably too many hosts are connected / trying to connect to the server at these time periods. Perhaps it looks like a DDOS attack for some firewall/router in the way. ID: 44762 · Rating: 0 · rate: / Reply Quote
Message 44763 - Posted: 18 Oct 2016, 4:12:34 UTC - in response to Message 44762. Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 99, Lost = 1 (1% loss), Approximate round trip times in milli-seconds: Minimum = 114ms, Maximum = 123ms, Average = 118ms C:\Windows\system32>tracert www.gpugrid.net Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 9 ms 11 ms 12 ms bdl1.rdl-ubr2.trpr-rdl.pa.cable.rcn.net [10.49.128.1] 2 13 ms 12 ms 10 ms bdle25-sub202.aggr1.phdl.pa.rcn.net [207.172.196.209] 3 11 ms 11 ms 11 ms xe-4-1-0.bar2.Philadelphia1.Level3.net [4.78.154.89] 4 * * * Request timed out. 5 16 ms 17 ms 17 ms Comcast-Level3-10G.boston1.Level3.net [4.68.110.90] 6 18 ms 16 ms 12 ms be2060.ccr41.jfk02.atlas.cogentco.com [154.54.31.9] 7 99 ms 96 ms 100 ms be2746.ccr41.par01.atlas.cogentco.com [154.54.29.118] 8 120 ms 121 ms 121 ms be2475.ccr21.bio02.atlas.cogentco.com [130.117.48.181] 9 121 ms 119 ms 117 ms be2235.ccr21.mad05.atlas.cogentco.com [130.117.48.134] 10 108 ms 105 ms 109 ms be2852.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.36.166] 11 103 ms 106 ms 105 ms 149.11.68.50 12 105 ms 106 ms 107 ms CIEMAT.AE2.telmad.rt4.mad.red.rediris.es [130.206.245.2] 13 112 ms 112 ms 107 ms TELMAD.AE4.uv.rt1.val.red.rediris.es [130.206.245.89] 14 121 ms 123 ms 119 ms anella-val1-router.red.rediris.es [130.206.211.70] 15 * * * Request timed out. 16 115 ms 114 ms 120 ms grosso.upf.edu [84.89.134.145] 17 114 ms 120 ms 119 ms grosso.upf.edu [84.89.134.145] 18 120 ms 114 ms 115 ms grosso.upf.edu [84.89.134.145] Trace complete. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org ID: 44763 · Rating: 0 · rate: / Reply Quote

Dayle Diamond Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level Scientific publications	Message 44764 - Posted: 18 Oct 2016, 5:25:25 UTC Nanoprobe, check your messages ^^ ID: 44764 · Rating: 0 · rate: / Reply Quote

Vagelis Giannadakis Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level Scientific publications	Message 44765 - Posted: 18 Oct 2016, 8:18:46 UTC Download / upload issues have been around for a while now. We have discussed them to sufficient length to come to the conclusion that neither we nor GPUGRID staff have a clue as to their cause. :( For the record, I too have downloads / uploads stall, succeed only after a number of retries, and all this only for GPUGRID! ID: 44765 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 44767 - Posted: 18 Oct 2016, 14:54:59 UTC - in response to Message 44765. For the record, I too have downloads / uploads stall, succeed only after a number of retries, and all this only for GPUGRID! Same here, but only downloads stall for me, and again: only for GPUGrid. Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 170ms, Maximum = 215ms, Average = 175ms It seems that everyone (including me) has this happening: 17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70] 18 * * * Request timed out. 19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145] Is it the problem? Again, the ONLY project this happens to is GPUGrid and never on any other downloads of any kind. ID: 44767 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 44768 - Posted: 18 Oct 2016, 15:17:37 UTC - in response to Message 44767. For me, it's only one file (or rarely two) that stalls, out of a dozen or more for a typical task. And it gets partway through the download before stalling. To me, that tells me that the destination address has been found, the route established, and the connection set up - no amount of pinging or tracerting is going to diagnose anything more than that. What we really need (and the project probably doesn't employ) is a network specialist experienced in analysing throughput at the individual packet level - and I don't know any of those, personally. ID: 44768 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 44770 - Posted: 18 Oct 2016, 20:12:19 UTC - in response to Message 44768. If there is an internal network issue it's likely something the university needs to sort out, rather than the group. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 44770 · Rating: 0 · rate: / Reply Quote

captainjack Send message Joined: 9 May 13 Posts: 171 Credit: 4,739,796,466 RAC: 1,182 Level Scientific publications	Message 44771 - Posted: 18 Oct 2016, 20:34:27 UTC Just processed a number of short tasks. Many of them had issues downloading files. In going back through the event log, all of the interrupts happened with the "*-psf_file" and "-pdb_file" files. More clues maybe? ID: 44771 · Rating: 0 · rate: / Reply Quote