Message boards :
Server and website :
SOS-Downloads stuck
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
After not having run this project for months I return to find the same problem that was here when I left. Absolutely unconscionable. Good by again. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What is the problem and why are you having it? Maybe we can help you figure it out. The project up and down loading are working just fine and we are all getting tons of tasks at the moment... some every few minutes with these SDOERR_CASP tasks being handed out that are literally running in less than 5-10 minutes on many hosts. One thing you could do if you have frequent interrupts over the internet pausing up/down loads is change the line in your BOINC cc_config from the default <http_transfer_timeout>3000</http_transfer_timeout> to <http_transfer_timeout>60</http_transfer_timeout> That will make it so if something does interrupt the transfer, it will retry a connection after 60 seconds and not wait 3000 seconds (50 minutes). As far as I can tell, all the servers are running fine (from the Server Status page and all my up and downloads are running smooth and nobody else has complained about this issue in weeks when there was a server full issue for a weekend. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. Note that the default, as stated in http://boinc.berkeley.edu/wiki/Client_configuration#Options, is actually 300. <http_transfer_timeout>seconds</http_transfer_timeout> |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The problem is the downloads start and then stop and after whatever time elapses it tries again and it keeps doing the same thing over and over until all the files are received sometimes taking hours. This happens on all 4 of my machines with Nvidia cards and this project is the ONLY one that I run that gives me this issue. There is no <http_transfer_timeout>3000</http_transfer_timeout> tag in my cc_config file but I just added one. We'll see what happens but I'm not very optimistic. This is the only project with that problem. Have no idea what they have set wrong but we've complained about it a number of times. Anyway, to address this problem I use the switch above: <http_transfer_timeout>60</http_transfer_timeout> That helps but in order to make it at acceptable I also have to start BOINC from the command line and use this argument: --pers_retry_delay_max 60 It still starts and stops but retries much more quickly. Downloads went from sometimes taking hours to now a maximum of 7-8 minutes. We shouldn't have to jump through these hoops but I really don't think there's anyone anymore on the project that knows how to configure the system. There are other easy fixes that we've asked for that never get addressed. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just don't get these problems except for the times when everyone is because of server failures. I have systems between 2 different locations on 2 different internet providers (Comcast and RCN). |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You've complained about it previously as have many others. It happens here on virtually every GPUGrid download (Centurylink). It never happens on any other downloads, BOINC or otherwise. |
|
Send message Joined: 26 Feb 12 Posts: 184 Credit: 222,376,233 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Maybe I'm overreacting but there are just too many issues with seeming simple remedies for me. Since I'm already in the process of scaling back my DC operation it doesn't really matter. With that said, I have 3 SuperMicro dual socket boards with Xeon ES V4 CPUs that will be going up for sale. 2 are 28c/56t and 1 is 36c/72t. Won't go into all the details now. If there's anyone here interested PM me and we'll discuss things. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
While I don't think the staff of GPUGrid could do anything about your HTTP timeout problem, out of curiosity I ask you to run a very basic network diagnostics: If you have a Windows based PC on the same network as your crunching box, please open a command prompt and type ping www.gpugrid.net -n 100 You can do it on Linux also, but I'm not familiar with its command syntax (the -n 100 parameter tells the ping command to try 100 times). You'll see a lot of (exactly 100, if everything's going well) messages like: Reply from 84.89.134.145: bytes=32 time=83ms TTL=49 Then, at the end: Ping statistics for 84.89.134.145:
Packets: Sent = 100, Received = 100, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 83ms, Maximum = 88ms, Average = 83msThese are the actual results of my host, I'm curious about your statistics. I expect your loss of packets and the round trip times be significantly higher than what I experience. Unfortunately these numbers do not reveal the device which is responsible for your problem, but I'm quite confident in that it's closer to your end (most probably it's at your ISP) than to the GPUGrid site (in this case much more users would have such difficulties). You could also try a traceroute command: tracert www.gpugrid.net Which gives you a list of the devices between your end and grosso.upf.edu (on which the gpugrid.net project resides). Perhaps this list could help us to figure out what's wrong. Especially if it gives you very different results when you run it multiple times. In some cases these errors are simply caused by network congestion (when the ISP has limited bandwidth to certain destinations), but it could depend on the time of the day. On your end however, P2P file sharing applications or appliances, a faulty router/switch could cause such strange errors (but I'm sure in this case there would be problems with other sites as well). |
|
Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Actually Retvari, I experience the same problems with downloads sticking, not the whole package just one or two files that stick. My Stats: Pinging www.gpugrid.net [84.89.134.145] with 32 bytes of data: Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 58ms, Maximum = 59ms, Average = 58ms My Tracert: Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms 192.168.0.1 2 * * * Request timed out. 3 13 ms 14 ms 12 ms be363.pr2.hobir.isp.sky.com [89.200.135.232] 4 12 ms 12 ms 12 ms ae-3.r02.londen03.uk.bb.gin.ntt.net [83.231.221. 45] 5 11 ms 11 ms 11 ms ae-3.r24.londen12.uk.bb.gin.ntt.net [129.250.4.2 3] 6 33 ms 33 ms 33 ms ae-6.r01.mdrdsp03.es.bb.gin.ntt.net [129.250.4.1 38] 7 36 ms 34 ms 34 ms rediris.baja.espanix.net [193.149.1.26] 8 48 ms 48 ms 48 ms CIEMAT.AE1.cica.rt1.and.red.rediris.es [130.206. 245.38] 9 52 ms 52 ms 51 ms CICA.AE1.uv.rt1.val.red.rediris.es [130.206.245. 34] 10 58 ms 60 ms 68 ms anella-val1-router.red.rediris.es [130.206.211.7 0] 11 * * * Request timed out. 12 58 ms 57 ms 57 ms grosso.upf.edu [84.89.134.145] 13 58 ms 57 ms 57 ms grosso.upf.edu [84.89.134.145] 14 58 ms 58 ms 58 ms grosso.upf.edu [84.89.134.145] Trace complete. I have also noticed the same thing on a remote host with a differernt ISP. |
|
Send message Joined: 9 May 13 Posts: 171 Credit: 4,739,796,466 RAC: 334,273 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am also having issues downloading individual files. Usually take two or three retries. My trace route says: Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms dsldevice.attlocal.net [192.168.1.254] 2 21 ms 20 ms 21 ms 99-17-40-3.lightspeed.edmdok.sbcglobal.net [99.17.40.3] 3 * * * Request timed out. 4 * * * Request timed out. 5 21 ms 22 ms 23 ms 12.83.71.89 6 28 ms 30 ms 29 ms ggr3.dlstx.ip.att.net [12.122.139.17] 7 27 ms 27 ms 27 ms 192.205.36.222 8 28 ms 26 ms 28 ms be2764.ccr22.dfw01.atlas.cogentco.com [154.54.47.213] 9 32 ms 32 ms 33 ms be2443.ccr22.iah01.atlas.cogentco.com [154.54.44.229] 10 46 ms 46 ms 45 ms be2690.ccr42.atl01.atlas.cogentco.com [154.54.28.129] 11 54 ms 54 ms 54 ms be2113.ccr42.dca01.atlas.cogentco.com [154.54.24.221] 12 59 ms 59 ms 58 ms be2807.ccr42.jfk02.atlas.cogentco.com [154.54.40.109] 13 129 ms 129 ms 129 ms be2747.ccr42.par01.atlas.cogentco.com [154.54.31.190] 14 143 ms 142 ms 142 ms be2423.ccr22.bio02.atlas.cogentco.com [130.117.50.78] 15 148 ms 147 ms 148 ms be2293.ccr22.mad05.atlas.cogentco.com [130.117.50.26] 16 147 ms 149 ms 147 ms be2853.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.56.62] 17 150 ms 161 ms 149 ms 149.11.68.2 18 149 ms 148 ms 148 ms CIEMAT.AE2.telmad.rt4.mad.red.rediris.es [130.206.245.2] 19 155 ms 154 ms 159 ms TELMAD.AE4.uv.rt1.val.red.rediris.es [130.206.245.89] 20 163 ms 162 ms 163 ms anella-val1-router.red.rediris.es [130.206.211.70] 21 * * * Request timed out. 22 161 ms 160 ms 160 ms grosso.upf.edu [84.89.134.145] 23 160 ms 159 ms 160 ms grosso.upf.edu [84.89.134.145] 24 161 ms 161 ms 160 ms grosso.upf.edu [84.89.134.145] Trace complete. Maybe that will help isolate the issue. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
After a quiet period when most downloads completed at the first attempt, in the last few days I've seen a marked increase in download delays - as Betting Slip says, usually just one file dropping to zero speed, while nominally still 'active'. That's coincided with more work being downloaded (and re-downloaded - see Pascal thread): I doubt that's a coincidence. If I notice quickly, I can briefly 'suspend network activity' for BOINC, then go back to 'network activity always', while the download status is still 'active', and hence the underlying TCP/IP connection is still alive (an authenticated route still exists). If I manage that, the download usually completes far faster than a normal download. That leads me to suspect that the majority of the packets have already arrived safely, with just a few gaps where individual packets have dropped out. And, for some reason, the 'resend packet xxxx' messages aren't getting through, or are themselves being dropped by the server. Three years ago, we had a great deal of success at SETI with advising Windows users to enable rfc1323 - but that was to overcome exactly the problem with "large bandwidth*delay" paths described in the RFC. SETI has moved to a better network environment since then. We don't have exactly the same problem here (I've tried the fix, and it made no difference), but I suspect we may need a similar sort of packet-level analysis to identify and alleviate the problem we are observing. |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My trace route looks very similar after the first couple of hops: Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms 192.168.11.254 [192.168.11.254] 2 16 ms 16 ms 16 ms lo1.bsr0-zugliget.net.telekom.hu [145.236.238.178] 3 16 ms 16 ms 16 ms 81.183.3.4 4 17 ms 16 ms 17 ms 81.183.3.4 5 19 ms 16 ms 16 ms 81.183.3.145 6 24 ms 23 ms 23 ms 80.157.202.125 7 22 ms 22 ms 22 ms 80.150.171.74 8 28 ms 28 ms 28 ms be2974.ccr21.muc03.atlas.cogentco.com [154.54.58.5] 9 33 ms 34 ms 34 ms be3072.ccr21.zrh01.atlas.cogentco.com [130.117.0.17] 10 46 ms 46 ms 45 ms be3080.ccr21.mrs01.atlas.cogentco.com [130.117.49.1] 11 58 ms 58 ms 57 ms be2354.ccr21.vlc02.atlas.cogentco.com [130.117.0.150] 12 62 ms 61 ms 62 ms be2339.ccr22.mad05.atlas.cogentco.com [130.117.49.81] 13 63 ms 62 ms 63 ms be2853.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.56.62] 14 63 ms 62 ms 63 ms 149.11.68.50 15 159 ms 74 ms 74 ms CIEMAT.AE1.cica.rt1.and.red.rediris.es [130.206.245.38] 16 78 ms 77 ms 77 ms CICA.AE1.uv.rt1.val.red.rediris.es [130.206.245.34] 17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70] 18 * * * Request timed out. 19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145] 20 84 ms 83 ms 83 ms grosso.upf.edu [84.89.134.145] 21 83 ms 91 ms 84 ms grosso.upf.edu [84.89.134.145] Trace complete. I'm suspecting that one of my hosts has had a stalled download, and that made it crunch for Einstein@home for awhile. But these glitches usually happen to my hosts almost only when new workunits become available after a near-empty period. That's when the ghost workunits are appear too. Probably too many hosts are connected / trying to connect to the server at these time periods. Perhaps it looks like a DDOS attack for some firewall/router in the way. |
caffeineyellow5Send message Joined: 30 Jul 14 Posts: 225 Credit: 2,658,976,345 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 99, Lost = 1 (1% loss), Approximate round trip times in milli-seconds: Minimum = 114ms, Maximum = 123ms, Average = 118ms C:\Windows\system32>tracert www.gpugrid.net Tracing route to www.gpugrid.net [84.89.134.145] over a maximum of 30 hops: 1 9 ms 11 ms 12 ms bdl1.rdl-ubr2.trpr-rdl.pa.cable.rcn.net [10.49.128.1] 2 13 ms 12 ms 10 ms bdle25-sub202.aggr1.phdl.pa.rcn.net [207.172.196.209] 3 11 ms 11 ms 11 ms xe-4-1-0.bar2.Philadelphia1.Level3.net [4.78.154.89] 4 * * * Request timed out. 5 16 ms 17 ms 17 ms Comcast-Level3-10G.boston1.Level3.net [4.68.110.90] 6 18 ms 16 ms 12 ms be2060.ccr41.jfk02.atlas.cogentco.com [154.54.31.9] 7 99 ms 96 ms 100 ms be2746.ccr41.par01.atlas.cogentco.com [154.54.29.118] 8 120 ms 121 ms 121 ms be2475.ccr21.bio02.atlas.cogentco.com [130.117.48.181] 9 121 ms 119 ms 117 ms be2235.ccr21.mad05.atlas.cogentco.com [130.117.48.134] 10 108 ms 105 ms 109 ms be2852.rcr11.b015537-1.mad05.atlas.cogentco.com [154.54.36.166] 11 103 ms 106 ms 105 ms 149.11.68.50 12 105 ms 106 ms 107 ms CIEMAT.AE2.telmad.rt4.mad.red.rediris.es [130.206.245.2] 13 112 ms 112 ms 107 ms TELMAD.AE4.uv.rt1.val.red.rediris.es [130.206.245.89] 14 121 ms 123 ms 119 ms anella-val1-router.red.rediris.es [130.206.211.70] 15 * * * Request timed out. 16 115 ms 114 ms 120 ms grosso.upf.edu [84.89.134.145] 17 114 ms 120 ms 119 ms grosso.upf.edu [84.89.134.145] 18 120 ms 114 ms 115 ms grosso.upf.edu [84.89.134.145] Trace complete. 1 Corinthians 9:16 "For though I preach the gospel, I have nothing to glory of: for necessity is laid upon me; yea, woe is unto me, if I preach not the gospel!" Ephesians 6:18-20, please ;-) http://tbc-pa.org |
|
Send message Joined: 5 Dec 12 Posts: 84 Credit: 1,663,883,415 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Nanoprobe, check your messages ^^ |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Download / upload issues have been around for a while now. We have discussed them to sufficient length to come to the conclusion that neither we nor GPUGRID staff have a clue as to their cause. :( For the record, I too have downloads / uploads stall, succeed only after a number of retries, and all this only for GPUGRID!
|
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For the record, I too have downloads / uploads stall, succeed only after a number of retries, and all this only for GPUGRID! Same here, but only downloads stall for me, and again: only for GPUGrid. Ping statistics for 84.89.134.145: Packets: Sent = 100, Received = 100, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 170ms, Maximum = 215ms, Average = 175ms It seems that everyone (including me) has this happening: 17 85 ms 83 ms 91 ms anella-val1-router.red.rediris.es [130.206.211.70] 18 * * * Request timed out. 19 83 ms 83 ms 86 ms grosso.upf.edu [84.89.134.145] Is it the problem? Again, the ONLY project this happens to is GPUGrid and never on any other downloads of any kind. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
For me, it's only one file (or rarely two) that stalls, out of a dozen or more for a typical task. And it gets partway through the download before stalling. To me, that tells me that the destination address has been found, the route established, and the connection set up - no amount of pinging or tracerting is going to diagnose anything more than that. What we really need (and the project probably doesn't employ) is a network specialist experienced in analysing throughput at the individual packet level - and I don't know any of those, personally. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
If there is an internal network issue it's likely something the university needs to sort out, rather than the group. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 9 May 13 Posts: 171 Credit: 4,739,796,466 RAC: 334,273 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just processed a number of short tasks. Many of them had issues downloading files. In going back through the event log, all of the interrupts happened with the "**-psf_file" and "*-pdb_file" files. More clues maybe? |
©2026 Universitat Pompeu Fabra