Work unit failure rate

Author	Message
BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 921,733,849 RAC: 0 Level Scientific publications	Message 13290 - Posted: 27 Oct 2009, 17:49:29 UTC I just looked into this on the seven systems I have actively running GPUGrid at the moment. Of the past 70 workunits completed (10 per system), I had a total of 8 failures -- most of these quite close to completion. The systems involved have no overclocked GPU's. The GPU's range from one 9600GT to one 250GTS, the rest being 9800GT. The OS is either Windows XP, or Windows 7. Driver version is either the 190.38 or 190.62. All workstations had one failure (one had two). Seems to be a pretty high failure rate to cope with. ID: 13290 · Rating: 0 · rate: / Reply Quote

philip Send message Joined: 29 Jan 09 Posts: 1 Credit: 562,650 RAC: 0 Level Scientific publications	Message 13292 - Posted: 27 Oct 2009, 19:24:49 UTC - in response to Message 13290. Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2. Bailing out. Just not worth it. ID: 13292 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 921,733,849 RAC: 0 Level Scientific publications	Message 13300 - Posted: 28 Oct 2009, 15:46:42 UTC - in response to Message 13292. Interesting -- I've not seen things a being worse with Win7 versus XP, but I think a 10% failure rate is 'suboptimal'. Since I have upgraded to Windows 7 I had nothing but failures. Running 191.07 with 6.10.16 on Quad X9770 with 9800GX2. Bailing out. Just not worth it. ID: 13300 · Rating: 0 · rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13304 - Posted: 29 Oct 2009, 7:49:51 UTC Recently I have the same kind of failure rate....2 of the last 15 workunits failed, both also quite close to completion with a total loss of about 34 hours of GPU time... Also on 9800 GT, however, I didn't change anything on the software part recently. Started a thread on my own to compare some details since the workunits don't seem to be total crashs as they were perfectly crunched by a GTX 260. ID: 13304 · Rating: 0 · rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13358 - Posted: 2 Nov 2009, 16:20:37 UTC And just another one bit the dust......that makes it 3 fails out of the last 13. Any comment on this issue or should I look for another CUDA project?? ID: 13358 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 921,733,849 RAC: 0 Level Scientific publications	Message 13361 - Posted: 3 Nov 2009, 2:05:03 UTC - in response to Message 13358. The thing is, at the moment, GPU project options are rather limited. That 9800GT is supported on SETI -- when they have work, and on Collatz -- when it is running. One thing with Collatz, it is a very low resource project which also is the only one supporting lower power ATI GPU's as well and when the other CUDA ATI GPU project (MW -- which supports only double precision GPU's and not that 9800GT for instance) is in trouble -- like it is at the moment, the load on Collatz seems simply too much. At this moment both Collatz and MW are offline. I use Collatz as my primary GPU project with GPUGrid these days as my backup GPU project. Add to that the current work unit famine here (which probably is only short term), and for GPU BOINC folks, life can get rather tedious. One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me. And just another one bit the dust......that makes it 3 fails out of the last 13. Any comment on this issue or should I look for another CUDA project?? ID: 13361 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 13365 - Posted: 3 Nov 2009, 15:59:28 UTC - in response to Message 13361. The thing is, at the moment, GPU project options are rather limited. Still sad but true ... the only good news is that Einstein is working on a GPU version ... though it is taking a lot longer than I have been expecting... especially in that EaH has been one of the most reliable projects to have work up and available and to handle outages with grace ... and volume too ... One thing that I've seen here is that problems with workunits (such as we are reporting here) - or when we note that workunits are running longer for the same credit pay out, there seems to be a response here (if there is a response), that no, you are not seeing what you are seeing, or, 'what have you (the user) changed (when you, the user) haven't changed things. That tends to be a tad frustrating for me. I have been wondering if this is just a symptom of the "admin fatigue" that we have been discussing elsewhere ... it is about that time here too ... As an answer to the situation, and as a reply to the admin's suggestion that problems with Resource Share are not a BOINC Problem that is of interest to projects I put GPU Grid in rotation with MW and Collatz and have noted that not only has my earnings here dropped though the floor, even the amount of time spend does not seem to be properly balanced ... of course I have seen this all for a long time but have not been able to get UCB's attention ... and am not likely to get GPU Grid's either ... then again ... a pox on both their houses ... I guess that means I am frustrated too ... :) ID: 13365 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 13398 - Posted: 7 Nov 2009, 10:12:09 UTC - in response to Message 13365. the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs. You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here: http://en.wikipedia.org/wiki/GeForce_200_Series and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems. With a proper installation you should be able to get nearly 100% success rate as several users do GDF ID: 13398 · Rating: 0 · rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 13399 - Posted: 7 Nov 2009, 12:12:42 UTC OC, Heat, under powered PSU, drivers. Lately I have been taking BOINC offline, making a copy of my BOINC data folder and then using the copy to test out my new OC settings. This way if I crash I can keep making copies without hammering the GPUGrid servers and I also don't run into any "allowed WUs per day" limits. I have found that this is about the best way for me as there really does not seem to be any really good testing tools for GPU ... sorry but relying on my visual inspection for artifacts is not a particularly rigorous process and I wonderif they are stressing the GPU the same way that GPUGrid does. Aren;t we more concerned with shaders first, memory second , and core really does not matter? @GDF - Could you please tell me which type of WUs are typically the most GPU intensive? This way I could refine my test process to make sure I am doing the best testing possible. Thank you, Steve Thanks - Steve ID: 13399 · Rating: 0 · rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13403 - Posted: 7 Nov 2009, 16:09:03 UTC Last modified: 7 Nov 2009, 16:14:12 UTC I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them. The only change of late has been upgrading to Boinc 6.10.3. This was nice since now all my work units (that complete) get me the bonus without my having to manually delete the 4 or 5 it would download with the older version of boinc. A typical failure looks like: Using CUDA device 0 # There is 1 device supporting CUDA # Device 0: "GeForce 9600 GSO" # Clock rate: 1.46 GHz # Total amount of global memory: 805044224 bytes # Number of multiprocessors: 12 # Number of cores: 96 MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure. I will open the box and blow the dust out of it in case it is overheating, but I can confirm that the error rate appears to have increased of late. ID: 13403 · Rating: 0 · rate: / Reply Quote

Dennis-TW Send message Joined: 15 Jan 09 Posts: 6 Credit: 113,514,591 RAC: 0 Level Scientific publications	Message 13436 - Posted: 10 Nov 2009, 0:46:15 UTC - in response to Message 13403. but I can confirm that the error rate appears to have increased of late. Copy that, it turned out that about 1/3 of all my WUs do fail recently. Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate. ID: 13436 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13441 - Posted: 10 Nov 2009, 6:09:38 UTC - in response to Message 13403. I have been running GPUGRID for more than a year now with this machine on it the whole time. It had been pretty regular about no errors most of the time. Some batches of work have had it produce one error out of 20 work units. Lately it failed on five out of the last 18 work units, or just under over 1/3 of them. The only change of late has been upgrading to Boinc 6.10.3. Not sure of your video driver version but this was posted today in the BOINC change log: The new Nvidia API that BOINC 6.10 uses has a minimum driver set of CUDA 2.2, 185.85. If your present drivers are below this version, update first. If your present drivers are below this and they're the last available for your hardware, you cannot update to 6.10; stay at the last 6.6.x version for your OS. Spread the word. ID: 13441 · Rating: 0 · rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13554 - Posted: 14 Nov 2009, 20:57:25 UTC - in response to Message 13436. Last modified: 14 Nov 2009, 20:58:00 UTC It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling. oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424 ID: 13554 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 921,733,849 RAC: 0 Level Scientific publications	Message 13601 - Posted: 18 Nov 2009, 16:09:50 UTC - in response to Message 13398. Right, I understand the natural administrative inclination to blame the user. This a natural and human failing, and it has the advantage of avoiding a review of work for which the administrator has some accountability. (in my earlier life I had the admin tasks and know this inclination). That being said, these are not overclocked CPU's or cards, and the problem is spread across 7 different configurations with three different GPU processors - 9600GT, 9800GT, GS 250). And it is getting WORSE. Over the past two weeks, 29 out 85 completed results were failures. Frankly, if Collatz were not so overburdened with work (they support normal ATI cards as well as normal CUDA cards), I'd simply back off of GPUGrid and wait for this project to resolve it's problems (note, I have not had failures over at Collatz for the much larger sampling over there). Perhaps when MW comes back to life, and Slicker returns from vacation over at GPU, I will fully move over to there and watch to see here if there is a response other than user error. the very large majority of errors are produced by overclocking and/or poor cooling of the GPUs. You might think that your GPUs is not overclocked but the manufacturer did it for you. Look up the suggested by Nvidia clock of your cards here: http://en.wikipedia.org/wiki/GeForce_200_Series and compare it with your card. Reducing it to the lower, recommended values might well fix all your problems. With a proper installation you should be able to get nearly 100% success rate as several users do GDF ID: 13601 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 921,733,849 RAC: 0 Level Scientific publications	Message 13602 - Posted: 18 Nov 2009, 16:13:51 UTC - in response to Message 13436. Same failure rate here -- and with the same systems the rate was much lower in September. The change for me included a move to 6.10.x for the client. It certainly is a possible culprit. But that is not something I can control in that any project supporting ATI GPU's requires the 6.10 client. Then again, getting Berkeley to accept that they are part of any problem is even less likely than getting folks here to step up to the plate. Curiously, I am NOT seeing these failures over at Collatz with either ATI or CUDA GPU's on the same computers. but I can confirm that the error rate appears to have increased of late. Copy that, it turned out that about 1/3 of all my WUs do fail recently. Though it's nice to read about the general hints.....overclocking, heat, driver, blabla, it's just the fact that nothing of it applies to my station here....in July/August everything was running fine, but now in October/November I get this failure rate. ID: 13602 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 921,733,849 RAC: 0 Level Scientific publications	Message 13603 - Posted: 18 Nov 2009, 16:16:04 UTC - in response to Message 13554. Perhaps if enough of us report this here with enough variation in OS and hardware, eventually the spotlight on user error might be changed to a mirror.... It has gone downhill for me. Up to 100% error rate. Nothing completed since the 11th. I just checked the machine and all the fans are running just the same as they have been for the past half year. I will let it run with the case open for a while to see if the new work units need better cooling. oh, and fwiw, I am running UNIX x86_64 Kernel Module 190.18 on http://www.gpugrid.net/show_host_detail.php?hostid=35424 ID: 13603 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 13604 - Posted: 18 Nov 2009, 16:21:48 UTC I second this call. My recent failure rate has been 12 out of 58 - over 20% - across three cards: two Zotac 9800GT at completely stock speeds, and a Zotac 'AMP' edition (factory overclock) 9800GTX+. The cards have all been running on SETI since January with no sign of failure, and are doing so again as I type. ID: 13604 · Rating: 0 · rate: / Reply Quote

GDF Volunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level Scientific publications	Message 13605 - Posted: 18 Nov 2009, 16:27:13 UTC - in response to Message 13604. Last modified: 18 Nov 2009, 16:45:25 UTC We will try to upload a new application compiled with cuda 2.3. Let's see if this serves the problem. The only change we had was that we are now distributing only a cuda2.2 application. gdf ID: 13605 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 16 Apr 09 Posts: 163 Credit: 921,733,849 RAC: 0 Level Scientific publications	Message 13606 - Posted: 18 Nov 2009, 16:32:41 UTC - in response to Message 13605. OK -- let folks know when that is in place -- for the moment, by GPUGrid processing is going on hold. We will try to upload a new application compiled with cuda 2.3. Let's see if this serves the problem. The only change we had was that we are not distributing only a cuda2.2 application. gdf ID: 13606 · Rating: 0 · rate: / Reply Quote

Richard Bertrand Send message Joined: 15 Nov 09 Posts: 1 Credit: 0 RAC: 0 Level Scientific publications	Message 13614 - Posted: 19 Nov 2009, 11:26:47 UTC - in response to Message 13603. To put in my 2cts... I have got cuda working on a 9600M GS card within Kubuntu amd64 with the nVidia 190.42 driver and have a 100% failure rate thusfar. Just started the 15th of November with Boinc 6.4.5 from the Ubuntu repositories. GPU grid is running the v6.70 cuda version. Error messages state that a file couldn't be renamed, so I am not 100% sure whether it is the same issue as discussed here, but inspection of the permissions revealed no problems as far as I can see. ID: 13614 · Rating: 0 · rate: / Reply Quote