Work unit failure rate

Author	Message
Andrew Send message Joined: 9 Dec 08 Posts: 29 Credit: 18,754,468 RAC: 0 Level Scientific publications	Message 13642 - Posted: 22 Nov 2009, 1:12:47 UTC In November I've had 3 failures to 11 successes, non-overclocked on a 8800Gt, so it's interesting that others are reporting failures on 8800 or 9800. ID: 13642 · Rating: 0 · rate: / Reply Quote

BlackNite Send message Joined: 21 Mar 09 Posts: 1 Credit: 2,518,637 RAC: 0 Level Scientific publications	Message 13643 - Posted: 22 Nov 2009, 1:33:02 UTC I had 9 failures in the last 32 WUs on a 8800GTS512. ID: 13643 · Rating: 0 · rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13644 - Posted: 22 Nov 2009, 2:39:16 UTC Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks. ID: 13644 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 13649 - Posted: 22 Nov 2009, 12:02:16 UTC Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ... Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ... ID: 13649 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 13666 - Posted: 23 Nov 2009, 19:13:24 UTC - in response to Message 13644. Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 If you check you'll see that almost all of these WUs were the one's talked about in this thread: http://www.gpugrid.net/forum_thread.php?id=1468 They were later successfully completed by GTX 260 (and above) cards. Seems these WUs were pulled right around the 16th. I moved my sub GTX 260 cards to other projects for a few days because they were experiencing the same errors you were having. Now it seems things are sorted out and the sub GTX 260 cards are running better. ID: 13666 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 13686 - Posted: 24 Nov 2009, 18:05:13 UTC Last modified: 24 Nov 2009, 18:37:50 UTC Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession. It also looks as if in the meantime, it has been trashing SETI Beta tasks in the characteristically SETI way, i.e. reporting 'success' but exiting early (after 17 seconds or so) with a false -9 overflow message and no useful science. This happened just before SETI closed for weekly maintenance, so I can't check their logs until later. But I've looked through the local log, and it was definitely the GPUGrid task which was the first to fail: the subsequent problems lasted long enough to drive SETI DCF way down (0.0219), so now I've got a major excess to work off. I rebooted the machine, and it's completed the next SETI Beta in a much saner 17m 34s (DCF 0.0889). I'll do one more SETI, then start the new queued GPUGrid. But I would be worried if it turns out that GPUGrid errors are wrecking the science, not only of your own project, but potentially other projects too. Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed. ID: 13686 · Rating: 0 · rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 13698 - Posted: 25 Nov 2009, 15:54:35 UTC - in response to Message 13686. Last modified: 25 Nov 2009, 15:56:44 UTC Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession. [...] Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed. I had 4 faulty ...TONI_HERG... on a 9800GT in the last few days. Each "ERROR" crashed the Driver (reboot needed). One of this WU's (>http://www.gpugrid.net/workunit.php?wuid=961479) already errored out (Too many error results). ID: 13698 · Rating: 0 · rate: / Reply Quote

CTAPbIi Send message Joined: 29 Aug 09 Posts: 175 Credit: 259,509,919 RAC: 0 Level Scientific publications	Message 13700 - Posted: 26 Nov 2009, 3:23:41 UTC - in response to Message 13698. Last modified: 26 Nov 2009, 4:08:16 UTC 3 last WUs died just before the end... 32-IBUCH_2_reverse_TRYP_0911-9-40-RND8911 85-GIANNI_BIND_166_119-23-100-RND0667 8-GIANNI_BIND_2-34-100-RND3540 I just did new OCing, it looks stable, at least POEM's WUs are OK... GPU's been flashed years ago, so it's not the case. ID: 13700 · Rating: 0 · rate: / Reply Quote

Daniel.Ahlborn Send message Joined: 12 Jan 09 Posts: 5 Credit: 3,359,168 RAC: 0 Level Scientific publications	Message 13701 - Posted: 26 Nov 2009, 9:40:50 UTC it dont seem like a OC Problem. Since a couple days i have a failure rate of nearly 100% on my machine with a GTS 250 as well. <core_client_version>6.10.17</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> # Using CUDA device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTS 250" # Clock rate: 1.84 GHz # Total amount of global memory: 536543232 bytes # Number of multiprocessors: 16 # Number of cores: 128 MDIO ERROR: cannot open file "restart.coor" Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure. </stderr_txt> ]]> they are all failing after a couple hrs of running with random reasons. http://www.gpugrid.net/results.php?hostid=56508 to me it appears like the current WU's are running well only on G200 based Chips, since my other machine with a GTX260 (G200b, 55nm, 216SP's) , same OS and same driver, is just working well with anything they feed it. ID: 13701 · Rating: 0 · rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 13717 - Posted: 28 Nov 2009, 18:19:28 UTC I took a closer look at my results (last 2 weeks). - no single error on my high overclocked GT200 (GTX260/GTX295) - 12 errors (55 valid) on my not overclocked 4x 9800GT -- 9 of 12 errors on '...TONI_HERG... WUs', 3 on '...IBUCH_..._TRYPE...' I found no single valid '...TONI_HERG...' on all four 9800GT. (I tried BM 6.6.38 up to 6.10.17 and NV-Driver 190.38/ 190.62/ 191.07 -no difference in failure rate) ID: 13717 · Rating: 0 · rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13742 - Posted: 1 Dec 2009, 5:52:14 UTC - in response to Message 13644. Things went from almost 100% failure back to 100% success for me on 16-Nov. I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks. It looks like cthulhu ate everything he was given and wants more. I am back to 100% error rate. I look at the WU's I failed and others fail them as well. Should this be taken as a formal announcement that G92 boards are no longer welcome on GPUGRID? Finding g92/linux friendly projects is becoming more and more difficult... ID: 13742 · Rating: 0 · rate: / Reply Quote

fractal Send message Joined: 16 Aug 08 Posts: 87 Credit: 1,248,879,715 RAC: 0 Level Scientific publications	Message 13756 - Posted: 2 Dec 2009, 6:06:19 UTC - in response to Message 13649. Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ... Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ... Ok. I will admit finding factual information is hard. Very hard... But. Newer GPU's like the GT240, based on the G215 GPU are compute level 1.2. The only difference between compute level 1.2 and compute level 1.3 that nvidia documents is that compute level 1.3 supports double precision. This begs to wonder. Are GT240's, based on the G215 chipset, supported by GPUGRID? We all knwo that GTS250's, based on the G92b chipsset are not, as are many GTX280's based on the G200 chipset while GTX280's based on the G200b chipset DO work with GPUGRID. Can boards that are expected to work be defined by their chipset, their compute level or by something else? Nvidia's conventions are hard to understand, but it is clear that G92 is not welcome on GPUGRID, nor is G200. G200b is. Is G215? ID: 13756 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 13835 - Posted: 8 Dec 2009, 17:26:03 UTC - in response to Message 13756. For Factual Information try here, http://en.wikipedia.org/wiki/GeForce_200_Series You should note that NVidias Newer Cards are not necessarily based on anything in particular, and NVidias naming system is beyond ridiculous. The G200 series seems to include GT92a, GT92b and GT96 based cards. There are 40nm, 55nm and 65nm cores, and the transistor count varies from 0.26 Billion to 2.8 Billion. Release dates don’t seem to matter much either. Of particularly annoying specs are: GTX 280 cards, used a 65nm fabricated core and were usually slower than the GTX 275s. The Older GTX 260s, used 65nm and the GTX260M used a GT92 core. The GTS 250, uses a 65nm core, GT92 A2, but still sort of works here! The GTS 240. Uses a 55nm G92b core – an afterthought or a fulfil contracts card perhaps. The GT 220M uses a 65nm G96M core. I doubt that would work. The combination of Card Factors that presently seem important to GPUGrid functionality include: Core Size; 40nm Good, 55nm OK, 65nm Bad GPU Codename; GT216 Good, GT215 Good, GT200b Good, GT200 Poor/OK, G92 A2 Poor/OK-ish, G92 Bad. The G90 is no longer Compatible. Memory; DDR3+DDR5 Good, DDR3 Mix of Good and Bad, DDR2 Presumably Bad Overall Performance; A combination of the Amount and Speed of the Cores, Shaders, Memories, Bus width, and other performance factors. Determines if the card can finish in time. Tempertures; Too hot and it will crash. Depends on Physical Architecture of GPU and Computer, use of fans, the GPUGrid WorkUnit and what else you are crunching... and not forgetting, How much use the card has seen; or how close it is to failure, given the cards other Factors! ID: 13835 · Rating: 0 · rate: / Reply Quote