Work unit failure rate

Message boards : Graphics cards (GPUs) : Work unit failure rate
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Andrew

Send message
Joined: 9 Dec 08
Posts: 29
Credit: 18,754,468
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 13642 - Posted: 22 Nov 2009, 1:12:47 UTC

In November I've had 3 failures to 11 successes, non-overclocked on a 8800Gt, so it's interesting that others are reporting failures on 8800 or 9800.
ID: 13642 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
BlackNite

Send message
Joined: 21 Mar 09
Posts: 1
Credit: 2,518,637
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 13643 - Posted: 22 Nov 2009, 1:33:02 UTC

I had 9 failures in the last 32 WUs on a 8800GTS512.
ID: 13643 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13644 - Posted: 22 Nov 2009, 2:39:16 UTC

Things went from almost 100% failure back to 100% success for me on 16-Nov.

I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks.
ID: 13644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 9 Jun 08
Posts: 1050
Credit: 37,321,185
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 13649 - Posted: 22 Nov 2009, 12:02:16 UTC

Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ...

Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ...
ID: 13649 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13666 - Posted: 23 Nov 2009, 19:13:24 UTC - in response to Message 13644.  

Things went from almost 100% failure back to 100% success for me on 16-Nov.

I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13

If you check you'll see that almost all of these WUs were the one's talked about in this thread:

http://www.gpugrid.net/forum_thread.php?id=1468

They were later successfully completed by GTX 260 (and above) cards. Seems these WUs were pulled right around the 16th. I moved my sub GTX 260 cards to other projects for a few days because they were experiencing the same errors you were having. Now it seems things are sorted out and the sub GTX 260 cards are running better.


ID: 13666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13686 - Posted: 24 Nov 2009, 18:05:13 UTC
Last modified: 24 Nov 2009, 18:37:50 UTC

Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession. It also looks as if in the meantime, it has been trashing SETI Beta tasks in the characteristically SETI way, i.e. reporting 'success' but exiting early (after 17 seconds or so) with a false -9 overflow message and no useful science.

This happened just before SETI closed for weekly maintenance, so I can't check their logs until later. But I've looked through the local log, and it was definitely the GPUGrid task which was the first to fail: the subsequent problems lasted long enough to drive SETI DCF way down (0.0219), so now I've got a major excess to work off.

I rebooted the machine, and it's completed the next SETI Beta in a much saner 17m 34s (DCF 0.0889). I'll do one more SETI, then start the new queued GPUGrid. But I would be worried if it turns out that GPUGrid errors are wrecking the science, not only of your own project, but potentially other projects too.

Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed.
ID: 13686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Siegfried Niklas
Avatar

Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 13698 - Posted: 25 Nov 2009, 15:54:35 UTC - in response to Message 13686.  
Last modified: 25 Nov 2009, 15:56:44 UTC

Just had a nasty experience on host 43404 - a 9800GTX+. It looks as if D14-TONI_HERGdof2-0-40-RND9670 failed, and (for the first time in my experience) left the card in such a state that the next five tasks failed in quick succession.

[...]

Edit - next GPUGrid task has been running for 20 minutes now without a problem, so it seems the reboot was all it needed.



I had 4 faulty ...TONI_HERG... on a 9800GT in the last few days.
Each "ERROR" crashed the Driver (reboot needed).

One of this WU's (>http://www.gpugrid.net/workunit.php?wuid=961479) already errored out (Too many error results).
ID: 13698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
CTAPbIi

Send message
Joined: 29 Aug 09
Posts: 175
Credit: 259,509,919
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 13700 - Posted: 26 Nov 2009, 3:23:41 UTC - in response to Message 13698.  
Last modified: 26 Nov 2009, 4:08:16 UTC

3 last WUs died just before the end...
32-IBUCH_2_reverse_TRYP_0911-9-40-RND8911
85-GIANNI_BIND_166_119-23-100-RND0667
8-GIANNI_BIND_2-34-100-RND3540

I just did new OCing, it looks stable, at least POEM's WUs are OK... GPU's been flashed years ago, so it's not the case.
ID: 13700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Daniel.Ahlborn

Send message
Joined: 12 Jan 09
Posts: 5
Credit: 3,359,168
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwatwatwat
Message 13701 - Posted: 26 Nov 2009, 9:40:50 UTC

it dont seem like a OC Problem. Since a couple days i have a failure rate of nearly 100% on my machine with a GTS 250 as well.

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
# Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTS 250"
# Clock rate: 1.84 GHz
# Total amount of global memory: 536543232 bytes
# Number of multiprocessors: 16
# Number of cores: 128
MDIO ERROR: cannot open file "restart.coor"
Cuda error: Kernel [pme_fill_charges_accumulate] failed in file 'fillcharges.cu' in line 73 : unspecified launch failure.

</stderr_txt>
]]>

they are all failing after a couple hrs of running with random reasons.

http://www.gpugrid.net/results.php?hostid=56508

to me it appears like the current WU's are running well only on G200 based Chips, since my other machine with a GTX260 (G200b, 55nm, 216SP's) , same OS and same driver, is just working well with anything they feed it.


ID: 13701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Siegfried Niklas
Avatar

Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 13717 - Posted: 28 Nov 2009, 18:19:28 UTC

I took a closer look at my results (last 2 weeks).

- no single error on my high overclocked GT200 (GTX260/GTX295)

- 12 errors (55 valid) on my not overclocked 4x 9800GT

-- 9 of 12 errors on '...TONI_HERG... WUs', 3 on '...IBUCH_..._TRYPE...'

I found no single valid '...TONI_HERG...' on all four 9800GT.

(I tried BM 6.6.38 up to 6.10.17 and NV-Driver 190.38/ 190.62/ 191.07 -no difference in failure rate)

ID: 13717 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13742 - Posted: 1 Dec 2009, 5:52:14 UTC - in response to Message 13644.  

Things went from almost 100% failure back to 100% success for me on 16-Nov.

I did upgrade cuda from 190.18 to 190.42 and boinc from 6.10.13 to 6.10.17 at that time in an attempt to get the machine to run colatz. Colatz still doesn't like my linux64 machine, but gpugrid is back to its old stable self. I am not sure if my changes fixed it or if you did anything, but whoever sacrificed the chicken to cthulhu has my thanks.

It looks like cthulhu ate everything he was given and wants more. I am back to 100% error rate. I look at the WU's I failed and others fail them as well.

Should this be taken as a formal announcement that G92 boards are no longer welcome on GPUGRID? Finding g92/linux friendly projects is becoming more and more difficult...
ID: 13742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
fractal

Send message
Joined: 16 Aug 08
Posts: 87
Credit: 1,248,879,715
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13756 - Posted: 2 Dec 2009, 6:06:19 UTC - in response to Message 13649.  

Just a cautionary note, this project is single precision heavy, MW is almost all double precision and Collatz is Integer ... so... success on one project does not at all imply that there is not a problem with the hardware side ... all three projects are using different parts of the cards ...

Just something to keep in mind ... and I did see a note elsewhere that someone reverted back to 6.6.x and their GPU Grid failures stopped ...

Ok. I will admit finding factual information is hard. Very hard... But. Newer GPU's like the GT240, based on the G215 GPU are compute level 1.2. The only difference between compute level 1.2 and compute level 1.3 that nvidia documents is that compute level 1.3 supports double precision.

This begs to wonder. Are GT240's, based on the G215 chipset, supported by GPUGRID? We all knwo that GTS250's, based on the G92b chipsset are not, as are many GTX280's based on the G200 chipset while GTX280's based on the G200b chipset DO work with GPUGRID.

Can boards that are expected to work be defined by their chipset, their compute level or by something else? Nvidia's conventions are hard to understand, but it is clear that G92 is not welcome on GPUGRID, nor is G200. G200b is. Is G215?
ID: 13756 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 13835 - Posted: 8 Dec 2009, 17:26:03 UTC - in response to Message 13756.  

For Factual Information try here, http://en.wikipedia.org/wiki/GeForce_200_Series
You should note that NVidias Newer Cards are not necessarily based on anything in particular, and NVidias naming system is beyond ridiculous.

The G200 series seems to include GT92a, GT92b and GT96 based cards.
There are 40nm, 55nm and 65nm cores, and the transistor count varies from 0.26 Billion to 2.8 Billion.
Release dates don’t seem to matter much either.

Of particularly annoying specs are:
GTX 280 cards, used a 65nm fabricated core and were usually slower than the GTX 275s.
The Older GTX 260s, used 65nm and the GTX260M used a GT92 core.
The GTS 250, uses a 65nm core, GT92 A2, but still sort of works here!
The GTS 240. Uses a 55nm G92b core – an afterthought or a fulfil contracts card perhaps.
The GT 220M uses a 65nm G96M core. I doubt that would work.

The combination of Card Factors that presently seem important to GPUGrid functionality include:

Core Size; 40nm Good, 55nm OK, 65nm Bad

GPU Codename; GT216 Good, GT215 Good, GT200b Good, GT200 Poor/OK, G92 A2 Poor/OK-ish, G92 Bad. The G90 is no longer Compatible.

Memory; DDR3+DDR5 Good, DDR3 Mix of Good and Bad, DDR2 Presumably Bad

Overall Performance; A combination of the Amount and Speed of the Cores, Shaders, Memories, Bus width, and other performance factors. Determines if the card can finish in time.

Tempertures; Too hot and it will crash. Depends on Physical Architecture of GPU and Computer, use of fans, the GPUGrid WorkUnit and what else you are crunching...

and not forgetting,
How much use the card has seen; or how close it is to failure, given the cards other Factors!
ID: 13835 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Graphics cards (GPUs) : Work unit failure rate

©2025 Universitat Pompeu Fabra