Video Card Longevity

Author	Message
Jeremy Send message Joined: 15 Feb 09 Posts: 55 Credit: 3,542,733 RAC: 0 Level Scientific publications	Message 9286 - Posted: 4 May 2009, 2:39:25 UTC - in response to Message 6136. Actually, number 2 might be more of a problem than you might think. The GT200 is a very hardy chip and can take some serious heat. However, I'll be receiving my THIRD GTX 260 (192) tomorrow if UPS cooperates. Two have failed me so far under warranty. The first failure was unexplainable. It failed while I was away at work. The second, however, I caught. The fan on the video card stopped. Completely. 0 rpm. I noticed artifacts in Crysis and closed out the game. The GPU was at 115 deg C and climbing. I monitor all system temps via SpeedFan. When I discovered the video card fan wasn't spinning, I immediately shut the system down. I let it sit for an hour, then restarted it. The fan spun up properly and everything worked, but the video card failed and refused to POST 2 days later. I now use SpeedFan's event monitor to automatically monitor temps and shut the system should any of them go higher than what I deem allowable, hopefully this will prevent any future RMA requests. It's not difficult to set up. Point is, if you're running your system under high loads for a sustained period of time as you do with BOINC or FAH, you really need to keep a close eye on things as much as possible. Automated is best IMHO. If my system had shut down in the middle of my Crysis session I would've been annoyed, but at least my video card wouldn't have cooked itself. I leave my system unattended so often (sleep and work), that it just makes sense to have something there in case something goes awry while I'm away. Jeremy ID: 9286 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 9293 - Posted: 4 May 2009, 10:53:18 UTC Heat is a common problem for the lifespan of any part, and failures are not as many people think selden or rare but they actually are common. I have had such an problem with a gigabyte mainboard which i never got support on, the problem with most problems are related to overheating one or more parts. The board i have is cooled by those fanless blocks but what for instance happened with this board is that the manufactor did not construct this coolblock at the right way, causing the northbridge chip case to meld. The coolblock only had partial contact with the chip case, but nobody can /will check these parts untill failures appear. I can show you pictures of this if asked, what most people don't know is that many parts on a mainbord are only cooled by airflow in the case or are not even cooled at all. Hence super high temps on these parts some actually become 128 C or hotter i remember a guys which made picture with a thermal camera showing the heat spots on most mainly mainboards showing very high temps. Another discussion is about harddrives and their temps, we think the lower the better but i have read documents from google that it actually seems to be more the moderate temps (between 45 and 65) where they get better and life longer. I also found out myself when i still worked as IT person on medium/large companies that spinning up a drive frequently does more damage to them then letting them run 24/7. Ofcourse we talk about enterprise drives here and not everybody wants to run his pc 24/7. But i allways tell people to leave the machine on when they know they need it again i a few hours. Anyway we can go on forever on this topic and i have read hundreds of reports and tests made by friends working with the same huge systems. I general a computer can allways fail especially when operated outside their parameters like temperature/frequency and/or voltage. This can occur by intend ( OC, test, benchmark ) or by failure of parts. (breakdown of fan, cooling system) And ofcourse nowadays manufactors make the parts not to last long but last the expected lifespan they want. ID: 9293 · Rating: 0 · rate: / Reply Quote

Paul D. Buck Send message Joined: 9 Jun 08 Posts: 1050 Credit: 37,321,185 RAC: 0 Level Scientific publications	Message 9304 - Posted: 4 May 2009, 18:48:58 UTC - in response to Message 9293. Another discussion is about harddrives and their temps, we think the lower the better but i have read documents from google that it actually seems to be more the moderate temps (between 45 and 65) where they get better and life longer. I also found out myself when i still worked as IT person on medium/large companies that spinning up a drive frequently does more damage to them then letting them run 24/7. YOu are seeing the effects of two factors, thermal cycling which leads to expansion and contraction effects which can induce failures and inrush currents which cause other failure modes (I talked of this in my part of the failure mode discussion in the other referenced thread). ID: 9304 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 9379 - Posted: 6 May 2009, 14:13:15 UTC Yes and is nice to see such very handy info, because we allways am concerned about our hardware. In fact sometimes a little bit too much :) ID: 9379 · Rating: 0 · rate: / Reply Quote

Edboard Send message Joined: 24 Sep 08 Posts: 72 Credit: 12,410,275 RAC: 0 Level Scientific publications	Message 9383 - Posted: 6 May 2009, 17:53:39 UTC Last modified: 6 May 2009, 17:56:20 UTC Two of my GTX280 died crunching 24/7 (Gpugrid/Folding/SETI) with a 16% OC (only clock and shaders, not memory). They last aprox. two month each. Since then, I do not OC my GPUs and only crunch about 10 hours/day ID: 9383 · Rating: 0 · rate: / Reply Quote

JockMacMad TSBT Send message Joined: 26 Jan 09 Posts: 31 Credit: 3,877,912 RAC: 0 Level Scientific publications	Message 9508 - Posted: 9 May 2009, 9:05:01 UTC I lost an ATI HD4850x2 at stock due to excessive temperatures. Since bought an AC unit for the room which cost less than the card. ID: 9508 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 9511 - Posted: 9 May 2009, 9:53:33 UTC Last modified: 9 May 2009, 9:56:15 UTC Well you said it right excessive temps are at longer periods a disaster some cards need better cooling then provided by manufactors. So to be honest if i buy a card i allways try to find one which blows the air out of the case. And I allways look at the cooling on it and the previews if the card is cooler then its competition Or if i can find a better solution to cool it like watercooling or a better heatsink. Like my previous card which is a nvdia 6600 Gt which is a notorious hothead (up to 160 C) i tweaked it with a watercooler and got it at stressed at 68 C and believe me not many are able to get it that low. So in both your cases it probably run too long at full power without enough air(cooling) flow, but to be honest its almost a science on itself to get optimum airflow in your machine. Nevertheless i allways make sure mine does and ofcourse allways have huge pc cases, even my htpc has the biggest case i could find and is full of 12" cooling fans :D And all fan slots have the best fans available ( sounding like a little airplane ;)) meaning the best airflow with the least of noise. My main case tops it with a radial fan blowing on my mainboard in my stacker case on top i have my watercooling fans 3 x 12" then i have 2x12" on the backside blowing out, on the front another 2x12" fans blowing on to the 4 drives inwards. Then ofcourse the single fan on the mainboard chipset and the 2 fans from my power supply and my VC fan so in all enough to make some people crazy ;) (woman) ID: 9511 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9587 - Posted: 10 May 2009, 12:15:17 UTC Jeremy wrote: Actually, number 2 might be more of a problem than you might think. Yes and no.. well, it depends. The "normal" failure rate of graphics cards seems to be higher than I expected, even without BOINC. Your first card seems to be one of them. What I mean by type 2 is a sudden failure with no apparent reason (overclock at stock voltage is not a proper reason, as I explained above). What you describe for your 2nd card (fan failed) could well be attributed to this type, but I'd tend to assign the mechanism to type 3: it's heat damage, greatly accelerated compared to the normal decay. Admittedly, when I talk about type 3 I have normal temperatures in mind, i.e. 50 - 80 °C. uBronan, a heat sink which is not mounted properly is rather similar to a fan suddenly failing. It's obviously bad and has to be avoided, but what I'm talking about is what happens in the absence of such failures, in a system where the cooling system works as expected. Regarding the hot spots on mainboards: not all of these components are silicon chips. for example a dumb inductor coil can tolerate a much higher temperature. And you can actually manufacture power electronics, which feature much more crude structures and are therefore much less prone to damage than the very fine structures of current cpus and gpus. So just that a component is 125°C hot does not neccessarily tell you that this is wrong or too bad. Regarding hdds: sorry, but temperatures up to 65°C are likely going to kill the disk!! Take a look, most are specified up to 60°C. German c't magazine once showed a photo of an old 10k rpm SCSI disk after the fan failed.. some plastic had molten and the entire hdd had turned into an unshapely something [slightly exagerated]. I also read about this Google study and while their conclusion "we see lower failure rates at mid 40°C than at mid 30°C" is right, it is not so clear what this means. The advantage of their study is that they average over many different systems, so they can gather lots of data. The drawback, however, is that they average over many different systems. There's at least one question which can not be answered easily: are the drives running at lower temperatures mounted in server chassis with cooling.. in critical systems, which experience a much higher load on their hdds? There could be more of such factors, which influence the result and persuade us to draw wrong conclusions, when we ignore the heterogenous landscape of Googles server farm. What I think is happening: the old rule of "lower temp is better" still applies, but in the range of mid-40°C we are relatively safe from thermally induced hdd failures. Thus other factors start to dominate the failure rates, which may coincidentally seem linked to the hdd temperature, but which may actually be linked to the hdd tpye / class / usage patterns. But i allways tell people to leave the machine on when they know they need it again i a few hours. But don't forget that nowadays all hdds have fluid-dynamic bearings (I imagine it quite difficult to do permanent damage to a fluid) and that pc component costs went down, whereas power costs went up, as well as pc power consumption. However, thermal cycling is of cource still a factor. Like my previous card which is a nvdia 6600 Gt which is a notorious hothead (up to 160 C) i tweaked it with a watercooler and got it at stressed at 68 C and believe me not many are able to get it that low. Well, mine ran at ~50°C idle and ~70°C load with a silent "Nv Silencer" style cooler. And emergency shutdown is set by the NV driver somewhere around 120 -130°C. Maybe you saw 160F mentioned somewhere? MrS Scanning for our furry friends since Jan 2002 ID: 9587 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 9591 - Posted: 10 May 2009, 15:06:15 UTC Last modified: 10 May 2009, 15:07:38 UTC The card which is a real hothead was running at temps exceeding 110 C not F with the normal fan from the manufactor, when just gaming .. hell i must not think for that card doing a dc project. To your information its not being used but still runs fine even after 5 years of use. And no the manufactor send me a answer on my question about the card running at 110 c when i was gaming that it could have much higher temps without really failing, if they lied about the temps then ofcourse i can't help that. If they made up a story to get rid of me sending emails about these temps then i can't help it. And yes your partially right about drive temps, but then again i was not talking about a 10k or 15k drive but a plain 7k drive. Read the google documents about drive temps in large clusters and the fail rates, Its a proven fact and they run also plain sata 7.2k drives. Since the high speed drives fail much faster, google needs storage not faster thats why they run sata/sas instead of iscsi or other solutions. My 2 seagate boot drives am running constantly at 65 C since i bought them about 2 years ago never been cooler. And yes there is a huge fan blowing cold air on them which doesn't cool them down much. The other drives are samsungs which run much cooler (37 C) if not together with the seagates. But since they share the same drive case they are at 43 now. Hence the samsungs help to cool the seagates. ID: 9591 · Rating: 0 · rate: / Reply Quote

Andrew Send message Joined: 9 Dec 08 Posts: 29 Credit: 18,754,468 RAC: 0 Level Scientific publications	Message 9596 - Posted: 10 May 2009, 16:58:02 UTC - in response to Message 9508. @ JockMacMad TSBT Are you aware that using AC in the same room as your crunching machines may mean that you are paying several times over for that power? I'm not entirely sure about my figures, but basically, if you're dumping, say 100W in an AC'ed room, then I believe the AC unit will require a significant fraction of this to remove the heat (if the heat is being moved to a hotter place as is usual). Perhaps someone else can provide numbers - I live in the UK where we sadly have no need for AC! ID: 9596 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 9601 - Posted: 10 May 2009, 18:50:55 UTC - in response to Message 9596. Andrew, you're right, an AC increases the power bill considerably. For supercomputers and clusters they usually factor in a factor of 2 for the cooling. So in your example a 100W PC in a room with AC will cost 200W in the end. uBronan, OK, I actually read the paper again.. damn, what am I doing here? Isn't it supposed to be sunday?! :p I think we're both too wrong. I remembered that they gathered stats from all their drives, which would mainly be desktop drives (IDE, SATA) mixed with some enterprise class drives. For example this could have lead to a situation, where the cheap desktop drives run around 35°C and fail more often than the enterprise drives, which run hotter due to their higher spindle speeds. However, this is not the case: they include only 5.4k and 7.2k rpm desktop drives, IDE and SATA. The very important passage: Overall our experiments confirm previously reported temperature effects only for the high end of our temperature range() and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. ... We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do. I think that's quite what I've been saying :) () Note that the high end of the temperature spectrum starts at 45°C for them and ends at 50°C. There the error rate rises, but the data quickly becomes noisy due to low statistics (large error bars). Regarding that 6600GT.. well, I can't accuse them of lying without futher knowledge. They may very well have had some reason to state that you could have seen even higher temps without immediate chip failure. I think those chips were produced on the 110 nm node, which means much larger and more robust structures, i.e. if you move one atom it causes less of an effect. Here's some nice information: most 6600GTs running in the 60 - 80°C range under load and a statement that 127°C is the limit where the NV driver does the emergency shutdown. Do you know what? "You could have seen higher temps" means "emergency shut down happens later". Which is not lying, but totally different from "110°C is fine" :D MrS Scanning for our furry friends since Jan 2002 ID: 9601 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 9660 - Posted: 11 May 2009, 23:34:47 UTC Well lol yea i have been reading it several times also. Since in my old job i had a cluster of 2100 drives devided over many cabinets running for the huge databases we had, those scsi drives where also kept at temps near 15 C. In fact i have also run some tests for the company on workstations also to see what was the best temperature to maintain them at, but yes those are all enterprise drives which are sturdier then normal desktop drives. And i must add the newer drives seem to be much weaker then the older drives, probably related to the much finer surface of the platters. Except the new glass platters which seems to be able to operate with higher temps. ID: 9660 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 9704 - Posted: 13 May 2009, 11:59:14 UTC Last modified: 13 May 2009, 12:00:52 UTC Sorry ET after i started up the old beast i saw that i misinformed about the old card its a 6800 GT nvidia from Gigabyte The 6600 came out a bit later. ID: 9704 · Rating: 0 · rate: / Reply Quote

Thamir Ghaslan Send message Joined: 26 Aug 08 Posts: 55 Credit: 1,475,857 RAC: 0 Level Scientific publications	Message 10318 - Posted: 30 May 2009, 7:22:01 UTC - in response to Message 5638. Does anyone have any first hand experience with failures related to 24/7 crunching? Overclocked or stock? I ask because a posting on another forum indicated failures from stress of crunching 24/7. There was no specific information given, so I don't really think that it was a valid statement. Anyone? I bought a gtx 280 in August 2008 and burned it in March 2009. So thats 6 months of stock 24/7 crunching on GPU grid, the fan was set to automatic, don't know if it would made a difference if I've set it to a higher manual fan speed. I remember the tempratures were below the tolerable levels. The relevant thread is here: http://www.gpugrid.net/forum_thread.php?id=829&nowrap=true#7338 So yes, GPU failures are well and real, I've seen enough posts from other GPU owners. However, its very rare to hear of CPU failures. I guess there are safe guards on CPUs that are more advanced than the GPU. ID: 10318 · Rating: 0 · rate: / Reply Quote

Bigred Send message Joined: 24 Nov 08 Posts: 10 Credit: 25,447,456 RAC: 0 Level Scientific publications	Message 10319 - Posted: 30 May 2009, 7:50:56 UTC - in response to Message 10318. So far, I've had 1 GTX260 out of 10 fail after 4 months of crunching. The fan bearings were totally wore out. It took 5 weeks to get it's replacement.As allways my stuff runs stock speeds. ID: 10319 · Rating: 0 · rate: / Reply Quote

pharrg Send message Joined: 12 Jan 09 Posts: 36 Credit: 1,075,543 RAC: 0 Level Scientific publications	Message 10381 - Posted: 2 Jun 2009, 15:57:53 UTC Last modified: 2 Jun 2009, 16:00:07 UTC I use XFX brand cards since they give a lifetime warranty if you register them. If mine burns out, I just do an replacement with them, though I've yet to have one die anyway. Keeping any video card cool is the other major factor. Just like your CPU, the cooler you keep your GPU, the less likely you are to see failures or errors. ID: 10381 · Rating: 0 · rate: / Reply Quote

uBronan Send message Joined: 1 Feb 09 Posts: 139 Credit: 575,023 RAC: 0 Level Scientific publications	Message 10433 - Posted: 5 Jun 2009, 20:29:48 UTC Looks like my 9600 gt is showing sings of break down as well I get random errors and saw some weird pixels when booting I even tried folding@home which run for a couple of units and non finished all where errored out. So i guess vc die from dpc projects ID: 10433 · Rating: 0 · rate: / Reply Quote

popandbob Send message Joined: 18 Jul 07 Posts: 67 Credit: 43,351,724 RAC: 0 Level Scientific publications	Message 10436 - Posted: 6 Jun 2009, 7:36:12 UTC - in response to Message 10433. So i guess vc die from dpc projects Saying that will scare others away. They don't die from doing dc as they would have failed anyway. If there is a problem with a card it will show up faster if stressed harder yes but to claim projects like GPUgrid kill cards is wrong. The best safeguard is to buy from good companies who will help solve problems. I've only dealt with evga and they've been good to me but I can't comment on other places. Bob ID: 10436 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 10520 - Posted: 12 Jun 2009, 22:51:54 UTC - in response to Message 10436. Last modified: 12 Jun 2009, 22:52:45 UTC They don't die from doing dc as they would have failed anyway. If there is a problem with a card it will show up faster if stressed harder yes but to claim projects like GPUgrid kill cards is wrong. I don't think it's that simple. Running load instead of idle will typically increase GPU temps by 20 - 30°C, which means a lifetime reduction of a factor of 4 to 8. So if a card fails after a half year of DC we could have expected it to last 2 to 4 years otherwise. And if it went into a lower-voltage 2D mode then degradation would be further reduced without DC. I can't give precise numbers, but I'd go as far as "at a significantly reduced voltage degradation almost doesn't matter any more". So you can kill a card in 6 months, which may otherwise have lasted 10 years, running most of which in 2D mode. So, yes, DC only accelerated the failure. However, it turned the card from "quasi-infinite lifetime" into "short lifetime". Which is by all practical means equivalent to "killing it". I sincerely think this is what we have to admit in order to be honest. To us and to our fellow crunchers. Why are GPUs seeing higher failure rates than CPUs? Easy: CPUs must be specified to withstand 24/7 load. Up to now the GPU manufacturers didn't have to deal with such loads.. after a few days even the hardest gamer needs some sleep. Due to these relaxed load-conditions they specify higher temperatures than the CPU guys. Furthermore the CPU guys have more space for efficient cooling solutions, so their chips don't have to be as hot. GPUs have long hit the power wall, where the maximum clock speed is actually determined by the noise the user can stand with the best cooling solution the manufacturer can fit into 2 slots. As a long term solution the manufacturers would have to offer special 24/7-versions of their cards: slightly lower clocks, slightly lower voltages, maybe a better cooling solution and the fan setting biased towards cooling rather than noise. Such cards could be used for 24/7 crunching.. but who would buy them? More expensive, slower and likely louder! MrS Scanning for our furry friends since Jan 2002 ID: 10520 · Rating: 0 · rate: / Reply Quote

Daniel Neely Send message Joined: 21 Feb 09 Posts: 5 Credit: 36,705,213 RAC: 0 Level Scientific publications	Message 10521 - Posted: 12 Jun 2009, 23:38:27 UTC - in response to Message 10520. lution the manufacturers would have to offer special 24/7-versions of their cards: slightly lower clocks, slightly lower voltages, maybe a better cooling solution and the fan setting biased towards cooling rather than noise. Such cards could be used for 24/7 crunching.. but who would buy them? More expensive, slower and likely louder! Isn't that called the nVidia Tesla? ID: 10521 · Rating: 0 · rate: / Reply Quote