WU: NOELIA

Author	Message
Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33677 - Posted: 30 Oct 2013, 12:57:33 UTC Last modified: 30 Oct 2013, 12:59:56 UTC On my GTX 660's (Win7 64-bit with 331.58 drivers): Concerning slow running: that happened once in a while to me, though only on the 660s and never on my GTX 650 Ti. But someone mentioned the old trick of setting the Power Management Mode to "Prefer Maximum Performance" in the Nvidia control panel, and I have not had a problem since. It is a little inconvenient to get to that setting now, since I normally connect my display to the internal Intel graphics adapter, but I used to always set it that way when I was running the monitor directly from the Nvidia card. Concerning failures: I was getting occasional failures on various Noelias (not necessarily just INS1P), but only on one of my two cards, which was curious since they are supposedly identical. It turns out that the GPU core voltage setting on the one that failed was a little lower than the other, apparently because it was running into a power limit. So using MSI Afterburner, I raised the power limit (to 105%) and raised the voltage a little. That fixed it, and I have had no failures since. It is a truism that the Noelias work your card hard, and if there are any weaknesses, they will find them. ID: 33677 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33715 - Posted: 2 Nov 2013, 11:33:37 UTC - in response to Message 33677. If you want to avoid the reduced power efficiency which comes along with the increased voltage you could also scale GPU clock back by 13 or 26 MHz - should have the same stabilizing effect (but be a little slower and a little more power efficient). MrS Scanning for our furry friends since Jan 2002 ID: 33715 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33721 - Posted: 2 Nov 2013, 13:05:34 UTC - in response to Message 33715. Last modified: 2 Nov 2013, 13:29:46 UTC If you want to avoid the reduced power efficiency which comes along with the increased voltage you could also scale GPU clock back by 13 or 26 MHz - should have the same stabilizing effect (but be a little slower and a little more power efficient). MrS Actually, I do set back both cards by 10 MHz, but for a different reason. I found that the problem card still had the slowdown on a subsequent Noelia work unit. Then I remembered another old trick that sometimes works to keep the clocks going - let MSI Afterburner control them. It doesn't seem to matter whether you increase or decrease the clock rate from the default, or by what amount. My guess is that it takes control away from the Nvidia software, or whatever they use. At least it has been working for six days now, which is encouraging, if not proof. But such a small change in clock rate (it is very close to the Nvidia default of 980 MHz anyway) does not make any discernible change in temperature or power consumption as measured by GPU-Z. I would have to make a much larger change than that, which I will do if necessary. I think the chip on that particular card was just weak; when they test them, I am sure they don't run them through anything as rigorous as what we do here. I also had to bump up the voltage a little more - I started at 25 mv, but that wasn't quite enough, so now it is 37 mv. It has been error-free for a couple of days and three Noelias, but I need more Noelias to test it. ID: 33721 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33724 - Posted: 2 Nov 2013, 16:01:32 UTC - in response to Message 33721. The clock granularity of Keplers is 13 MHz, so you might want to keep to multiples of this. If you don't, it's being rounded - no problem, unless you change clocks a bit but it actually gets rounded to the same clock speed and doesn't change anything. And you're right, +/-10 MHz has a negligible effect on power consumption. What I was referring to was the increased power consumption from the voltage increase. It's not dramatic either (larger than what the frequency change causes), but it's something you might not want. And don't try to shoot for 0% error rate at GPU-Grid - I'm not sure this is actually possible and what it would depend on (OS, drivers etc.). If you do get occasional errors it's always a good idea to check whether the WU is also failing for your wingmen (which should have crunched it a few days after your attempt). MrS Scanning for our furry friends since Jan 2002 ID: 33724 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33729 - Posted: 2 Nov 2013, 17:43:01 UTC - in response to Message 33724. And you're right, +/-10 MHz has a negligible effect on power consumption. What I was referring to was the increased power consumption from the voltage increase. It's not dramatic either (larger than what the frequency change causes), but it's something you might not want. And don't try to shoot for 0% error rate at GPU-Grid - I'm not sure this is actually possible and what it would depend on (OS, drivers etc.). If you do get occasional errors it's always a good idea to check whether the WU is also failing for your wingmen (which should have crunched it a few days after your attempt). MrS There is a small effect from the voltage increase thus far, but not that much. The problem card (0) is in the top slot, and runs a couple of degrees hotter than the bottom card (1) even without the boost; typically 68 and 66 degrees C, probably due to air flow from the side fans (I have one of the few motherboards that puts the top card in the very top slot, which then raises the lower card up also). When I raise the voltage, it adds a degree (or less) to that on average. I probably should reverse their slot positions, but it is not that important yet. But the bottom card has done quite well - no errors in over a week; only the top card has had the errors. http://www.gpugrid.net/results.php?hostid=159002&offset=0&show_names=1&state=0&appid= I normally would buy Asus cards for better cooling (the non-overclocked versions), but needed the space-saving of these Zotac cards at the time. Now they are in a larger case, and I can replace them with anything if need be. The main point for me is that the big problems of a few months ago are past, for the moment. ID: 33729 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33743 - Posted: 3 Nov 2013, 13:28:23 UTC - in response to Message 33729. The projects tasks have been quite stable of late. The only recent exception being a small batch of WU's that failed quickly. So it's a good time to know if you have a stable system or not. In a system with 2 GPU's the top GPU is more likely to be the warmest because it's sandwiched between the CPU and the other GPU. If you have exhaust cooling GPU's then the side fans would be better blowing into the case. If not then these fans might be better blowing out (but it depends on the case and other fans). I have two GPU's in the one open case. Despite both having triple fans, the top card's temperature was 72 to 73C (with fans at 95%). I propped up 2 case fans to blow the air out from their sides (as they vent into the case). This dropped the top cards temperature to around 63C. It also raised the temperature of the bottom card, but only up to 55C :) FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33743 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33744 - Posted: 3 Nov 2013, 14:36:43 UTC - in response to Message 33743. If you have exhaust cooling GPU's then the side fans would be better blowing into the case. If not then these fans might be better blowing out (but it depends on the case and other fans). I have two 120 mm side fans blowing in, a 120 mm rear fan blowing out, and a top 140 mm fan blowing out (the power supply is bottom-mounted). I think that establishes the airflow over the GPUs pretty well, but you never know until you try it another way. As you point out, it can do strange things. However, the top temperature for the top card is about 70 C, which is reasonable enough. The real limitation on temperature now is probably just the heatsink/fans on the GPUs themselves. But my theory of why that card had errors has more to do with the power limit rather than temperature per se. It would bump up against the power limit (as shown by GPU-Z), and so the voltage (and/or current) to the GPU core could not increase any more when the Noelias needed it. By increasing the power limit to 105% and raising the base voltage, it can supply the current when it needs it. That particular chip just fell on the wrong end of the speed/power yield curve for number crunching use, though it would be fine for other purposes. And I can re-purpose it for other use if need be; it just needs to last until the Maxwells come out. ID: 33744 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33746 - Posted: 3 Nov 2013, 15:07:24 UTC - in response to Message 33744. Last modified: 3 Nov 2013, 15:12:26 UTC The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature". Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance. For me, I have: - 3 GPUs (eVGA GTX 660 Ti FTW 3GB, eVGA GTX 460 SC, NVIDIA/DELL GTS 240); the 660 Ti and the 460 are both factory overclocked, which I haven't touched - Intel i7 965XE quad-core hyperthreaded CPU, factory overclocked to 3742 Mhz - 1000-watt power supply (Dell XPS 730X case/system) - The GPUs run tasks 24/7 (GPUGrid only runs on the 660 Ti and the GTX 460)... alongside CPU fully loaded with CPU tasks - Precision-X setting the GTX 660 Ti to 140% Power Target (so it can upclock to max boost 1241 MHz without ever being limited by the 100% power limitation) - Precision-X fan curves set up so that max GTX 660 Ti fan speed (80%) occurs before the 70C mark (so I can keep max boost), and then max non-660-Ti speed (100%) occurs at 85C (I've had no problems with GPUs running that hot) - System fans set up to assist in keeping the 660 Ti nearly always below 70C (whereas, the defaul system fan settings would have the 660 Ti climb to 82C even if the GPU was at max-80% fan, as the system runs quite hot) - Normal temps for my "fully-loaded 24/7" system: CPU cores: 77-85C GTX 660 Ti: 67-71C GTX 460: 66-75C GTS 240: 75-80C - BOINC/GPUGrid errors that appear to be caused by any sort of hardware problem: NONE ID: 33746 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 33748 - Posted: 3 Nov 2013, 15:21:53 UTC - in response to Message 33746. The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature". Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance. All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first. ID: 33748 · Rating: 0 · rate: / Reply Quote

GoodFodder Send message Joined: 4 Oct 12 Posts: 53 Credit: 333,467,496 RAC: 0 Level Scientific publications	Message 33786 - Posted: 6 Nov 2013, 11:31:02 UTC 'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs). Hope gpugrid is not returning to these ridiculously large WUs again? If so I suspect volunteer base is going to head downwards - can't they be split up? ID: 33786 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33787 - Posted: 6 Nov 2013, 12:13:13 UTC - in response to Message 33786. Last modified: 6 Nov 2013, 12:25:26 UTC 'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs). Hope gpugrid is not returning to these ridiculously large WUs again? If so I suspect volunteer base is going to head downwards - can't they be split up? You will struggle with this type of WU because one of your cards only has 1 GIG of memory and this Noelia unit uses 1.3 GIG but doesn't use much CPU. It will probably make any computer with a 1 GIG card unresponsive. I agree that the project is shooting itself in the foot by just dumping these WU's on machines that can't chew them http://www.gpugrid.net/forum_thread.php?id=3523 ID: 33787 · Rating: 0 · rate: / Reply Quote

wdiz Send message Joined: 4 Nov 08 Posts: 20 Credit: 871,871,594 RAC: 0 Level Scientific publications	Message 33795 - Posted: 8 Nov 2013, 16:39:16 UTC - in response to Message 33786. 'New' (old?) 94x4-NOELIA_1MG_RUN4 very log running (over 24hrs). Hope gpugrid is not returning to these ridiculously large WUs again? If so I suspect volunteer base is going to head downwards - can't they be split up? Same here, with GTX680 or GTX 580 Very long crunch !!! ID: 33795 · Rating: 0 · rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 34 Level Scientific publications	Message 33796 - Posted: 8 Nov 2013, 17:02:52 UTC Last modified: 8 Nov 2013, 17:05:43 UTC Oh its not me only again.. 32hours...560ti 448core 1,28GB -_- DSKAG Austria Research Team: http://www.research.dskag.at ID: 33796 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33797 - Posted: 8 Nov 2013, 23:49:57 UTC - in response to Message 33795. Last modified: 8 Nov 2013, 23:51:15 UTC My 35x5-NOELIA_1MG_RUN4-2-4-RND8673_0 running on a GTX660Ti is at 34% and took 4h22min. So it should complete in about 13h (Win7x64). If you get a task that has been running too long. Check the temps, GPU usage... and do a system shut down and restart if something looks wrong (temps too low). Note that NOELIA_1MG tasks may not be similar to NOELIA_INS1p tasks. PS. Noticed that some of NOELIA's tasks now use a full CPU core/thread (but others still don't). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33797 · Rating: 0 · rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 33799 - Posted: 9 Nov 2013, 0:21:19 UTC - in response to Message 33797. These NOELIA_1MG are about 8-9 hours on GTX680 with 2Gb Memory http://www.gpugrid.net/result.php?resultid=7444885 http://www.gpugrid.net/result.php?resultid=7443692 and around 34 hours on GTX460 with 1Gb Memory http://www.gpugrid.net/result.php?resultid=7440928 Same thing happened between the 768Mb and 1024Mb division in the past. Now we move past the 1024 minimum for some WU's. ID: 33799 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 5 Jan 09 Posts: 670 Credit: 2,498,095,550 RAC: 0 Level Scientific publications	Message 33801 - Posted: 9 Nov 2013, 0:51:48 UTC - in response to Message 33797. My 35x5-NOELIA_1MG_RUN4-2-4-RND8673_0 running on a GTX660Ti is at 34% and took 4h22min. So it should complete in about 13h (Win7x64). If you get a task that has been running too long. Check the temps, GPU usage... and do a system shut down and restart if something looks wrong (temps too low). Note that NOELIA_1MG tasks may not be similar to NOELIA_INS1p tasks. PS. Noticed that some of NOELIA's tasks now use a full CPU core/thread (but others still don't). On a GTX660TI with 2GB memory NO PROBLEM but this post all about those cards with less than 2GB ID: 33801 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33806 - Posted: 9 Nov 2013, 10:15:54 UTC - in response to Message 33801. Last modified: 9 Nov 2013, 10:17:29 UTC The NOELIA_1MG WU I'm presently running is using 1.2GB GDDR5, so it wouldn't do well on a 1GB card. Cards impacted by this would be anything at or below 1GB and possibly other cards under some conditions. This includes, Most versions of the GT 440 and GTS450, all versions of the GTX460 and GTX465. The GT 545 (GDDR5 version), some GTX550Ti’s, some GTX560’s and some GTX560Ti's Some GT 640's, the GTX 645, some GTX650's and GTX650Ti's. The 1280MB cards that might be impacted are the GTX470, GTX560Ti448 and GTX570. Would be interesting to know how much GDDR was being used on the different operating sytsems (XP, Linux, Vista, W7, W8). Note that where larger memory versions exist, they tend to be more expensive so not many people buy them. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33806 · Rating: 0 · rate: / Reply Quote

Dagorath Send message Joined: 16 Mar 11 Posts: 509 Credit: 179,005,236 RAC: 0 Level Scientific publications	Message 33812 - Posted: 9 Nov 2013, 13:18:58 UTC - in response to Message 33806. Would be interesting to know how much GDDR was being used on the different operating sytsems (XP, Linux, Vista, W7, W8). I'm not sure if we're talking the same error here but I had potx234-NOELIA_INS1P-12-14-RND6963_0 fail on my 660Ti with 3 gig mem on Linux, driver 331.17, more details here. That task also failed on this host (1 gig, Linux, 560Ti, driver unknown) but succeeded on this host , (2 gig, Win7, 2 X 680). I've had 4 other Noelia run successfully on my 660Ti on Linux. BOINC <<--- credit whores, pedants, alien hunters ID: 33812 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33821 - Posted: 10 Nov 2013, 14:38:07 UTC - in response to Message 33748. The claim is that errors can be caused by "not having enough voltage" or by "having too high of a temperature". Do we have conclusive proof of this claim? Or is it more of a generalization based on experience? I'm struggling to understand how voltage or temperature can have any effect on error % rates, and would appreciate some guidance. All semiconductor manufacturers create yield curves for their production lots. They show how much voltage/current it takes to achieve a given speed. In general, the more power you supply to the chip, the faster it can be clocked. Of course, it also gets hotter, which can eventually destroy the chip. That is why a power limit is also specified (e.g., 95 watts for some Intel CPUs, etc.). But the chips vary, with some being able to run fast at lower power, and some requiring higher power to achieve the same speeds. You can get errors due to a variety of reasons, with temperature being just one. But I have seen errors even below 70 C, so some other limitation may get you first. Hi Jacob.. I suppose you wouldn't mind going a bit deeper? To make a transistor switch (at a very high level) you apply a voltage which in turn pulls electrons through the channel (or "missing electrons" aka holes in the other direction). This physical movement of charge carriers is needed to make it switch. And it takes some time, which ultimately limits the clock speeds a chip can reach. This is where temperature and voltage must be considered. The voltage is a measure for how hard the electrons are pulled, or how quickly they're accelerated. That's why the maximum achievable (error-free) frequency scales approximately linear with voltage. Temperature is a measure for the vibrations of the atomic lattice. Without any vibrations the electrons wouldn't "see" the lattice at all. The atoms (in a single crystal) are forming a perfectly periodic potential landscape, through which the electrons move as waves. If this periodic structure is disturbed (e.g. by random fluctuations caused by temperature > 0 K), the electrons scatter with these perturbations. This slows their movement down and heats the lattice up (like in a regular resistor). In a real chip there are chains of transistors, which all have to switch within each clock cycle. In CPUs each stage of the pipeline is such a domain. If individual transistors are switching too slow, the computation result will not have reached the output stage of that domain yet when the next clock cycle is triggered. The old result (or something in between, depending on how the result is composed) will be used as the input for the next stage and a computation error will have occurred. That's why timing analysis is so important when designing a chip - the slowest path limits the overall clock speed the chip can achieve. And putting it all together it should be more clear now how increased temperature and too low voltage can lead to errors. And to get a bit closer to reality: the real switching speed of each transistor is affected by many more factors, including fabrication tolerances, non-fatal defects (which also scatter electrons and hence slow them down as well), defects developed due to operating the chip under prolonged load (at high temperature and voltage). At this point I can hand over to Jim: the manufacturer profiles their chips and determines proper working points (clock speed & voltage at maximum allowed temperature). Depending on how careful they do this (e.g. Intel usually allows for plenty of head room, whereas factory OC'ed GPUs have occasionally been set too agressive) things work out just normally.. or the end user could see calculation errors. Mostly these will only appear under unuausl work loads (which weren't tested for) or after significant chip degradation. Or just due to bad luck, which wasn't caught by the initial IC error testing (which is seldom, luckily). Hope this helps :) MrS Scanning for our furry friends since Jan 2002 ID: 33821 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33823 - Posted: 10 Nov 2013, 16:18:22 UTC Last modified: 10 Nov 2013, 16:19:50 UTC It does help, thank you very much for the detailed explanations. I've read through it once, and I'll have to read through it again a few more times for it to sink in. I actually studied Computer Engineering for a few years before switching over to a Bachelor's degree in Computer Information Systems. But, I still don't quite understand one other thing. When you overclock a CPU too far, you usually get a BSOD (presumably because the execution pointer is off in no-man's land, or because the data got jacked up in the pipeline, or both), right? But what about going "too far" on a GPU? The scenario I'm looking to better define is: overclocking or overheating a GPU too far to cause GPUGrid problems, but not far enough to cause Windows problems or driver resets. In order to get these BOINC Computation Errors, then that scenario must exist, right? Why doesn't Windows catch this and explain the error to the user? ID: 33823 · Rating: 0 · rate: / Reply Quote

WU: NOELIA_INS1P