The hardware enthusiast's corner

Author	Message
Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 56686 - Posted: 24 Feb 2021, 18:23:41 UTC Is there a hardware doctor in the house? This one isn't looking too happy, so while we have a pause, I'll pull it out and take a look. The same card, a few minutes apart. Both fans have been bouncing around for a few days. Fan % has been stable at 88%. GPU clock and power follow the fans - apply a bit of fan, the card speeds up and draws more power. It's an Asus GTX 1660 Super Dual Evo, under warranty. The supplier says Asus is likely to take it back and refund the price, but there are no new ones in the UK to buy with the proceeds. I can check for dust bunnies etc., but if I go much further it'll probably void the warranty and break the card completely. It's reached 96% of a D3RBandit task (slowly!), and it's displaying a perfectly clear image on my 1920x1200 monitor, so it can't be too far gone. Suggestions? ID: 56686 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 347,555 Level Scientific publications	Message 56687 - Posted: 24 Feb 2021, 18:56:06 UTC - in response to Message 56686. it’s definitely thermal throttling. And the fans look really iffy. Are they actually spinning when it says they aren’t? At those temps the fans should be 100%. Do you have an option to send the card for RMA (repair or replacement)? Usually situations like this the warranty will either fix the card or replace it with an equal product, not just refund the purchase price. At least that’s been my experience in the US with ASUS. I would look up the RMA process for the U.K. on ASUS’s website. ID: 56687 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 56688 - Posted: 24 Feb 2021, 19:19:51 UTC In the UK, the retailer is given the legal responsibility for sorting all that out. In this case, it's a local independent system builder / gamer hangout / business support / trade counter: I've used them for decades, and they have a good reputation. No problems there. At the moment, I'm leaving the box undisturbed under my workbench until this task is finished - I haven't eyeballed the fans yet. It's moved on two checkpoints since I posted, so I may get a chance to look later tonight - otherwise tomorrow morning. The machine also has a GTX 1650, so I'll move that to the 16x slot: it can keep crunching smaller jobs while I think about it. ID: 56688 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 347,555 Level Scientific publications	Message 56689 - Posted: 24 Feb 2021, 19:44:27 UTC - in response to Message 56688. Last modified: 24 Feb 2021, 19:49:15 UTC while the retailer having the legal responsibility to sort it out for you, do you have the CHOICE to instead send it directly to ASUS for RMA? i do believe the problem is likely with the fans. the % is just what it's set to, but the tach reading shows how fast it's actually spinning. and in the case of your two pics, it's indicating that the fans aren't spinning or are intermittently spinning. bumping the fan percentage is probably providing enough power in the motor to overcome some stiction, allowing the fans to spin up, cooling the GPU below thermal throttling limit, allowing the GPU to increase the clocks and this is likely why this results in observed increase in power consumption. ID: 56689 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,147,686,510 RAC: 4,315,110 Level Scientific publications	Message 56691 - Posted: 24 Feb 2021, 20:29:58 UTC - in response to Message 56686. I've liked very much the monitoring tool you're showing pics of. Thank you! It is very complete and comprehensive. I didn't know about it, since I'm lately crunching under Ubuntu Linux OS most of the time. There I use Psensor utility. I take note to test it on my dual Linux/Windows 10 host, the next time I enter Windows for my periodic update schedule. Regarding the problem itself, I agree you both that fans are not working as expected for those GPU temperatures and Fan % driving. ID: 56691 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 56692 - Posted: 24 Feb 2021, 21:33:27 UTC Well, the plucky little guy has crunched what may very well be its last WU - task 32541017. Valid, and no reported errors: though the 3 day+ runtime will mess up rod's statistics! I've pulled the card, and the host is flying again on one wing. No obvious signs of damage, and the fans turn to the finger without stiffness. I couldn't see them in situ because the second card was too close. Yes, ASUS do have a UK centre with a direct RMA procedure. But it handles so many classes of equipment it's hard to navigate, and when you get there it says both barcodes aren't valid. And neither matches the serial number on the invoice. I'll leave it for tonight, and try again tomorrow. ID: 56692 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 3,629 Level Scientific publications	Message 56693 - Posted: 24 Feb 2021, 22:00:13 UTC Pull the cooler off to see if there is good contact. Reapply paste, etc. Be careful of any thermal pads on GDDR. Warranty void if removed stickers are illegal. https://www.ifixit.com/News/11748/warranty-stickers-are-illegal#:~:text=Most%20consumers%20don't%20know,language%20of%20your%20warranty%20says. ID: 56693 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 347,555 Level Scientific publications	Message 56694 - Posted: 24 Feb 2021, 22:30:17 UTC - in response to Message 56693. Pull the cooler off to see if there is good contact. Reapply paste, etc. Be careful of any thermal pads on GDDR. Warranty void if removed stickers are illegal. https://www.ifixit.com/News/11748/warranty-stickers-are-illegal#:~:text=Most%20consumers%20don't%20know,language%20of%20your%20warranty%20says. They’re illegal in the US. But maybe not in the U.K. where Richard is located. I still think it’s a fan issue since the % shows 88% but the tach wasn’t showing that the fan was actually spinning ID: 56694 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 56696 - Posted: 25 Feb 2021, 0:52:51 UTC - in response to Message 56691. Last modified: 25 Feb 2021, 0:55:29 UTC While it isn't as snazzy looking as GPU-Z, you do have a pretty good looking gpu GUI monitoring application called gpu-mon available in Linux. It is part of the gpu-utils suite by Ricks-Lab. https://github.com/Ricks-Lab/gpu-utils ID: 56696 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,147,686,510 RAC: 4,315,110 Level Scientific publications	Message 56698 - Posted: 25 Feb 2021, 6:16:23 UTC - in response to Message 56696. While it isn't as snazzy looking as GPU-Z, you do have a pretty good looking gpu GUI monitoring application called gpu-mon available in Linux. It is part of the gpu-utils suite by Ricks-Lab. Certainly, it is very attractive for a tight monitoring of GPU. Thank you very much. ID: 56698 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 56699 - Posted: 25 Feb 2021, 15:29:52 UTC Last modified: 25 Feb 2021, 15:44:16 UTC I am also in need for a GPU diagnosis. Same symptom as before. WU downloaded, compuation starts successfully, randomly during the run I sometimes hear fans suddenly spinning down (for no apparent reason) and looking at GPU-Z, the GPU just stops computing... Almost as if it is just exhausted after a couple hours without any interruption and needs some rest. If I pause/suspend the WU and then restart from the last checkpoint, it immediately starts computing again. Today I let it rest for 20 min, as I was curious if it’d start again on its own, but unfortunately it didn't. Continiously suspending/unsuspending feels like it cannot be my remedy forever.... Kind of bummed with this. GPU suddenly stops GPU starts back up again after suspension/restart ID: 56699 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 56700 - Posted: 25 Feb 2021, 18:05:23 UTC Some sort of Windows sleep/hibernate/idle detection going on? Windows doesn't see any mouse/keyboard activity and so idles the gpu? Some other Windows monitoring software idling the gpu? ID: 56700 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 347,555 Level Scientific publications	Message 56701 - Posted: 25 Feb 2021, 18:43:34 UTC - in response to Message 56700. Some sort of Windows sleep/hibernate/idle detection going on? Windows doesn't see any mouse/keyboard activity and so idles the gpu? Some other Windows monitoring software idling the gpu? in addition, maybe some driver wonkiness? with strange issues like this I always recommend trying to wipe out (use DDU on windows, from Safe Mode), and a full re-install with the package from Nvidia, not allowing windows to install drivers automatically. also what are the BOINC compute settings? are you giving it 100% CPU "time"? do you have anything setup to pause the GPU crunching? perhaps a setting to pause when the computer is in use? or an exclusive app setting that stops boinc when some other app is running? are you running other projects? is BOINC switching the computation to another project? all things you should check off the list. something strange that caught my eye when looking at his pics, was that when the GPU load stops, the PCIe bus load shoots up, and vice versa when computation restarts. I'm having a hard time explaining what might cause that. ID: 56701 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,147,686,510 RAC: 4,315,110 Level Scientific publications	Message 56702 - Posted: 25 Feb 2021, 19:06:07 UTC - in response to Message 56699. WU downloaded, compuation starts successfully, randomly during the run I sometimes hear fans suddenly spinning down (for no apparent reason) and looking at GPU-Z, the GPU just stops computing... I've noticed recently a similar behavior while experiencing with overclocking with my reworked GTX 1660 Ti graphics card, as related at previous posts. When power consumption reached rated TDP, and trying clock frequencies for GPU above a certain limit, suddenly GPU clock slowed down to a minimum value of 300 MHz. The only solution being a system restart, and trying lower frequencies. At borderline frequencies, this didn't happen immediately, sometimes after few seconds, and sometimes after few minutes. As this system is running under Linux, and yours under Windows, I guess that there might be some kind of self protection based on graphics card firmware, and thus being OS independent. As can be seen at GTX 1660 SUPER specifications, stock Boost clock for this GPU is 1785 MHz. At your images, your GPU clock frequency at full performance is getting 1950 MHz, pretty higher than that... ID: 56702 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 593 Credit: 12,147,686,510 RAC: 4,315,110 Level Scientific publications	Message 56703 - Posted: 25 Feb 2021, 19:08:23 UTC - in response to Message 56701. something strange that caught my eye when looking at his pics, was that when the GPU load stops, the PCIe bus load shoots up, and vice versa when computation restarts. I'm having a hard time explaining what might cause that. I wonder about the same. ID: 56703 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 56704 - Posted: 25 Feb 2021, 20:39:35 UTC Last modified: 25 Feb 2021, 20:49:04 UTC A bit of background that might add some value to exploring this further. The monitor is connected to the 1660 Super card, being the primary one as opposed to the 750 Ti. While the card was overclocked from when I initially deployed the system here up until a week ago when Ian&steve advised to scale it back or even forfeit it altogether. So I guess this is not the issue with my card ServicEginIC. It is only running at moderate clock speeds (+40 MHz Core / + 200 MHz mem) as opposed to ~125 MHz/+250 MHz offset on the core/mem clock on other projects. Additionally, I adjusted the thermal limit down to 67 C to keep the fans down as I am directly sitting next to the computer. Never experienced this problem on other projects except for MLC (Boinc project) and F@H. Those problems occured around the same time when I installed the CUDA development toolkit and CUDA runtime back in January, when I tinkered a bit with NN programming with Keras. After this caused said issues across multiple projects, I uninstalled it as soon as finished my project. Compute settings are set to 100% CPU time, keeping tasks in memory upon suspension, no exclusive applications set to date, suspend computation if application uses > 65% of CPU (programming stuff mainly). The last aspect is however completetly unrelated to my issue, as every time this happend, I was not running any application at or over the defined threshold and all other running WUs were crunching along just fine. I am beginning to suspect that this might be the culprit here. Maybe a clean install of the lastest driver is the way to go... I'll tackle this as soon as my GPU Grid tasks finish up. Might the high PCIe bus load be caused by the CPU trying to take over and continue computing the GPU task? If this won't solve it, I'll continue looking into the possibility of some software or Windows setting putting the GPU into hibernation mode like Keith suggested. Thx for sharing your thoughts with me! ID: 56704 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 56705 - Posted: 26 Feb 2021, 13:11:17 UTC - in response to Message 56704. While the card was overclocked from when I initially deployed the system here up until a week ago when Ian&steve advised to scale it back or even forfeit it altogether. So I guess this is not the issue with my card ServicEginIC. It is only running at moderate clock speeds (+40 MHz Core / + 200 MHz mem) You should revert back to factory overclock, or even set lower frequencies / power limit. Overclocking the GPU memory is highly not recommended for GPUGrid. as opposed to ~125 MHz/+250 MHz offset on the core/mem clock on other projects. Stable GPU overclocking achiveved with other projects is deceitful. 90% GPU usage by the GPUGrid app results in higher GPU power draw at lower GPU frequency than 90% GPU usage by other projects. Therefore you should not consider the apps of other projects as a reference of calibrating GPU overclock for GPUGrid. Additionally, I adjusted the thermal limit down to 67 C to keep the fans down as I am directly sitting next to the computer. Lowering the thermal limit would increase the fan speed (at the same GPU frequencies/voltages). Or maybe I'm missing something? Never experienced this problem on other projects except for MLC (Boinc project) and F@H. That doesn't matter. Those problems occured around the same time when I installed the CUDA development toolkit and CUDA runtime back in January, when I tinkered a bit with NN programming with Keras. After this caused said issues across multiple projects, I uninstalled it as soon as finished my project. A full uninstall is recommended in this case. Download the latest driver, download DDU, then disable the networking on your PC and start DDU (it will restart in safe mode, do the cleaning, then restart in normal mode), then install the latest driver, lastly enable the networking. Maybe a clean install of the lastest driver is the way to go... Do not set any overclocking after you installed the latest driver. Let it crunch 10 tasks, then you can try increasing GPU clocks. Might the high PCIe bus load be caused by the CPU trying to take over and continue computing the GPU task? The GPUGrid app is constantly polling the GPU. When the GPU is in normal state, it will return some subresult to the CPU. The CPU does some Double Precision calculations with it, then puts it back to the GPU. When the GPU locks up, it doesn't return anything, so the polling is repeated at a much higher rate, which results in higher PCIe bus load. If this won't solve it, I'll continue looking into the possibility of some software or Windows setting putting the GPU into hibernation mode like Keith suggested. It's not the Windows, it's the overclocking, or the interference of some GPU tool/app with the GPUGrid app. ID: 56705 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 347,555 Level Scientific publications	Message 56706 - Posted: 26 Feb 2021, 13:45:29 UTC - in response to Message 56705. Last modified: 26 Feb 2021, 14:28:04 UTC Overclocking the GPU memory is highly not recommended for GPUGrid. I've not had a single issue with mild overclocks on GPU memory with GPUGrid. Personally I only overclock the memory to the default P0 state clocks, On Turing this is +400MHz. this has never caused an issue across many many GPUs, and I know Keith OCs his memory even further without issue. the OP doesn't appear to be even pushing the clocks that far, so I doubt this is an issue for him, unless there is something defective with the GPU hardware. Lowering the thermal limit would increase the fan speed (at the same GPU frequencies/voltages). Or maybe I'm missing something? You're missing something. In Windows overclocking with newer nvidia GPUs, you can set thermal limits for the overclock with certain software. it will limit the clock speeds based on temperatures. fan speeds are ONLY controlled by the fan curves that are set, whether it be the default or a custom user curve. The GPUGrid app is constantly polling the GPU. When the GPU is in normal state, it will return some subresult to the CPU. The CPU does some Double Precision calculations with it, then puts it back to the GPU. When the GPU locks up, it doesn't return anything, so the polling is repeated at a much higher rate, which results in higher PCIe bus load. do you know for a fact that the application operates this way? In regards to the shuffling of data to the CPU for DP processing. that sounds like a waste of resources when any GPU that is capable of processing GPUGRID tasks, also are capable of DP processing. it would make a lot more sense to have the GPU do that, and it would be faster. Do you have information from the devs about this? Can you link to it please? polling certainly is the reason for the CPU thread being pegged to 100% for each GPU tasks, as you see this with most nvidia/CUDA loads in other projects, but only GPUGRID has the high PCIe bus use. and from my experience, PCIe bus load is only high when computation is actually happening (at least under Linux, I can't really attest to PCIe load under the windows app). i see about 25% PCIe bus load on a PCIe 3.0 x16 link, or around 40% on a PCIe 3.0 x8 link. 80-90% on a 3.0x4 link. My interpretation of the PCIe bus use for GPUGRID is that it's constantly reading data out of the disk into memory and sending it over to the GPU, or constantly swapping data between the system memory and GPU, across the PCIe bus. it's clear that the app doesnt cache all the necessary data for each task since very little GPU memory is used. And certain beta tasks that have popped up in the past which used a large portion of GPU memory, also saw a huge reduction in PCIe use. but this is all contrary to what the OP is seeing, he's seeing low PCIe use during computation, and high use when not. That's completely opposite how these tasks usually operate. ID: 56706 · Rating: 0 · rate: / Reply Quote

bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 56707 - Posted: 26 Feb 2021, 13:57:48 UTC - in response to Message 56706. Last modified: 26 Feb 2021, 14:03:11 UTC Overclocking the GPU memory is highly not recommended for GPUGrid My motivation for this was just to reduce NVIDIA's P0-state memory clock penalty whenever using it for CUDA-enabled applications that was discussed here some time ago. Stable GPU overclocking achiveved with other projects is deceitful. ...Therefore you should not consider the apps of other projects as a reference of calibrating GPU overclock for GPUGrid. I guess you are most certainly right on this point Zoltan! A full uninstall is recommended in this case. That's what my gut feeling is telling me. I will probably look into the DDU-tool as soon as my task pipeline has emptied. Very interesting to hear about the app inner workings and the GPU polling by the CPU. Never thought about it this much before. do you know for a fact that the application operates this way? Would be highly interested to hear more about this! It's not the Windows, it's the overclocking, or the interference of some GPU tool/app with the GPUGrid app I still believe it's the latter one. I'll be smarter in a few days when I can analyse whether a clean install of the driver solved the issue. the OP doesn't appear to be even pushing the clocks that far, so I doubt this is an issue for him. I am 100% with you on this one. Especially, as I have intermittently been running at full stock speeds and the same issue occurred. he's seeing low PCIe use during computation, and high use when not. That's completely opposite how these tasks usually operate. I am getting ever more confused .... I'll see if reinstalling drivers will do the trick for me. ID: 56707 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 56708 - Posted: 26 Feb 2021, 14:36:39 UTC - in response to Message 56692. ASUS do have a UK centre with a direct RMA procedure. But it handles so many classes of equipment it's hard to navigate, and when you get there it says both barcodes aren't valid. And neither matches the serial number on the invoice. I'll leave it for tonight, and try again tomorrow. Couldn't get the ASUS site to co-operate, so went the reseller route. Phone call: they looked up the order number in seconds, confirmed warranty status, and issued an RMA without quibble. And emailed me a label for courier collection, valid at my local convenience store. So its up to Asus now. I'm not holding my breath that the rest of the procedure will be so slick. ID: 56708 · Rating: 0 · rate: / Reply Quote