Opinions please: ignoring suspend requests

Author	Message
skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33251 - Posted: 29 Sep 2013, 10:26:44 UTC - in response to Message 33233. Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go. Your issue is most likely with the card, Power supply, cooling when running the GPUGRid app, but there are some other possibilities. This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed. Any ideas are appreciated. Thank you. Does this 'start-stop' behavior occur when you are not using the CPU, or when you run different CPU projects (than those you mention)? A 12 thread processor that's maxed out could conceivably struggle to feed a Titan. Have you tried setting Boinc to use 10 threads (83.4%)? When your CPU is fully used, the draw on the PSU is probably another 100W or so. If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler? To test if power/overuse of the CPU is the issue, stop running CPU tasks, and just run GPU tasks. To test if heat is the issue, take the side of the case off and run a WU or two. Might Boinc 7.0.28 be an issue? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33251 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 33252 - Posted: 29 Sep 2013, 10:41:55 UTC - in response to Message 33251. Might Boinc 7.0.28 be an issue? BOINC v7.0.28 doesn't suffer from the "suspend GPU apps when applying thermal throttling to CPUs" bug that started this thread off. It has other weaknesses, but I don't think they would be relevant in this case. I'd leave BOINC unchanged for the time being - work on the other possible causes, one at a time, until you find out what it is. ID: 33252 · Rating: 0 · rate: / Reply Quote

RaymondFO* Send message Joined: 22 Nov 12 Posts: 72 Credit: 14,040,706,346 RAC: 0 Level Scientific publications	Message 33266 - Posted: 29 Sep 2013, 20:01:48 UTC - in response to Message 33251. Last modified: 29 Sep 2013, 20:29:22 UTC Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go. Your issue is most likely with the card, Power supply, cooling when running the GPUGRid app, but there are some other possibilities. This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed. Any ideas are appreciated. Thank you. Does this 'start-stop' behavior occur when you are not using the CPU, or when you run different CPU projects (than those you mention)? A 12 thread processor that's maxed out could conceivably struggle to feed a Titan. Have you tried setting Boinc to use 10 threads (83.4%)? When your CPU is fully used, the draw on the PSU is probably another 100W or so. If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler? To test if power/overuse of the CPU is the issue, stop running CPU tasks, and just run GPU tasks. To test if heat is the issue, take the side of the case off and run a WU or two. Might Boinc 7.0.28 be an issue? Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go. Your issue is most likely with the card, Power supply, cooling when running the GPUGRid app, but there are some other possibilities. I reduced the power and the temperature target to 69c and at 69c and 71c the "stop and go" occurred. When your CPU is fully used, the draw on the PSU is probably another 100W or so. If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler? I have a 1250W PSU attached to this computer. The CPU is internal water cooled. I have had the side case removed since June 2013 to generate maximum air flow and the computer is near an open window. I will add a small desktop fan and position the fan and generate air flow towards the cards. I have reduced the CPU usage where only GPUGRID tasks are running an no other CPU tasks are being run. The "stop and go" event still occurred with zero CPU tasks being run or being listed in the BOINC client. The BOINC local and project WEB settings for any projects are always the same as I dislike potential conflicts. Yes, I have learned my lesson on that issue the hard way a while back. The one possibility that was raised is the CPU core is being maxed out and could conceivably struggle to feed a Titan. I will run the CPU without hyper threading as that is the only other possibility that has not been explored. ID: 33266 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33315 - Posted: 1 Oct 2013, 21:19:31 UTC - in response to Message 33266. Sounds you should be pretty safe, hardware-wise. But, reading a few posts back, I'm not sure which error we're actually talking about here. Could you summarize the "start and stop" problem? Are you seeing that "client suspended (user request)" message often in your tasks and they take quite long, about twice as long as the CPU time they required? MrS Scanning for our furry friends since Jan 2002 ID: 33315 · Rating: 0 · rate: / Reply Quote

RaymondFO* Send message Joined: 22 Nov 12 Posts: 72 Credit: 14,040,706,346 RAC: 0 Level Scientific publications	Message 33348 - Posted: 3 Oct 2013, 18:05:53 UTC - in response to Message 33315. Sounds you should be pretty safe, hardware-wise. But, reading a few posts back, I'm not sure which error we're actually talking about here. Could you summarize the "start and stop" problem? Are you seeing that "client suspended (user request)" message often in your tasks and they take quite long, about twice as long as the CPU time they required? MrS When the task is suspended by the computer I see "Scheduler wait: access violation" and then later as "waiting to run" while other GPUGRID tasks run. So the BOINC client window will show one or two GPU tasks in "waiting" mode and partially run while other the other two tasks are being crunched. I do not see any "client suspended (user request)" description. I have recently see the BOINC message window saying to the effect that if this keep occurring (the "stop and start" process) I should reset the project and that is my next step. The actual tasks take approximately the same amount of GPU/CPU crunch time if they were crunched continuously or by this "stop and start" issue. The tasks being stopped due to the "Scheduler wait: access violation" and then later as "waiting to run" issue that causes the the "wall clock" time to occur much longer as I am now crunching three or four tasks at the same time and switching back and forth between them instead of two tasks continuously being crunched at the same time. ID: 33348 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33382 - Posted: 6 Oct 2013, 19:28:35 UTC - in response to Message 33348. It's been discovered in a news thread that 331.40 beta fixes the access violations. Give it a try! MrS Scanning for our furry friends since Jan 2002 ID: 33382 · Rating: 0 · rate: / Reply Quote

RaymondFO* Send message Joined: 22 Nov 12 Posts: 72 Credit: 14,040,706,346 RAC: 0 Level Scientific publications	Message 33387 - Posted: 7 Oct 2013, 3:15:22 UTC - in response to Message 33382. It's been discovered in a news thread that 331.40 beta fixes the access violations. Give it a try! MrS Installed 331.40 beta and no issues to report. It would appear this situation maybe resolved. Thank you for posting this information here. ID: 33387 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33389 - Posted: 7 Oct 2013, 9:58:18 UTC - in response to Message 33387. Last modified: 7 Oct 2013, 9:59:24 UTC Regarding the downclocking TJ and others experienced, I suspected this was happening on my Windows 7 system as some of the runtimes were a bit too long. As I have a GTX660 and a GTX660Ti in the same system, its a bit difficult/time consuming to identify such issues retrospectively. Fortunately Matt added the name of the GPU used in the Stderr output file. However, while the device clock is also there it looks like it's not being updated: The following SANTI_baxbim2 task ran on the GTX660Ti and although it reports 1110MHz when I checked it was at 705MHz (GPUz and MSI Afterburner). I362-SANTI_baxbim2-3-32-RND0950_0 Run time 54,173.06, CPU time 53,960.47 The task took 37% longer than two other tasks that ran on the same GPU, I67-SANTI_baxbim2-1-32-RND7117_0 Run time 39,369.26, CPU time 39,052.88 I255-SANTI_baxbim2-0-32-RND5544_1 Run time 40,646.89, CPU time 40,494.18 I also noticed some tasks running with the GPU power around 55% and very low temperatures, but didn't have the time to look into it. In my situation I can't easily set Prefer Maximum Performance, as I'm using the iGPU for my only display (you cant use NVidia Control Panel unless a monitor is plugged into an NVidia GPU). For now I upgraded from 326.41 to 331.4 Beta drivers. I will see how that gets on, and if I see further downclocking I will plug a monitor into an NVidia GPU and set to Prefer Maximum Performance. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33389 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33390 - Posted: 7 Oct 2013, 11:49:03 UTC Last modified: 7 Oct 2013, 11:50:25 UTC If you find nVidia bugs (like cannot use nVidia Control Panel when no monitor is connected, or performance degradation occurs on Adaptive Performance even when under full 3d load from a CUDA task)... Please report your issues in the nVidia driver feedback thread on their forums! Sometimes they are responsive. ID: 33390 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33395 - Posted: 7 Oct 2013, 15:12:03 UTC - in response to Message 33389. Skgiven I think that what we see in the stderr file is the boost value of the GPU and not the actual value or the value it ran at. My 660 is set to 1045MHz, while the stderr shows 1110MHz, even when it is down clocked. I have now switched the 660 to LR only and it is running for 6 days and 9 hours without down clocking! Some of these WU´s ran even without interruption by "The simulation has become unstable. Terminating to avoid lock-up (1)". I saw my down clocking especially with SR and beta´s, but I am not checking every WU that closely. I have already downloaded the 331.4 beta driver and will install it next time I need to reboot the system (it runs fine now, so I leave it). Greetings from TJ ID: 33395 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33402 - Posted: 7 Oct 2013, 20:48:37 UTC The clock Matt is displaying seems to be the "Boost Clock" property of a GPU. It may be what nVidia think a typical boost clock in games will be for the card. For my GTX660Ti I've got: - a base clock of 1006 MHz (factory OC) - 1084 MHz "Boost Clock" - 1189 MHz running GPU-Grid stock - 1228 MHz running GPU-Grid with clock offset - 1040 - 1100 MHz @ 1.03 - 1.07 V with power consumption limited to 108 W, depending on the WU In any cases 1084 MHz appears in the task output. Sure it would be nice to see an average of the frequency, but this probably shouldn't be polled too often (to avoid conflicts with other utilities doing this and not to slow things down). May one value every 1 or 10 minutes? BTW: we've drifted far away from the main topic here. MrS Scanning for our furry friends since Jan 2002 ID: 33402 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33403 - Posted: 7 Oct 2013, 21:02:41 UTC - in response to Message 33402. The clock that's printed is whatever the runtime reports. It's not defined whether it's the base, peak or instantaneous value. The value always seems to be constant, even when the GPU is clearly throttling, so take it with a pinch of salt. MJH ID: 33403 · Rating: 0 · rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 33406 - Posted: 7 Oct 2013, 23:16:21 UTC - in response to Message 33403. Last modified: 7 Oct 2013, 23:24:37 UTC It always seems to be what GPU-Z displays as the Default Clock Boost. Not any over/underclock done with EVGA Precision or actual boost. The GPU-Z pop up description simply states: "Shows the default turbo boost clock frequency of the GPU without any overclocking." Edit....This is for the Boost 680 for me. The 460 card does not show the default clock under GPU-Z, but rather the GPU Clock. The code to pull clock speed seems to work different between the two cards. ID: 33406 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33409 - Posted: 7 Oct 2013, 23:29:33 UTC - in response to Message 33403. Last modified: 7 Oct 2013, 23:32:15 UTC Yes, if you use GPUZ (alas only available on Windows) under the Graphics Card tab it will show what I think is the 'maximum boost clock'; an estimate of what the boost would be if the power target, Voltage and GPU utilization was 100% (I think). Under the Sensors tab it tells you the actual boost while crunching (or whatever else you care to do), and lots of other useful info. As GPUGrid WU's don't often use close to the max power limit or 100% GPU usage, many GPU's can boost higher. There is a lot of variation out there, with some cards not going higher than the 'maximum boost clock' and other cards going 15% or more past the reference 'maximum boost clock'. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33409 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33411 - Posted: 7 Oct 2013, 23:42:20 UTC - in response to Message 33409. Last modified: 7 Oct 2013, 23:45:58 UTC For me, for my eVGA GTX 660 Ti 3GB FTW (factory overclocked), I use a custom GPU Fan profile (in Precision-X), so while working on GPUGrid, the temperatures almost always drive the fan to maximum fan (80%). The clock is usually at 1241 MHz, and the Power % is usually around 90-100%. But there are times where the temperature still gets above 72C, in which case the clock goes down to 1228 MHz. Also, there are times when a task really stresses the GPU (such that it would normally downclock to not exceed the 100% power target, even if temps are below 70C), so in Precision-X, I have set the power target to its maximum, 140%. So, in times of considerable stress, I do sometimes see it running at 1241 MHz, at ~105% Power, while under 72*C. Regarding reported clock speeds, even though the GPU is usually at maximum boost (1241 MHz), I believe GPUGrid task results always reports 1124 MHz (which I believe may be the "default boost" for a "regular" GTX 660 Ti), which is fine by me. Here are the GPU-Z values from my eVGA GTX 660 Ti: GPU Clock: Shows the currently selected performance GPU clock speed of this device. 1046 MHz Memory Clock: Shows the currently selected performance memory clock speed of this device. 1502 MHz Boost Clock: Shows the typical turbo boost clock frequency of the GPU. 1124 MHz Default GPU Clock: Shows the default GPU clock speed of this device without any overclocking. 1046 MHz Default Memory Clock: Shows the default memory clock speed of this device without any overclocking. 1502 MHz Default Boost Clock: Shows the default turbo boost clock frequency of the GPU without any overclocking. 1124 MHz ID: 33411 · Rating: 0 · rate: / Reply Quote

Jeremy Zimmerman Send message Joined: 13 Apr 13 Posts: 61 Credit: 726,605,417 RAC: 0 Level Scientific publications	Message 33412 - Posted: 8 Oct 2013, 2:23:57 UTC - in response to Message 33411. Below outputs from GPU-Z as compared to the stderr output. GTX460 <stderr_txt> # Device clock : 1760MHz # Memory clock : 1850MHz GPU-Z Clock....Default / GPU: 760 / 880 Memory...Default / GPU: 900 / 925 Shader...Default / GPU: 1520 / 1760 GTX680 <stderr_txt> # Device clock : 1124MHz # Memory clock : 3200MHz GPU-Z Clock....Default / GPU: 1059 / 1085 Memory...Default / GPU: 1552 / 1600 Boost....Default / GPU: 1124 / 1150 (Boost is actually running 1201 ~90% of time) Probably belongs in wish list, but having a reading added to the stderr output line of the GPU clock that GPU-Z pulls for boost added to each temp line. Then have Device clock be the average of those readings. Of course cross linked to the task results so it could be pulled in a view when trying to runs stats on the cards. :) ID: 33412 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33425 - Posted: 8 Oct 2013, 20:48:52 UTC - in response to Message 33403. The clock that's printed is whatever the runtime reports. It's not defined whether it's the base, peak or instantaneous value. The value always seems to be constant, even when the GPU is clearly throttling, so take it with a pinch of salt. MJH For Keplers it's the "Boost clock" defined in their bios (hence it's not affacted by the actual boost or downclocking), whereas for older cards the shader clock is reported (by the runtime). Not sure how GPU-Z and others are doing it, but I suspect the runtime can also report the actual instantaneous clock which we're interested in. MrS Scanning for our furry friends since Jan 2002 ID: 33425 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33440 - Posted: 9 Oct 2013, 18:51:31 UTC - in response to Message 33425. This thread has digressed some way from the original topic. An 'observed boost clocks' thread would be in order, should anyone care to start one/shift a few posts? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33440 · Rating: 0 · rate: / Reply Quote