Advanced search

Message boards : Graphics cards (GPUs) : Opinions please: ignoring suspend requests

Author Message
Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33184 - Posted: 24 Sep 2013 | 20:39:10 UTC

As I understand things, if "the use at most CPU %" is set at anything less than 100%, the BOINC client repeatedly suspends the app. Based on the number of complaints of slow performance I see here and that are attributable to this, it's a problems that seems to catch out many users.

Since the type of suspend request sent to the application is used only here and when running CPU benchmarks, I could simply change the application to ignore these suspend/resume requests, avoiding this failure mode. Full suspending, where the app is terminated, would continue to work.

Opinions please!

Matt

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33185 - Posted: 24 Sep 2013 | 22:05:08 UTC - in response to Message 33184.

Hi Matt,

I have set in my cc_config to not do any benchmarks, by a line posted by someone on the forum. Secondly I have all my rigs at 100% all the time, and do not use TTrottle to lower temperature. This is a nice program, it lowers CPU and GPU use as it rises above a set temperature value by the user.

So for me you might disable all features that suspend or hamper GPUGRID calculation.
Even the down clocking of the GPU clock as a WU has failed or terminated to prevent, if this is something in your power to do so off course.

I think that people who are a bit concerned about heat and warmth build up in the system and have most of the time expensive graphics cards, could have a different view of it.
____________
Greetings from TJ

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,761,851
RAC: 8,795,383
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33186 - Posted: 24 Sep 2013 | 22:12:01 UTC

I would, reluctantly, support that approach as a temporary stop-gap only.

I say reluctantly, because:

1) In principle, I support permanent diagnosis and robust fixes. Other GPU apps that I run, from other projects, don't suffer unrecoverable errors as a result of suspension. To be fair, that could simply be because other apps don't run as long as GPUGrid - you require longer-term stability than anyone else. And similarly, there are fewer reports of slow running at other projects, again perhaps because it isn't so noticeable on shorter tasks.

2) I suspect this has come to light because a stupid feature was, mistakenly in my view, included in the v7.0.64 client (sorry, we didn't catch it in time), and that client has remained 'recommended' for far longer than expected. Once that aberration has been corrected, I'd be happier if the requisite workround was removed too - so we end up with the design working as intended (CPU throttling allowed, without disrupting CPU work). We do have to watch out for that '0.5 CPU' definition of coprocessor work, though.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,761,851
RAC: 8,795,383
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33188 - Posted: 24 Sep 2013 | 22:17:33 UTC - in response to Message 33185.

Even the down clocking of the GPU clock as a WU has failed or terminated to prevent, if this is something in your power to do so off course.

Downclocking is a safety feature built into both hardware and drivers since about the Fermi launch. I don't think anyone should try to circumvent that - it would be like welding shut the safety valve on a steam boiler.

I'm sure downclocking a GPU is a source of irritation when it happens, but better to address the cause, than to close your eyes to the symptom. Unlike a steam boiler, a molten GPU is hardly likely to kill anybody - but it could be expensive to replace.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33190 - Posted: 24 Sep 2013 | 22:27:55 UTC - in response to Message 33188.

I agree with you Richard, however this down clocking emerged frequently at my rig with the 660 since the introduction of the "termination to prevent hang up". I am not sure that it has to do with that, but it seems so. It is now every other day I have to boot my rig to solve this, and I am not always there to do so. But we are going of topic, so I'll stop about it.
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33194 - Posted: 25 Sep 2013 | 8:43:51 UTC - in response to Message 33186.

[QuoteI would, reluctantly, support that approach as a temporary stop-gap only[/quote]

To be clear, this is a proposal designed to address only the cases where misconfigured boinc clients repeatedly suspend the app, causing the runtime to dramatically increase. This change would avoid the user having to make a configuration change manually.

It is unrelated to the 'access violation' problem.

Matt

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,761,851
RAC: 8,795,383
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33195 - Posted: 25 Sep 2013 | 8:53:23 UTC

I understand that the downclocking that TJ is concerned about (which is permanent, until the next host reboot) is likely to occur if BOINC suspends the app without allowing time for what is known as a 'threadsafe' exit.

That's the reason for the note about Do GPU kernels within critical sections in the BOINC co-processor programming guide.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33197 - Posted: 25 Sep 2013 | 11:32:49 UTC - in response to Message 33195.
Last modified: 25 Sep 2013 | 11:43:01 UTC


I understand that the downclocking that TJ is concerned about (which is permanent, until the next host reboot) is likely to occur if BOINC suspends the app without allowing time for what is known as a 'threadsafe' exit.


With the current application this should no longer be occuring. The application defers suspension and termination until a safe point. The Boinc library is no longer able to asynchronously suspend the process, since that was a significant cause of instability.

Matt

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33199 - Posted: 25 Sep 2013 | 11:53:41 UTC - in response to Message 33197.


I understand that the downclocking that TJ is concerned about (which is permanent, until the next host reboot) is likely to occur if BOINC suspends the app without allowing time for what is known as a 'threadsafe' exit.


With the current application this should no longer be occuring. The application defers suspension and termination until a safe point. The Boinc library is no longer able to asynchronously suspend the process, since that was a significant cause of instability.

Matt

Well it happened yesterday twice on my rig with the 660, using app 8.14 (cuda55).
____________
Greetings from TJ

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33200 - Posted: 25 Sep 2013 | 11:58:07 UTC - in response to Message 33199.

OK TJ, could you please elaborate on the circumstances, in a private message to me.

Matt

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33208 - Posted: 25 Sep 2013 | 22:16:00 UTC - in response to Message 33200.
Last modified: 25 Sep 2013 | 22:33:08 UTC

You should not ignore the request to suspend/resume, when benchmarking or when being CPU-Throttled.

If there was a client bug that was fixed, so be it.
If there still exists a client bug, we should find out about it and fix it

If any app, even a GPU app, is told to suspend, for whatever reason (including CPU throttling), it needs to suspend.
If the user wants to ensure no CPU throttling, then the user needs to change the setting to ensure no CPU throttling.
If the user specified CPU Throttling, then tasks will be and should be throttled.
Simple as that.

That's my opinion.

5pot
Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33211 - Posted: 26 Sep 2013 | 4:17:50 UTC - in response to Message 33208.

You should not ignore the request to suspend/resume, when benchmarking or when being CPU-Throttled.

If there was a client bug that was fixed, so be it.
If there still exists a client bug, we should find out about it and fix it

If any app, even a GPU app, is told to suspend, for whatever reason (including CPU throttling), it needs to suspend.
If the user wants to ensure no CPU throttling, then the user needs to change the setting to ensure no CPU throttling.
If the user specified CPU Throttling, then tasks will be and should be throttled.
Simple as that.

That's my opinion.


Im gong to err on the side of caution here, and agree with Jacob. It's up to the user to get their settings straight, and not the devs to override potentially "bad" settings without user consent.

That is if I do understand what we're talking about correctly. :)

P.S. No matter how much trouble it's causing, in this case, it's up to the user to fix.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33227 - Posted: 27 Sep 2013 | 19:32:41 UTC - in response to Message 33208.

I'd also like to add another opinion:

I've heard of the following change to BOINC 7.2.1+:
client: don't apply CPU throttling to apps that use < .5 CPUs (like GPU, NCI).

Is it likely that GPUGrid tasks will show as using (0.5 CPUs) or more?
If so, then the application will receive suspend/resume events even in 7.2.x.

And if it has trouble responding to those events (trouble like task errors, or task hangs)... Then the problem is in the application, not in the client. And it should be solved (by making it not error/hang), not worked around (ie: ignore suspend/resume events).

Suspend/resume events need to be supported, even for GPU applications.

RaymondFO*
Send message
Joined: 22 Nov 12
Posts: 72
Credit: 14,040,706,346
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33229 - Posted: 27 Sep 2013 | 21:30:48 UTC - in response to Message 33227.
Last modified: 27 Sep 2013 | 21:34:40 UTC

My TITAN uses 0.827 CPU and is subject to this issue, and I have others that exceed 0.50 CPU. So upgrading the BOINC client will not work here.

No other computer running GPUGRID is subject to this exception and I have always had CPU time usage at 100% and using on multiprocessor systems using 100% and I have never throttled back for any computer.

While I have employed a "work-around," I would prefer this issue be resolved as I have also recently experienced numerous computer and driver crashes. No other computer running GPUGRID has experienced this frequency of crashes, except this one. I am using driver 327.33 for this computer and the others are still using 320.49. The only third party utility I am using is the EVGA Precision X only to control the fan speeds, not for manual overclocking purposes.

Please note this situation does not occur with PrimeGrid or Collatz. Then again, their CPU resource requirement is below 0.10.

Thank you.

Raymond

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,761,851
RAC: 8,795,383
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33230 - Posted: 27 Sep 2013 | 22:58:49 UTC - in response to Message 33229.

My TITAN uses 0.827 CPU and is subject to this issue, and I have others that exceed 0.50 CPU. So upgrading the BOINC client will not work here.

I can't see your computers - they are hidden - but could you expand further?

Does your Titan actually use 82.7% CPU - i.e., is CPU time recorded for your results 82.7% of total runtime? The average figure for my Kepler is nearer 98.5%, but for my Fermi more like 7.9%.

BOINC Manager displays a status line like 'Running (X.XX CPUs + Y.YY NVIDIA GPUs)'. This, like many of BOINC's numbers, is a server-generated estimate, and bears no relationship to the actual observed behaviour of your own CPU/GPU combination. But it is this estimate which will in the end be compared with David's arbitrary 0.5 CPU threshhold "to throttle, or not to throttle". It would be best to devote our efforts into educating David Anderson into the behaviour and requirements of real GPUs in the real world, and get him to program BOINC accordingly - remembering not to pretend that the best settings for any one single project are necessarily correct across the board.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33231 - Posted: 28 Sep 2013 | 11:02:17 UTC - in response to Message 33230.
Last modified: 28 Sep 2013 | 11:06:33 UTC

I expect the 82.7% is just what's being reported in Boinc Task Properties (and what is used by the scheduler).
ALL Keplers (GK104 and GK110) use a full CPU when running GPUGrid tasks (except Noelia tasks, which use around 30 or 40%) - it's in the app that they do this, but the scheduler uses a different system (the app and scheduler ignore each other). Fermi's don't use a full CPU core.

If a non-Noelia task uses significantly less than a full CPU (as in <95%) then it suggests there might be a problem - the CPU is being overused (by other apps), the CPU is not capable of supporting the GPU fully (architecture or setup), or some other hardware is limiting the GPU (PCIE, RAM speeds, clogged up drive).

Boinc Scheduler thinks my GTX660Ti & GTX660 use 0.593 CPU's (W7, i7-3770K@4.2GHz), and my GTX670 uses 0.682 CPU's (Ubuntu G2020@2.9GHz).

Although some of the 100% CPU for one WU is polling, if you set it to less then the runtimes of the tasks increase varying by task type. This would be more significant with the GK110 cards.

There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently:
Computing allowed,

    While computer is in use (when not ticked)
    Use GPU while computer is in use (when not Ticked)
    While processor usage is less than (anything other than zero)
    Use at most % CPU time (anything other than 100%)


Maybe using some UPS's and having 'While computer is on batteries' unselected could cause issues, or if a laptop battery is misreporting its connection state (OS or OS driver bugs).

In my experience some default settings and recommended settings at other projects are unsuitable for crunching at GPUGrid. It would be useful if there was some sort of way to automatic select recommended settings from your online account (or a Project Button in BM).
It would also be useful if when you allow others to see your systems, they can also see your Boinc configurations, or at least an option for moderators to see these.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

RaymondFO*
Send message
Joined: 22 Nov 12
Posts: 72
Credit: 14,040,706,346
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33233 - Posted: 28 Sep 2013 | 17:11:39 UTC - in response to Message 33231.

I expect the 82.7% is just what's being reported in Boinc Task Properties

Reply: That is correct. When the task is suspended it will initially read as "Schedluler wait: access violation" and then later as "waiting to run" while other GPUGRID tasks will now run.

If a non-Noelia task uses significantly less than a full CPU (as in <95%) then it suggests there might be a problem - the CPU is being overused (by other apps), the CPU is not capable of supporting the GPU fully (architecture or setup), or some other hardware is limiting the GPU (PCIE, RAM speeds, clogged up drive)
.
Reply. - this may be true, but when app8.13 with the 326.80 beta driver was being used this never happened. Should I use a app info file and force it to carve out 1 CPU for the video card?


There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently:


While computer is in use (when not ticked) - Ticked
Use GPU while computer is in use (when not Ticked) - Ticked
While processor usage is less than (anything other than zero) - is zero
Use at most % CPU time (anything other than 100%) -100 %

Maybe using some UPS's and having 'While computer is on batteries' unselected could cause issues, or if a laptop battery is misreporting its connection state (OS or OS driver bugs)
. This is ticked or selected.

In my experience some default settings and recommended settings at other projects are unsuitable for crunching at GPUGrid. It would be useful if there was some sort of way to automatic select recommended settings from your online account (or a Project Button in BM).
It would also be useful if when you allow others to see your systems, they can also see your Boinc configurations, or at least an option for moderators to see these.


System information:
Boinc version 7.0.28
Genuine Intel(R) Xeon(R) CPU 1366 @ 3.20GHz [Family 6 Model 44 Stepping 2]
(12 processors)
[2] NVIDIA GeForce GTX TITAN (4095MB) driver: 327.23
Microsoft Windows 8 x64 Edition, home edition
12 gigs ram
This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed.

Any ideas are appreciated. Thank you.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,761,851
RAC: 8,795,383
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33234 - Posted: 28 Sep 2013 | 17:45:22 UTC - in response to Message 33231.

There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently:

Computing allowed,

  • While computer is in use (when not ticked)
  • Use GPU while computer is in use (when not Ticked)
  • While processor usage is less than (anything other than zero)
  • Use at most % CPU time (anything other than 100%)


Anybody reading and following that advice needs to be reminded that there are two different ways of setting computing preferences.

1) Via the Computing preferences link on your account page
2) Via the Tools|Computing preferences... menu in BOINC Manager.

Some web-based preference setting tools have the opposite sense to the list you quote: for example, the first one reads "Suspend work while computer is in use?" when using the web-based preference setting.

To avoid suspending tasks while you are using the computer.uncheck (untick) the web setting, or check (tick) the local setting. And so on. Whichever set you use, read the wording carefully.

Users should try to be clear in their own mind whether they are using web settings or local settings, and use one technique exclusively. In particular, even simply viewing the settings locally, and closing the dialog by clicking the OK button, creates a complete snapshot of the current active settings and uses it exclusively from that point onwards: future web-based preference changes will be ignored.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33236 - Posted: 28 Sep 2013 | 18:10:27 UTC

Raymond: not sure if you understood it correctly. What ever number of CPU usage the BOINC manager has absolutely no effect on how GPU-Grid runs. The app takes its core, no matter what.

The only effect this percentage has is to help BOINC schedule tasks, i.e. how many tasks of which type to run. So lang as you don't have any problems with that don't bother changing this value via an app_config. You describe your problems as "numerous computer and driver crashes" which only your Ttian box has. Some wil argue that it's possible that the number of tasks will result in instability.. but I think this is not at all common.

Instead I see a few possible other reasons:
- GPU-Grid taxes the GPU more than most projects -> it could push your power supply beyond stability limits, or heat the system up too much
- your Titan might not be OK -> I'd try downclocking it a bit and/or lower the power target (this would also help the PSU and heat)

MrS
____________
Scanning for our furry friends since Jan 2002

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33237 - Posted: 28 Sep 2013 | 18:45:40 UTC

@MJH: obeying the suspend requests is surely the safe way. I'm not sure what unwanted side effects an override may have now and in the future.

On the other hand the problem of horrible performance due to this throttling setting is real and it would be nice to find a solution or at least some relief. What I could imagine is to add this topic into the FAQ / system requirements / getting started / whatever sections.

What I'd put there:
- less than 100% CPU time is bad
- if a Kepler GPU with boost gets too hot / loud / expensive: reduce the power target, as this makes the chip run more efficiently

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33251 - Posted: 29 Sep 2013 | 10:26:44 UTC - in response to Message 33233.

Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go.

Your issue is most likely with the card, Power supply, cooling when running the GPUGRid app, but there are some other possibilities.

This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed.

Any ideas are appreciated. Thank you.

Does this 'start-stop' behavior occur when you are not using the CPU, or when you run different CPU projects (than those you mention)?

A 12 thread processor that's maxed out could conceivably struggle to feed a Titan. Have you tried setting Boinc to use 10 threads (83.4%)?

When your CPU is fully used, the draw on the PSU is probably another 100W or so.
If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler?

To test if power/overuse of the CPU is the issue, stop running CPU tasks, and just run GPU tasks.

To test if heat is the issue, take the side of the case off and run a WU or two.

Might Boinc 7.0.28 be an issue?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,599,761,851
RAC: 8,795,383
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33252 - Posted: 29 Sep 2013 | 10:41:55 UTC - in response to Message 33251.

Might Boinc 7.0.28 be an issue?

BOINC v7.0.28 doesn't suffer from the "suspend GPU apps when applying thermal throttling to CPUs" bug that started this thread off. It has other weaknesses, but I don't think they would be relevant in this case.

I'd leave BOINC unchanged for the time being - work on the other possible causes, one at a time, until you find out what it is.

RaymondFO*
Send message
Joined: 22 Nov 12
Posts: 72
Credit: 14,040,706,346
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33266 - Posted: 29 Sep 2013 | 20:01:48 UTC - in response to Message 33251.
Last modified: 29 Sep 2013 | 20:29:22 UTC

Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go.

Your issue is most likely with the card, Power supply, cooling when running the GPUGRid app, but there are some other possibilities.

This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed.

Any ideas are appreciated. Thank you.

Does this 'start-stop' behavior occur when you are not using the CPU, or when you run different CPU projects (than those you mention)?

A 12 thread processor that's maxed out could conceivably struggle to feed a Titan. Have you tried setting Boinc to use 10 threads (83.4%)?

When your CPU is fully used, the draw on the PSU is probably another 100W or so.
If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler?

To test if power/overuse of the CPU is the issue, stop running CPU tasks, and just run GPU tasks.

To test if heat is the issue, take the side of the case off and run a WU or two.

Might Boinc 7.0.28 be an issue?



Raymond, I agree with the suggestions made by MrS and reducing the power target is certainly worth a go.

Your issue is most likely with the card, Power supply, cooling when running the GPUGRid app, but there are some other possibilities.


I reduced the power and the temperature target to 69c and at 69c and 71c the "stop and go" occurred.

When your CPU is fully used, the draw on the PSU is probably another 100W or so. If it's not the power draw itself then it could be the heat generated from it. Does the heat radiate into the case, or are you using an external water cooler?


I have a 1250W PSU attached to this computer. The CPU is internal water cooled. I have had the side case removed since June 2013 to generate maximum air flow and the computer is near an open window. I will add a small desktop fan and position the fan and generate air flow towards the cards.

I have reduced the CPU usage where only GPUGRID tasks are running an no other CPU tasks are being run. The "stop and go" event still occurred with zero CPU tasks being run or being listed in the BOINC client.

The BOINC local and project WEB settings for any projects are always the same as I dislike potential conflicts. Yes, I have learned my lesson on that issue the hard way a while back.

The one possibility that was raised is the CPU core is being maxed out and could conceivably struggle to feed a Titan. I will run the CPU without hyper threading as that is the only other possibility that has not been explored.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33315 - Posted: 1 Oct 2013 | 21:19:31 UTC - in response to Message 33266.

Sounds you should be pretty safe, hardware-wise. But, reading a few posts back, I'm not sure which error we're actually talking about here. Could you summarize the "start and stop" problem? Are you seeing that "client suspended (user request)" message often in your tasks and they take quite long, about twice as long as the CPU time they required?

MrS
____________
Scanning for our furry friends since Jan 2002

RaymondFO*
Send message
Joined: 22 Nov 12
Posts: 72
Credit: 14,040,706,346
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33348 - Posted: 3 Oct 2013 | 18:05:53 UTC - in response to Message 33315.

Sounds you should be pretty safe, hardware-wise. But, reading a few posts back, I'm not sure which error we're actually talking about here. Could you summarize the "start and stop" problem? Are you seeing that "client suspended (user request)" message often in your tasks and they take quite long, about twice as long as the CPU time they required?

MrS



When the task is suspended by the computer I see "Scheduler wait: access violation" and then later as "waiting to run" while other GPUGRID tasks run. So the BOINC client window will show one or two GPU tasks in "waiting" mode and partially run while other the other two tasks are being crunched. I do not see any "client suspended (user request)" description. I have recently see the BOINC message window saying to the effect that if this keep occurring (the "stop and start" process) I should reset the project and that is my next step.

The actual tasks take approximately the same amount of GPU/CPU crunch time if they were crunched continuously or by this "stop and start" issue. The tasks being stopped due to the "Scheduler wait: access violation" and then later as "waiting to run" issue that causes the the "wall clock" time to occur much longer as I am now crunching three or four tasks at the same time and switching back and forth between them instead of two tasks continuously being crunched at the same time.

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33382 - Posted: 6 Oct 2013 | 19:28:35 UTC - in response to Message 33348.

It's been discovered in a news thread that 331.40 beta fixes the access violations. Give it a try!

MrS
____________
Scanning for our furry friends since Jan 2002

RaymondFO*
Send message
Joined: 22 Nov 12
Posts: 72
Credit: 14,040,706,346
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33387 - Posted: 7 Oct 2013 | 3:15:22 UTC - in response to Message 33382.

It's been discovered in a news thread that 331.40 beta fixes the access violations. Give it a try!

MrS


Installed 331.40 beta and no issues to report. It would appear this situation maybe resolved. Thank you for posting this information here.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33389 - Posted: 7 Oct 2013 | 9:58:18 UTC - in response to Message 33387.
Last modified: 7 Oct 2013 | 9:59:24 UTC

Regarding the downclocking TJ and others experienced,

I suspected this was happening on my Windows 7 system as some of the runtimes were a bit too long. As I have a GTX660 and a GTX660Ti in the same system, its a bit difficult/time consuming to identify such issues retrospectively. Fortunately Matt added the name of the GPU used in the Stderr output file. However, while the device clock is also there it looks like it's not being updated:

The following SANTI_baxbim2 task ran on the GTX660Ti and although it reports 1110MHz when I checked it was at 705MHz (GPUz and MSI Afterburner).

I362-SANTI_baxbim2-3-32-RND0950_0 Run time 54,173.06, CPU time 53,960.47

The task took 37% longer than two other tasks that ran on the same GPU,
I67-SANTI_baxbim2-1-32-RND7117_0 Run time 39,369.26, CPU time 39,052.88

I255-SANTI_baxbim2-0-32-RND5544_1 Run time 40,646.89, CPU time 40,494.18

I also noticed some tasks running with the GPU power around 55% and very low temperatures, but didn't have the time to look into it.

In my situation I can't easily set Prefer Maximum Performance, as I'm using the iGPU for my only display (you cant use NVidia Control Panel unless a monitor is plugged into an NVidia GPU). For now I upgraded from 326.41 to 331.4 Beta drivers. I will see how that gets on, and if I see further downclocking I will plug a monitor into an NVidia GPU and set to Prefer Maximum Performance.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33390 - Posted: 7 Oct 2013 | 11:49:03 UTC
Last modified: 7 Oct 2013 | 11:50:25 UTC

If you find nVidia bugs (like cannot use nVidia Control Panel when no monitor is connected, or performance degradation occurs on Adaptive Performance even when under full 3d load from a CUDA task)... Please report your issues in the nVidia driver feedback thread on their forums! Sometimes they are responsive.

TJ
Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33395 - Posted: 7 Oct 2013 | 15:12:03 UTC - in response to Message 33389.

Skgiven I think that what we see in the stderr file is the boost value of the GPU and not the actual value or the value it ran at.
My 660 is set to 1045MHz, while the stderr shows 1110MHz, even when it is down clocked.
I have now switched the 660 to LR only and it is running for 6 days and 9 hours without down clocking! Some of these WU´s ran even without interruption by "The simulation has become unstable. Terminating to avoid lock-up (1)".

I saw my down clocking especially with SR and beta´s, but I am not checking every WU that closely.
I have already downloaded the 331.4 beta driver and will install it next time I need to reboot the system (it runs fine now, so I leave it).
____________
Greetings from TJ

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33402 - Posted: 7 Oct 2013 | 20:48:37 UTC

The clock Matt is displaying seems to be the "Boost Clock" property of a GPU. It may be what nVidia think a typical boost clock in games will be for the card. For my GTX660Ti I've got:

- a base clock of 1006 MHz (factory OC)
- 1084 MHz "Boost Clock"
- 1189 MHz running GPU-Grid stock
- 1228 MHz running GPU-Grid with clock offset
- 1040 - 1100 MHz @ 1.03 - 1.07 V with power consumption limited to 108 W, depending on the WU

In any cases 1084 MHz appears in the task output. Sure it would be nice to see an average of the frequency, but this probably shouldn't be polled too often (to avoid conflicts with other utilities doing this and not to slow things down). May one value every 1 or 10 minutes?

BTW: we've drifted far away from the main topic here.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33403 - Posted: 7 Oct 2013 | 21:02:41 UTC - in response to Message 33402.

The clock that's printed is whatever the runtime reports. It's not defined whether it's the base, peak or instantaneous value. The value always seems to be constant, even when the GPU is clearly throttling, so take it with a pinch of salt.

MJH

Jeremy Zimmerman
Send message
Joined: 13 Apr 13
Posts: 61
Credit: 726,605,417
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33406 - Posted: 7 Oct 2013 | 23:16:21 UTC - in response to Message 33403.
Last modified: 7 Oct 2013 | 23:24:37 UTC

It always seems to be what GPU-Z displays as the Default Clock Boost. Not any over/underclock done with EVGA Precision or actual boost. The GPU-Z pop up description simply states: "Shows the default turbo boost clock frequency of the GPU without any overclocking."

Edit....This is for the Boost 680 for me.

The 460 card does not show the default clock under GPU-Z, but rather the GPU Clock.

The code to pull clock speed seems to work different between the two cards.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33409 - Posted: 7 Oct 2013 | 23:29:33 UTC - in response to Message 33403.
Last modified: 7 Oct 2013 | 23:32:15 UTC

Yes, if you use GPUZ (alas only available on Windows) under the Graphics Card tab it will show what I think is the 'maximum boost clock'; an estimate of what the boost would be if the power target, Voltage and GPU utilization was 100% (I think).

Under the Sensors tab it tells you the actual boost while crunching (or whatever else you care to do), and lots of other useful info.

As GPUGrid WU's don't often use close to the max power limit or 100% GPU usage, many GPU's can boost higher. There is a lot of variation out there, with some cards not going higher than the 'maximum boost clock' and other cards going 15% or more past the reference 'maximum boost clock'.
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33411 - Posted: 7 Oct 2013 | 23:42:20 UTC - in response to Message 33409.
Last modified: 7 Oct 2013 | 23:45:58 UTC

For me, for my eVGA GTX 660 Ti 3GB FTW (factory overclocked), I use a custom GPU Fan profile (in Precision-X), so while working on GPUGrid, the temperatures almost always drive the fan to maximum fan (80%). The clock is usually at 1241 MHz, and the Power % is usually around 90-100%. But there are times where the temperature still gets above 72*C, in which case the clock goes down to 1228 MHz. Also, there are times when a task really stresses the GPU (such that it would normally downclock to not exceed the 100% power target, even if temps are below 70*C), so in Precision-X, I have set the power target to its maximum, 140%. So, in times of considerable stress, I do sometimes see it running at 1241 MHz, at ~105% Power, while under 72*C. Regarding reported clock speeds, even though the GPU is usually at maximum boost (1241 MHz), I believe GPUGrid task results always reports 1124 MHz (which I believe may be the "default boost" for a "regular" GTX 660 Ti), which is fine by me.

Here are the GPU-Z values from my eVGA GTX 660 Ti:

GPU Clock: Shows the currently selected performance GPU clock speed of this device.
1046 MHz

Memory Clock: Shows the currently selected performance memory clock speed of this device.
1502 MHz

Boost Clock: Shows the typical turbo boost clock frequency of the GPU.
1124 MHz

Default GPU Clock: Shows the default GPU clock speed of this device without any overclocking.
1046 MHz

Default Memory Clock: Shows the default memory clock speed of this device without any overclocking.
1502 MHz

Default Boost Clock: Shows the default turbo boost clock frequency of the GPU without any overclocking.
1124 MHz

Jeremy Zimmerman
Send message
Joined: 13 Apr 13
Posts: 61
Credit: 726,605,417
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwat
Message 33412 - Posted: 8 Oct 2013 | 2:23:57 UTC - in response to Message 33411.

Below outputs from GPU-Z as compared to the stderr output.

GTX460
<stderr_txt>
# Device clock : 1760MHz
# Memory clock : 1850MHz

GPU-Z
Clock....Default / GPU: 760 / 880
Memory...Default / GPU: 900 / 925
Shader...Default / GPU: 1520 / 1760


GTX680
<stderr_txt>
# Device clock : 1124MHz
# Memory clock : 3200MHz

GPU-Z
Clock....Default / GPU: 1059 / 1085
Memory...Default / GPU: 1552 / 1600
Boost....Default / GPU: 1124 / 1150 (Boost is actually running 1201 ~90% of time)

Probably belongs in wish list, but having a reading added to the stderr output line of the GPU clock that GPU-Z pulls for boost added to each temp line. Then have Device clock be the average of those readings. Of course cross linked to the task results so it could be pulled in a view when trying to runs stats on the cards. :)

ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33425 - Posted: 8 Oct 2013 | 20:48:52 UTC - in response to Message 33403.

The clock that's printed is whatever the runtime reports. It's not defined whether it's the base, peak or instantaneous value. The value always seems to be constant, even when the GPU is clearly throttling, so take it with a pinch of salt.

MJH

For Keplers it's the "Boost clock" defined in their bios (hence it's not affacted by the actual boost or downclocking), whereas for older cards the shader clock is reported (by the runtime).

Not sure how GPU-Z and others are doing it, but I suspect the runtime can also report the actual instantaneous clock which we're interested in.

MrS
____________
Scanning for our furry friends since Jan 2002

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33440 - Posted: 9 Oct 2013 | 18:51:31 UTC - in response to Message 33425.

This thread has digressed some way from the original topic.
An 'observed boost clocks' thread would be in order, should anyone care to start one/shift a few posts?
____________
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help

Post to thread

Message boards : Graphics cards (GPUs) : Opinions please: ignoring suspend requests

//