Opinions please: ignoring suspend requests

Author	Message
MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33184 - Posted: 24 Sep 2013, 20:39:10 UTC As I understand things, if "the use at most CPU %" is set at anything less than 100%, the BOINC client repeatedly suspends the app. Based on the number of complaints of slow performance I see here and that are attributable to this, it's a problems that seems to catch out many users. Since the type of suspend request sent to the application is used only here and when running CPU benchmarks, I could simply change the application to ignore these suspend/resume requests, avoiding this failure mode. Full suspending, where the app is terminated, would continue to work. Opinions please! Matt ID: 33184 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33185 - Posted: 24 Sep 2013, 22:05:08 UTC - in response to Message 33184. Hi Matt, I have set in my cc_config to not do any benchmarks, by a line posted by someone on the forum. Secondly I have all my rigs at 100% all the time, and do not use TTrottle to lower temperature. This is a nice program, it lowers CPU and GPU use as it rises above a set temperature value by the user. So for me you might disable all features that suspend or hamper GPUGRID calculation. Even the down clocking of the GPU clock as a WU has failed or terminated to prevent, if this is something in your power to do so off course. I think that people who are a bit concerned about heat and warmth build up in the system and have most of the time expensive graphics cards, could have a different view of it. Greetings from TJ ID: 33185 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33186 - Posted: 24 Sep 2013, 22:12:01 UTC I would, reluctantly, support that approach as a temporary stop-gap only. I say reluctantly, because: 1) In principle, I support permanent diagnosis and robust fixes. Other GPU apps that I run, from other projects, don't suffer unrecoverable errors as a result of suspension. To be fair, that could simply be because other apps don't run as long as GPUGrid - you require longer-term stability than anyone else. And similarly, there are fewer reports of slow running at other projects, again perhaps because it isn't so noticeable on shorter tasks. 2) I suspect this has come to light because a stupid feature was, mistakenly in my view, included in the v7.0.64 client (sorry, we didn't catch it in time), and that client has remained 'recommended' for far longer than expected. Once that aberration has been corrected, I'd be happier if the requisite workround was removed too - so we end up with the design working as intended (CPU throttling allowed, without disrupting CPU work). We do have to watch out for that '0.5 CPU' definition of coprocessor work, though. ID: 33186 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33188 - Posted: 24 Sep 2013, 22:17:33 UTC - in response to Message 33185. Even the down clocking of the GPU clock as a WU has failed or terminated to prevent, if this is something in your power to do so off course. Downclocking is a safety feature built into both hardware and drivers since about the Fermi launch. I don't think anyone should try to circumvent that - it would be like welding shut the safety valve on a steam boiler. I'm sure downclocking a GPU is a source of irritation when it happens, but better to address the cause, than to close your eyes to the symptom. Unlike a steam boiler, a molten GPU is hardly likely to kill anybody - but it could be expensive to replace. ID: 33188 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33190 - Posted: 24 Sep 2013, 22:27:55 UTC - in response to Message 33188. I agree with you Richard, however this down clocking emerged frequently at my rig with the 660 since the introduction of the "termination to prevent hang up". I am not sure that it has to do with that, but it seems so. It is now every other day I have to boot my rig to solve this, and I am not always there to do so. But we are going of topic, so I'll stop about it. Greetings from TJ ID: 33190 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33194 - Posted: 25 Sep 2013, 8:43:51 UTC - in response to Message 33186. [QuoteI would, reluctantly, support that approach as a temporary stop-gap only[/quote] To be clear, this is a proposal designed to address only the cases where misconfigured boinc clients repeatedly suspend the app, causing the runtime to dramatically increase. This change would avoid the user having to make a configuration change manually. It is unrelated to the 'access violation' problem. Matt ID: 33194 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33195 - Posted: 25 Sep 2013, 8:53:23 UTC I understand that the downclocking that TJ is concerned about (which is permanent, until the next host reboot) is likely to occur if BOINC suspends the app without allowing time for what is known as a 'threadsafe' exit. That's the reason for the note about Do GPU kernels within critical sections in the BOINC co-processor programming guide. ID: 33195 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33197 - Posted: 25 Sep 2013, 11:32:49 UTC - in response to Message 33195. Last modified: 25 Sep 2013, 11:43:01 UTC I understand that the downclocking that TJ is concerned about (which is permanent, until the next host reboot) is likely to occur if BOINC suspends the app without allowing time for what is known as a 'threadsafe' exit. With the current application this should no longer be occuring. The application defers suspension and termination until a safe point. The Boinc library is no longer able to asynchronously suspend the process, since that was a significant cause of instability. Matt ID: 33197 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 33199 - Posted: 25 Sep 2013, 11:53:41 UTC - in response to Message 33197. I understand that the downclocking that TJ is concerned about (which is permanent, until the next host reboot) is likely to occur if BOINC suspends the app without allowing time for what is known as a 'threadsafe' exit. With the current application this should no longer be occuring. The application defers suspension and termination until a safe point. The Boinc library is no longer able to asynchronously suspend the process, since that was a significant cause of instability. Matt Well it happened yesterday twice on my rig with the 660, using app 8.14 (cuda55). Greetings from TJ ID: 33199 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33200 - Posted: 25 Sep 2013, 11:58:07 UTC - in response to Message 33199. OK TJ, could you please elaborate on the circumstances, in a private message to me. Matt ID: 33200 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33208 - Posted: 25 Sep 2013, 22:16:00 UTC - in response to Message 33200. Last modified: 25 Sep 2013, 22:33:08 UTC You should not ignore the request to suspend/resume, when benchmarking or when being CPU-Throttled. If there was a client bug that was fixed, so be it. If there still exists a client bug, we should find out about it and fix it If any app, even a GPU app, is told to suspend, for whatever reason (including CPU throttling), it needs to suspend. If the user wants to ensure no CPU throttling, then the user needs to change the setting to ensure no CPU throttling. If the user specified CPU Throttling, then tasks will be and should be throttled. Simple as that. That's my opinion. ID: 33208 · Rating: 0 · rate: / Reply Quote

5pot Send message Joined: 8 Mar 12 Posts: 411 Credit: 2,083,882,218 RAC: 0 Level Scientific publications	Message 33211 - Posted: 26 Sep 2013, 4:17:50 UTC - in response to Message 33208. You should not ignore the request to suspend/resume, when benchmarking or when being CPU-Throttled. If there was a client bug that was fixed, so be it. If there still exists a client bug, we should find out about it and fix it If any app, even a GPU app, is told to suspend, for whatever reason (including CPU throttling), it needs to suspend. If the user wants to ensure no CPU throttling, then the user needs to change the setting to ensure no CPU throttling. If the user specified CPU Throttling, then tasks will be and should be throttled. Simple as that. That's my opinion. Im gong to err on the side of caution here, and agree with Jacob. It's up to the user to get their settings straight, and not the devs to override potentially "bad" settings without user consent. That is if I do understand what we're talking about correctly. :) P.S. No matter how much trouble it's causing, in this case, it's up to the user to fix. ID: 33211 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 33227 - Posted: 27 Sep 2013, 19:32:41 UTC - in response to Message 33208. I'd also like to add another opinion: I've heard of the following change to BOINC 7.2.1+: client: don't apply CPU throttling to apps that use < .5 CPUs (like GPU, NCI). Is it likely that GPUGrid tasks will show as using (0.5 CPUs) or more? If so, then the application will receive suspend/resume events even in 7.2.x. And if it has trouble responding to those events (trouble like task errors, or task hangs)... Then the problem is in the application, not in the client. And it should be solved (by making it not error/hang), not worked around (ie: ignore suspend/resume events). Suspend/resume events need to be supported, even for GPU applications. ID: 33227 · Rating: 0 · rate: / Reply Quote

RaymondFO* Send message Joined: 22 Nov 12 Posts: 72 Credit: 14,040,706,346 RAC: 0 Level Scientific publications	Message 33229 - Posted: 27 Sep 2013, 21:30:48 UTC - in response to Message 33227. Last modified: 27 Sep 2013, 21:34:40 UTC My TITAN uses 0.827 CPU and is subject to this issue, and I have others that exceed 0.50 CPU. So upgrading the BOINC client will not work here. No other computer running GPUGRID is subject to this exception and I have always had CPU time usage at 100% and using on multiprocessor systems using 100% and I have never throttled back for any computer. While I have employed a "work-around," I would prefer this issue be resolved as I have also recently experienced numerous computer and driver crashes. No other computer running GPUGRID has experienced this frequency of crashes, except this one. I am using driver 327.33 for this computer and the others are still using 320.49. The only third party utility I am using is the EVGA Precision X only to control the fan speeds, not for manual overclocking purposes. Please note this situation does not occur with PrimeGrid or Collatz. Then again, their CPU resource requirement is below 0.10. Thank you. Raymond ID: 33229 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33230 - Posted: 27 Sep 2013, 22:58:49 UTC - in response to Message 33229. My TITAN uses 0.827 CPU and is subject to this issue, and I have others that exceed 0.50 CPU. So upgrading the BOINC client will not work here. I can't see your computers - they are hidden - but could you expand further? Does your Titan actually use 82.7% CPU - i.e., is CPU time recorded for your results 82.7% of total runtime? The average figure for my Kepler is nearer 98.5%, but for my Fermi more like 7.9%. BOINC Manager displays a status line like 'Running (X.XX CPUs + Y.YY NVIDIA GPUs)'. This, like many of BOINC's numbers, is a server-generated estimate, and bears no relationship to the actual observed behaviour of your own CPU/GPU combination. But it is this estimate which will in the end be compared with David's arbitrary 0.5 CPU threshhold "to throttle, or not to throttle". It would be best to devote our efforts into educating David Anderson into the behaviour and requirements of real GPUs in the real world, and get him to program BOINC accordingly - remembering not to pretend that the best settings for any one single project are necessarily correct across the board. ID: 33230 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33231 - Posted: 28 Sep 2013, 11:02:17 UTC - in response to Message 33230. Last modified: 28 Sep 2013, 11:06:33 UTC I expect the 82.7% is just what's being reported in Boinc Task Properties (and what is used by the scheduler). ALL Keplers (GK104 and GK110) use a full CPU when running GPUGrid tasks (except Noelia tasks, which use around 30 or 40%) - it's in the app that they do this, but the scheduler uses a different system (the app and scheduler ignore each other). Fermi's don't use a full CPU core. If a non-Noelia task uses significantly less than a full CPU (as in <95%) then it suggests there might be a problem - the CPU is being overused (by other apps), the CPU is not capable of supporting the GPU fully (architecture or setup), or some other hardware is limiting the GPU (PCIE, RAM speeds, clogged up drive). Boinc Scheduler thinks my GTX660Ti & GTX660 use 0.593 CPU's (W7, i7-3770K@4.2GHz), and my GTX670 uses 0.682 CPU's (Ubuntu G2020@2.9GHz). Although some of the 100% CPU for one WU is polling, if you set it to less then the runtimes of the tasks increase varying by task type. This would be more significant with the GK110 cards. There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently: Computing allowed, While computer is in use (when not ticked) Use GPU while computer is in use (when not Ticked) While processor usage is less than (anything other than zero) Use at most % CPU time (anything other than 100%) Maybe using some UPS's and having 'While computer is on batteries' unselected could cause issues, or if a laptop battery is misreporting its connection state (OS or OS driver bugs). In my experience some default settings and recommended settings at other projects are unsuitable for crunching at GPUGrid. It would be useful if there was some sort of way to automatic select recommended settings from your online account (or a Project Button in BM). It would also be useful if when you allow others to see your systems, they can also see your Boinc configurations, or at least an option for moderators to see these. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33231 · Rating: 0 · rate: / Reply Quote

RaymondFO* Send message Joined: 22 Nov 12 Posts: 72 Credit: 14,040,706,346 RAC: 0 Level Scientific publications	Message 33233 - Posted: 28 Sep 2013, 17:11:39 UTC - in response to Message 33231. I expect the 82.7% is just what's being reported in Boinc Task Properties Reply: That is correct. When the task is suspended it will initially read as "Schedluler wait: access violation" and then later as "waiting to run" while other GPUGRID tasks will now run. If a non-Noelia task uses significantly less than a full CPU (as in <95%) then it suggests there might be a problem - the CPU is being overused (by other apps), the CPU is not capable of supporting the GPU fully (architecture or setup), or some other hardware is limiting the GPU (PCIE, RAM speeds, clogged up drive) . Reply. - this may be true, but when app8.13 with the 326.80 beta driver was being used this never happened. Should I use a app info file and force it to carve out 1 CPU for the video card? There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently: While computer is in use (when not ticked) - Ticked Use GPU while computer is in use (when not Ticked) - Ticked While processor usage is less than (anything other than zero) - is zero Use at most % CPU time (anything other than 100%) -100 % Maybe using some UPS's and having 'While computer is on batteries' unselected could cause issues, or if a laptop battery is misreporting its connection state (OS or OS driver bugs) . This is ticked or selected. In my experience some default settings and recommended settings at other projects are unsuitable for crunching at GPUGrid. It would be useful if there was some sort of way to automatic select recommended settings from your online account (or a Project Button in BM). It would also be useful if when you allow others to see your systems, they can also see your Boinc configurations, or at least an option for moderators to see these. System information: Boinc version 7.0.28 Genuine Intel(R) Xeon(R) CPU 1366 @ 3.20GHz [Family 6 Model 44 Stepping 2] (12 processors) [2] NVIDIA GeForce GTX TITAN (4095MB) driver: 327.23 Microsoft Windows 8 x64 Edition, home edition 12 gigs ram This behavior occurs with Asteriods@home, PrimaBonica or Fight Malaria@home are the CPU projects. This is a dedicated cruncher with no other work being processed. Any ideas are appreciated. Thank you. ID: 33233 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 33234 - Posted: 28 Sep 2013, 17:45:22 UTC - in response to Message 33231. There are four processor usage Boinc settings that could cause the GPU to suspend and resume frequently: Computing allowed, * While computer is in use (when not ticked) * Use GPU while computer is in use (when not Ticked) * While processor usage is less than (anything other than zero) * Use at most % CPU time (anything other than 100%) Anybody reading and following that advice needs to be reminded that there are two different ways of setting computing preferences. 1) Via the Computing preferences link on your account page 2) Via the Tools\|Computing preferences... menu in BOINC Manager. Some web-based preference setting tools have the opposite sense to the list you quote: for example, the first one reads "Suspend work while computer is in use?" when using the web-based preference setting. To avoid suspending tasks while you are using the computer.uncheck (untick) the web setting, or check (tick) the local setting. And so on. Whichever set you use, read the wording carefully. Users should try to be clear in their own mind whether they are using web settings or local settings, and use one technique exclusively. In particular, even simply viewing the settings locally, and closing the dialog by clicking the OK button, creates a complete snapshot of the current active settings and uses it exclusively from that point onwards: future web-based preference changes will be ignored. ID: 33234 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33236 - Posted: 28 Sep 2013, 18:10:27 UTC Raymond: not sure if you understood it correctly. What ever number of CPU usage the BOINC manager has absolutely no effect on how GPU-Grid runs. The app takes its core, no matter what. The only effect this percentage has is to help BOINC schedule tasks, i.e. how many tasks of which type to run. So lang as you don't have any problems with that don't bother changing this value via an app_config. You describe your problems as "numerous computer and driver crashes" which only your Ttian box has. Some wil argue that it's possible that the number of tasks will result in instability.. but I think this is not at all common. Instead I see a few possible other reasons: - GPU-Grid taxes the GPU more than most projects -> it could push your power supply beyond stability limits, or heat the system up too much - your Titan might not be OK -> I'd try downclocking it a bit and/or lower the power target (this would also help the PSU and heat) MrS Scanning for our furry friends since Jan 2002 ID: 33236 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 33237 - Posted: 28 Sep 2013, 18:45:40 UTC @MJH: obeying the suspend requests is surely the safe way. I'm not sure what unwanted side effects an override may have now and in the future. On the other hand the problem of horrible performance due to this throttling setting is real and it would be nice to find a solution or at least some relief. What I could imagine is to add this topic into the FAQ / system requirements / getting started / whatever sections. What I'd put there: - less than 100% CPU time is bad - if a Kepler GPU with boost gets too hot / loud / expensive: reduce the power target, as this makes the chip run more efficiently MrS Scanning for our furry friends since Jan 2002 ID: 33237 · Rating: 0 · rate: / Reply Quote