Message boards :
Number crunching :
strange behaviour...
Message board moderation
Previous · 1 · 2 · 3 · Next
| Author | Message |
|---|---|
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Richard, What's this CPU throtting thing? Do you know how it works? There's no thing germane in the library code so presumably it's all in the client. Matt |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Richard, Yes, in the client. It's meant for thermal control of CPUs, and it dates back to the early days of BOINC. If you look at the Computing preferences on your account here, the bottom item under Processor usage is: Use at most The implementation is crude: they wanted it to use the same source code on every platform, and there isn't a fine control like that. So it operates on a granularity of 1 second, so capeITLabs' 75% would have been 3 seconds on and 1 second off. That, of course, means three eternities on and one eternity off at the speeds GPUs operate. David Anderson made a gut reaction to a single user's request on the mailing list back in January: http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2013-January/019305.html - I'm sure you can think of such a reason. That emerged in version 7.0.45 It was removed with v7.2.1. You might like to look at the note: client: don't apply CPU throttling to apps that use < .5 CPUs (like GPU, NCI). and http://boinc.berkeley.edu/trac/changeset/4cb34a123aacfaccc28b5f1f76717864b0b63a57/boinc-v2 with respect to the requested CPU reservation for Keplers and above (and make the same suggestion to any OpenCL developers you know). Links to the earlier changesets are contained in my email at http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2013-July/020131.html Any casual reader here who wishes to apply thermal control to their CPU or GPU under Windows (only) would be better advised to consider TThrottle |
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Thanks Richard, I guess I'd better take a look and see exactly how this third suspend-resume mechanism works under the hood.. MJH |
|
Send message Joined: 25 Apr 13 Posts: 27 Credit: 240,283,511 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
660 ... have aborted run ... just installed latest driver. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
660 ... have aborted run ... just installed latest driver. I have updated the drivers on my two GTX 660s to 327.23 and completed my first Noelia with no problems (4-NOELIA_INS1P-9-15-RND4205_0 14:09:09). Each card is running another Noelia with no problems thus far, so I will let them run and see what happens. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 318 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks Richard, I don't know whether this is concidence, or whether you've been in communication behind the scenes, but David Anderson has just started work on a better throttling solution. "client: preliminary implementation (commented out) of sub-second throttling" http://boinc.berkeley.edu/trac/changeset/ebde7809ceaca8cc35d75c2a2b5adc32c19694e5/boinc-v2 http://boinc.berkeley.edu/trac/changeset/35f489d36f4c7734d13f76af5844ec42d244be59/boinc-v2 |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I'm against coarse-grained throttling for thermal control as it's inefficient for any hardware using adaptive power states (like boosting nVidias and Intels + AMDs with Turbo). The reason: during activity the hardware boosts into the maximum power state supported, which implies a high voltage and lower power efficiency, whereas during idle periods it obviously does nothing. If the throttling took place fine-grained the hardare could adjust to the requested performance level and sustain a lower power state (lower voltage - higher power efficiency) and achieve the same throughput. Starting and stopping this often is inefficient from a software perspective, though. At least for GPU-Grid there's a far better solution: simply lower the GPUs power target and leave it at 100% time. It will take care of adjusting clocks and voltages down by itself. The downside of this: it requires the user to use tuning software, since this is not even available in nVidias control panel under Win (just checked mine). Let alone Linux or Mac OS, which generally don't have working hardware control software. Adjusting the power target down for CPU is also not as easy as it should be.. given Intels mobile chips already support cTDP in principle. And with AMD GPUs boosting is not yet as wide-soread, efficient and controllable as for the green team :/ MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 25 Apr 13 Posts: 27 Credit: 240,283,511 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I91R9-NATHAN_KIDc22_glu-7-10-RND1126_1 Has been running for over 49 hours ... elapsed time increases, remaining time barely moves, but increases. |
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi, GPUGrid Folks: Short run task has been grinding away for over 14 h...... 251-NOELIA_CRYST1-9-12-RND5111_0 (60% complete) :( John |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Paul and John, If the GPU is cooler than normal, I suggest shutting the system down, turning the PSU off for a minute and then turning it back on and starting the system up again - doing this allowed me to finish a WU that had run for 5days (but had really stopped after about 6h). Keep an eye on your runtime before and after you restart. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If the GPU is cooler than normal, I suggest shutting the system down, turning the PSU off for a minute and then turning it back on and starting the system up again - doing this allowed me to finish a WU that had run for 5days (but had really stopped after about 6h). That fixed it for me with I18R10-NATHAN_KIDc22_glu-8-10-RND4986_1, which was taking 30 hours to complete on a GTX 660 (327.23 drivers). It had previously completed three others in the NATHAN_KIDc22 series with no problems in about 12 hours. That is unfortunately not a practical solution for me, since I lost 10 hours of CEP2 work running on the CPU. It seems to be more of a problem with the mid-range cards (GTX 660, 660 Ti). Are the 700 series cards immune? |
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Many thanks, skgiven. Problem fixed. I had hoped to run these tasks in a 'set and forget' mode, but that may not be possible. Being unable to sleep last night, I took a peek at my machine at around 05:00h to see if all is well and that's when I discovered the long run. I will try again and if the problem recurs I will make the suggested fix. Thanks again, John |
|
Send message Joined: 15 May 11 Posts: 108 Credit: 297,176,099 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It seems to be more of a problem with the mid-range cards (GTX 660, 660 Ti). Are the 700 series cards immune? Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts. Could be the same thing causing different symptoms using different GPUs. I think it's all down to 8.14 myself. Operator. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts. The Memory Controller Load apparently runs at a constant 14% rate when it is running slowly, so I doubt that it is the start/stop condition. (It should run about 30% normally on these work units.) I know they had a similar problem with the older apps (before the 8 series), particularly with the GTX 660s, and thought it might have been solved. Otherwise, the 8.14 app works very nicely that I can see, except for one Noelia that errored out, but no crashes or other bad behavior. I hope they can get the last wrinkles ironed out for the mid-range cards, and also for the 700 cards or else there is not much incentive to upgrade to those. |
|
Send message Joined: 25 Apr 13 Posts: 27 Credit: 240,283,511 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Shut the machine down while I went to work, 12 hrs later turned it back on. The elapsed time increases, the remaining stagnant ? |
|
Send message Joined: 25 Apr 13 Posts: 27 Credit: 240,283,511 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
This is the second run I have aborted. My GPUGRID credits are decreasing because I am running programs that don't work and I have to abort. |
|
Send message Joined: 25 Apr 13 Posts: 27 Credit: 240,283,511 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
All this started happening just recently ... |
|
Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts. Have you checked if the GPU clock runs still at full load (to load you want or have set to)? I have had a lot of troubles with my 660's, even bought a new motherboard. They do fine now with the beta's and long runs and 8.14. Short runs give (still) the most problems. My 770 from Asus is almost error free with all types of WU, and more over most WU's don't even stop en route, they run in one go. We can now see that with the new stderr Matt has made. So in my new builds only 770, 780 and Titan. Greetings from TJ |
|
Send message Joined: 25 Apr 13 Posts: 27 Credit: 240,283,511 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If nothing will fix this, I will delete GPUGRID and run another BOINC program. |
|
Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Have you checked if the GPU clock runs still at full load (to load you want or have set to)? I have had a lot of troubles with my 660's, even bought a new motherboard. They do fine now with the beta's and long runs and 8.14. Short runs give (still) the most problems. Yes, the GPU clock shows running a full speed on GPU-Z. It is normally 993 MHz as set by the card, but I had reduced it to 980 MHz (hardly a difference) and also bumped up the core voltage slightly (by 12.5 mv) with Nvidia Inspector. But there was no obvious down-clocking, as was a problem for some Nvidia cards a few years ago. But maybe not all the relevant clocks are shown by GPU-Z? It is nothing I can fix at any rate, and I have seen no reports of such problems for these current drivers. It is on a Z77 motherboard with an Ivy Bridge i7-3770, with each GPU supported by a vitual CPU core, so that should not be a limitation. And the fact that a reboot fixes it would indicate that it is a software, not a hardware problem (to me at any rate). There was some speculation earlier on various reasons that some cards were affected and others weren't, such as cache size, memory bandwidth, etc., but I don't think any definitive answer has been found. It is apparently something only GPUGrid can fix. |
©2025 Universitat Pompeu Fabra