strange behaviour...

Message boards : Number crunching : strange behaviour...
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33109 - Posted: 19 Sep 2013, 22:14:28 UTC - in response to Message 33107.  

Richard,

What's this CPU throtting thing? Do you know how it works? There's no thing germane in the library code so presumably it's all in the client.

Matt
ID: 33109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33111 - Posted: 19 Sep 2013, 23:26:00 UTC - in response to Message 33109.  

Richard,

What's this CPU throtting thing? Do you know how it works? There's no thing germane in the library code so presumably it's all in the client.

Matt

Yes, in the client. It's meant for thermal control of CPUs, and it dates back to the early days of BOINC. If you look at the Computing preferences on your account here, the bottom item under Processor usage is:

Use at most
Can be used to reduce CPU heat 100% of CPU time

The implementation is crude: they wanted it to use the same source code on every platform, and there isn't a fine control like that. So it operates on a granularity of 1 second, so capeITLabs' 75% would have been 3 seconds on and 1 second off. That, of course, means three eternities on and one eternity off at the speeds GPUs operate.

David Anderson made a gut reaction to a single user's request on the mailing list back in January: http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2013-January/019305.html - I'm sure you can think of such a reason. That emerged in version 7.0.45

It was removed with v7.2.1. You might like to look at the note:

client: don't apply CPU throttling to apps that use < .5 CPUs (like GPU, NCI).

and http://boinc.berkeley.edu/trac/changeset/4cb34a123aacfaccc28b5f1f76717864b0b63a57/boinc-v2 with respect to the requested CPU reservation for Keplers and above (and make the same suggestion to any OpenCL developers you know).

Links to the earlier changesets are contained in my email at http://lists.ssl.berkeley.edu/pipermail/boinc_dev/2013-July/020131.html

Any casual reader here who wishes to apply thermal control to their CPU or GPU under Windows (only) would be better advised to consider TThrottle
ID: 33111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33112 - Posted: 19 Sep 2013, 23:29:57 UTC - in response to Message 33111.  

Thanks Richard,

I guess I'd better take a look and see exactly how this third suspend-resume mechanism works under the hood..

MJH

ID: 33112 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 25 Apr 13
Posts: 27
Credit: 240,283,511
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33117 - Posted: 20 Sep 2013, 20:40:06 UTC - in response to Message 33094.  

660 ... have aborted run ... just installed latest driver.
ID: 33117 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33120 - Posted: 20 Sep 2013, 23:16:27 UTC - in response to Message 33117.  

660 ... have aborted run ... just installed latest driver.

I have updated the drivers on my two GTX 660s to 327.23 and completed my first Noelia with no problems (4-NOELIA_INS1P-9-15-RND4205_0 14:09:09). Each card is running another Noelia with no problems thus far, so I will let them run and see what happens.
ID: 33120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 318
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33122 - Posted: 21 Sep 2013, 7:21:32 UTC - in response to Message 33112.  

Thanks Richard,

I guess I'd better take a look and see exactly how this third suspend-resume mechanism works under the hood..

MJH

I don't know whether this is concidence, or whether you've been in communication behind the scenes, but David Anderson has just started work on a better throttling solution.

"client: preliminary implementation (commented out) of sub-second throttling"
http://boinc.berkeley.edu/trac/changeset/ebde7809ceaca8cc35d75c2a2b5adc32c19694e5/boinc-v2
http://boinc.berkeley.edu/trac/changeset/35f489d36f4c7734d13f76af5844ec42d244be59/boinc-v2
ID: 33122 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33124 - Posted: 21 Sep 2013, 11:41:09 UTC

I'm against coarse-grained throttling for thermal control as it's inefficient for any hardware using adaptive power states (like boosting nVidias and Intels + AMDs with Turbo). The reason: during activity the hardware boosts into the maximum power state supported, which implies a high voltage and lower power efficiency, whereas during idle periods it obviously does nothing.

If the throttling took place fine-grained the hardare could adjust to the requested performance level and sustain a lower power state (lower voltage - higher power efficiency) and achieve the same throughput. Starting and stopping this often is inefficient from a software perspective, though.

At least for GPU-Grid there's a far better solution: simply lower the GPUs power target and leave it at 100% time. It will take care of adjusting clocks and voltages down by itself. The downside of this: it requires the user to use tuning software, since this is not even available in nVidias control panel under Win (just checked mine). Let alone Linux or Mac OS, which generally don't have working hardware control software.

Adjusting the power target down for CPU is also not as easy as it should be.. given Intels mobile chips already support cTDP in principle. And with AMD GPUs boosting is not yet as wide-soread, efficient and controllable as for the green team :/

MrS
Scanning for our furry friends since Jan 2002
ID: 33124 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 25 Apr 13
Posts: 27
Credit: 240,283,511
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33163 - Posted: 23 Sep 2013, 9:26:42 UTC - in response to Message 33117.  

I91R9-NATHAN_KIDc22_glu-7-10-RND1126_1

Has been running for over 49 hours ... elapsed time increases, remaining time barely moves, but increases.
ID: 33163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33165 - Posted: 23 Sep 2013, 9:54:28 UTC - in response to Message 33163.  
Last modified: 23 Sep 2013, 10:47:50 UTC

Hi, GPUGrid Folks:

Short run task has been grinding away for over 14 h......


251-NOELIA_CRYST1-9-12-RND5111_0 (60% complete)

:(

John
ID: 33165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33166 - Posted: 23 Sep 2013, 11:21:25 UTC - in response to Message 33165.  
Last modified: 23 Sep 2013, 11:22:51 UTC

Paul and John,

If the GPU is cooler than normal, I suggest shutting the system down, turning the PSU off for a minute and then turning it back on and starting the system up again - doing this allowed me to finish a WU that had run for 5days (but had really stopped after about 6h). Keep an eye on your runtime before and after you restart.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 33166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33167 - Posted: 23 Sep 2013, 12:35:59 UTC - in response to Message 33166.  

If the GPU is cooler than normal, I suggest shutting the system down, turning the PSU off for a minute and then turning it back on and starting the system up again - doing this allowed me to finish a WU that had run for 5days (but had really stopped after about 6h).

That fixed it for me with I18R10-NATHAN_KIDc22_glu-8-10-RND4986_1, which was taking 30 hours to complete on a GTX 660 (327.23 drivers). It had previously completed three others in the NATHAN_KIDc22 series with no problems in about 12 hours.

That is unfortunately not a practical solution for me, since I lost 10 hours of CEP2 work running on the CPU. It seems to be more of a problem with the mid-range cards (GTX 660, 660 Ti). Are the 700 series cards immune?
ID: 33167 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 33168 - Posted: 23 Sep 2013, 12:57:43 UTC - in response to Message 33166.  
Last modified: 23 Sep 2013, 12:58:18 UTC

Many thanks, skgiven. Problem fixed. I had hoped to run these tasks in a 'set and forget' mode, but that may not be possible. Being unable to sleep last night, I took a peek at my machine at around 05:00h to see if all is well and that's when I discovered the long run.

I will try again and if the problem recurs I will make the suggested fix.

Thanks again,

John
ID: 33168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33171 - Posted: 23 Sep 2013, 16:04:33 UTC - in response to Message 33167.  

It seems to be more of a problem with the mid-range cards (GTX 660, 660 Ti). Are the 700 series cards immune?


Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts.

Could be the same thing causing different symptoms using different GPUs.

I think it's all down to 8.14 myself.

Operator.
ID: 33171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33172 - Posted: 23 Sep 2013, 18:17:05 UTC - in response to Message 33171.  

Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts.

The Memory Controller Load apparently runs at a constant 14% rate when it is running slowly, so I doubt that it is the start/stop condition. (It should run about 30% normally on these work units.) I know they had a similar problem with the older apps (before the 8 series), particularly with the GTX 660s, and thought it might have been solved. Otherwise, the 8.14 app works very nicely that I can see, except for one Noelia that errored out, but no crashes or other bad behavior. I hope they can get the last wrinkles ironed out for the mid-range cards, and also for the 700 cards or else there is not much incentive to upgrade to those.
ID: 33172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 25 Apr 13
Posts: 27
Credit: 240,283,511
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33173 - Posted: 23 Sep 2013, 19:45:52 UTC - in response to Message 33166.  

Shut the machine down while I went to work, 12 hrs later turned it back on.

The elapsed time increases, the remaining stagnant ?
ID: 33173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 25 Apr 13
Posts: 27
Credit: 240,283,511
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33174 - Posted: 23 Sep 2013, 20:06:18 UTC - in response to Message 33173.  

This is the second run I have aborted. My GPUGRID credits are decreasing because I am running programs that don't work and I have to abort.
ID: 33174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 25 Apr 13
Posts: 27
Credit: 240,283,511
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33175 - Posted: 23 Sep 2013, 20:11:55 UTC - in response to Message 33174.  

All this started happening just recently ...
ID: 33175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33176 - Posted: 23 Sep 2013, 21:26:38 UTC - in response to Message 33172.  

Depends on whether the same thing that is causing your system to just stop processing is the same thing that causes 780s/Titans to have constant "Access violations" and app restarts.

The Memory Controller Load apparently runs at a constant 14% rate when it is running slowly, so I doubt that it is the start/stop condition. (It should run about 30% normally on these work units.) I know they had a similar problem with the older apps (before the 8 series), particularly with the GTX 660s, and thought it might have been solved. Otherwise, the 8.14 app works very nicely that I can see, except for one Noelia that errored out, but no crashes or other bad behavior. I hope they can get the last wrinkles ironed out for the mid-range cards, and also for the 700 cards or else there is not much incentive to upgrade to those.

Have you checked if the GPU clock runs still at full load (to load you want or have set to)? I have had a lot of troubles with my 660's, even bought a new motherboard. They do fine now with the beta's and long runs and 8.14. Short runs give (still) the most problems.
My 770 from Asus is almost error free with all types of WU, and more over most WU's don't even stop en route, they run in one go. We can now see that with the new stderr Matt has made. So in my new builds only 770, 780 and Titan.

Greetings from TJ
ID: 33176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Paul

Send message
Joined: 25 Apr 13
Posts: 27
Credit: 240,283,511
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 33178 - Posted: 23 Sep 2013, 22:51:42 UTC - in response to Message 33175.  

If nothing will fix this, I will delete GPUGRID and run another BOINC program.
ID: 33178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33179 - Posted: 24 Sep 2013, 0:59:01 UTC - in response to Message 33176.  
Last modified: 24 Sep 2013, 1:04:20 UTC

Have you checked if the GPU clock runs still at full load (to load you want or have set to)? I have had a lot of troubles with my 660's, even bought a new motherboard. They do fine now with the beta's and long runs and 8.14. Short runs give (still) the most problems.

Yes, the GPU clock shows running a full speed on GPU-Z. It is normally 993 MHz as set by the card, but I had reduced it to 980 MHz (hardly a difference) and also bumped up the core voltage slightly (by 12.5 mv) with Nvidia Inspector. But there was no obvious down-clocking, as was a problem for some Nvidia cards a few years ago. But maybe not all the relevant clocks are shown by GPU-Z? It is nothing I can fix at any rate, and I have seen no reports of such problems for these current drivers. It is on a Z77 motherboard with an Ivy Bridge i7-3770, with each GPU supported by a vitual CPU core, so that should not be a limitation. And the fact that a reboot fixes it would indicate that it is a software, not a hardware problem (to me at any rate).

There was some speculation earlier on various reasons that some cards were affected and others weren't, such as cache size, memory bandwidth, etc., but I don't think any definitive answer has been found. It is apparently something only GPUGrid can fix.
ID: 33179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : strange behaviour...

©2025 Universitat Pompeu Fabra