acemdlong application 8.14 - discussion

Message boards : News : acemdlong application 8.14 - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32978 - Posted: 15 Sep 2013, 21:10:30 UTC - in response to Message 32970.  
Last modified: 15 Sep 2013, 21:52:55 UTC

I think Richards explanation sums up what's going on well.

If a WU fails and the app doesn't crash out or fail the WU, but want's to restart it, the WU is stopped. As soon as this happens another WU will run (if it's already downloaded). In TJ's case the fix worked, but for Operator (who may have a dud card or some other issue), WU's keep trying to recover, one after the other.
Perhaps there should be a limit on the number of attempts to recover, say 20?

Operator, I thought you may have a dud card but I now think it's just a cooling issue.

The Access violations are occurring just after GPU0 reaches a high temp, usually 79°C or above - check your logs:

# GPU 0 : 80C
# GPU 1 : 74C
# Access violation : progress made, try to restart
# GPU [GeForce GTX TITAN] Platform [Windows] Rev [3203] VERSION [55]
# SWAN Device 1 :
# Name : GeForce GTX TITAN
# ECC : Disabled
# Global mem : 4095MB
# Capability : 3.5
# PCI ID : 0000:03:00.0
# Device clock : 928MHz
# Memory clock : 3004MHz
# Memory width : 384bit
# Driver version : r325_00 : 32641

The issue is probably with the top GPU (closest to the CPU) getting too warm. If cooling is improved I bet the error rates will fall.

I suggest you test this by disabling the use of top GPU for crunching at GPUGrid (hopefully GPU0 according to Boinc) using cc_config (and telling Boinc to re-read the config file):

    <cc_config>
    <options>
    <use_all_gpus>1</use_all_gpus>
    <exclude_gpu>
    <url>http://www.gpugrid.net/</url>
    <device_num>0</device_num>
    </exclude_gpu>
    </options>
    </cc_config>



You might still want to swap one or both cards into other system to check them.


FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32980 - Posted: 15 Sep 2013, 23:09:09 UTC - in response to Message 32978.  

Currently when the app does a temporary exit it tells the client to wait 30secs before attempting a restart. I'll probably change this to an immediate restart; this should minimise the opportunity for the client to chop and change tasks.

MJH
ID: 32980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32982 - Posted: 15 Sep 2013, 23:50:25 UTC

http://www.gpugrid.net/result.php?resultid=7280096

Here is an example of one that constantly was getting access violations. Notice how the temps were pretty much always in the low 70s. Which AFAIK, isn't hot. Or rather, too hot to matter.

I have also tried running only one WU, and suspending the other, and the errors were still present.

I will bump up the fans a little bit, currently I have them at around 80%
ID: 32982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32983 - Posted: 16 Sep 2013, 0:02:12 UTC

I don't think the cards are to hot. I have had a GTX550Ti running at 79°C 24/7 with this project and almost none errors.
Moreover Operator had no problems with version 8.03 and the temperatures should be around the same. And according to nVidia its maximum temperature is 95°C!
More strange is a lot of access violations and WU stops and starts but eventually the WU finished okay.
Matt knows absolute what he is doing, he has made the app a whole lot better, but I guess he has overlooked a small thing between version 8.03 towards 8.14.
Perhaps he can have one more look tomorrow after a good sleep?
Greetings from TJ
ID: 32983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32985 - Posted: 16 Sep 2013, 3:57:26 UTC

I've dropped the clock, and still get the violations.

Let me ask this, does anyone have a gk110 that isnt getting access violations?
ID: 32985 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 28 Jul 12
Posts: 819
Credit: 1,591,285,971
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32993 - Posted: 16 Sep 2013, 11:09:21 UTC - in response to Message 32983.  

I don't think the cards are to hot. I have had a GTX550Ti running at 79°C 24/7 with this project and almost none errors.

I don't think you can generalize from one card to the next even if they are in the same series, much less from a GTX550Ti to a GTX 660. If it fails when it hits a certain temperature, that looks like a smoking gun to me.
ID: 32993 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32995 - Posted: 16 Sep 2013, 11:41:27 UTC - in response to Message 32985.  
Last modified: 16 Sep 2013, 11:52:36 UTC

Perhaps most of the access violations were already being recovered by the card while running the earlier apps; there are errors that the card can recover without intervention (recoverable errors).

Regarding safe temps - I agree with Jim, it's an individual thing.

When my GTX660Ti ran at around 78°C it had quite a lot of errors and my 660 is also good until it gets into the high 70's. Now that both are usually below 60°C errors are very rare (just 1 Beta since 6th Sept). One of my GTX670's is fine in the high 70's and my 470's were generally OK until they went into the 80's, but they use a Fermi architecture and the default fan profile allowed the cards to reach 93C!

The GPU temperature of a card doesn't tell you how hot anything else is.

Perhaps Operator has different issues to 5pot, possibly Titan vs 780, or perhaps there are two different general issues; one that kicks in when the temp hits 80°C and a separate issue that can occur at lower temps.

Both systems have two GPU's (which might be significant, and can be tested by removing or disabling one), and both systems have one GPU that is noticeably hotter than the other; GPU0 which presumably sits between the other GPU and the CPU. This is common, but suggests insufficient cooling.
From experience of running single and multi-GPU setups, when you add a GPU, even when you get the GPUs temperatures down to what you think is reasonable the rest of the card is hotter, so you have to aim to reduce the ambient system temps a lot.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32995 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32999 - Posted: 16 Sep 2013, 15:22:38 UTC - in response to Message 32995.  
Last modified: 16 Sep 2013, 15:23:13 UTC

Okay so let's take a look at somebody else.

After scanning the listing of top hosts I see another system similarly configured to mine, two GTX Titans, 3.20 Xeon processor, 12GB ram, etc. all the same as mine, except this box is running Win8.

Now take a look at these results riddled with Access violations and temps that occasionally approach 80C from time to time. And the times are in most cases close to (but actually better) than my systems WU completion times.

http://www.gpugrid.net/results.php?hostid=156948

It's not just my system having these issues.

I will get around to pulling out one of the GPUs and running tests this evening, but I do not anticipate that with just one GPU installed that there will be any change from what my experience has been running 8.14 with both GPUs installed.

Titans do not just 'crash' when they get to 80C. They do start reducing frequency to reduce temps, but they don't 'quit'.

I know there has been discussion previously about Titans and memory mis-reporting by the driver. I notice that issue is still not resolved. Titans (currently) come standard with 6GB of memory, but they are always reported in BOINC as having only 4GB. No idea what that's about or if it even matters.

Operator
ID: 32999 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33001 - Posted: 16 Sep 2013, 16:31:30 UTC

MJH

Does the 4X Titan E5 system in your facility only run Linux?

If it runs Windows, does it display the same issues (Access violations, etc.)?

Operator
ID: 33001 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33002 - Posted: 16 Sep 2013, 16:39:05 UTC - in response to Message 33001.  

We only run Linux in-house.

MJH
ID: 33002 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33003 - Posted: 16 Sep 2013, 16:43:03 UTC - in response to Message 33001.  

The access violations appear to come from deep inside NVIDIA code. Maybe I'll have to get a Titan machine bumped to Windows for testing, since there's a limit to the information I can get remotely.


I know there has been discussion previously about Titans and memory mis-reporting by the driver. I notice that issue is still not resolved. Titans (currently) come standard with 6GB of memory, but they are always reported in BOINC as having only 4GB. No idea what that's about or if it even matters.


That's just because the client is using a 32bit integer to hold the memory size. 4GB is largest value representable in that datatype.

MJH
ID: 33003 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33007 - Posted: 16 Sep 2013, 18:12:25 UTC

Well, thats one step forward. Best of luck. Deep inside nvidia code doesnt sound good, but for me, while it does slow the tasks down, im just happy theyre running and validating.
ID: 33007 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33008 - Posted: 16 Sep 2013, 19:23:49 UTC - in response to Message 33003.  

The access violations appear to come from deep inside NVIDIA code. Maybe I'll have to get a Titan machine bumped to Windows for testing, since there's a limit to the information I can get remotely.

MJH


Thank you sir for the 32 bit reference on the memory size, makes perfect sense.

While we're on that topic, I don't suppose there is any hope of compiling a 64 bit version of the app and sending that out for testing?

It's my understanding that either a 32 bit or 64 bit app can be compilied by the toolkit.

Is this correct?

What's the downside of doing this?

Thanks,

Operator
ID: 33008 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33009 - Posted: 16 Sep 2013, 19:27:59 UTC - in response to Message 33008.  


While we're on that topic, I don't suppose there is any hope of compiling a 64 bit version of the app and sending that out for testing?


No, the Windows application will stay 32bit for the near future, since that will work on all hosts. Importantly, there's no performance advantage in a 64bit version. It may happen in the future, but not until 1) after the transition to Cuda 5.5 is complete, and 2) 32bit hosts contribute an insignificant fraction of our compute capability.

Matt
ID: 33009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33011 - Posted: 16 Sep 2013, 19:38:45 UTC - in response to Message 32995.  


Perhaps most of the access violations were already being recovered by the card while running the earlier apps; there are errors that the card can recover without intervention (recoverable errors).


The app will attempt recovery if it ran long enough to make a new checkpoint file. If it starts and crashes before that point, it will just abort the task, to avoid getting stuck in a loop.


Regarding safe temps - I agree with Jim, it's an individual thing.


My experience is that you get best performance out of a Titan if the temperature is below 78C. By 80 it is throttling. Over 80 and the thermal environment is too challenging for it to maintain its target (the card will be spending most of its time in the lowest performance state) r, and you should really try and improve the cooling.

Just turning up the fanspeed can have counter-intuitive effects. When we tried this on our chasses, the increased airflow through the GPUs hindered airflow around the cards, actually making the top parts of the card away from the thermal sensors hotter.

MJH
ID: 33011 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33014 - Posted: 16 Sep 2013, 20:48:24 UTC

For tests you could just grad an old spare HDD an install some windows on it - it doesn't even need to be activated.

What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

And you're right, Titans regulate themselves to "up to 80°C" by boosting clock speed and voltage, but not any higher. Hence it's their [b]expected temperature[/(b]. They'll always like more cooling, though, unless you go below 30 K.. :D

MrS
Scanning for our furry friends since Jan 2002
ID: 33014 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Operator

Send message
Joined: 15 May 11
Posts: 108
Credit: 297,176,099
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33017 - Posted: 16 Sep 2013, 21:04:54 UTC - in response to Message 33014.  
Last modified: 16 Sep 2013, 21:05:28 UTC



What could be interesting is whether these access violations already happened in 8.03 but had no visible effect, or if they're caused by some change made to the app since then.

MrS



MrS;

Easy enough to find out.

Here's the results of the last 4 WUs crunched on the Titan system using 8.03 before 8.14 got downloaded automatically:

http://www.gpugrid.net/result.php?resultid=7264399

http://www.gpugrid.net/result.php?resultid=7263637

http://www.gpugrid.net/result.php?resultid=7262985

http://www.gpugrid.net/result.php?resultid=7262745

I don't see any evidence of either errors or Access violations.

And here is the very first WU running on the 8.14 app:

http://www.gpugrid.net/result.php?resultid=7265074

So I don't think it's about heat issues, third party software, tribbles in the vent shafts, moon phases, any of that.

I've been through this system thoroughly.

It clearly started with 8.14. From the very first 8.14 WU I got.

Operator
ID: 33017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
5pot

Send message
Joined: 8 Mar 12
Posts: 411
Credit: 2,083,882,218
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33022 - Posted: 16 Sep 2013, 21:31:35 UTC

What you're forgetting is that it most likely wasn't reporting the error since that time point. He added more debug information as time went on.

My WU times are in line with previous batches as well.
ID: 33022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 33024 - Posted: 16 Sep 2013, 21:52:09 UTC - in response to Message 33017.  

Operator, you and I don't think it is an heat issue, but the temperature reading where absent in 8.03 and where introduced between 8.04 and 8.14.
So somewhere it could have to do with temperature readings or other things regarding temperature. Matt should know as he programmed the otherwise improved app.
Greetings from TJ
ID: 33024 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33026 - Posted: 16 Sep 2013, 21:56:14 UTC

For these access violation problems, it seems that I'm going to have to set up a Windows system with a Titan in the lab and try to reproduce it. Unfortunately I'll not be back to do that until mid October at the earliest. I hope you can tolerate the current state of affairs until then?

Matt
ID: 33026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · Next

Message boards : News : acemdlong application 8.14 - discussion

©2025 Universitat Pompeu Fabra