Author |
Message |
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Eventually noticed my credit dropping, and found that a NOELIA WU was barely progressing. The temps were cool and it was progressing at a rate of about 0.002% per minute. After 235h (over 9days) it was only 66% complete. I aborted the WU and a SANTI_MAR is progressing normally.
gluglux8x68-NOELIA_DIPEPT-0-2-RND0892
7718045 159186 31 Jan 2014 | 22:54:24 UTC 11 Feb 2014 | 13:27:00 UTC Aborted by user 848,171.58 7,994.09 --- Long runs (8-12 hours on fastest card) v8.03 (cuda55)
So, just a reminder to keep an eye out for lazy tasks.
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
Jim1348Send message
Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level
Scientific publications
|
I have had about a dozen complete without problems (GTX 660), but one errored out on all users who got it. It ended quickly fortunately, but maybe there are problems with the series.
http://www.gpugrid.net/workunit.php?wuid=5128425
But these days when I see a slow clock, I assume that the card is being over-stressed and slowing down to protect itself. I then reduce the clocks. It is counter-intuitive, but works; I haven't seen that problem in months. |
|
|
StefanProject administrator Project developer Project tester Project scientist Send message
Joined: 5 Mar 13 Posts: 348 Credit: 0 RAC: 0 Level
Scientific publications
|
Sounds nasty :/ I sent the post around but I don't think there will be much on it as it seems to be an outlier (a very bad one admitedly). |
|
|
ecafkidSend message
Joined: 31 Dec 10 Posts: 4 Credit: 1,359,947,817 RAC: 0 Level
Scientific publications
|
I aborted this WU after 24 hours it was at 0% and the remaing column had only dashes in it.
7764248 166275 11 Feb 2014 | 12:23:51 UTC 12 Feb 2014 | 12:32:31 UTC Aborted by user 86,754.02 0.00 --- Long runs (8-12 hours on fastest card) v8.15 (cuda55)
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Nothing to do with 'NOELIA_DIPEPT progressing very slowly' but regarding your setup:
You get a lot of errors! These problems may stem from overheating GPU's. 80C is probably too hot. Suggest you control your GPU temperatures so that they are a bit cooler. Try MSI afterburner.
http://www.gpugrid.net/result.php?resultid=7758914
Can't tell what the temps are for device 2 but many tasks seem to fail shortly after the GPU reaches 80C. I believe throttling starts at 80C, but in this case the issue may be due to swapping the task from one of your Quadro K5000's to the other, or your Tesla K20c.
I wonder if this works well for the Quadro's and Tesla's?
Possibly something for Matt to think about, but if the temp is kept below 80C then this wouldn't be happening.
Stderr output
<core_client_version>7.2.33</core_client_version>
<![CDATA[
<message>
The file exists.
(0x50) - exit code 80 (0x50)
</message>
<stderr_txt>
# GPU [Quadro K5000] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 2 :
# Name : Quadro K5000
# ECC : Disabled
# Global mem : 4095MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 705MHz
# Memory clock : 2700MHz
# Memory width : 256bit
# Driver version : r331_00 : 33182
# GPU 0 : 74C
# GPU 1 : 65C
# GPU 0 : 75C
# GPU 0 : 76C
# GPU 0 : 77C
# GPU 0 : 78C
# GPU 1 : 66C
# GPU 0 : 79C
# GPU 1 : 67C
# GPU 1 : 68C
# GPU 1 : 69C
# GPU 0 : 80C
# GPU 1 : 70C
# GPU 1 : 71C
# GPU [Tesla K20c] Platform [Windows] Rev [3203M] VERSION [55]
# SWAN Device 0 :
# Name : Tesla K20c
# ECC : Enabled
# Global mem : 4095MB
# Capability : 3.5
# PCI ID : 0000:22:00.0
# Device clock : 705MHz
# Memory clock : 2600MHz
# Memory width : 320bit
# Driver version : r331_00 : 33182
SWAN : FATAL : Cuda driver error 719 in file 'swanlibnv2.cpp' in line 1963.
# SWAN swan_assert 0
</stderr_txt>
]]>
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|
ecafkidSend message
Joined: 31 Dec 10 Posts: 4 Credit: 1,359,947,817 RAC: 0 Level
Scientific publications
|
Thank's! Is there an optimum temperature for them to run at? I know sometimes there is a magic temp that things run at there best performance. I really appreciate you looking into this. I am not a scientist or researcher. I just have some spre cycles on my computers I feel should go to good use. I will probably bring three or four more online soon. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Try to keep your GPU's below 70C and they should run fine. Often 70C to ~78C is OK, but when you go over 80C expect trouble, especially on the smaller cards. When GK104, GK106 and GK107 cards hit 70C they mostly stop boosting as high. The GK110 cards tend to throttle their boost when they hit 80C. So keeping below these helps. Ideally, you keep the GPU's as cool as possible as that way they lose less power as heat radiation. Power is one factor that can limit the GPU's clocks. Reliable voltage is another...
____________
FAQ's
HOW TO:
- Opt out of Beta Tests
- Ask for Help |
|
|