Lots of errors

Message boards : Number crunching : Lots of errors
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41152 - Posted: 27 May 2015, 0:00:10 UTC

I'm starting to get a high number or short tasks that error out. Can someone explain why this is happening and how I can fix it? Have changed no settings.
Here is the log from one of the tasks.
WinXP SP3 dual 750Ti



http://www.gpugrid.net/result.php?resultid=14202446
ID: 41152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Bedrich Hajek

Send message
Joined: 28 Mar 09
Posts: 490
Credit: 11,739,145,728
RAC: 95,752
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41154 - Posted: 27 May 2015, 0:59:50 UTC - in response to Message 41152.  

I am getting errors too, but mine are with GERARD_EQUI WU's. Three had errors, two finished ok. It seems to be bad batch.

https://www.gpugrid.net/result.php?resultid=14210451



895456x4-GERARD_EQUI_26Apr_CXCL-0-1-RND0321_4
Workunit 10949024
Created 26 May 2015 | 23:34:38 UTC
Sent 26 May 2015 | 23:34:54 UTC
Received 26 May 2015 | 23:50:42 UTC
Server state Over
Outcome Computation error
Client state Compute error
Exit status -97 (0xffffffffffffff9f) Unknown error number
Computer ID 30790
Report deadline 31 May 2015 | 23:34:54 UTC
Run time 87.09
CPU time 77.31
Validate state Invalid
Credit 0.00
Application version Short runs (2-3 hours on fastest card) v8.47 (cuda65)
Stderr output
<core_client_version>7.4.42</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -97 (0xffffff9f)
</message>
<stderr_txt>
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 1 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:05:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU [GeForce GTX 690] Platform [Windows] Rev [3212] VERSION [65]
# SWAN Device 0 :
# Name : GeForce GTX 690
# ECC : Disabled
# Global mem : 2047MB
# Capability : 3.0
# PCI ID : 0000:04:00.0
# Device clock : 1019MHz
# Memory clock : 3004MHz
# Memory width : 256bit
# Driver version : r343_98 : 34411
# GPU 0 : 63C
# GPU 1 : 73C
# The simulation has become unstable. Terminating to avoid lock-up (1)



ID: 41154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41155 - Posted: 27 May 2015, 1:07:33 UTC
Last modified: 27 May 2015, 1:10:10 UTC

I'm getting some nasty errors too, with the GERARD_EQUI_26Apr_CXCL tasks. They're causing major TDRs, which in turn then make the computer have hardware acceleration problems in other tasks (like web browsing, or gaming), and also cause driver problems where the clocks never go back to 3d-mode clocks.

Admins: Please look into which batches need to be revoked, to prevent these problems. It's a major headache, for me at least.

1154144x3-GERARD_EQUI_26Apr_CXCL-0-1-RND9216_7
http://www.gpugrid.net/result.php?resultid=14210052
895456x5-GERARD_EQUI_26Apr_CXCL-0-1-RND9089_5
http://www.gpugrid.net/result.php?resultid=14210507
ID: 41155 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41156 - Posted: 27 May 2015, 2:06:03 UTC

All my tasks are now erroring out. Suspending this project for now until this issue is resolved.
ID: 41156 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Eric

Send message
Joined: 12 Apr 15
Posts: 1
Credit: 49,381,475
RAC: 0
Level
Val
Scientific publications
watwat
Message 41157 - Posted: 27 May 2015, 4:52:30 UTC

I've actually been having issues with the Graphics drivers themselves crashing and windows having to recover.
ID: 41157 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41158 - Posted: 27 May 2015, 5:46:47 UTC

Same here. Now have five GERARD_EQUI_26Apr_CXCL tasks crashed.
ID: 41158 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gerard

Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41160 - Posted: 27 May 2015, 9:05:55 UTC

Could you please post your errors in this thread? I will cancel the batch if they persist. Thanks for your patience...
ID: 41160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tito

Send message
Joined: 21 May 09
Posts: 22
Credit: 2,002,780,169
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41161 - Posted: 27 May 2015, 10:54:28 UTC

https://www.gpugrid.net/result.php?resultid=14210324
Short WU errored after 80sek at 750Ti.
ID: 41161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41162 - Posted: 27 May 2015, 11:10:02 UTC
Last modified: 27 May 2015, 11:10:52 UTC

Could you please post your errors in this thread? I will cancel the batch if they persist. Thanks for your patience...



https://www.gpugrid.net/result.php?resultid=14210504
ID: 41162 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gerard

Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41164 - Posted: 27 May 2015, 12:50:55 UTC
Last modified: 27 May 2015, 12:52:26 UTC

We detected an unexpected parameterization error in some of the simulations and we just cancelled them. Sorry for any inconvience caused and thank your for reporting it to us! If you find any other errors please do not hesitate to tell us (hopefully this particular issue is already resolved).
ID: 41164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41166 - Posted: 27 May 2015, 12:56:58 UTC

Excellent. Thank you!!
ID: 41166 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41168 - Posted: 27 May 2015, 15:10:08 UTC

ID: 41168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41169 - Posted: 27 May 2015, 16:07:07 UTC - in response to Message 41168.  
Last modified: 27 May 2015, 16:08:05 UTC

nanoprobe:

What is the exact make/model of your GPU? Do the tasks still fail when the Boost clock is set to the reference clock? My hunch is that your GPU is overclocked too much, either by the factory or by you.

"The simulation has become unstable. Terminating to avoid lock-up"
... generally means that you are overclocking too much, or have a hardware problem... from my experience.
ID: 41169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
[CSF] Thomas H.V. DUPONT

Send message
Joined: 20 Jul 14
Posts: 732
Credit: 130,089,082
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwat
Message 41170 - Posted: 27 May 2015, 17:14:22 UTC - in response to Message 41166.  

Excellent. Thank you!!

+1 :)
[CSF] Thomas H.V. Dupont
Founder of the team CRUNCHERS SANS FRONTIERES 2.0
www.crunchersansfrontieres
ID: 41170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41171 - Posted: 27 May 2015, 17:37:37 UTC - in response to Message 41169.  
Last modified: 27 May 2015, 17:39:06 UTC

nanoprobe:

What is the exact make/model of your GPU? Do the tasks still fail when the Boost clock is set to the reference clock? My hunch is that your GPU is overclocked too much, either by the factory or by you.

"The simulation has become unstable. Terminating to avoid lock-up"
... generally means that you are overclocking too much, or have a hardware problem... from my experience.

Cards are PNY 750Ti. No factory O/C. No six pin PCI-E power plugs. 60Watt load @99%. They've been running stock out of the box since I bought them and I've been running the short tasks on these cards since I got them and have never had the failure rate I've been experiencing lately.
If it was one card producing all/most of the errors then I would suspect the card but the tasks are failing on both cards.
ID: 41171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zdnko

Send message
Joined: 17 Jan 09
Posts: 2
Credit: 22,278,452
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwat
Message 41172 - Posted: 27 May 2015, 17:52:14 UTC

1232906x8-GERARD_EQUI_26Apr_CXCL-0-1-RND1418_4 causes a lot of crash of gpu drivers. Stopped!
ID: 41172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41173 - Posted: 27 May 2015, 18:13:46 UTC - in response to Message 41171.  
Last modified: 27 May 2015, 18:21:17 UTC

nanoprobe:

Can you supply the exact model of the GPU, to confirm that it's not factory-overclocked?

Alternatively, could you use GPU-Z to confirm that the GPU Clock and Default Clock say 1020 MHz (which is the stock speed of a GTX 750 Ti, per http://en.wikipedia.org/wiki/GeForce_700_series)

If it's anything above 1020, then it is in fact overclocked, and I recommend using EVGA Precision X to downclock it back to reference 1020 MHz, to see if it helps.

I'm getting frustrated trying to help by offering advice that gets ignored.
ID: 41173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 41190 - Posted: 29 May 2015, 1:00:26 UTC - in response to Message 41173.  

nanoprobe:

[quote]I'm getting frustrated trying to help by offering advice that gets ignored.


WOW! Let me offer you some advise. If it doesn't concern life, death or health then is surely isn't worth getting frustrated over.

FWIW the issue seems to have cleared up. The faulty WUs have been taken care of. Thanks for your help.
ID: 41190 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41191 - Posted: 29 May 2015, 3:05:39 UTC

I have had over 20 WUs fail...on my GTX 660 Ti devices. I will stop gettings tasks and now go to bed....
ID: 41191 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41192 - Posted: 29 May 2015, 3:49:35 UTC - in response to Message 41190.  
Last modified: 29 May 2015, 3:52:39 UTC

nanoprobe:

[quote]I'm getting frustrated trying to help by offering advice that gets ignored.


WOW! Let me offer you some advise. If it doesn't concern life, death or health then is surely isn't worth getting frustrated over.

FWIW the issue seems to have cleared up. The faulty WUs have been taken care of. Thanks for your help.


There were some faulty WUs, but they have nothing to do with tasks erroring out with "Simulation has become unstable." messages and no other error messages. Errors like yours are usuall a result of overclocking too much. Please keep my advice (lower clocks to reference clocks) in mind, the next time you try to troubleshoot those errors.

Good luck,
Jacob
ID: 41192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Lots of errors

©2026 Universitat Pompeu Fabra