WU not completing

Message boards : Number crunching : WU not completing
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 41125 - Posted: 22 May 2015, 14:21:52 UTC - in response to Message 41114.  

I tried running all CPU cores on BOINC WUs with two GPUGrid WUs. GPUGrid WUs failed every time.

I find I must run 5 BOINC CPU WUs max on my AMD FX-8350 8 core PC with two GPUGrid WUs to prevent failures.
ID: 41125 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41145 - Posted: 26 May 2015, 1:13:42 UTC - in response to Message 41124.  
Last modified: 26 May 2015, 1:15:05 UTC

So.. according to what I can sort-of understand ... you don't know the exact make/model/manufacturer. But, we're able to see:

GPU-Prozessor: GeForce GTX 760 GPU GK104
Kerntakt: 1006 MHz
Speicherschnittstelle: 256-Bit
Dedizierter Videospeicher: 2048 MB GDDR5


... which means that you have a 256-bit GTX 760 with a factory clock of 1006 MHz.

Looking at the wiki listing here:
http://en.wikipedia.org/wiki/GeForce_700_series
... The base core clock for your GPU is actually 980 MHz. This means, to my knowledge, that your GPU is factory-overclocked, and could be causing the failures.

I recommend installing EVGA Precision X, and using it to downclock the GPU Clock Offset value to -24 (so you are running at 980 MHz, the reference clock), and seeing if that helps you at all. You could even try values lower, like -100 or -200, to test.

Regards,
Jacob
ID: 41145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Killersocke

Send message
Joined: 18 Oct 13
Posts: 53
Credit: 406,647,419
RAC: 0
Level
Gln
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41159 - Posted: 27 May 2015, 7:28:36 UTC - in response to Message 41145.  

I think it is not the basic problem.
All other Applications, Programs, Apps, Boinc Projects etc.
are stable here.

https://www.gpugrid.net/forum_thread.php?id=4097
ID: 41159 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41163 - Posted: 27 May 2015, 12:29:30 UTC
Last modified: 27 May 2015, 12:30:24 UTC

Can you please remove the overclock, to at least test and rule that out?

I have had overclocks where everything works great, except GPUGrid tasks, because they work parts of the GPU in different ways. So, when testing, it's best to remove the overclock (or even put it at -200), to confirm that it resolves the problems.
ID: 41163 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gerard

Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41165 - Posted: 27 May 2015, 12:56:54 UTC

If the problems you were having were with any of the workunits named "EQUI_26Apr_CXCL", most likely the problem was ours. These workunits have been cancelled this morning (Spain). Thanks for your understanding and sorry for any inconvenience caused.
ID: 41165 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41167 - Posted: 27 May 2015, 13:20:20 UTC

The problems in this thread ... are different than the "EQUI_26Apr_CXCL" TDR tasks.

Any time is says "has become unstable", I continue to recommend taking the base clock down to attempt to resolve it. I wish people would listen :)
ID: 41167 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Gerard

Send message
Joined: 26 Mar 14
Posts: 101
Credit: 0
RAC: 0
Level

Scientific publications
wat
Message 41205 - Posted: 29 May 2015, 18:20:42 UTC - in response to Message 41167.  

In my experience, when the simulation gives an error of the type "has become unstable" is mainly because of some misconfiguration of the molecular system (usually it means that an explosion occured in some molecule due to extreme forces; this is what was happening in my case).

On the other hand, I also noticed errors of the type "cuda errors" which are usually unsolvable and are related to some specific cards or some random error in the calculus.

The third type is when no error is found and the simulation seems to get ongoing indefinetely. I got some of them this last time with the "EQUI_26Apr_CXCL" corrupted batch.

Do you think "has become unstable" errors could be also caused by overclocking? I doubt it, I would expect more a "cuda error"
ID: 41205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41206 - Posted: 29 May 2015, 18:37:23 UTC - in response to Message 41205.  
Last modified: 29 May 2015, 18:56:03 UTC

Do you think "has become unstable" errors could be also caused by overclocking?

It definitely does. (EDIT2: it's like the older error: "energies have become NAN")

EDIT3: these two tasks errored out on my new Palit JetStream GTX980, because it is factory overclocked, and the MSI Afterburner raised the same amount of MHz of its GPU clock as it was set on my standard GTX980. 14198435 14198172
Now both cards runs fine at 1420MHz.

EDIT: There should be some safety check calculation (with known results) built in the client, which would regularly check the condition of the GPU (say by every 20 minutes)

EDIT4: the client should detect the real clock of the GPU somehow.
ID: 41206 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 2
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41207 - Posted: 29 May 2015, 19:09:31 UTC - in response to Message 41206.  

Hardware can also go out of tolerance if it's not given the correct supply voltage, either from the host PSU or via the bios/regulators on the card itself. It's possibly more more likely that power components will suffer from aging when subject to the continuous stress of GPGPU work, compared to the calculation components.
ID: 41207 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41208 - Posted: 29 May 2015, 19:25:43 UTC - in response to Message 41206.  
Last modified: 29 May 2015, 19:26:11 UTC

Do you think "has become unstable" errors could be also caused by overclocking?

It definitely does.


Yes. Definitely. Based on my own experience with 3 factory-overclocked GPUs.
ID: 41208 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41212 - Posted: 31 May 2015, 8:23:23 UTC - in response to Message 41208.  
Last modified: 31 May 2015, 8:31:04 UTC

If the power, voltage, GPU clocks or GDDR5 clocks are too high for any given task then the task can fail. This is more commonly seen on smaller cards which can be weaker design ways yet more fully used/pushed to their max (especially on XP & Linux were GPU usage is often 99%).
GPU usage by tasks varies by task type/batch. This is why one setup or OC might work for one batch but not another and I too had issues with some factory closks on some smaller cards in the past.

I've even seen situations where running some CPU WU's cause the CPU to run hot enough to raise the temperature of GPU0 by several degrees C. Just running climate models for example can increase power usage by 30W and that mostly ends up as heat in the case if you have a basic heatsink and fan cooler.

However, on a decent setup (with a GPU fan profile), GPU core clocks or temps may or may not be the reason tasks fail,

FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 41212 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 41269 - Posted: 7 Jun 2015, 10:27:40 UTC - in response to Message 41212.  

Tried 353.06 on an XPx86 system with a GTX770 (GPU0) and a GTX670 (GPU1).
While the GTX670 ran at ~98% Power and ~95% GPU usage the GTX770's power remained at 48% constantly - it had downclocked and wouldn't be coaxed back to what it should be (restarts and task swapping).
I've tried other recent drivers too but had similar experiences.
Half expecting a GPU power related issue I went back to 344.75 to see if it's the driver, the mobo, a connector...
Now both the GTX770 and GTX670 are running at ~95% GPU usage. The power usage for the GTX670 is 98% while the power usage of the GTX770 is ~78%.

I think it's fair to conclude that the 344.75 driver works well on Windows XP while the more recent drivers do not.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 41269 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : WU not completing

©2026 Universitat Pompeu Fabra