Message boards :
Number crunching :
WU not completing
Message board moderation
Previous · 1 · 2
| Author | Message |
|---|---|
|
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I tried running all CPU cores on BOINC WUs with two GPUGrid WUs. GPUGrid WUs failed every time. I find I must run 5 BOINC CPU WUs max on my AMD FX-8350 8 core PC with two GPUGrid WUs to prevent failures. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
So.. according to what I can sort-of understand ... you don't know the exact make/model/manufacturer. But, we're able to see: GPU-Prozessor: GeForce GTX 760 GPU GK104 ... which means that you have a 256-bit GTX 760 with a factory clock of 1006 MHz. Looking at the wiki listing here: http://en.wikipedia.org/wiki/GeForce_700_series ... The base core clock for your GPU is actually 980 MHz. This means, to my knowledge, that your GPU is factory-overclocked, and could be causing the failures. I recommend installing EVGA Precision X, and using it to downclock the GPU Clock Offset value to -24 (so you are running at 980 MHz, the reference clock), and seeing if that helps you at all. You could even try values lower, like -100 or -200, to test. Regards, Jacob |
|
Send message Joined: 18 Oct 13 Posts: 53 Credit: 406,647,419 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I think it is not the basic problem. All other Applications, Programs, Apps, Boinc Projects etc. are stable here. https://www.gpugrid.net/forum_thread.php?id=4097 |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Can you please remove the overclock, to at least test and rule that out? I have had overclocks where everything works great, except GPUGrid tasks, because they work parts of the GPU in different ways. So, when testing, it's best to remove the overclock (or even put it at -200), to confirm that it resolves the problems. |
|
Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
If the problems you were having were with any of the workunits named "EQUI_26Apr_CXCL", most likely the problem was ours. These workunits have been cancelled this morning (Spain). Thanks for your understanding and sorry for any inconvenience caused. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The problems in this thread ... are different than the "EQUI_26Apr_CXCL" TDR tasks. Any time is says "has become unstable", I continue to recommend taking the base clock down to attempt to resolve it. I wish people would listen :) |
|
Send message Joined: 26 Mar 14 Posts: 101 Credit: 0 RAC: 0 Level ![]() Scientific publications
|
In my experience, when the simulation gives an error of the type "has become unstable" is mainly because of some misconfiguration of the molecular system (usually it means that an explosion occured in some molecule due to extreme forces; this is what was happening in my case). On the other hand, I also noticed errors of the type "cuda errors" which are usually unsolvable and are related to some specific cards or some random error in the calculus. The third type is when no error is found and the simulation seems to get ongoing indefinetely. I got some of them this last time with the "EQUI_26Apr_CXCL" corrupted batch. Do you think "has become unstable" errors could be also caused by overclocking? I doubt it, I would expect more a "cuda error" |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Do you think "has become unstable" errors could be also caused by overclocking? It definitely does. (EDIT2: it's like the older error: "energies have become NAN") EDIT3: these two tasks errored out on my new Palit JetStream GTX980, because it is factory overclocked, and the MSI Afterburner raised the same amount of MHz of its GPU clock as it was set on my standard GTX980. 14198435 14198172 Now both cards runs fine at 1420MHz. EDIT: There should be some safety check calculation (with known results) built in the client, which would regularly check the condition of the GPU (say by every 20 minutes) EDIT4: the client should detect the real clock of the GPU somehow. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hardware can also go out of tolerance if it's not given the correct supply voltage, either from the host PSU or via the bios/regulators on the card itself. It's possibly more more likely that power components will suffer from aging when subject to the continuous stress of GPGPU work, compared to the calculation components. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Do you think "has become unstable" errors could be also caused by overclocking? Yes. Definitely. Based on my own experience with 3 factory-overclocked GPUs. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
If the power, voltage, GPU clocks or GDDR5 clocks are too high for any given task then the task can fail. This is more commonly seen on smaller cards which can be weaker design ways yet more fully used/pushed to their max (especially on XP & Linux were GPU usage is often 99%). GPU usage by tasks varies by task type/batch. This is why one setup or OC might work for one batch but not another and I too had issues with some factory closks on some smaller cards in the past. I've even seen situations where running some CPU WU's cause the CPU to run hot enough to raise the temperature of GPU0 by several degrees C. Just running climate models for example can increase power usage by 30W and that mostly ends up as heat in the case if you have a basic heatsink and fan cooler. However, on a decent setup (with a GPU fan profile), GPU core clocks or temps may or may not be the reason tasks fail, FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Tried 353.06 on an XPx86 system with a GTX770 (GPU0) and a GTX670 (GPU1). While the GTX670 ran at ~98% Power and ~95% GPU usage the GTX770's power remained at 48% constantly - it had downclocked and wouldn't be coaxed back to what it should be (restarts and task swapping). I've tried other recent drivers too but had similar experiences. Half expecting a GPU power related issue I went back to 344.75 to see if it's the driver, the mobo, a connector... Now both the GTX770 and GTX670 are running at ~95% GPU usage. The power usage for the GTX670 is 98% while the power usage of the GTX770 is ~78%. I think it's fair to conclude that the 344.75 driver works well on Windows XP while the more recent drivers do not. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
©2026 Universitat Pompeu Fabra