Message boards :
Number crunching :
Nearly every WU crashes - what's wrong here ?
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi @all, does anyone know what's happening here ? Nearly all of the GPUGRID WUs crash on my machine. http://www.gpugrid.net/results.php?userid=93083 It's a Q8200 CPU with 3 GPUs: - GTX260 - GTX460 - GTX560Ti The 260 is excluded from GPUGRID. In the last days I've read a lot of things about crashing GPUGRID tasks. So I've done the following things so far: - disabled Screensaver and energy saving at all - set BOINC project switching to 1440 mins (24h) to prevend GPUGRID WUs from being suspended - installed BOINC 7.2.4 - checked the cooling - everything's fine (according to HWINFO64 and GPU-Z) - disabled all other BOINC projects on the machine - no effect Hope someone can help here. Thanks in advance Rene |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
9 Valids and 55 errors tells me that your system can do work but isn't well setup. Some WU's have recently been more troublesome but you have completed work from different queues at different times. I would suggest you stop running CPU tasks to see if your system performs better. What are the GPU and CPU temperatures? Also, IF you don't use the 260, remove it; it's quite power hungry. Running many different GPU projects can bring its own set of problems. If you must, I suggest a very small cache/buffer of work and to set boinc to switch between apps every 999 minutes. Should you not see any improvement try the 314 drivers (advanced/clean install). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for the quick reply. The 260 is crunching for POEM at the moment. But let's give it a try. I will disable all the other projects and only run GPUGRID. The 260 will stay excluded for now. Since the machine is located in a remote Server room, it will be a bit difficult to remove the 260 on the fly ;) According to HWINFO, the CPU core temps are around 55 deg, the GPU temps are - 78 deg for the GTX460 - 69 deg for the GTX260 - 72 deg for the GTX560Ti should be no problem IMHO. |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Ok, I've disabled all other projects on this machine, did a complete driver cleanup (incl. Driver Cleaner PE) and reinstalled the NVIDIA drivers 314.22 (only the graphics driver). The GTX260 is still excluded from GPUGRID (makes no sense performance-wise IMHO). I've just received a couple of short runs...let's see what happens... |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Hmmm...looks like the first short run crashed again after about 2 hours on the 560Ti... :( This isn't funny anymore... |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I suggest you try using fan controlling software to keep the GPU temps below 70°C, if possible. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
In a case of deep frustration, I've just re-installed Linux on this machine (as it already was before Win7) ;) Let's see if this runs a bit more stable again. But I'll keep your suggestion in mind. |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Hmmmm...this bloody thing keeps crashing... :( But only GPUGRID tasks crash on this machine. All other projects (even GPU) are working fine. But is it normal, that each GPUGRID task requests 16GB of virtual mem ? |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Ok, next try: - removed the 260 - added 4 GB more physical RAM (6 GB total) started crunching 1 short and 1 long... |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
it's absolutely unbelievable and totally annoying...the GPUGRID tasks still crash after an indeterminate time... :( Is there something like a "boinc task debug mode" ? |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Why don't you try removing all other GPUs from the machine? Maybe you're having other issues, like heat, power, etc. With all the rest you've done, I think it's little more trouble to go through.
|
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
The thing is, that only GPUGRID tasks are crashing. All the other projects (CPU or GPU) are working fine. But I will try that as well. |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
ok, next try: - removed 4GB of RAM (the two modules left are working for sure, according to 48h memtest86) - added 2 more 120mm fans (in total: 1 intake, 3 outtakes, each @2000 rpm) crunching 1 short and 1 long now. CPU temps for all cores are 42 deg, GPU temps are at 73 deg. let's see what happens... |
|
Send message Joined: 1 Dec 12 Posts: 24 Credit: 60,122,950 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Do you have an overclock on your cpu or any FSB increasement? In my case it helped when I set it back to stock clocks (Q6600). CPU times are being higher now, but its stable! I think my northbridge was running too hot with CPU OC + folding @ PrimeGrid and 2 GPU's downclocked + folding @ GPUGRID. It is really a pain in the ass when WU's errored again, again and again. I know how its feel like. Very frustrating. Did you also have seen this thread? |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
No OC in place, not CPU and not GPU, except of the 560Ti, which is a Palit Sonic and therefore has some factory OC. The WUs are still crashing... I think I can rule out any heat issues now. RAM also works fine, if I can trust memtest86. This leaves the mainboard (MSI P7N SLI) and the PSU (650W LCPower) on the table. I know the LCPower isn't high end, but it's also working in my other BOINC machines and they don't show any issues. And a 560Ti + GTX460 + Q8200 should be no problem for this PSU. In another one of my machines it fires a i7-3770 + two HD6950. The Radeons are running 3 MilkyWay WUs in parallel and are at 100%, all CPU cores are at 100% too. So the PSU should work with this setup. Hmmm... |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
GPU temps are at 73 deg. That's not disastrous but it's still slightly on the high side. I suggest you use reference settings for the 560Ti (822/1645/4008) If you haven't done it yet, try tweaking the clocks down a bit. It might be the case that the Voltage isn't ideal for the clocks, so by dropping the clocks a notch you might find stability. I would start by reducing both by ~5% and test it on the short tasks. As FoldingNator said, Overclocking the FSB can cause issue and you shouldn't OC the PCIE bus for here. If you haven't OC'ed, and have tried everything else, you might want to reduce the FSB and possibly even the PCIE (though that might cause as many problems as it resolves). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Ok, thanks for the advices. I will try that. Interestingly all the last crashes were caused by segmentation violations... How much physical RAM should be in the box to run 2 or 3 GPUGRID WUs in parallel ? At the moment the box has 2 GB physical and plenty swap... |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The GPUGrid tasks use about 140MB of system memory each, but if you are doing anything else (CPU crunching) 2GB is probably not enough, especially on Windows 7 - the OS is likely reading and writing to the drive a lot. Task Manager will tell you what you are using. I suggest you put the other 2GB back in, and maybe give it a quick test by swapping it with the existing modules first. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
|
Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
- power down, remove the power cord, wait at least 15 minutes and try again - this has solved some weird sh*t for me in the past! - run GPU-Grid only on the 460 or 560 - let's see if either one can be singled out - I agree with slightly lower GPU core and memory clocks - the cards may have degraded slightly over time and might not be stable at stock clocks under high load (not much beats GPU-Grid except Furmark) in the summer any more - try regular 3D tests like some 3D Mark - removing the 260 should have ruled out the PSU.. but I'd try another one anyway, since they do break and can cause weird errors MrS Scanning for our furry friends since Jan 2002 |
|
Send message Joined: 17 Nov 12 Posts: 30 Credit: 111,887,025 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
ok, did the following now: - re-installed Win7 (because of better tool support) - down-clocked the GPUs to NVIDIA defaults - fixed fan rpm to 75% for both cards started crunching 1 short and 1 long again...let's see what happens... GPU temps are @ 69 deg (GTX460) and 58 deg (GTX560Ti) |
©2025 Universitat Pompeu Fabra