Message boards :
Graphics cards (GPUs) :
acemd2 stops checkpointing?
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Just got a new video card (ASUS ENGTX260) which is in the same system that has been successfully running GPUGRID for over a year (config below). With the new card, it's not erroring out (yet), but the "current CPU time" is increasing normally, but the "checkpoint CPU time" has stopped updating. Any idea if this is normal or something is wrong? Athlon 64 X2 6000+ Linux 2.6.34 x86_64 (Fedora 13) NVIDIA driver Linux x86_64 256.35 ASUS ENGTX260 (GeForce GTX 260/216) |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
it might be hanged. The progress indicator should always work. gdf |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It definitely seems hung, even though the "current CPU time" continues to climb and the card appears to be crunching away at full speed. I noticed that once the "checkpoint CPU time" stops increasing, the "fraction done" stops increasing, as well. If I stop and restart the BOINC client, it starts over from the last checkpoint and seems to work normally for 30-60 minutes, then gets hung again. Any tips on further troubleshooting or diagnosis? It seems like overheating is a very common problem, but how do I determine if that is the issue? I already have as many case fans packed into this thing as I can... Thank you. |
|
Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
To see if it is heat download and install GPUZ or nvidiaInspector or Precision or RealTemp (all free, just do a quick search to find download sites). They will not only tell you the GPU temp but also tell you what speeds you are actually running at. Are you sure you are not soft crashing down into 2D mode? Are you sure no power settings on your PC are throttling it down? Thanks - Steve |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
ETQuestor, How much RAM does your system have? What else are you crunching? What version of Boinc are you using? Is your BIOS configured to put devices to sleep? |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Snow Crash, Do any of those info utilities support linux? I don't have Windows on this system. I have no idea if I'm dropping to 2D mode (how do I check?) or have a problem with power settings, but I ran GPUGRID on a a GeForce 9600 GSO in this same system with no problem for almost a year. skgiven, I have 3GB of RAM and the OS reports only 25% RAM utilization. The only other thing running is SETI@Home, which only uses CPU (they don't have a GPU app for linux). I have BOINC 6.10.56 (x86_64 Linux). I don't think the BIOS is configured for put devices to sleep, but I'll check once I get home. Thanks to both of you for your suggestions. |
nenymSend message Joined: 31 Mar 09 Posts: 137 Credit: 1,431,087,071 RAC: 64,039 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
(they don't have a GPU app for linux). I have BOINC 6.10.56 (x86_64 Linux) You can try Crunch3r's app. 2.2 http://calbe.dw70.de/mb/viewtopic.php?f=9&t=116 or 3.0 http://calbe.dw70.de/mb/viewtopic.php?f=9&t=120. Needs a little ldd and ldconfing, but nothing difficult. For more info you can see Lunatic's Seti forum. |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I just found the "nvidia-smi" utility which is installed as part of the nvidia driver. It doesn't report back much, but it does read the GPU core temp, which is pegged at 79 degrees C. I *think* that is well within the operating range, so I think this means overheating is unlikely as the cause, right? |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
OK, this is getting frustrating. I managed to finish one WU, but they are mostly erroring out now. The common error is listed below. I also noticed that the hangs (and the final errors) always occur on multiples of 15 minutes of CPU time (e.g., current CPU time = 1800 seconds). That can't be a coincidence. SWAN : FATAL : Failure executing kernel sync [transpose_float2] [700] acemd2_6.04_x86_64-pc-linux-gnu__cuda: ../swan/swanlib_nv.cpp:203: void swanRunKernel(const char*, int3, int3, size_t, ...): Assertion `0' failed. |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Can you show your computers? gdf |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
GDF, I'm not sure if you are just asking me to identify which computer is mine or if you're asking me to do something. My computer is http://www.gpugrid.net/show_host_detail.php?hostid=43352 Please clarify if you wanted to be do something specific and I'll be happy to do it. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I think it could be a bit warm and the WUs may be crashing when the system is in use? The tolerance of one card is not always the same as another, and changes during its life span. Leave the case door off, it should let the card cool down a bit more. Then don’t use the system while a task is running and see how it gets on. |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Thanks for the advice, everyone. I invested some energy into cooling fans and whatnot a while ago, so I have a lot of air movement through my system. This is also a server, so it is 99.9% idle other than GPUGrid. Also, the core temp of the GTX 260 never exceeds 80 degrees C, which is well within normal operating range. I ran memtestg80 and consistently got errors in the high memory, so I am returning the card as defective. |
|
Send message Joined: 11 Jul 09 Posts: 27 Credit: 1,000,618,568 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I swapped in a replacement card and it's happily running. Looks like the old card was defective. |
©2026 Universitat Pompeu Fabra