Message boards :
Number crunching :
High Failure Rate of SANTI Tasks
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I am creating this thread because the other SANTI thread has been closed. Basically every SANTI task I've gotten has failed! ALL these tasks have also failed at least once on another host, although many of them have succeeded in the end. I think the project people must take a look.
|
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I keep having many SANTIs failing on my 750Ti on Linux. Some of them succeed, but most of them fail. They run completely OK on my 650Ti on the same computer. Many of them fail on other systems as well, not just on mine. Eventually, they seem to locate a system of their liking and succeed! Two recent examples (my computer being 171276): http://www.gpugrid.net/workunit.php?wuid=8275005 http://www.gpugrid.net/workunit.php?wuid=8281764 The recently failed SANTIs on my 750Ti: http://www.gpugrid.net/result.php?resultid=11483800 http://www.gpugrid.net/result.php?resultid=11256545 http://www.gpugrid.net/result.php?resultid=11106610 http://www.gpugrid.net/result.php?resultid=11103836 http://www.gpugrid.net/result.php?resultid=11103808 I know my card is not a dud, since I tried it with Einstein and it worked OK. Something else must be wrong. I'm using the newest driver for Linux, 337.25. The errors existed with older versions as well, 334.21 and 331.49. Can a project researcher / engineer please take a look? I'll be glad to provide more information or try remedies.
|
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I switched my 750Ti to Einstein for about a day and a half just to make sure the card is OK. I crunched 10 WUs successfully, so I am positive my card is OK. I switched back to GPUGRID yesterday evening and crunched a NOELIA_MG1EC. Then I got another SANTI_p53final, which errored-out about 2 hours in. And take a look at this guy: http://www.gpugrid.net/workunit.php?wuid=8259035 Something is definitely wrong with these WUs!
|
|
Send message Joined: 6 Jan 09 Posts: 4 Credit: 151,278,745 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I hope you did realize that one of the failed WUs was completed with a 750ti under windows. There is some indication that GPU memory is a factor for some of GPUGRID WUs. For the last one you linked, it will be interesting to see if that 780ti can complete it since it has more memory available than the other cards. Now looking at your 750ti host, it has 2 GPUs. Based on the forum, it is not easy to troubleshoot but you should try to find the various suggestions for similar systems. However, you have an high failure rate so I think your problem is with your system, probably related to overheating. The main things that I would suggest are: Upgrading the Linux kernel and GPU driver if you can Monitor GPU and system temps - newer NVIDIA driver is better for that under Linux. If these get too hot then you know you need better cooling or downclocking GPUs. Some other suggestions of things that you can experiment with: Use only one GPU - one valid WU is better than 2 failed WUs. Also switch cards to ensure both work correctly. Run under the short queue to see what happens (if it is overheating then hopefully WUs complete before temps get too high) Run only GPU work so CPU can maintain GPUs and not contribute to system heating. From that you should get some clue of what is happening with your system and what you can do about it. |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My setup is handling heat pretty adequately, to give you an idea, with ambient temperatures at ~30C my GPUs and CPU maxes out at ~70C. I've invested much time in building my crunching rig for heat and I think it works fine. It's also pretty quiet, you can easily sit next to it. It's not easy to tolerate the heat it emits though, that top exhaust fan is like a small oven! I also don't think it's the memory size, since my 650Ti crunches everything like a boss and it has half the memory of the 750Ti. I tried three different NVIDIA driver versions to no avail. Upgrading the kernel is a good idea and I think it's the next thing I will be trying. Another suspect in my mind is the motherboard. My ASUS P7P55D-E's second PCIEx16 slot works at x4. Over at Einstein it caused the 750Ti to perform really slow, but maybe it causes stability problems with GPUGRID as well. I think I will not avoid fiddling with the hardware again, swapping cards in slots, leaving just the 750 in there, etc.
|
|
Send message Joined: 6 Jan 09 Posts: 4 Credit: 151,278,745 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Other systems are completing those failed WUs so it is not the WUs. (Okay, there are occasionally bad batches which usually get sorted out very quickly.) Hopefully you figure out what is the problem with your system. |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I upgraded the kernel today to Ubuntu's 12.04 latest - 3.13. I then got two NOELIA_TRPS1S4, the one of which that was assigned to the 750Ti failed after ~1100 sec... It then got a SANTI_p53final and it's still crunching that ~10 hours later. Maybe it will complete it (knocking on wood!) I also discovered today that my motherboard's second PCIEx16 slot's 4 lanes come from the chipset, not from the CPU. I don't know if that could cause the errors, but I guess next thing to test would be to take out the 650Ti and leave only the 750Ti in, on the primary PCIEx16 slot.
|
|
Send message Joined: 17 Mar 10 Posts: 23 Credit: 1,173,824,416 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I also had two WUs from SANTI that failed with the message The simulation has become unstable. Terminating to avoid lock-up (1) This card runs other WUs just fine, so I don't think it is a hardware or driver problem. http://www.gpugrid.net/result.php?resultid=12188313 http://www.gpugrid.net/result.php?resultid=12529134 |
BeyondSend message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm running 7 EVGA & PNY factory OCed 750 Ti cards in Win7-64 and have had only 1 error: http://www.gpugrid.net/workunit.php?wuid=8906204 which appears to be a bad WU. Many, many SANTI WUs have completed successfully on these cards. As I have asked twice before, can you try your card on a Windows box? I suspect the card has problems, but testing it in a different environment will give you some answers. |
|
Send message Joined: 5 May 13 Posts: 187 Credit: 349,254,454 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hey Beyond, thanks for your response! Yes, I have concluded it is not the SANTIs after all, but something with my system. The card has failed with all sorts of WUs, but I have made some interesting observations. Take a look at the 750TI-650TI Combo on Linux thread, where I am continuing this discussion, as it is not a matter of SANTIs any more and I don't want to keep this thread at the head of the Number crunching section. Your input is always welcome and appreciated!
|
©2025 Universitat Pompeu Fabra