Message boards :
Graphics cards (GPUs) :
TONI_KIDln issues
Message board moderation
| Author | Message |
|---|---|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I have two almost identical PCs. One of them began failing almost every TONI_KIDln wus, while the other not. I tried to update the BOINC manager, and the NVidia drivers, but it didn't help. The 1st (failing) pc's config: MB: Asus P5Q pro (Intel P45 chipset) CPU: Core2 Quad 6600 @ 2.4GHz (stock) RAM: 4Gb 1066MHz DDR2 Kingston HiperX VGA: Gigabyte GV-N480D5-15I-B (stock clocking, 72-76°C) PSU: Chieftec A135-1000W OS: WinXP SP3 x86 Boinc MGR 6.11.4 NVidia drivers 259.31 swan_sync=0 The 2nd pc's config: MB: Asus P5Q Deluxe (Intel P45 chipset) CPU: Core2 Quad 9550 @ 2.83GHz (stock) RAM: 8Gb 1066MHz DDR2 Kingston HiperX VGA: Asus ENGTX480 (stock clocking, 72-76°C) PSU: Gigabyte Superb 720W OS: WinXP SP3 x86 Boinc MGR 6.11.4 NVidia drivers 259.12 swan_sync=0 Any ideas? |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
The errors are not time specific. The TONI_KIDln tasks are failing any time from immediately through to times over about 4000sec. I did see the odd other task failure too, like this one, but almost all failures were for TONI_KIDln tasks. Seems to make no different using 6.11.4 and more recent drivers, to the error messages: <core_client_version>6.11.4</core_client_version> <![CDATA[ <message> Nem megfelel� funkci�. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 480" # Clock rate: 1.40 GHz # Total amount of global memory: 1610153984 bytes # Number of multiprocessors: 15 # Number of cores: 120 SWAN: Using synchronization method 0 MDIO ERROR: cannot open file "restart.coor" <core_client_version>6.10.58</core_client_version> <![CDATA[ <message> Nem megfelel� funkci�. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # Using device 0 # There is 1 device supporting CUDA # Device 0: "GeForce GTX 480" # Clock rate: 1.40 GHz # Total amount of global memory: 1610153984 bytes # Number of multiprocessors: 15 # Number of cores: 120 SWAN: Using synchronization method 0 MDIO ERROR: cannot open file "restart.coor" There is the odd nan error too, but I would expect to get the odd one of these anyway. SWAN: Using synchronization method 0 MDIO ERROR: cannot open file "restart.coor" ERROR: file deven.cpp line 855: # Energies have become nan Perhaps these tasks are stressing your card and finding a weakness with it, or your system. I would be inclined to suspend any CPU tasks, close Boinc, restart and just run GPU tasks to see if the failures continue; just in case you are crunching a greedy CPU tasks that hoggs system memory. Per chance did you check how much system memory was being used and are you crunching Lattice CPU tasks? You might also want to make a RAM testing disk, and boot to it. http://oca.microsoft.com/en/windiag.asp You might want to consider swapping the GPU's between systems. That would tell you a lot, but I would be inclined to test the RAM first! PS. There is no point having 8GB RAM in your other system, running XP x86! |
liveoncSend message Joined: 1 Jan 10 Posts: 292 Credit: 41,567,650 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]()
|
I'd make a wild guess, it's your RAM. Why? Kingston HiperX is a RAM I've tried having the most difficulties with. They need such a high RAM Voltage compared with so many others just to get stable, some times so much the mainboard starts freaking out. I can't imaging 8GB (4x2GB) is fun or or easy, I've had problems with 4GB (2x2GB) of that RAM on Asus, MSI, & XFX mainboards. They're "touchy" DDR2 at 1066Mhz, better to lower to 800Mhz or increase RAM Voltage, which "may" also require you to increase the NB Voltage.
|
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I would be inclined to suspend any CPU tasks, close Boinc, restart and just run GPU tasks to see if the failures continue; I'll try this. Per chance did you check how much system memory was being used and are you crunching Lattice CPU tasks? About 1.5Gb used. I run 3 rosetta task simultaneously, they consume about 250~400kb each. But that's true for my other system too. I don't crunch Lattice at all. You might also want to make a RAM testing disk, and boot to it. I have Vista x64 and Win 7 x64 on both systems, so I can run their RAM test, but that'll be the last thing :) You might want to consider swapping the GPU's between systems. That would tell you a lot That's came to my mind too. I was just hoping there will be some other (simple) way to figure it out. PS. There is no point having 8GB RAM in your other system, running XP x86! I know :) I play on the same PC sometimes, on Win 7 x64. This was my default operating system until I started crunching... |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'd make a wild guess, it's your RAM. Why? Kingston HiperX is a RAM I've tried having the most difficulties with. They need such a high RAM Voltage compared with so many others just to get stable, some times so much the mainboard starts freaking out. I can't imaging 8GB (4x2GB) is fun or or easy, I've had problems with 4GB (2x2GB) of that RAM on Asus, MSI, & XFX mainboards. They're "touchy" DDR2 at 1066Mhz, better to lower to 800Mhz or increase RAM Voltage, which "may" also require you to increase the NB Voltage. OK, I lowered it to 800MHz on both system. (these are those "Tall Heatsink" modules, even so I don't like to increase RAM Voltage). BTW I never had any memory related problem before with this modules. But it didn't help. There are two failed WUs since then. I will swap the GPUs... |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Perhaps these tasks are stressing your card and finding a weakness with it, or your system. I agree with that, but the KASHIF_HIVPR tasks also a GPU stressing kind, and these doesn't tend to fail. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Let a few tasks run when you swap the cards. Drivers should not need updating, so it's not too much of a task. Report back any findings. Thanks, |
Retvari ZoltanSend message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
It seems to me now, I've solved the problem without swapping the GPUs. I've looked for further differences between my two systems, and noticed my failing GPU runs at 1.000V, while the other runs at 1.025V. So I raised the failing GPU's core voltage to 1.025V, since then it's completed 4 TONI_KIDln tasks. |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Well spotted. There has been one or two of these issues before with the Fermi cards. I think the last one was a Gigabyte as well; perhaps they are cutting it a bit too fine. I’m guessing ASUS got the pick of the bunch to work with. My Asus ENGTX470 has a Voltage of 0.975, and crunches away quite happily at 704MHz. I paid a bit over the odds for it, but now I think it was definitely worth it. Good Luck, |
©2026 Universitat Pompeu Fabra