Message boards :
Graphics cards (GPUs) :
GPU problem
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I'm getting a lot or errors as below Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting Gn30779-TEST12-0-5-acemd_0 Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting task Gn30779-TEST12-0-5-acemd_0 using acemd version 625 Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Computation for task Gn30779-TEST12-0-5-acemd_0 finished Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_1 for task Gn30779-TEST12-0-5-acemd_0 absent Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_2 for task Gn30779-TEST12-0-5-acemd_0 absent Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_3 for task Gn30779-TEST12-0-5-acemd_0 absent This usually happens when the 2nd WU of a download batch runs (or 3rd/4th), I don't think my rig has successfully processed a full batch of WUs. The 1st Wu of a batch normally runs to completion so I set my "connect every" to 0.1 with 0 cache to try to download only 1 WU at a time but the above error came from a single WU download this one. The WU can process anything from 13 seconds (as above) to 4 hours (this one) before failing. Is anyone else getting errors like these? Any ideas why its happening? Anyone else using an 8800GS successfully? Fedora 7, Q6600 (running 3xseti & 1xps3grid), Asus 8800GS running 173.14 drivers (also happened with 169.09) |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
I'm getting a lot or errors as below Thanks for the accurate description. We knew there was a problem with WU at start after a successful one, but this is much more clear. Keep in touch. We hope to fix it soon. GDF |
UBT - NaRyanSend message Joined: 16 Jul 08 Posts: 68 Credit: 1,242,980 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I also have had 1 workunit fail right at the start. Task ID 38220, and by the looks of things it's one of the ones Temujin had fail on him too (but after 924 Seconds) Mine was moaning about "error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory" |
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
While waiting for a fix, do you know of any tricks to reduce the number of failures as I've hit my 4 WU/day limit today with 4 failures :( I've tried restarting boinc but that didn't work, would restarting the machine help? Is there anything I can run to clean things up? I don't mean to sound ungrateful/impatient but any idea how long "soon" will be? are we talking days or weeks? How has the take up of the GPU app been? Any idea how many GPU users you have? |
Stefan LedwinaSend message Joined: 16 Jul 07 Posts: 464 Credit: 298,573,998 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
I guess the fix will be implemented very soon... ;) I also have had issues with one of my cards (driver problems plus some failing tasks like you described them), and hit the max. of 4 WUs per day. Unfortunately there is nothing you can do to reset it... pixelicious.at - my little photoblog |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
While waiting for a fix, do you know of any tricks to reduce the number of failures as I've hit my 4 WU/day limit today with 4 failures :( Hi, we are looking into it. The problem is that we cannot replicate it here. It does happen to others but much less frequently. At the moment your machine and Stefan's are summing up 90% of all errors, which otherwise is going well. It could be a driver problem for both. I hope that we can do something in days. GDF |
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
At the moment your machine and Stefan's are summing up 90% of all errorsOops I hope that we can do something in days.Many thanks |
UBT - NaRyanSend message Joined: 16 Jul 08 Posts: 68 Credit: 1,242,980 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Just got the same error: Task ID: 38629 Mon 21 Jul 2008 17:32:53 BST|PS3GRID|Restarting task xD30815-TEST12-1-5-acemd_0 using acemd version 625 Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Computation for task xD30815-TEST12-1-5-acemd_0 finished Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_0 for task xD30815-TEST12-1-5-acemd_0 absent Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_1 for task xD30815-TEST12-1-5-acemd_0 absent Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_2 for task xD30815-TEST12-1-5-acemd_0 absent Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_3 for task xD30815-TEST12-1-5-acemd_0 absent Mind you Boinc had been jumping about the workunits like a demented flea yesterday, due to it thinking that it would not reach the workunit deadline. That workunit was listed as 0% and had just started after one had just finnished. I don't know if it's anyway related, but a workunit for the project I run along side gpugrid had also just finnished. |
Stefan LedwinaSend message Joined: 16 Jul 07 Posts: 464 Credit: 298,573,998 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
... Well, I'm not sure if I can believe that it is a driver problem only... I run a 8800GT with Ubuntu 7.10 and 169.14 drivers, a 9800GTX with Fedora 9 and 173.14.09 drivers and a GTX 260 with Ubuntu 8.04 and 177.13 drivers... Three different machines, three different OS and three different driver versions, and all show the same errors from time to time... The error with the WUs at start after a successful one with the error "process exited with code 127 (0x7f, -129) acemd_6.25_x86_64-pc-linux-gnu__cuda: error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory" and the second error "process exited with code 1 (0x1, -255)" I know that the 177.13 driver for the GTX260 is really crap because the PowerMizer does not work and it slows down the core clock speed of the card after the first successful WU, but the other two computers (drivers) too? I really hope you can find out what's going wrong. pixelicious.at - my little photoblog |
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
"process exited with code 127 (0x7f, -129)I've only had this error once. In my boinc directory I only have libcudart32.so and libcudart64.so, ie no libcudart.so so would it be worth trying ln -s libcudart64.so libcudart.so ?? and the second error "process exited with code 1 (0x1, -255)"Thats the one I get on all but 1 of my fails :( |
Stefan LedwinaSend message Joined: 16 Jul 07 Posts: 464 Credit: 298,573,998 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
"process exited with code 127 (0x7f, -129)I've only had this error once. In my boinc directory I only have libcudart32.so and libcudart64.so, ie no libcudart.so so would it be worth trying libcudart.so was downloaded from ps3grid/gpugrid with your first WU and it should be in the BOINC/projects/www.ps3grid.net directory. pixelicious.at - my little photoblog |
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
libcudart.so was downloaded from ps3grid/gpugrid with your first WU and it should be in the BOINC/projects/www.ps3grid.net directory. Yep, you're right I didn't think to look in there |
Stefan LedwinaSend message Joined: 16 Jul 07 Posts: 464 Credit: 298,573,998 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hmm, the only thing I can see is that we both use Quadcore CPUs... Gianni, could this problems be related to Quadcores, or is this only a coincidence? What are the CPU types of the other computers which throw out these errors? pixelicious.at - my little photoblog |
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
What are the CPU types of the other computers which throw out these errors?You may be on to something. I know of the following GPU users UBT - NaRyan Q6600, 5 good, 2 errors, 1 abort Athlon 6000+, 11 good sneakysaurus Q6600, 6 good, 5 errors JG4KEZ(Koichi Soraku) Xeon X3360, 7 good, 1 error |
Stefan LedwinaSend message Joined: 16 Jul 07 Posts: 464 Credit: 298,573,998 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Looks obvious that there's something wrong with Quads, but who knows... Let's see what G is saying. pixelicious.at - my little photoblog |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
What are the CPU types of the other computers which throw out these errors?You may be on to something. The vast majority of errors are at start-up. We have submitted a series of very fast WUs to check it now. If you go over quota for the day, let me have your hostid. GDF |
Stefan LedwinaSend message Joined: 16 Jul 07 Posts: 464 Credit: 298,573,998 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Ok, but my queue is pretty full with ps3grid WUs, I doubt I'll get new WUs until tomorrow. But I'll try to stop running tasks, maybe I can get some new ones... pixelicious.at - my little photoblog |
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
If you go over quota for the day, let me have your hostid.PM sent |
GDFSend message Joined: 14 Mar 07 Posts: 1958 Credit: 629,356 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() |
Hi, your card seems to be overclocked which makes it unstable and causes the errors! Is it right? GDF |
|
Send message Joined: 12 Jul 07 Posts: 100 Credit: 21,848,502 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
who, me? not as far as I know, I've certainly not tweaked anything. nvclock gives the following -- General info -- Card: Unknown Nvidia card Architecture: G92 A2 PCI id: 0x606 GPU clock: 601.712 MHz Bustype: PCI-Express -- Shader info -- Clock: 1674.000 MHz Stream units: 96 (1b) ROP units: 12 (1b) -- Memory info -- Amount: 384 MB Type: 128 bit DDR3 Clock: 899.996 MHz -- PCI-Express info -- Current Rate: 16X Maximum rate: 16X -- Sensor info -- Sensor: GPU Internal Sensor GPU temperature: 18C -- VideoBios information -- Version: 62.92.29.00.00 Signon message: ASUS EN8800GS TOP VGA BIOS Ver 62.92.29.00.AS13 Performance level 0: gpu 600MHz/shader 1700MHz/memory 900MHz/0.00V/100% VID mask: 3 Voltage level 0: 0.95V, VID: 0 Voltage level 1: 1.00V, VID: 1 Voltage level 2: 1.05V, VID: 2 Voltage level 3: 1.10V, VID: 3 |
©2025 Universitat Pompeu Fabra