GPU problem

Message boards : Graphics cards (GPUs) : GPU problem
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1279 - Posted: 19 Jul 2008, 22:46:08 UTC

I'm getting a lot or errors as below

Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting Gn30779-TEST12-0-5-acemd_0
Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting task Gn30779-TEST12-0-5-acemd_0 using acemd version 625
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Computation for task Gn30779-TEST12-0-5-acemd_0 finished
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_1 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_2 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_3 for task Gn30779-TEST12-0-5-acemd_0 absent


This usually happens when the 2nd WU of a download batch runs (or 3rd/4th), I don't think my rig has successfully processed a full batch of WUs.
The 1st Wu of a batch normally runs to completion so I set my "connect every" to 0.1 with 0 cache to try to download only 1 WU at a time but the above error came from a single WU download this one.
The WU can process anything from 13 seconds (as above) to 4 hours (this one) before failing.

Is anyone else getting errors like these?
Any ideas why its happening?
Anyone else using an 8800GS successfully?


Fedora 7, Q6600 (running 3xseti & 1xps3grid), Asus 8800GS running 173.14 drivers (also happened with 169.09)
ID: 1279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 1282 - Posted: 20 Jul 2008, 8:26:11 UTC - in response to Message 1279.  

I'm getting a lot or errors as below

Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting Gn30779-TEST12-0-5-acemd_0
Sat 19 Jul 2008 23:15:13 BST|PS3GRID|Starting task Gn30779-TEST12-0-5-acemd_0 using acemd version 625
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Computation for task Gn30779-TEST12-0-5-acemd_0 finished
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_1 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_2 for task Gn30779-TEST12-0-5-acemd_0 absent
Sat 19 Jul 2008 23:15:26 BST|PS3GRID|Output file Gn30779-TEST12-0-5-acemd_0_3 for task Gn30779-TEST12-0-5-acemd_0 absent


This usually happens when the 2nd WU of a download batch runs (or 3rd/4th), I don't think my rig has successfully processed a full batch of WUs.
The 1st Wu of a batch normally runs to completion so I set my "connect every" to 0.1 with 0 cache to try to download only 1 WU at a time but the above error came from a single WU download this one.
The WU can process anything from 13 seconds (as above) to 4 hours (this one) before failing.

Is anyone else getting errors like these?
Any ideas why its happening?
Anyone else using an 8800GS successfully?


Fedora 7, Q6600 (running 3xseti & 1xps3grid), Asus 8800GS running 173.14 drivers (also happened with 169.09)



Thanks for the accurate description. We knew there was a problem with WU at start after a successful one, but this is much more clear.

Keep in touch. We hope to fix it soon.

GDF
ID: 1282 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile UBT - NaRyan
Avatar

Send message
Joined: 16 Jul 08
Posts: 68
Credit: 1,242,980
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 1283 - Posted: 20 Jul 2008, 9:29:24 UTC

I also have had 1 workunit fail right at the start.
Task ID 38220, and by the looks of things it's one of the ones Temujin had fail on him too (but after 924 Seconds)

Mine was moaning about "error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory"
ID: 1283 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1296 - Posted: 21 Jul 2008, 9:21:56 UTC - in response to Message 1282.  

While waiting for a fix, do you know of any tricks to reduce the number of failures as I've hit my 4 WU/day limit today with 4 failures :(
I've tried restarting boinc but that didn't work, would restarting the machine help?
Is there anything I can run to clean things up?

I don't mean to sound ungrateful/impatient but any idea how long "soon" will be?
are we talking days or weeks?

How has the take up of the GPU app been?
Any idea how many GPU users you have?
ID: 1296 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stefan Ledwina
Avatar

Send message
Joined: 16 Jul 07
Posts: 464
Credit: 298,573,998
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwat
Message 1297 - Posted: 21 Jul 2008, 9:40:20 UTC

I guess the fix will be implemented very soon... ;)

I also have had issues with one of my cards (driver problems plus some failing tasks like you described them), and hit the max. of 4 WUs per day. Unfortunately there is nothing you can do to reset it...



pixelicious.at - my little photoblog
ID: 1297 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 1299 - Posted: 21 Jul 2008, 15:19:26 UTC - in response to Message 1296.  
Last modified: 21 Jul 2008, 15:19:46 UTC

While waiting for a fix, do you know of any tricks to reduce the number of failures as I've hit my 4 WU/day limit today with 4 failures :(
I've tried restarting boinc but that didn't work, would restarting the machine help?
Is there anything I can run to clean things up?

I don't mean to sound ungrateful/impatient but any idea how long "soon" will be?
are we talking days or weeks?

How has the take up of the GPU app been?
Any idea how many GPU users you have?



Hi,
we are looking into it. The problem is that we cannot replicate it here. It does happen to others but much less frequently. At the moment your machine and Stefan's are summing up 90% of all errors, which otherwise is going well. It could be a driver problem for both. I hope that we can do something in days.



GDF
ID: 1299 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1300 - Posted: 21 Jul 2008, 16:13:57 UTC - in response to Message 1299.  

At the moment your machine and Stefan's are summing up 90% of all errors
Oops

I hope that we can do something in days.
Many thanks
ID: 1300 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile UBT - NaRyan
Avatar

Send message
Joined: 16 Jul 08
Posts: 68
Credit: 1,242,980
RAC: 0
Level
Ala
Scientific publications
watwatwatwatwat
Message 1302 - Posted: 21 Jul 2008, 16:46:26 UTC

Just got the same error: Task ID: 38629

Mon 21 Jul 2008 17:32:53 BST|PS3GRID|Restarting task xD30815-TEST12-1-5-acemd_0 using acemd version 625
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Computation for task xD30815-TEST12-1-5-acemd_0 finished
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_0 for task xD30815-TEST12-1-5-acemd_0 absent
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_1 for task xD30815-TEST12-1-5-acemd_0 absent
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_2 for task xD30815-TEST12-1-5-acemd_0 absent
Mon 21 Jul 2008 17:32:54 BST|PS3GRID|Output file xD30815-TEST12-1-5-acemd_0_3 for task xD30815-TEST12-1-5-acemd_0 absent

Mind you Boinc had been jumping about the workunits like a demented flea yesterday, due to it thinking that it would not reach the workunit deadline.
That workunit was listed as 0% and had just started after one had just finnished.

I don't know if it's anyway related, but a workunit for the project I run along side gpugrid had also just finnished.
ID: 1302 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stefan Ledwina
Avatar

Send message
Joined: 16 Jul 07
Posts: 464
Credit: 298,573,998
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwat
Message 1303 - Posted: 21 Jul 2008, 16:59:49 UTC - in response to Message 1299.  

...
It could be a driver problem for both. I hope that we can do something in days.

GDF


Well, I'm not sure if I can believe that it is a driver problem only...

I run a 8800GT with Ubuntu 7.10 and 169.14 drivers,
a 9800GTX with Fedora 9 and 173.14.09 drivers
and a GTX 260 with Ubuntu 8.04 and 177.13 drivers...

Three different machines, three different OS and three different driver versions, and all show the same errors from time to time...

The error with the WUs at start after a successful one with the error

"process exited with code 127 (0x7f, -129)
acemd_6.25_x86_64-pc-linux-gnu__cuda: error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory"

and the second error

"process exited with code 1 (0x1, -255)"

I know that the 177.13 driver for the GTX260 is really crap because the PowerMizer does not work and it slows down the core clock speed of the card after the first successful WU, but the other two computers (drivers) too?

I really hope you can find out what's going wrong.

pixelicious.at - my little photoblog
ID: 1303 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1304 - Posted: 21 Jul 2008, 17:19:35 UTC - in response to Message 1303.  

"process exited with code 127 (0x7f, -129)
acemd_6.25_x86_64-pc-linux-gnu__cuda: error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory"
I've only had this error once. In my boinc directory I only have libcudart32.so and libcudart64.so, ie no libcudart.so so would it be worth trying
ln -s libcudart64.so libcudart.so
??

and the second error "process exited with code 1 (0x1, -255)"
Thats the one I get on all but 1 of my fails :(
ID: 1304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stefan Ledwina
Avatar

Send message
Joined: 16 Jul 07
Posts: 464
Credit: 298,573,998
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwat
Message 1305 - Posted: 21 Jul 2008, 17:50:08 UTC - in response to Message 1304.  

"process exited with code 127 (0x7f, -129)
acemd_6.25_x86_64-pc-linux-gnu__cuda: error while loading shared libraries: libcudart.so: cannot open shared object file: No such file or directory"
I've only had this error once. In my boinc directory I only have libcudart32.so and libcudart64.so, ie no libcudart.so so would it be worth trying
ln -s libcudart64.so libcudart.so
??

and the second error "process exited with code 1 (0x1, -255)"
Thats the one I get on all but 1 of my fails :(


libcudart.so was downloaded from ps3grid/gpugrid with your first WU and it should be in the BOINC/projects/www.ps3grid.net directory.

pixelicious.at - my little photoblog
ID: 1305 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1306 - Posted: 21 Jul 2008, 17:55:55 UTC - in response to Message 1305.  

libcudart.so was downloaded from ps3grid/gpugrid with your first WU and it should be in the BOINC/projects/www.ps3grid.net directory.

Yep, you're right
I didn't think to look in there
ID: 1306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stefan Ledwina
Avatar

Send message
Joined: 16 Jul 07
Posts: 464
Credit: 298,573,998
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwat
Message 1307 - Posted: 21 Jul 2008, 18:09:36 UTC
Last modified: 21 Jul 2008, 18:11:02 UTC

Hmm, the only thing I can see is that we both use Quadcore CPUs...
Gianni, could this problems be related to Quadcores, or is this only a coincidence?
What are the CPU types of the other computers which throw out these errors?

pixelicious.at - my little photoblog
ID: 1307 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1308 - Posted: 21 Jul 2008, 18:41:27 UTC - in response to Message 1307.  
Last modified: 21 Jul 2008, 18:59:38 UTC

What are the CPU types of the other computers which throw out these errors?
You may be on to something.
I know of the following GPU users

UBT - NaRyan
Q6600, 5 good, 2 errors, 1 abort
Athlon 6000+, 11 good

sneakysaurus
Q6600, 6 good, 5 errors

JG4KEZ(Koichi Soraku)
Xeon X3360, 7 good, 1 error
ID: 1308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stefan Ledwina
Avatar

Send message
Joined: 16 Jul 07
Posts: 464
Credit: 298,573,998
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwat
Message 1310 - Posted: 21 Jul 2008, 19:00:44 UTC

Looks obvious that there's something wrong with Quads, but who knows...
Let's see what G is saying.



pixelicious.at - my little photoblog
ID: 1310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 1311 - Posted: 21 Jul 2008, 19:02:44 UTC - in response to Message 1308.  
Last modified: 21 Jul 2008, 19:03:17 UTC

What are the CPU types of the other computers which throw out these errors?
You may be on to something.
I know of the following GPU users

UBT - NaRyan
Q6600, 5 good, 2 errors, 1 abort
Athlon 6000+, 11 good

sneakysaurus
Q6600, 6 good, 5 errors

JG4KEZ(Koichi Soraku)
Xeon X3360, 7 good, 1 error



The vast majority of errors are at start-up. We have submitted a series of very fast WUs to check it now.
If you go over quota for the day, let me have your hostid.


GDF
ID: 1311 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stefan Ledwina
Avatar

Send message
Joined: 16 Jul 07
Posts: 464
Credit: 298,573,998
RAC: 0
Level
Asn
Scientific publications
watwatwatwatwatwatwatwat
Message 1312 - Posted: 21 Jul 2008, 19:07:28 UTC

Ok, but my queue is pretty full with ps3grid WUs, I doubt I'll get new WUs until tomorrow.
But I'll try to stop running tasks, maybe I can get some new ones...

pixelicious.at - my little photoblog
ID: 1312 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1313 - Posted: 21 Jul 2008, 19:08:54 UTC - in response to Message 1311.  

If you go over quota for the day, let me have your hostid.
PM sent
ID: 1313 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 1314 - Posted: 21 Jul 2008, 19:18:36 UTC - in response to Message 1308.  




Hi,

your card seems to be overclocked which makes it unstable and causes the errors!
Is it right?

GDF
ID: 1314 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Temujin

Send message
Joined: 12 Jul 07
Posts: 100
Credit: 21,848,502
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwat
Message 1315 - Posted: 21 Jul 2008, 19:21:45 UTC - in response to Message 1314.  




Hi,

your card seems to be overclocked which makes it unstable and causes the errors!
Is it right?

GDF
who, me?

not as far as I know, I've certainly not tweaked anything.

nvclock gives the following
-- General info --
Card: Unknown Nvidia card
Architecture: G92 A2
PCI id: 0x606
GPU clock: 601.712 MHz
Bustype: PCI-Express

-- Shader info --
Clock: 1674.000 MHz
Stream units: 96 (1b)
ROP units: 12 (1b)
-- Memory info --
Amount: 384 MB
Type: 128 bit DDR3
Clock: 899.996 MHz

-- PCI-Express info --
Current Rate: 16X
Maximum rate: 16X

-- Sensor info --
Sensor: GPU Internal Sensor
GPU temperature: 18C

-- VideoBios information --
Version: 62.92.29.00.00
Signon message: ASUS EN8800GS TOP VGA BIOS Ver 62.92.29.00.AS13
Performance level 0: gpu 600MHz/shader 1700MHz/memory 900MHz/0.00V/100%
VID mask: 3
Voltage level 0: 0.95V, VID: 0
Voltage level 1: 1.00V, VID: 1
Voltage level 2: 1.05V, VID: 2
Voltage level 3: 1.10V, VID: 3

ID: 1315 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Graphics cards (GPUs) : GPU problem

©2025 Universitat Pompeu Fabra