Work units failing in 64-bit Linux

Message boards : Graphics cards (GPUs) : Work units failing in 64-bit Linux
Message board moderation

To post messages, you must log in.

AuthorMessage
Greg

Send message
Joined: 20 Nov 08
Posts: 3
Credit: 2,670,125
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12638 - Posted: 22 Sep 2009, 23:38:44 UTC
Last modified: 22 Sep 2009, 23:41:30 UTC

Work units are failing on 64-bit linux. This is a research unit that has proven itself to be very stable, but is idle at the moment. Thought I'd try to take advantage of the idle time.

See, for example,
http://www.gpugrid.net/result.php?resultid=1291252

The driver version is cudadriver_2.3_linux_64_190.16

Info about the setup:
CUDA Device Query (Runtime API) version (CUDART static linking)
There are 4 devices supporting CUDA

Device 0: "Tesla C1060"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294705152 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Device 1: "Tesla C1060"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294705152 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Device 2: "Tesla C1060"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294705152 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Device 3: "Tesla C1060"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294705152 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Test PASSED
ID: 12638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile GDF
Volunteer moderator
Project administrator
Project developer
Project tester
Volunteer developer
Volunteer tester
Project scientist

Send message
Joined: 14 Mar 07
Posts: 1958
Credit: 629,356
RAC: 0
Level
Gly
Scientific publications
watwatwatwatwat
Message 12668 - Posted: 23 Sep 2009, 7:33:57 UTC - in response to Message 12638.  

Do you always fail or just sometime?

The error type that you get is usually given by too hot GPUs. Being your GPUs a tesla I am quite surprised.

gdf
ID: 12668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg

Send message
Joined: 20 Nov 08
Posts: 3
Credit: 2,670,125
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12696 - Posted: 24 Sep 2009, 0:55:04 UTC - in response to Message 12668.  
Last modified: 24 Sep 2009, 1:01:40 UTC

The first eight work units failed, so I suspended it at that point. It's definitely not overheating. It's housed in 1u rack unit in a frigid, air conditioned room. I wonder whether it has something to do with

# Total amount of global memory: -262144 bytes
ID: 12696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Greg

Send message
Joined: 20 Nov 08
Posts: 3
Credit: 2,670,125
RAC: 0
Level
Ala
Scientific publications
watwatwatwat
Message 12697 - Posted: 24 Sep 2009, 1:24:13 UTC - in response to Message 12696.  

Found the problem. The Einstein@Home beta app was the culprit. I had left something around that GPUGrid didn't like. Unloading and reloading the kernel module fixed the problem.
ID: 12697 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Graphics cards (GPUs) : Work units failing in 64-bit Linux

©2025 Universitat Pompeu Fabra