New batch KKi4

Message boards : Number crunching : New batch KKi4
Message board moderation

To post messages, you must log in.

AuthorMessage
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 18862 - Posted: 7 Oct 2010, 17:21:36 UTC
Last modified: 7 Oct 2010, 17:21:56 UTC

Dears, this is the continuation of an experiment we'd like to publish soon. WUs are twice as large as the old "CAPBIND*" series.
ID: 18862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18868 - Posted: 8 Oct 2010, 11:01:51 UTC - in response to Message 18862.  

Dear Toni,

I have downloaded and processed already a few of this WU's.
Also a few cancelled within 1 minute.

Already known?

Good luck and good weekend,


Ton (ftpd) Netherlands
ID: 18868 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 18870 - Posted: 8 Oct 2010, 12:58:33 UTC - in response to Message 18868.  

There should be nothing new with these WUs (except their length). By "cancelled" you mean that they failed?
ID: 18870 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18871 - Posted: 8 Oct 2010, 13:18:16 UTC - in response to Message 18870.  
Last modified: 8 Oct 2010, 13:20:21 UTC

I had one this morning which has failed on two different machines so far: http://www.gpugrid.net/workunit.php?wuid=1966290

(Edit: but I've had one successful run, on the same machine, and another is currently at about 60%)
ID: 18871 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Saenger
Avatar

Send message
Joined: 20 Jul 08
Posts: 134
Credit: 23,657,183
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwat
Message 18872 - Posted: 8 Oct 2010, 13:29:53 UTC - in response to Message 18827.  

And a TONI_KK broken as well.
stderr is this (my Linux):
stderr out	

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
process exited with code 98 (0x62, -158)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.34 GHz
# Total amount of global memory:                 536150016 bytes
# Number of multiprocessors:                     12
# Number of cores:                               96
MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms
ERROR: file mdioload.cpp line 80: Unable to read bincoordfile
 
11:16:36 (3686): called boinc_finish

</stderr_txt>
]]>


and this (the other Windows):
stderr out	

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
 - exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 260"
# Clock rate: 1.35 GHz
# Total amount of global memory:                 919994368 bytes
# Number of multiprocessors:                     27
# Number of cores:                               216
MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms
ERROR: file mdioload.cpp line 80: Unable to read bincoordfile
 
called boinc_finish

</stderr_txt>
]]>

Gruesse vom Saenger

For questions about Boinc look in the BOINC-Wiki
ID: 18872 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18873 - Posted: 8 Oct 2010, 13:57:09 UTC - in response to Message 18870.  

Dear Toni,

They failed within 1 minute (10-15 seconds processing).


Ton (ftpd) Netherlands
ID: 18873 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18874 - Posted: 8 Oct 2010, 13:59:49 UTC - in response to Message 18871.  

I have finished 4 and my systems are running at least one. Reasonable performance compared to the other tasks. However I also got one immediate failure:

3105670 1965540 8 Oct 2010 4:12:47 UTC 8 Oct 2010 8:55:18 UTC Error while computing 2.48 0.95 0.00 --- ACEMD2: GPU molecular dynamics v6.11 (cuda31)

Name f178r2-TONI_KKi4-0-200-RND1238_2
Workunit 1965540
Created 8 Oct 2010 3:32:44 UTC
Sent 8 Oct 2010 4:12:47 UTC
Received 8 Oct 2010 8:55:18 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 98 (0x62)
Computer ID 71363
Report deadline 13 Oct 2010 4:12:47 UTC
Run time 2.484375
CPU time 0.953125
stderr out

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 470"
# Clock rate: 1.43 GHz
# Total amount of global memory: 1341849600 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Device 1: "GeForce GTX 470"
# Clock rate: 1.43 GHz
# Total amount of global memory: 1341718528 bytes
# Number of multiprocessors: 14
# Number of cores: 112
SWAN: Using synchronization method 0
MDIO ERROR: read error for file "input.coor", byte number 0: expected to read number of atoms
ERROR: file mdioload.cpp line 80: Unable to read bincoordfile

called boinc_finish

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 0.00480620718015305
Granted credit 0
application version ACEMD2: GPU molecular dynamics v6.11 (cuda31)
ID: 18874 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 18877 - Posted: 8 Oct 2010, 18:53:23 UTC - in response to Message 18874.  

Hi, for those getting: byte number 0: expected to read number of atoms - it must have been a glitch in mass-WU creation, let them die. Richard - I think your other failure was on a mobile card.
ID: 18877 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18878 - Posted: 8 Oct 2010, 19:08:27 UTC - in response to Message 18877.  

Thank you to everyone that reported this problem and thank you Toni for letting us know it is just a WU creation glitch.

As these errors are immediate they will have almost no impact on peoples RAC. To date I have only had one such error - most KKi4 WU's run well.
ID: 18878 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18890 - Posted: 9 Oct 2010, 17:49:25 UTC

One of my 9800GTs had a go at h230r2-TONI_KKi4-0-200-RND9586, but unfortunately crashed with an assertion failure at the bitter end, after more than 24 hours of work. C'est la vie.
ID: 18890 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18893 - Posted: 9 Oct 2010, 20:08:52 UTC - in response to Message 18890.  

Richard, you took that blow well.

Toni, perhaps Fermi-only long tasks would go down better; a failure after a few hours is no big deal but after a day it really bites, and not everyone is so understanding.

I've now had 5 failures, but all under 10sec. 16 other KKi4 tasks ran well.
ID: 18893 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18909 - Posted: 10 Oct 2010, 12:58:34 UTC - in response to Message 18893.  

Toni, perhaps Fermi-only long tasks would go down better; a failure after a few hours is no big deal but after a day it really bites, and not everyone is so understanding.

I've now had 5 failures, but all under 10sec. 16 other KKi4 tasks ran well.

My GTX 260/216 runs the TONI_KKi4 WUs well, in fact it runs everything well. The problem is with my three GT 240 cards. They won't run the TONI_KKi4 WUs. They don't like the TONI_HERGMETAXDOFE WUs either. They do run KASHIF_HIVPR, TONI_CAPBIND and IBUCH very well though.
ID: 18909 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18910 - Posted: 10 Oct 2010, 13:50:32 UTC - in response to Message 18909.  

I have had 4 finish on a GT240, and just one that failed after 2.46sec. Vista x64, all 512MB DDR5 cards.
ID: 18910 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Fred J. Verster

Send message
Joined: 1 Apr 09
Posts: 58
Credit: 35,833,978
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18914 - Posted: 10 Oct 2010, 19:42:42 UTC - in response to Message 18910.  

One faulty WU , probably, as all hosts have failed on this one......?!

That's the only faulty WU {or result?}, I've seen, sofar.
It's meant to stay that way :)




Knight Who Says Ni N!
ID: 18914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Fred J. Verster

Send message
Joined: 1 Apr 09
Posts: 58
Credit: 35,833,978
RAC: 0
Level
Val
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18918 - Posted: 12 Oct 2010, 15:34:16 UTC - in response to Message 18914.  

Found 2 WU's , computed by 4 hosts, which all failed, 2 still have to Report.

h176r1-TONI_KKi4-0-200-RND5770 is giving problems as well!?

The faults I've seen so far, all come from the x999y1-TONI_KKi4-0-200-RND5770, batch.

Must be noticed by many others, concluding this from the # of INValid
Results.
dynamics v6.05 (cuda), dynamics v6.11 (cuda31) and dynamics v6.06 (cuda30), are involved, all with process exited with code 1 (0x1, -255).
All cards are involved, 240, 250, 470, 480 NVIDIA.




Knight Who Says Ni N!
ID: 18918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 351
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 18925 - Posted: 13 Oct 2010, 14:01:36 UTC - in response to Message 18890.  

One of my 9800GTs had a go at h230r2-TONI_KKi4-0-200-RND9586, but unfortunately crashed with an assertion failure at the bitter end, after more than 24 hours of work. C'est la vie.

This 9800GT host really doesn't like KKi4 - now failed g105r2-TONI_KKi4-6-200-RND6062 with the same

SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [700]
Assertion failed: 0, file swanlib_nv.cpp, line 121

error message. At least it only wasted 22 Ksec this time.
ID: 18925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 19161 - Posted: 1 Nov 2010, 1:13:34 UTC - in response to Message 18925.  

ID: 19161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : New batch KKi4

©2025 Universitat Pompeu Fabra