Advanced search

Message boards : Graphics cards (GPUs) : GA: information and issues

Author Message
Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15368 - Posted: 22 Feb 2010 | 18:42:11 UTC

Dear Crunchers,

we've submitted approximately 1000 WUs of the "GA" (gramicidin A) type. They are a re-issue of a system which we have already run for a while. The purpose of the runs is methodological: they use a model system to improve an algorithm that can be transferred to other molecules.

The video is here - though I'm making new ones.




Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15374 - Posted: 22 Feb 2010 | 22:48:39 UTC - in response to Message 15368.

Keep challenging your methodologies and you will strengthen the research. Good decision for the long term future. I hope you identify subtle improvements you have made with the new applications and confirm existing results.

Thanks,

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15381 - Posted: 23 Feb 2010 | 14:49:57 UTC - in response to Message 15374.
Last modified: 23 Feb 2010 | 14:51:48 UTC

Thanks. Btw, all of them are acemd2, so have the higher bang for the buck ratio (ie credits/hour) of the new app.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15391 - Posted: 23 Feb 2010 | 19:06:54 UTC - in response to Message 15381.
Last modified: 23 Feb 2010 | 20:04:04 UTC

Finished my first GA!

GTX295, Shaders at 1620, WinXP
i7-920 HT ON at 4.0 GHz, 8 CPU threads of WCG HCMD2 fully loaded.

GPU = 5 hours
CPU usage = 1230 seconds
Time per step = 23.927 ms
Points w/ bonus = 6945.175

compared to recent TONI series avg on the same machine
GPU = 4 hours 40 minutes
CPU usage = 555 seconds
Time per step = 25.651 ms
Points w/ bonus = 6123.06875

so the CPU time is up *2.5 and GPU just a little ... looks good to me.

I'm looking forward to your new videos, I hope these results help you find a better answer :-)
____________
Thanks - Steve

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15394 - Posted: 23 Feb 2010 | 23:15:43 UTC

I've been getting nothing but errors on the "TONI_GA" ACEMD - GPU molecular dynamics v6.03 (cuda) WU's over the past 36 hours.


"SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [66,1,1][64,1,1]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 194

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information."


Running 6.10.17, with drivers 196.34. No issues with other WU's, even other types of v6.03's. Running 2x 8800GT + 1 GTS 240. Restart didn't help.

I've halted new WU's for now, but I think I may try to change the preferences not to get these new types. What should I de-select to prevent only these TONI_GA v6.03 types from downloading?

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15404 - Posted: 24 Feb 2010 | 17:16:08 UTC - in response to Message 15394.
Last modified: 24 Feb 2010 | 17:20:00 UTC

Hi K1atOdessa,

do you know if the fail on the GTS or on the 8800 (or both?).
At present you can't filter one WU type, but you can filter out acemd2 altogether (Your account, gpugrid preferences). However, this batch of WUs should be over.

T

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15405 - Posted: 24 Feb 2010 | 17:24:13 UTC - in response to Message 15391.
Last modified: 24 Feb 2010 | 17:24:30 UTC

Hi Steve, thanks for the report... timings look normal to me.

Only thing, I wouldn't swear that the CPU time is reproduced even if you run two identical WUs (I may be wrong). What's important is that it is much less than the GPU time.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 15406 - Posted: 24 Feb 2010 | 17:34:38 UTC - in response to Message 15394.


"SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [66,1,1][64,1,1]


That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory?

MJH

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15407 - Posted: 24 Feb 2010 | 18:06:25 UTC - in response to Message 15406.


That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory?

MJH


It did continue after a hard shutdown / reboot. I am not doing anything else (games, etc.) with this machine. I've changed my WU options to only get the "old" WU types, and completed a couple "Full-atom molecular dynamics v6.71 (cuda23)" WUs with no issue.

ACEMD: yes
ACEMD ver 2.0: no
ACEMD beta: no


The interesting thing is that I did get one v6.03 WU to complete this morning, which was one I grabbed before changing the WU options. I am going to let it run with just the "ACEMD" type right now, but maybe switch back to accepting "ACMD ver 2.0" type after a couple days of no issues to see what happens. I'd like to get the benefit of the performance increase since my cards are not high-end and take a while to complete.

Any other ideas of something I should try?

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15408 - Posted: 24 Feb 2010 | 18:58:33 UTC
Last modified: 24 Feb 2010 | 18:59:58 UTC

Very consistant timings ... three more on the same machine:
I have unhidden my computers (56900) so you can verify.

------------------------------------------------
GPU -------- CPU ----- Time Step
------------------------------------------------
17910 ------- 1238 ------ 23.892
17889 ------- 1233 ------ 23.864
17722 ------- 1206 ------ 23.64
17935 ------- 1230 ------ 23.93
------------------------------------------------

I am not complaining at all, they run very nicely for me.

The CPU seconds is still much less than when I run them
on my Vista PC which is a GTX285 i7-920 and takes ~5000 CPU sec.

Keep up the good work!
____________
Thanks - Steve

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15417 - Posted: 25 Feb 2010 | 1:50:42 UTC - in response to Message 15408.

K1atOdessa,

At some stage in the next few days it would probably be a good idea to make sure you have selected to receive work from other projects (ACEMD ver 2.0 and Betas) if the projects you have selected (ACEMD) have no work.

This is also in your projects settings.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15425 - Posted: 25 Feb 2010 | 10:48:37 UTC - in response to Message 15408.

Hi Steve, noted. Thanks for the info.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15427 - Posted: 25 Feb 2010 | 13:19:00 UTC

is this just coindidence that two machines got incorrect function errors or is it something with the WU?
http://www.gpugrid.net/result.php?resultid=1902641

Based on the amont of time it processed on my machine it should have been finished. I just upgraded it to Win7 and it has been returning WUs OK after the upgrade, including a GA.
____________
Thanks - Steve

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15437 - Posted: 25 Feb 2010 | 17:22:47 UTC - in response to Message 15427.
Last modified: 25 Feb 2010 | 17:24:01 UTC

It's a coincidence. The "other" machine is not returning any results.

Did you abort it manually based on elapsed time? something went wrong but I don't think the WU is any special.

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15438 - Posted: 25 Feb 2010 | 17:52:04 UTC - in response to Message 15437.
Last modified: 25 Feb 2010 | 17:53:08 UTC

I did not abort the WU ... the machine is at home and I am at work :-)
It did return another WU of a different type since then so it looks like the machine is OK. While the driver is the same one I was using for Vista and the OS itself should not make a difference from a stability standpoint, I will lower my OC when I get home today. I will also check my error and system event logs and post anything *special*.

<ot>Are you seeing any trending in general on Win7 machines producing more errors?</ot>
____________
Thanks - Steve

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15455 - Posted: 26 Feb 2010 | 5:13:04 UTC - in response to Message 15407.


That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory?

MJH


It did continue after a hard shutdown / reboot. I am not doing anything else (games, etc.) with this machine. I've changed my WU options to only get the "old" WU types, and completed a couple "Full-atom molecular dynamics v6.71 (cuda23)" WUs with no issue.

ACEMD: yes
ACEMD ver 2.0: no
ACEMD beta: no


The interesting thing is that I did get one v6.03 WU to complete this morning, which was one I grabbed before changing the WU options. I am going to let it run with just the "ACEMD" type right now, but maybe switch back to accepting "ACMD ver 2.0" type after a couple days of no issues to see what happens. I'd like to get the benefit of the performance increase since my cards are not high-end and take a while to complete.

Any other ideas of something I should try?


OK. So I restricted my machine to only the v6.71 WU's over the past 2 days. See tasks, 7/8 completed with no issues. I flipped the options back to allow v6.03 WU's and instant failures again. Two ran longer than just a couple seconds, but eventually failed.

So, any ideas why am I seeing this failure activity only on the newer v6.03 WU's? Are these v6.03 WU's doing something different that the v6.71 didn't? I've had to go back to restricting to download only the v6.71 WU's because otherwise I'd quickly hit the max errored WU's and sit for 24 hours to do it again.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 15664 - Posted: 10 Mar 2010 | 11:54:40 UTC - in response to Message 15455.
Last modified: 10 Mar 2010 | 11:54:52 UTC

I've sent more GA runs.. let's see if the newer application improves things.

And, btw, a new movie here.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15667 - Posted: 10 Mar 2010 | 12:50:28 UTC - in response to Message 15664.

K1atOdessa, What cards does your system actually have?
I see GT8800 and GT240 ?!?

Have you swaped cards around and kept the same drivers?
If so, reinstall the driver to register the card, restart and then start crunching again.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15669 - Posted: 10 Mar 2010 | 13:52:41 UTC - in response to Message 15667.

K1atOdessa, Strike that last message.

I see you have two 8800GT's and one GT240 in the same system.

Restart your system, first!
Upgrade to the latest version of Boinc (6.10.36). Restart again.
See if that works.

If you installed any of these cards recently you could try to manually reinstall the drivers from device manager, individually and for each card!

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15675 - Posted: 10 Mar 2010 | 18:41:13 UTC - in response to Message 15669.

Restart your system, first!
Upgrade to the latest version of Boinc (6.10.36). Restart again.
See if that works.


Thanks, I just saw in another thread that 6.10.36 is the current recommended version. I will upgrade to that later tonight to see what happens.

I've had all three cards in working fine on the older WU's for some time, but if the upgrade to newer BOINC version doesn't help, I'll try the the manual reinstall of drivers for each card.

Thanks for the tips.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16194 - Posted: 7 Apr 2010 | 13:42:25 UTC - in response to Message 15675.

I just submitted a new batch (GA7F). These should be shorter than usual.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16340 - Posted: 16 Apr 2010 | 22:03:43 UTC - in response to Message 16194.

Another batch is out, GA7R. Also short.

Profile X-Files 27
Avatar
Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16346 - Posted: 17 Apr 2010 | 1:41:01 UTC - in response to Message 16340.

Another batch is out, GA7R. Also short.


Got 3 errors: Incorrect function. (0x1) - exit code 1 (0x1)
1368864 -> 3 computers already reporting as error
1368710
1368868
____________

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16358 - Posted: 17 Apr 2010 | 13:15:34 UTC - in response to Message 16340.

Another batch is out, GA7R. Also short.


Got first error on new batch - Incorrect function. (0x1) - exit code 1 (0x1)

1369719 - already errored out by one other cruncher

Currently another is crunching. We'll see.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 14
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16371 - Posted: 17 Apr 2010 | 17:06:50 UTC - in response to Message 16358.

Had seven failures in the last 12hrs, with an average run time of 1hr before failure. I'm now shooting any on sight :-)

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16374 - Posted: 17 Apr 2010 | 17:19:43 UTC - in response to Message 16371.

I have errored 2 out of 5.
The first one was only 20 minutes but the other was 2 hours.
Not a very good return rate.
____________
Thanks - Steve

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16431 - Posted: 19 Apr 2010 | 14:28:08 UTC - in response to Message 16371.
Last modified: 19 Apr 2010 | 14:28:39 UTC

@ stoneageman and snowcrash - which of your computers errored out?

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16432 - Posted: 19 Apr 2010 | 14:53:48 UTC - in response to Message 16431.

I aborted but everyone else failed
http://www.gpugrid.net/workunit.php?wuid=1369580

GTX295 (CompID = 56900) - failed and so is everyone else
http://www.gpugrid.net/workunit.php?wuid=1369689
http://www.gpugrid.net/workunit.php?wuid=1369405
____________
Thanks - Steve

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 14
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16435 - Posted: 19 Apr 2010 | 16:26:32 UTC - in response to Message 16431.

They all did!

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16472 - Posted: 21 Apr 2010 | 14:01:36 UTC

failed for everyone who crunched this WU.
f111r1-TONI_GA7R-0-1-RND5547_5
____________
Thanks - Steve

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16501 - Posted: 22 Apr 2010 | 16:42:21 UTC

failing for everyone ...
f103s2-TONI_GA7R-0-1-RND8503_2
____________
Thanks - Steve

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16607 - Posted: 28 Apr 2010 | 14:53:13 UTC

failing for everyone
f109s10-TONI_GA8F-0-1-RND4323

Are the failures (not just the ones I have posted) legit due to parameters of the experiment or do you see them as problems with the machines they are running on? If they are part of the experimewnt paramters have you considered adding error handling that makes the difference between a machine error and paramter out of bounds type of error? Perhaps even awarding points for what amounts to a valid execution of invalid parameters?
____________
Thanks - Steve

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16608 - Posted: 28 Apr 2010 | 15:04:00 UTC - in response to Message 16607.
Last modified: 28 Apr 2010 | 15:07:50 UTC

Please, abort any WUs named GA8 and GA9. (Not GA10).

Concerning the failures, these are not expected and should not happen of course. We are investigating their cause (possibly these runs are exposing a bug), but at the moment it is not possible to "trap" them.

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 14
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16653 - Posted: 29 Apr 2010 | 23:08:51 UTC - in response to Message 16608.

Just had four GA10R and one GA10F fail on three different machines in the last few hours.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16663 - Posted: 30 Apr 2010 | 7:48:22 UTC - in response to Message 16653.

Immediately or after a while?

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16668 - Posted: 30 Apr 2010 | 8:55:08 UTC

For me also now two GA10R-WU cancelled after 35 min.
Windows XP-pro 06.10.45 GTX295.

Yesterday also two WU!

Success!!


____________
Ton (ftpd) Netherlands

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 14
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16669 - Posted: 30 Apr 2010 | 9:30:46 UTC - in response to Message 16668.

Score so far with GA10R. Nine failed and five successful. Quickest to fail was 4min and longest 3h10m. Average about 1hr. Boinc 6:10:43 driver 197.13 winXP64

Snow Crash
Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16672 - Posted: 30 Apr 2010 | 12:43:16 UTC

If you know the GA8 is bad can you please stop sending them out?
I got one very early this moring and wasted over an hour on it.
WU: f124s3-TONI_GA8F-0-1-RND6576_1
Returned: 30 Apr 2010 3:33:25 UTC
GPU Time: 4,560.34
CPU Time: 440.73


4 GA10 failures between late yesterday and today.

WU: f150r7-TONI_GA10R-0-1-RND0345_1
Returned: 30 Apr 2010 8:01:47 UTC
GPU Time: 1,697.97
CPU Time: 159.95

WU: f187r5-TONI_GA10R-0-1-RND5176_0
Returned: 30 Apr 2010 7:53:12 UTC
GPU Time: 12,832.70
CPU Time: 1,337.33

WU: f140r8-TONI_GA10R-0-1-RND4688_0
Returned: 30 Apr 2010 2:17:11 UTC
GPU Time: 4,305.75
CPU Time: 413.22

WU: f100r2-TONI_GA10R-0-1-RND6793_1
Returned: 29 Apr 2010 18:35:10 UTC
GPU Time: 796.77
CPU Time: 77.42

____________
Thanks - Steve

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16729 - Posted: 1 May 2010 | 19:35:43 UTC - in response to Message 16608.

Please, abort any WUs named GA8 and GA9. (Not GA10).

Concerning the failures, these are not expected and should not happen of course. We are investigating their cause (possibly these runs are exposing a bug), but at the moment it is not possible to "trap" them.


I've had failures on the last 3 of 4 GA10's. I'm going to kill them as well as they appear to have similar issues with GA8 and GA9.

Siegfried Niklas
Avatar
Send message
Joined: 23 Feb 09
Posts: 39
Credit: 144,654,294
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 16739 - Posted: 2 May 2010 | 8:10:25 UTC

Ongoing problems with "...-TONI_GA..." on all cards.

http://www.gpugrid.net/workunit.php?wuid=1412871

http://www.gpugrid.net/workunit.php?wuid=1404490

http://www.gpugrid.net/workunit.php?wuid=1413442

http://www.gpugrid.net/workunit.php?wuid=1413519

http://www.gpugrid.net/workunit.php?wuid=1413117

http://www.gpugrid.net/workunit.php?wuid=1413857

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16740 - Posted: 2 May 2010 | 11:22:42 UTC

ga10R failled after 7 min.


____________
Ton (ftpd) Netherlands

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16743 - Posted: 2 May 2010 | 12:22:48 UTC - in response to Message 16740.

f114r5-TONI_GA10R-0-1-RND7226

Too many errors (may have bug)

Profile Stoneageman
Avatar
Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,224,498
RAC: 14
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16780 - Posted: 3 May 2010 | 13:52:58 UTC - in response to Message 16743.

GA11R....two failed & two completed ok. Looks like there's still an issue!

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16781 - Posted: 3 May 2010 | 14:10:06 UTC - in response to Message 16780.
Last modified: 3 May 2010 | 14:19:44 UTC

Sorry guys, GA is making me sweat too... However for now I am not aware of mistakes in GA11.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,852,511,851
RAC: 10,070,133
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16794 - Posted: 3 May 2010 | 22:07:09 UTC
Last modified: 3 May 2010 | 22:09:26 UTC

Add another exit code 1 (0x1) to the collection:

f196r4-TONI_GA11R-0-1-RND1898_0

Edit - And a 'ERROR: file tclutil.cpp line 31: get_Dvec() element 0 (b) ':

f136r9-TONI_GA11R-0-1-RND4524_0

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16797 - Posted: 4 May 2010 | 9:13:52 UTC - in response to Message 16794.

Hi Richard,

could it be that you are getting errors on host 45218 since you upgraded from 6.10.48 to 6.10.51?

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16798 - Posted: 4 May 2010 | 9:16:41 UTC - in response to Message 16797.
Last modified: 4 May 2010 | 9:22:29 UTC

Stopped all suspicious GAxx. There is a small number of GAUS1 out that should run fine, except they produce large uploads. A batch of GAUS2 should work well. Thanks for all of your reports.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,852,511,851
RAC: 10,070,133
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16799 - Posted: 4 May 2010 | 9:37:00 UTC - in response to Message 16797.

Hi Richard,

could it be that you are getting errors on host 45218 since you upgraded from 6.10.48 to 6.10.51?

Interesting question. True, but I don't think you can claim "cause and effect".

I upgraded host 43404 to BOINC v6.10.51 at the same time. That host hasn't thrown any errors yet - but then, it hasn't been issued any GA11R tasks either.

The other difference between the hosts is that 43404 (factory overclocked 9800GTX+, no errors at the moment) is running NVidia drivers 190.38: host 45218 (stock speed 9800GT, errors on GA11R) I opgraded from 197.13 to 197.45 in the same session as I installed v6.10.51. (Both 197 drivers have difficulty holding my 1600 x 1200 resolution when I switch the DVI KVM to another host). I'm active in BOINC development testing, and I'm not aware of any changes in v6.10.51 that could cause application errors - if anything, the 197 drivers might be more of a problem, because (at least as reported by BOINC) they leave less GPU RAM available for apps to use.

Both hosts are currently running TONI_GAUS1 tasks (do they count as 'GA' for the purposes of this thread?): 43404 is at 15%, 45218 is at 65%. That'll be the first head-to-head comparison between the two hosts - resultis in a few hours.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,852,511,851
RAC: 10,070,133
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16800 - Posted: 4 May 2010 | 9:51:24 UTC

Uh oh. "exit code 1 (0x1)" on f5r6-TONI_GAUS1-0-50-RND5224_1 - that's on 43404, the 9800 GTX+ that was OK overnight.

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16801 - Posted: 4 May 2010 | 10:11:45 UTC - in response to Message 16800.

Yes the driver is more likely to be a culprit than the BOINC v. (And I would be also cautious in cause-effects).

Your hosts make an interesing pair. So, if I understand

43404, factory oc -> v 190.38, errors
45218 was 197.13, now 197.45, errors after upgrade (but did not crunch GA before it)

is that correct?

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,852,511,851
RAC: 10,070,133
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16802 - Posted: 4 May 2010 | 10:22:01 UTC - in response to Message 16801.
Last modified: 4 May 2010 | 10:32:32 UTC

Yes the driver is more likely to be a culprit than the BOINC v. (And I would be also cautious in cause-effects).

Your hosts make an interesing pair. So, if I understand

43404, factory oc -> v 190.38, errors
45218 was 197.13, now 197.45, errors after upgrade (but did not crunch GA before it)

is that correct?

I wouldn't put it that way.

43404, factory oc -> v 190.38, 1 error with GAUS1, success with CAPBIND, HERG and pTEEI. Also succeeded with a GA10F and a GA8F, errored with another GA8F.

45218 fair comment, but it has no GA tasks shown from before upgrade, ONLY (so far) GA11R tasks since upgrade.

PS - I have a third 'control' host, 43362, with a non-overclicked 9800GT, BOINC v6.10.36, 190.38 driver. But it hasn't got any GA tasks yet....

Got to go out now, will review outcomes when I get back.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16803 - Posted: 4 May 2010 | 11:24:17 UTC - in response to Message 16802.
Last modified: 4 May 2010 | 11:32:57 UTC

I'm also getting errors on my system with four GT240's.

Two things are interesting,
- they are all TONI_GA11R-0-1-RND
- The system picked up a few Beta 6.22 tasks. Although these all ran successfully, perhaps they interfeered with the GA11R task in some way.

I say this because I saw the same pattern a few days ago; Betas OK but other WU's failed.


The Failures,
2266574 1428922 3 May 2010 13:02:26 UTC 3 May 2010 15:17:42 UTC Error while computing 4,354.23 518.53 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda)
2265705 1429788 3 May 2010 21:25:49 UTC 4 May 2010 0:32:03 UTC Error while computing 827.67 108.02 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda)
2271301 1429669 4 May 2010 1:32:37 UTC 4 May 2010 6:44:26 UTC Error while computing 18,111.57 2,181.78 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda)

f184r5-TONI_GA11R-0-1-RND4020_2
f123r4-TONI_GA11R-0-1-RND7715_1
f193r9-TONI_GA11R-0-1-RND1232_0

(6.10.45, 19745, 3 of the cards on this system are OC'd but complete other tasks OK)

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16814 - Posted: 4 May 2010 | 14:58:26 UTC
Last modified: 4 May 2010 | 15:00:00 UTC

f199r1-TONI_GA11R-0-1-RND7585_1
Workunit 1429855
Aangemaakt 4 May 2010 5:26:42 UTC
Sent 4 May 2010 5:40:45 UTC
Received 4 May 2010 14:56:44 UTC
Server state Over
Outcome Success
Client state Geen
Exit status 0 (0x0)
Computer ID 47762
Report deadline 9 May 2010 5:40:45 UTC
Run time 14756.328125
CPU time 1624.516
stderr out <core_client_version>6.10.50</core_client_version>
<![CDATA[
<stderr_txt>
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939327488 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 30
# Number of cores: 240
MDIO ERROR: cannot open file "restart.coor"
# Time per step: 29.504 ms
# Approximate elapsed time for entire WU: 14751.813 s
called boinc_finish

</stderr_txt>
]

This one is OK with driver 197.45 and windows-xp-pro and boinc 06.10.50.
____________
Ton (ftpd) Netherlands

ftpd
Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16815 - Posted: 4 May 2010 | 15:03:29 UTC

f8r6-TONI_GAUS1-0-50-RND3701_2
Workunit 1431573
Aangemaakt 4 May 2010 5:33:57 UTC
Sent 4 May 2010 5:40:45 UTC
Received 4 May 2010 15:00:57 UTC
Server state Over
Outcome Success
Client state Geen
Exit status 0 (0x0)
Computer ID 47762
Report deadline 9 May 2010 5:40:45 UTC
Run time 18665.984375
CPU time 2017.328
stderr out <core_client_version>6.10.50</core_client_version>
<![CDATA[
<stderr_txt>
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939327488 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 30
# Number of cores: 240
MDIO ERROR: cannot open file "restart.coor"
# Time per step: 28.732 ms
# Approximate elapsed time for entire WU: 18675.547 s
called boinc_finish

</stderr_txt>
]]>


This one is also OK, only 60,13MB upload, same driver and boinc-version.
____________
Ton (ftpd) Netherlands

Profile CNT - IQE
Send message
Joined: 21 Sep 09
Posts: 3
Credit: 1,951,950,972
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16820 - Posted: 4 May 2010 | 17:56:35 UTC
Last modified: 4 May 2010 | 18:42:52 UTC

Hi,

Is it possible to reactivate log information about GPU?

We want to know, on which card is WU crunched.

We have 4x295 system and now, if here is a error WU, we dont know, which card is bad.

In old log was information like:

WU is started on GPU 4...

Computer:

http://www.gpugrid.net/show_host_detail.php?hostid=59988

OS: Ubuntu 9.10 64b
Drivers: 195.36.15


Its only TONI-GA WUs
Error wus in last 7 days:

http://www.gpugrid.net/result.php?resultid=2268379
http://www.gpugrid.net/result.php?resultid=2268375
http://www.gpugrid.net/result.php?resultid=2265727
http://www.gpugrid.net/result.php?resultid=2265544
http://www.gpugrid.net/result.php?resultid=2265061
http://www.gpugrid.net/result.php?resultid=2264541
http://www.gpugrid.net/result.php?resultid=2241247
http://www.gpugrid.net/result.php?resultid=2241094
http://www.gpugrid.net/result.php?resultid=2240609
http://www.gpugrid.net/result.php?resultid=2240370
http://www.gpugrid.net/result.php?resultid=2240225
http://www.gpugrid.net/result.php?resultid=2228009

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,852,511,851
RAC: 10,070,133
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16821 - Posted: 4 May 2010 | 19:05:30 UTC - in response to Message 16800.

Uh oh. "exit code 1 (0x1)" on f5r6-TONI_GAUS1-0-50-RND5224_1 - that's on 43404, the 9800 GTX+ that was OK overnight.

But it was followed immediately by SUCCESS with f92r9-TONI_GAUS1-0-50-RND9426_1. You're not wrong about that upload!

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16826 - Posted: 4 May 2010 | 22:15:01 UTC - in response to Message 16820.

Is it possible to reactivate log information about GPU? We want to know, on which card is WU crunched.


Definitely would like to have that information in the WU's again.

Richard Haselgrove
Send message
Joined: 11 Jul 09
Posts: 1576
Credit: 5,852,511,851
RAC: 10,070,133
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16831 - Posted: 5 May 2010 | 13:23:08 UTC

Latest conundrum:

f10r3-TONI_GAUS2-0-50-RND3526_0 FAILED on stock 9800GT
f79r2-TONI_GAUS2-0-50-RND5339_1 SUCCESS on overclocked 9800GTX+

Toni
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Send message
Joined: 9 Dec 08
Posts: 1006
Credit: 5,068,599
RAC: 0
Level
Ser
Scientific publications
watwatwatwat
Message 16864 - Posted: 6 May 2010 | 11:42:19 UTC - in response to Message 16826.

> Definitely would like to have that information in the WU's again.

It should be restored in some coming update in fact.

Profile K1atOdessa
Send message
Joined: 25 Feb 08
Posts: 249
Credit: 370,320,941
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16866 - Posted: 6 May 2010 | 13:38:14 UTC - in response to Message 16864.

> Definitely would like to have that information in the WU's again.

It should be restored in some coming update in fact.


Thanks for the update.

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16902 - Posted: 7 May 2010 | 21:57:30 UTC - in response to Message 16866.

Task, f11r7-TONI_GAUS2-1-50-RND2945_4 failed after 21691 sec (6h). System not being used by user at the time.

Again, there were betas about!

Previous task on that card was successful, so far the next task is OK (@90%).

Profile skgiven
Volunteer moderator
Volunteer tester
Avatar
Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 16966 - Posted: 11 May 2010 | 18:50:27 UTC - in response to Message 16902.
Last modified: 11 May 2010 | 18:52:37 UTC

Name f38r0-TONI_GAUS2-4-50-RND0732_0
Workunit 1455699
Created 9 May 2010 15:12:28 UTC
Sent 9 May 2010 15:51:00 UTC
Received 10 May 2010 13:14:06 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 3 (0x3)
Computer ID 51747
Report deadline 14 May 2010 15:51:00 UTC
Run time 37046.842462
CPU time 2079.649
stderr out

<core_client_version>6.10.43</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: cannot open file "restart.coor"
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 12
# Number of cores: 96
SWAN : FATAL : Failure executing kernel sync [nb_k_nt_tp_pme] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 4458.63078703704
Granted credit 0
application version ACEMD - GPU molecular dynamics v6.03 (cuda)

---
Name f26r3-TONI_GAUS2-7-50-RND1787_1
Workunit 1456753
Created 10 May 2010 8:06:41 UTC
Sent 10 May 2010 8:17:06 UTC
Received 10 May 2010 17:58:29 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 55951
Report deadline 15 May 2010 8:17:06 UTC
Run time 29994.549999
CPU time 4620.22
stderr out

<core_client_version>6.10.50</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: cannot open file "restart.coor"
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96

</stderr_txt>
]]>

Validate state Invalid
Claimed credit 4458.63078703704
Granted credit 0
application version ACEMD - GPU molecular dynamics v6.03 (cuda)

Post to thread

Message boards : Graphics cards (GPUs) : GA: information and issues

//