GA: information and issues

Message boards : Graphics cards (GPUs) : GA: information and issues

Author	Message
Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15368 - Posted: 22 Feb 2010 \| 18:42:11 UTC
	Dear Crunchers, we've submitted approximately 1000 WUs of the "GA" (gramicidin A) type. They are a re-issue of a system which we have already run for a while. The purpose of the runs is methodological: they use a model system to improve an algorithm that can be transferred to other molecules. The video is here - though I'm making new ones.
	ID: 15368 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 15374 - Posted: 22 Feb 2010 \| 22:48:39 UTC - in response to Message 15368.
	Keep challenging your methodologies and you will strengthen the research. Good decision for the long term future. I hope you identify subtle improvements you have made with the new applications and confirm existing results. Thanks,
	ID: 15374 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15381 - Posted: 23 Feb 2010 \| 14:49:57 UTC - in response to Message 15374. Last modified: 23 Feb 2010 \| 14:51:48 UTC
	Thanks. Btw, all of them are acemd2, so have the higher bang for the buck ratio (ie credits/hour) of the new app.
	ID: 15381 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15391 - Posted: 23 Feb 2010 \| 19:06:54 UTC - in response to Message 15381. Last modified: 23 Feb 2010 \| 20:04:04 UTC
	Finished my first GA! GTX295, Shaders at 1620, WinXP i7-920 HT ON at 4.0 GHz, 8 CPU threads of WCG HCMD2 fully loaded. GPU = 5 hours CPU usage = 1230 seconds Time per step = 23.927 ms Points w/ bonus = 6945.175 compared to recent TONI series avg on the same machine GPU = 4 hours 40 minutes CPU usage = 555 seconds Time per step = 25.651 ms Points w/ bonus = 6123.06875 so the CPU time is up *2.5 and GPU just a little ... looks good to me. I'm looking forward to your new videos, I hope these results help you find a better answer :-) ____________ Thanks - Steve
	ID: 15391 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 15394 - Posted: 23 Feb 2010 \| 23:15:43 UTC
	I've been getting nothing but errors on the "TONI_GA" ACEMD - GPU molecular dynamics v6.03 (cuda) WU's over the past 36 hours. "SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [66,1,1][64,1,1] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 194 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information." Running 6.10.17, with drivers 196.34. No issues with other WU's, even other types of v6.03's. Running 2x 8800GT + 1 GTS 240. Restart didn't help. I've halted new WU's for now, but I think I may try to change the preferences not to get these new types. What should I de-select to prevent only these TONI_GA v6.03 types from downloading?
	ID: 15394 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15404 - Posted: 24 Feb 2010 \| 17:16:08 UTC - in response to Message 15394. Last modified: 24 Feb 2010 \| 17:20:00 UTC
	Hi K1atOdessa, do you know if the fail on the GTS or on the 8800 (or both?). At present you can't filter one WU type, but you can filter out acemd2 altogether (Your account, gpugrid preferences). However, this batch of WUs should be over. T
	ID: 15404 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15405 - Posted: 24 Feb 2010 \| 17:24:13 UTC - in response to Message 15391. Last modified: 24 Feb 2010 \| 17:24:30 UTC
	Hi Steve, thanks for the report... timings look normal to me. Only thing, I wouldn't swear that the CPU time is reproduced even if you run two identical WUs (I may be wrong). What's important is that it is much less than the GPU time.
	ID: 15405 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 15406 - Posted: 24 Feb 2010 \| 17:34:38 UTC - in response to Message 15394.
	"SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [66,1,1][64,1,1] That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory? MJH
	ID: 15406 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 15407 - Posted: 24 Feb 2010 \| 18:06:25 UTC - in response to Message 15406.
	That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory? MJH It did continue after a hard shutdown / reboot. I am not doing anything else (games, etc.) with this machine. I've changed my WU options to only get the "old" WU types, and completed a couple "Full-atom molecular dynamics v6.71 (cuda23)" WUs with no issue. ACEMD: yes ACEMD ver 2.0: no ACEMD beta: no The interesting thing is that I did get one v6.03 WU to complete this morning, which was one I grabbed before changing the WU options. I am going to let it run with just the "ACEMD" type right now, but maybe switch back to accepting "ACMD ver 2.0" type after a couple days of no issues to see what happens. I'd like to get the benefit of the performance increase since my cards are not high-end and take a while to complete. Any other ideas of something I should try?
	ID: 15407 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15408 - Posted: 24 Feb 2010 \| 18:58:33 UTC Last modified: 24 Feb 2010 \| 18:59:58 UTC
	Very consistant timings ... three more on the same machine: I have unhidden my computers (56900) so you can verify. ------------------------------------------------ GPU -------- CPU ----- Time Step ------------------------------------------------ 17910 ------- 1238 ------ 23.892 17889 ------- 1233 ------ 23.864 17722 ------- 1206 ------ 23.64 17935 ------- 1230 ------ 23.93 ------------------------------------------------ I am not complaining at all, they run very nicely for me. The CPU seconds is still much less than when I run them on my Vista PC which is a GTX285 i7-920 and takes ~5000 CPU sec. Keep up the good work! ____________ Thanks - Steve
	ID: 15408 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 15417 - Posted: 25 Feb 2010 \| 1:50:42 UTC - in response to Message 15408.
	K1atOdessa, At some stage in the next few days it would probably be a good idea to make sure you have selected to receive work from other projects (ACEMD ver 2.0 and Betas) if the projects you have selected (ACEMD) have no work. This is also in your projects settings.
	ID: 15417 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15425 - Posted: 25 Feb 2010 \| 10:48:37 UTC - in response to Message 15408.
	Hi Steve, noted. Thanks for the info.
	ID: 15425 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15427 - Posted: 25 Feb 2010 \| 13:19:00 UTC
	is this just coindidence that two machines got incorrect function errors or is it something with the WU? http://www.gpugrid.net/result.php?resultid=1902641 Based on the amont of time it processed on my machine it should have been finished. I just upgraded it to Win7 and it has been returning WUs OK after the upgrade, including a GA. ____________ Thanks - Steve
	ID: 15427 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15437 - Posted: 25 Feb 2010 \| 17:22:47 UTC - in response to Message 15427. Last modified: 25 Feb 2010 \| 17:24:01 UTC
	It's a coincidence. The "other" machine is not returning any results. Did you abort it manually based on elapsed time? something went wrong but I don't think the WU is any special.
	ID: 15437 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15438 - Posted: 25 Feb 2010 \| 17:52:04 UTC - in response to Message 15437. Last modified: 25 Feb 2010 \| 17:53:08 UTC
	I did not abort the WU ... the machine is at home and I am at work :-) It did return another WU of a different type since then so it looks like the machine is OK. While the driver is the same one I was using for Vista and the OS itself should not make a difference from a stability standpoint, I will lower my OC when I get home today. I will also check my error and system event logs and post anything special. <ot>Are you seeing any trending in general on Win7 machines producing more errors?</ot> ____________ Thanks - Steve
	ID: 15438 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 15455 - Posted: 26 Feb 2010 \| 5:13:04 UTC - in response to Message 15407.
	That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory? MJH It did continue after a hard shutdown / reboot. I am not doing anything else (games, etc.) with this machine. I've changed my WU options to only get the "old" WU types, and completed a couple "Full-atom molecular dynamics v6.71 (cuda23)" WUs with no issue. ACEMD: yes ACEMD ver 2.0: no ACEMD beta: no The interesting thing is that I did get one v6.03 WU to complete this morning, which was one I grabbed before changing the WU options. I am going to let it run with just the "ACEMD" type right now, but maybe switch back to accepting "ACMD ver 2.0" type after a couple days of no issues to see what happens. I'd like to get the benefit of the performance increase since my cards are not high-end and take a while to complete. Any other ideas of something I should try? OK. So I restricted my machine to only the v6.71 WU's over the past 2 days. See tasks, 7/8 completed with no issues. I flipped the options back to allow v6.03 WU's and instant failures again. Two ran longer than just a couple seconds, but eventually failed. So, any ideas why am I seeing this failure activity only on the newer v6.03 WU's? Are these v6.03 WU's doing something different that the v6.71 didn't? I've had to go back to restricting to download only the v6.71 WU's because otherwise I'd quickly hit the max errored WU's and sit for 24 hours to do it again.
	ID: 15455 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 15664 - Posted: 10 Mar 2010 \| 11:54:40 UTC - in response to Message 15455. Last modified: 10 Mar 2010 \| 11:54:52 UTC
	I've sent more GA runs.. let's see if the newer application improves things. And, btw, a new movie here.
	ID: 15664 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 15667 - Posted: 10 Mar 2010 \| 12:50:28 UTC - in response to Message 15664.
	K1atOdessa, What cards does your system actually have? I see GT8800 and GT240 ?!? Have you swaped cards around and kept the same drivers? If so, reinstall the driver to register the card, restart and then start crunching again.
	ID: 15667 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 15669 - Posted: 10 Mar 2010 \| 13:52:41 UTC - in response to Message 15667.
	K1atOdessa, Strike that last message. I see you have two 8800GT's and one GT240 in the same system. Restart your system, first! Upgrade to the latest version of Boinc (6.10.36). Restart again. See if that works. If you installed any of these cards recently you could try to manually reinstall the drivers from device manager, individually and for each card!
	ID: 15669 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 15675 - Posted: 10 Mar 2010 \| 18:41:13 UTC - in response to Message 15669.
	Restart your system, first! Upgrade to the latest version of Boinc (6.10.36). Restart again. See if that works. Thanks, I just saw in another thread that 6.10.36 is the current recommended version. I will upgrade to that later tonight to see what happens. I've had all three cards in working fine on the older WU's for some time, but if the upgrade to newer BOINC version doesn't help, I'll try the the manual reinstall of drivers for each card. Thanks for the tips.
	ID: 15675 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16194 - Posted: 7 Apr 2010 \| 13:42:25 UTC - in response to Message 15675.
	I just submitted a new batch (GA7F). These should be shorter than usual.
	ID: 16194 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16340 - Posted: 16 Apr 2010 \| 22:03:43 UTC - in response to Message 16194.
	Another batch is out, GA7R. Also short.
	ID: 16340 \| Rating: 0 \| rate: / Reply Quote

X-Files 27 Send message Joined: 11 Oct 08 Posts: 95 Credit: 68,023,693 RAC: 0 Level Scientific publications	Message 16346 - Posted: 17 Apr 2010 \| 1:41:01 UTC - in response to Message 16340.
	Another batch is out, GA7R. Also short. Got 3 errors: Incorrect function. (0x1) - exit code 1 (0x1) 1368864 -> 3 computers already reporting as error 1368710 1368868 ____________
	ID: 16346 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 16358 - Posted: 17 Apr 2010 \| 13:15:34 UTC - in response to Message 16340.
	Another batch is out, GA7R. Also short. Got first error on new batch - Incorrect function. (0x1) - exit code 1 (0x1) 1369719 - already errored out by one other cruncher Currently another is crunching. We'll see.
	ID: 16358 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 11 Level Scientific publications	Message 16371 - Posted: 17 Apr 2010 \| 17:06:50 UTC - in response to Message 16358.
	Had seven failures in the last 12hrs, with an average run time of 1hr before failure. I'm now shooting any on sight :-)
	ID: 16371 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 16374 - Posted: 17 Apr 2010 \| 17:19:43 UTC - in response to Message 16371.
	I have errored 2 out of 5. The first one was only 20 minutes but the other was 2 hours. Not a very good return rate. ____________ Thanks - Steve
	ID: 16374 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16431 - Posted: 19 Apr 2010 \| 14:28:08 UTC - in response to Message 16371. Last modified: 19 Apr 2010 \| 14:28:39 UTC
	@ stoneageman and snowcrash - which of your computers errored out?
	ID: 16431 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 16432 - Posted: 19 Apr 2010 \| 14:53:48 UTC - in response to Message 16431.
	I aborted but everyone else failed http://www.gpugrid.net/workunit.php?wuid=1369580 GTX295 (CompID = 56900) - failed and so is everyone else http://www.gpugrid.net/workunit.php?wuid=1369689 http://www.gpugrid.net/workunit.php?wuid=1369405 ____________ Thanks - Steve
	ID: 16432 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 11 Level Scientific publications	Message 16435 - Posted: 19 Apr 2010 \| 16:26:32 UTC - in response to Message 16431.
	They all did!
	ID: 16435 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 16472 - Posted: 21 Apr 2010 \| 14:01:36 UTC
	failed for everyone who crunched this WU. f111r1-TONI_GA7R-0-1-RND5547_5 ____________ Thanks - Steve
	ID: 16472 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 16501 - Posted: 22 Apr 2010 \| 16:42:21 UTC
	failing for everyone ... f103s2-TONI_GA7R-0-1-RND8503_2 ____________ Thanks - Steve
	ID: 16501 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 16607 - Posted: 28 Apr 2010 \| 14:53:13 UTC
	failing for everyone f109s10-TONI_GA8F-0-1-RND4323 Are the failures (not just the ones I have posted) legit due to parameters of the experiment or do you see them as problems with the machines they are running on? If they are part of the experimewnt paramters have you considered adding error handling that makes the difference between a machine error and paramter out of bounds type of error? Perhaps even awarding points for what amounts to a valid execution of invalid parameters? ____________ Thanks - Steve
	ID: 16607 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16608 - Posted: 28 Apr 2010 \| 15:04:00 UTC - in response to Message 16607. Last modified: 28 Apr 2010 \| 15:07:50 UTC
	Please, abort any WUs named GA8 and GA9. (Not GA10). Concerning the failures, these are not expected and should not happen of course. We are investigating their cause (possibly these runs are exposing a bug), but at the moment it is not possible to "trap" them.
	ID: 16608 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 11 Level Scientific publications	Message 16653 - Posted: 29 Apr 2010 \| 23:08:51 UTC - in response to Message 16608.
	Just had four GA10R and one GA10F fail on three different machines in the last few hours.
	ID: 16653 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16663 - Posted: 30 Apr 2010 \| 7:48:22 UTC - in response to Message 16653.
	Immediately or after a while?
	ID: 16663 \| Rating: 0 \| rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 16668 - Posted: 30 Apr 2010 \| 8:55:08 UTC
	For me also now two GA10R-WU cancelled after 35 min. Windows XP-pro 06.10.45 GTX295. Yesterday also two WU! Success!! ____________ Ton (ftpd) Netherlands
	ID: 16668 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 11 Level Scientific publications	Message 16669 - Posted: 30 Apr 2010 \| 9:30:46 UTC - in response to Message 16668.
	Score so far with GA10R. Nine failed and five successful. Quickest to fail was 4min and longest 3h10m. Average about 1hr. Boinc 6:10:43 driver 197.13 winXP64
	ID: 16669 \| Rating: 0 \| rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 16672 - Posted: 30 Apr 2010 \| 12:43:16 UTC
	If you know the GA8 is bad can you please stop sending them out? I got one very early this moring and wasted over an hour on it. WU: f124s3-TONI_GA8F-0-1-RND6576_1 Returned: 30 Apr 2010 3:33:25 UTC GPU Time: 4,560.34 CPU Time: 440.73 4 GA10 failures between late yesterday and today. WU: f150r7-TONI_GA10R-0-1-RND0345_1 Returned: 30 Apr 2010 8:01:47 UTC GPU Time: 1,697.97 CPU Time: 159.95 WU: f187r5-TONI_GA10R-0-1-RND5176_0 Returned: 30 Apr 2010 7:53:12 UTC GPU Time: 12,832.70 CPU Time: 1,337.33 WU: f140r8-TONI_GA10R-0-1-RND4688_0 Returned: 30 Apr 2010 2:17:11 UTC GPU Time: 4,305.75 CPU Time: 413.22 WU: f100r2-TONI_GA10R-0-1-RND6793_1 Returned: 29 Apr 2010 18:35:10 UTC GPU Time: 796.77 CPU Time: 77.42 ____________ Thanks - Steve
	ID: 16672 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 16729 - Posted: 1 May 2010 \| 19:35:43 UTC - in response to Message 16608.
	Please, abort any WUs named GA8 and GA9. (Not GA10). Concerning the failures, these are not expected and should not happen of course. We are investigating their cause (possibly these runs are exposing a bug), but at the moment it is not possible to "trap" them. I've had failures on the last 3 of 4 GA10's. I'm going to kill them as well as they appear to have similar issues with GA8 and GA9.
	ID: 16729 \| Rating: 0 \| rate: / Reply Quote

Siegfried Niklas Send message Joined: 23 Feb 09 Posts: 39 Credit: 144,654,294 RAC: 0 Level Scientific publications	Message 16739 - Posted: 2 May 2010 \| 8:10:25 UTC
	Ongoing problems with "...-TONI_GA..." on all cards. http://www.gpugrid.net/workunit.php?wuid=1412871 http://www.gpugrid.net/workunit.php?wuid=1404490 http://www.gpugrid.net/workunit.php?wuid=1413442 http://www.gpugrid.net/workunit.php?wuid=1413519 http://www.gpugrid.net/workunit.php?wuid=1413117 http://www.gpugrid.net/workunit.php?wuid=1413857
	ID: 16739 \| Rating: 0 \| rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 16740 - Posted: 2 May 2010 \| 11:22:42 UTC
	ga10R failled after 7 min. ____________ Ton (ftpd) Netherlands
	ID: 16740 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 16743 - Posted: 2 May 2010 \| 12:22:48 UTC - in response to Message 16740.
	f114r5-TONI_GA10R-0-1-RND7226 Too many errors (may have bug)
	ID: 16743 \| Rating: 0 \| rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 11 Level Scientific publications	Message 16780 - Posted: 3 May 2010 \| 13:52:58 UTC - in response to Message 16743.
	GA11R....two failed & two completed ok. Looks like there's still an issue!
	ID: 16780 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16781 - Posted: 3 May 2010 \| 14:10:06 UTC - in response to Message 16780. Last modified: 3 May 2010 \| 14:19:44 UTC
	Sorry guys, GA is making me sweat too... However for now I am not aware of mistakes in GA11.
	ID: 16781 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1626 Credit: 9,384,566,723 RAC: 19,075,423 Level Scientific publications	Message 16794 - Posted: 3 May 2010 \| 22:07:09 UTC Last modified: 3 May 2010 \| 22:09:26 UTC
	Add another exit code 1 (0x1) to the collection: f196r4-TONI_GA11R-0-1-RND1898_0 Edit - And a 'ERROR: file tclutil.cpp line 31: get_Dvec() element 0 (b) ': f136r9-TONI_GA11R-0-1-RND4524_0
	ID: 16794 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16797 - Posted: 4 May 2010 \| 9:13:52 UTC - in response to Message 16794.
	Hi Richard, could it be that you are getting errors on host 45218 since you upgraded from 6.10.48 to 6.10.51?
	ID: 16797 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16798 - Posted: 4 May 2010 \| 9:16:41 UTC - in response to Message 16797. Last modified: 4 May 2010 \| 9:22:29 UTC
	Stopped all suspicious GAxx. There is a small number of GAUS1 out that should run fine, except they produce large uploads. A batch of GAUS2 should work well. Thanks for all of your reports.
	ID: 16798 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1626 Credit: 9,384,566,723 RAC: 19,075,423 Level Scientific publications	Message 16799 - Posted: 4 May 2010 \| 9:37:00 UTC - in response to Message 16797.
	Hi Richard, could it be that you are getting errors on host 45218 since you upgraded from 6.10.48 to 6.10.51? Interesting question. True, but I don't think you can claim "cause and effect". I upgraded host 43404 to BOINC v6.10.51 at the same time. That host hasn't thrown any errors yet - but then, it hasn't been issued any GA11R tasks either. The other difference between the hosts is that 43404 (factory overclocked 9800GTX+, no errors at the moment) is running NVidia drivers 190.38: host 45218 (stock speed 9800GT, errors on GA11R) I opgraded from 197.13 to 197.45 in the same session as I installed v6.10.51. (Both 197 drivers have difficulty holding my 1600 x 1200 resolution when I switch the DVI KVM to another host). I'm active in BOINC development testing, and I'm not aware of any changes in v6.10.51 that could cause application errors - if anything, the 197 drivers might be more of a problem, because (at least as reported by BOINC) they leave less GPU RAM available for apps to use. Both hosts are currently running TONI_GAUS1 tasks (do they count as 'GA' for the purposes of this thread?): 43404 is at 15%, 45218 is at 65%. That'll be the first head-to-head comparison between the two hosts - resultis in a few hours.
	ID: 16799 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1626 Credit: 9,384,566,723 RAC: 19,075,423 Level Scientific publications	Message 16800 - Posted: 4 May 2010 \| 9:51:24 UTC
	Uh oh. "exit code 1 (0x1)" on f5r6-TONI_GAUS1-0-50-RND5224_1 - that's on 43404, the 9800 GTX+ that was OK overnight.
	ID: 16800 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16801 - Posted: 4 May 2010 \| 10:11:45 UTC - in response to Message 16800.
	Yes the driver is more likely to be a culprit than the BOINC v. (And I would be also cautious in cause-effects). Your hosts make an interesing pair. So, if I understand 43404, factory oc -> v 190.38, errors 45218 was 197.13, now 197.45, errors after upgrade (but did not crunch GA before it) is that correct?
	ID: 16801 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1626 Credit: 9,384,566,723 RAC: 19,075,423 Level Scientific publications	Message 16802 - Posted: 4 May 2010 \| 10:22:01 UTC - in response to Message 16801. Last modified: 4 May 2010 \| 10:32:32 UTC
	Yes the driver is more likely to be a culprit than the BOINC v. (And I would be also cautious in cause-effects). Your hosts make an interesing pair. So, if I understand 43404, factory oc -> v 190.38, errors 45218 was 197.13, now 197.45, errors after upgrade (but did not crunch GA before it) is that correct? I wouldn't put it that way. 43404, factory oc -> v 190.38, 1 error with GAUS1, success with CAPBIND, HERG and pTEEI. Also succeeded with a GA10F and a GA8F, errored with another GA8F. 45218 fair comment, but it has no GA tasks shown from before upgrade, ONLY (so far) GA11R tasks since upgrade. PS - I have a third 'control' host, 43362, with a non-overclicked 9800GT, BOINC v6.10.36, 190.38 driver. But it hasn't got any GA tasks yet.... Got to go out now, will review outcomes when I get back.
	ID: 16802 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 16803 - Posted: 4 May 2010 \| 11:24:17 UTC - in response to Message 16802. Last modified: 4 May 2010 \| 11:32:57 UTC
	I'm also getting errors on my system with four GT240's. Two things are interesting, - they are all TONI_GA11R-0-1-RND - The system picked up a few Beta 6.22 tasks. Although these all ran successfully, perhaps they interfeered with the GA11R task in some way. I say this because I saw the same pattern a few days ago; Betas OK but other WU's failed. The Failures, 2266574 1428922 3 May 2010 13:02:26 UTC 3 May 2010 15:17:42 UTC Error while computing 4,354.23 518.53 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda) 2265705 1429788 3 May 2010 21:25:49 UTC 4 May 2010 0:32:03 UTC Error while computing 827.67 108.02 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda) 2271301 1429669 4 May 2010 1:32:37 UTC 4 May 2010 6:44:26 UTC Error while computing 18,111.57 2,181.78 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda) f184r5-TONI_GA11R-0-1-RND4020_2 f123r4-TONI_GA11R-0-1-RND7715_1 f193r9-TONI_GA11R-0-1-RND1232_0 (6.10.45, 19745, 3 of the cards on this system are OC'd but complete other tasks OK)
	ID: 16803 \| Rating: 0 \| rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 16814 - Posted: 4 May 2010 \| 14:58:26 UTC Last modified: 4 May 2010 \| 15:00:00 UTC
	f199r1-TONI_GA11R-0-1-RND7585_1 Workunit 1429855 Aangemaakt 4 May 2010 5:26:42 UTC Sent 4 May 2010 5:40:45 UTC Received 4 May 2010 14:56:44 UTC Server state Over Outcome Success Client state Geen Exit status 0 (0x0) Computer ID 47762 Report deadline 9 May 2010 5:40:45 UTC Run time 14756.328125 CPU time 1624.516 stderr out <core_client_version>6.10.50</core_client_version> <![CDATA[ <stderr_txt> # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 295" # Clock rate: 1.24 GHz # Total amount of global memory: 939327488 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 1: "GeForce GTX 295" # Clock rate: 1.24 GHz # Total amount of global memory: 939196416 bytes # Number of multiprocessors: 30 # Number of cores: 240 MDIO ERROR: cannot open file "restart.coor" # Time per step: 29.504 ms # Approximate elapsed time for entire WU: 14751.813 s called boinc_finish </stderr_txt> ] This one is OK with driver 197.45 and windows-xp-pro and boinc 06.10.50. ____________ Ton (ftpd) Netherlands
	ID: 16814 \| Rating: 0 \| rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 16815 - Posted: 4 May 2010 \| 15:03:29 UTC
	f8r6-TONI_GAUS1-0-50-RND3701_2 Workunit 1431573 Aangemaakt 4 May 2010 5:33:57 UTC Sent 4 May 2010 5:40:45 UTC Received 4 May 2010 15:00:57 UTC Server state Over Outcome Success Client state Geen Exit status 0 (0x0) Computer ID 47762 Report deadline 9 May 2010 5:40:45 UTC Run time 18665.984375 CPU time 2017.328 stderr out <core_client_version>6.10.50</core_client_version> <![CDATA[ <stderr_txt> # There are 2 devices supporting CUDA # Device 0: "GeForce GTX 295" # Clock rate: 1.24 GHz # Total amount of global memory: 939327488 bytes # Number of multiprocessors: 30 # Number of cores: 240 # Device 1: "GeForce GTX 295" # Clock rate: 1.24 GHz # Total amount of global memory: 939196416 bytes # Number of multiprocessors: 30 # Number of cores: 240 MDIO ERROR: cannot open file "restart.coor" # Time per step: 28.732 ms # Approximate elapsed time for entire WU: 18675.547 s called boinc_finish </stderr_txt> ]]> This one is also OK, only 60,13MB upload, same driver and boinc-version. ____________ Ton (ftpd) Netherlands
	ID: 16815 \| Rating: 0 \| rate: / Reply Quote

CNT - IQE Send message Joined: 21 Sep 09 Posts: 3 Credit: 1,951,950,972 RAC: 0 Level Scientific publications	Message 16820 - Posted: 4 May 2010 \| 17:56:35 UTC Last modified: 4 May 2010 \| 18:42:52 UTC
	Hi, Is it possible to reactivate log information about GPU? We want to know, on which card is WU crunched. We have 4x295 system and now, if here is a error WU, we dont know, which card is bad. In old log was information like: WU is started on GPU 4... Computer: http://www.gpugrid.net/show_host_detail.php?hostid=59988 OS: Ubuntu 9.10 64b Drivers: 195.36.15 Its only TONI-GA WUs Error wus in last 7 days: http://www.gpugrid.net/result.php?resultid=2268379 http://www.gpugrid.net/result.php?resultid=2268375 http://www.gpugrid.net/result.php?resultid=2265727 http://www.gpugrid.net/result.php?resultid=2265544 http://www.gpugrid.net/result.php?resultid=2265061 http://www.gpugrid.net/result.php?resultid=2264541 http://www.gpugrid.net/result.php?resultid=2241247 http://www.gpugrid.net/result.php?resultid=2241094 http://www.gpugrid.net/result.php?resultid=2240609 http://www.gpugrid.net/result.php?resultid=2240370 http://www.gpugrid.net/result.php?resultid=2240225 http://www.gpugrid.net/result.php?resultid=2228009
	ID: 16820 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1626 Credit: 9,384,566,723 RAC: 19,075,423 Level Scientific publications	Message 16821 - Posted: 4 May 2010 \| 19:05:30 UTC - in response to Message 16800.
	Uh oh. "exit code 1 (0x1)" on f5r6-TONI_GAUS1-0-50-RND5224_1 - that's on 43404, the 9800 GTX+ that was OK overnight. But it was followed immediately by SUCCESS with f92r9-TONI_GAUS1-0-50-RND9426_1. You're not wrong about that upload!
	ID: 16821 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 16826 - Posted: 4 May 2010 \| 22:15:01 UTC - in response to Message 16820.
	Is it possible to reactivate log information about GPU? We want to know, on which card is WU crunched. Definitely would like to have that information in the WU's again.
	ID: 16826 \| Rating: 0 \| rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1626 Credit: 9,384,566,723 RAC: 19,075,423 Level Scientific publications	Message 16831 - Posted: 5 May 2010 \| 13:23:08 UTC
	Latest conundrum: f10r3-TONI_GAUS2-0-50-RND3526_0 FAILED on stock 9800GT f79r2-TONI_GAUS2-0-50-RND5339_1 SUCCESS on overclocked 9800GTX+
	ID: 16831 \| Rating: 0 \| rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 16864 - Posted: 6 May 2010 \| 11:42:19 UTC - in response to Message 16826.
	> Definitely would like to have that information in the WU's again. It should be restored in some coming update in fact.
	ID: 16864 \| Rating: 0 \| rate: / Reply Quote

K1atOdessa Send message Joined: 25 Feb 08 Posts: 249 Credit: 422,229,476 RAC: 983,366 Level Scientific publications	Message 16866 - Posted: 6 May 2010 \| 13:38:14 UTC - in response to Message 16864.
	> Definitely would like to have that information in the WU's again. It should be restored in some coming update in fact. Thanks for the update.
	ID: 16866 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 16902 - Posted: 7 May 2010 \| 21:57:30 UTC - in response to Message 16866.
	Task, f11r7-TONI_GAUS2-1-50-RND2945_4 failed after 21691 sec (6h). System not being used by user at the time. Again, there were betas about! Previous task on that card was successful, so far the next task is OK (@90%).
	ID: 16902 \| Rating: 0 \| rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 16966 - Posted: 11 May 2010 \| 18:50:27 UTC - in response to Message 16902. Last modified: 11 May 2010 \| 18:52:37 UTC
	Name f38r0-TONI_GAUS2-4-50-RND0732_0 Workunit 1455699 Created 9 May 2010 15:12:28 UTC Sent 9 May 2010 15:51:00 UTC Received 10 May 2010 13:14:06 UTC Server state Over Outcome Client error Client state Compute error Exit status 3 (0x3) Computer ID 51747 Report deadline 14 May 2010 15:51:00 UTC Run time 37046.842462 CPU time 2079.649 stderr out <core_client_version>6.10.43</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> # There is 1 device supporting CUDA # Device 0: "GeForce GT 240" # Clock rate: 1.62 GHz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 12 # Number of cores: 96 MDIO ERROR: cannot open file "restart.coor" # There is 1 device supporting CUDA # Device 0: "GeForce GT 240" # Clock rate: 1.62 GHz # Total amount of global memory: 1073741824 bytes # Number of multiprocessors: 12 # Number of cores: 96 SWAN : FATAL : Failure executing kernel sync [nb_k_nt_tp_pme] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> Validate state Invalid Claimed credit 4458.63078703704 Granted credit 0 application version ACEMD - GPU molecular dynamics v6.03 (cuda) --- Name f26r3-TONI_GAUS2-7-50-RND1787_1 Workunit 1456753 Created 10 May 2010 8:06:41 UTC Sent 10 May 2010 8:17:06 UTC Received 10 May 2010 17:58:29 UTC Server state Over Outcome Client error Client state Compute error Exit status 1 (0x1) Computer ID 55951 Report deadline 15 May 2010 8:17:06 UTC Run time 29994.549999 CPU time 4620.22 stderr out <core_client_version>6.10.50</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # There is 1 device supporting CUDA # Device 0: "GeForce GT 240" # Clock rate: 1.62 GHz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 12 # Number of cores: 96 MDIO ERROR: cannot open file "restart.coor" # There is 1 device supporting CUDA # Device 0: "GeForce GT 240" # Clock rate: 1.62 GHz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 12 # Number of cores: 96 # There is 1 device supporting CUDA # Device 0: "GeForce GT 240" # Clock rate: 1.62 GHz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 12 # Number of cores: 96 # There is 1 device supporting CUDA # Device 0: "GeForce GT 240" # Clock rate: 1.62 GHz # Total amount of global memory: 536870912 bytes # Number of multiprocessors: 12 # Number of cores: 96 </stderr_txt> ]]> Validate state Invalid Claimed credit 4458.63078703704 Granted credit 0 application version ACEMD - GPU molecular dynamics v6.03 (cuda)
	ID: 16966 \| Rating: 0 \| rate: / Reply Quote

Post to thread

Message boards : Graphics cards (GPUs) : GA: information and issues

	About	Science	Volunteers	Performance	Forum	Join us	Donate