Author |
Message |
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Dear Crunchers,
we've submitted approximately 1000 WUs of the "GA" (gramicidin A) type. They are a re-issue of a system which we have already run for a while. The purpose of the runs is methodological: they use a model system to improve an algorithm that can be transferred to other molecules.
The video is here - though I'm making new ones.
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Keep challenging your methodologies and you will strengthen the research. Good decision for the long term future. I hope you identify subtle improvements you have made with the new applications and confirm existing results.
Thanks, |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Thanks. Btw, all of them are acemd2, so have the higher bang for the buck ratio (ie credits/hour) of the new app. |
|
|
|
Finished my first GA!
GTX295, Shaders at 1620, WinXP
i7-920 HT ON at 4.0 GHz, 8 CPU threads of WCG HCMD2 fully loaded.
GPU = 5 hours
CPU usage = 1230 seconds
Time per step = 23.927 ms
Points w/ bonus = 6945.175
compared to recent TONI series avg on the same machine
GPU = 4 hours 40 minutes
CPU usage = 555 seconds
Time per step = 25.651 ms
Points w/ bonus = 6123.06875
so the CPU time is up *2.5 and GPU just a little ... looks good to me.
I'm looking forward to your new videos, I hope these results help you find a better answer :-)
____________
Thanks - Steve |
|
|
|
I've been getting nothing but errors on the "TONI_GA" ACEMD - GPU molecular dynamics v6.03 (cuda) WU's over the past 36 hours.
"SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [66,1,1][64,1,1]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 194
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information."
Running 6.10.17, with drivers 196.34. No issues with other WU's, even other types of v6.03's. Running 2x 8800GT + 1 GTS 240. Restart didn't help.
I've halted new WU's for now, but I think I may try to change the preferences not to get these new types. What should I de-select to prevent only these TONI_GA v6.03 types from downloading? |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Hi K1atOdessa,
do you know if the fail on the GTS or on the 8800 (or both?).
At present you can't filter one WU type, but you can filter out acemd2 altogether (Your account, gpugrid preferences). However, this batch of WUs should be over.
T |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Hi Steve, thanks for the report... timings look normal to me.
Only thing, I wouldn't swear that the CPU time is reproduced even if you run two identical WUs (I may be wrong). What's important is that it is much less than the GPU time. |
|
|
MJHProject administrator Project developer Project scientist Send message
Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level
Scientific publications
|
"SWAN : FATAL : Failure executing kernel [mshake_position_kernel_1] [2] [66,1,1][64,1,1]
That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory?
MJH |
|
|
|
That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory?
MJH
It did continue after a hard shutdown / reboot. I am not doing anything else (games, etc.) with this machine. I've changed my WU options to only get the "old" WU types, and completed a couple "Full-atom molecular dynamics v6.71 (cuda23)" WUs with no issue.
ACEMD: yes
ACEMD ver 2.0: no
ACEMD beta: no
The interesting thing is that I did get one v6.03 WU to complete this morning, which was one I grabbed before changing the WU options. I am going to let it run with just the "ACEMD" type right now, but maybe switch back to accepting "ACMD ver 2.0" type after a couple days of no issues to see what happens. I'd like to get the benefit of the performance increase since my cards are not high-end and take a while to complete.
Any other ideas of something I should try? |
|
|
|
Very consistant timings ... three more on the same machine:
I have unhidden my computers (56900) so you can verify.
------------------------------------------------
GPU -------- CPU ----- Time Step
------------------------------------------------
17910 ------- 1238 ------ 23.892
17889 ------- 1233 ------ 23.864
17722 ------- 1206 ------ 23.64
17935 ------- 1230 ------ 23.93
------------------------------------------------
I am not complaining at all, they run very nicely for me.
The CPU seconds is still much less than when I run them
on my Vista PC which is a GTX285 i7-920 and takes ~5000 CPU sec.
Keep up the good work!
____________
Thanks - Steve |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
K1atOdessa,
At some stage in the next few days it would probably be a good idea to make sure you have selected to receive work from other projects (ACEMD ver 2.0 and Betas) if the projects you have selected (ACEMD) have no work.
This is also in your projects settings. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Hi Steve, noted. Thanks for the info. |
|
|
|
is this just coindidence that two machines got incorrect function errors or is it something with the WU?
http://www.gpugrid.net/result.php?resultid=1902641
Based on the amont of time it processed on my machine it should have been finished. I just upgraded it to Win7 and it has been returning WUs OK after the upgrade, including a GA.
____________
Thanks - Steve |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
It's a coincidence. The "other" machine is not returning any results.
Did you abort it manually based on elapsed time? something went wrong but I don't think the WU is any special. |
|
|
|
I did not abort the WU ... the machine is at home and I am at work :-)
It did return another WU of a different type since then so it looks like the machine is OK. While the driver is the same one I was using for Vista and the OS itself should not make a difference from a stability standpoint, I will lower my OC when I get home today. I will also check my error and system event logs and post anything *special*.
<ot>Are you seeing any trending in general on Win7 machines producing more errors?</ot>
____________
Thanks - Steve |
|
|
|
That's an out-of-memory error, but it's coming from an improbable place, making me think there's some other problem. Is the problem persisting over a hard-reset of the machine? Are you running anything else that might be using the GPU's memory?
MJH
It did continue after a hard shutdown / reboot. I am not doing anything else (games, etc.) with this machine. I've changed my WU options to only get the "old" WU types, and completed a couple "Full-atom molecular dynamics v6.71 (cuda23)" WUs with no issue.
ACEMD: yes
ACEMD ver 2.0: no
ACEMD beta: no
The interesting thing is that I did get one v6.03 WU to complete this morning, which was one I grabbed before changing the WU options. I am going to let it run with just the "ACEMD" type right now, but maybe switch back to accepting "ACMD ver 2.0" type after a couple days of no issues to see what happens. I'd like to get the benefit of the performance increase since my cards are not high-end and take a while to complete.
Any other ideas of something I should try?
OK. So I restricted my machine to only the v6.71 WU's over the past 2 days. See tasks, 7/8 completed with no issues. I flipped the options back to allow v6.03 WU's and instant failures again. Two ran longer than just a couple seconds, but eventually failed.
So, any ideas why am I seeing this failure activity only on the newer v6.03 WU's? Are these v6.03 WU's doing something different that the v6.71 didn't? I've had to go back to restricting to download only the v6.71 WU's because otherwise I'd quickly hit the max errored WU's and sit for 24 hours to do it again. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
I've sent more GA runs.. let's see if the newer application improves things.
And, btw, a new movie here. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
K1atOdessa, What cards does your system actually have?
I see GT8800 and GT240 ?!?
Have you swaped cards around and kept the same drivers?
If so, reinstall the driver to register the card, restart and then start crunching again. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
K1atOdessa, Strike that last message.
I see you have two 8800GT's and one GT240 in the same system.
Restart your system, first!
Upgrade to the latest version of Boinc (6.10.36). Restart again.
See if that works.
If you installed any of these cards recently you could try to manually reinstall the drivers from device manager, individually and for each card! |
|
|
|
Restart your system, first!
Upgrade to the latest version of Boinc (6.10.36). Restart again.
See if that works.
Thanks, I just saw in another thread that 6.10.36 is the current recommended version. I will upgrade to that later tonight to see what happens.
I've had all three cards in working fine on the older WU's for some time, but if the upgrade to newer BOINC version doesn't help, I'll try the the manual reinstall of drivers for each card.
Thanks for the tips. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
I just submitted a new batch (GA7F). These should be shorter than usual. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Another batch is out, GA7R. Also short. |
|
|
|
Another batch is out, GA7R. Also short.
Got 3 errors: Incorrect function. (0x1) - exit code 1 (0x1)
1368864 -> 3 computers already reporting as error
1368710
1368868
____________
|
|
|
|
Another batch is out, GA7R. Also short.
Got first error on new batch - Incorrect function. (0x1) - exit code 1 (0x1)
1369719 - already errored out by one other cruncher
Currently another is crunching. We'll see. |
|
|
|
Had seven failures in the last 12hrs, with an average run time of 1hr before failure. I'm now shooting any on sight :-) |
|
|
|
I have errored 2 out of 5.
The first one was only 20 minutes but the other was 2 hours.
Not a very good return rate.
____________
Thanks - Steve |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
@ stoneageman and snowcrash - which of your computers errored out? |
|
|
|
I aborted but everyone else failed
http://www.gpugrid.net/workunit.php?wuid=1369580
GTX295 (CompID = 56900) - failed and so is everyone else
http://www.gpugrid.net/workunit.php?wuid=1369689
http://www.gpugrid.net/workunit.php?wuid=1369405
____________
Thanks - Steve |
|
|
|
They all did! |
|
|
|
failed for everyone who crunched this WU.
f111r1-TONI_GA7R-0-1-RND5547_5
____________
Thanks - Steve |
|
|
|
failing for everyone ...
f103s2-TONI_GA7R-0-1-RND8503_2
____________
Thanks - Steve |
|
|
|
failing for everyone
f109s10-TONI_GA8F-0-1-RND4323
Are the failures (not just the ones I have posted) legit due to parameters of the experiment or do you see them as problems with the machines they are running on? If they are part of the experimewnt paramters have you considered adding error handling that makes the difference between a machine error and paramter out of bounds type of error? Perhaps even awarding points for what amounts to a valid execution of invalid parameters?
____________
Thanks - Steve |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Please, abort any WUs named GA8 and GA9. (Not GA10).
Concerning the failures, these are not expected and should not happen of course. We are investigating their cause (possibly these runs are exposing a bug), but at the moment it is not possible to "trap" them. |
|
|
|
Just had four GA10R and one GA10F fail on three different machines in the last few hours. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Immediately or after a while? |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
For me also now two GA10R-WU cancelled after 35 min.
Windows XP-pro 06.10.45 GTX295.
Yesterday also two WU!
Success!!
____________
Ton (ftpd) Netherlands |
|
|
|
Score so far with GA10R. Nine failed and five successful. Quickest to fail was 4min and longest 3h10m. Average about 1hr. Boinc 6:10:43 driver 197.13 winXP64 |
|
|
|
If you know the GA8 is bad can you please stop sending them out?
I got one very early this moring and wasted over an hour on it.
WU: f124s3-TONI_GA8F-0-1-RND6576_1
Returned: 30 Apr 2010 3:33:25 UTC
GPU Time: 4,560.34
CPU Time: 440.73
4 GA10 failures between late yesterday and today.
WU: f150r7-TONI_GA10R-0-1-RND0345_1
Returned: 30 Apr 2010 8:01:47 UTC
GPU Time: 1,697.97
CPU Time: 159.95
WU: f187r5-TONI_GA10R-0-1-RND5176_0
Returned: 30 Apr 2010 7:53:12 UTC
GPU Time: 12,832.70
CPU Time: 1,337.33
WU: f140r8-TONI_GA10R-0-1-RND4688_0
Returned: 30 Apr 2010 2:17:11 UTC
GPU Time: 4,305.75
CPU Time: 413.22
WU: f100r2-TONI_GA10R-0-1-RND6793_1
Returned: 29 Apr 2010 18:35:10 UTC
GPU Time: 796.77
CPU Time: 77.42
____________
Thanks - Steve |
|
|
|
Please, abort any WUs named GA8 and GA9. (Not GA10).
Concerning the failures, these are not expected and should not happen of course. We are investigating their cause (possibly these runs are exposing a bug), but at the moment it is not possible to "trap" them.
I've had failures on the last 3 of 4 GA10's. I'm going to kill them as well as they appear to have similar issues with GA8 and GA9. |
|
|
|
Ongoing problems with "...-TONI_GA..." on all cards.
http://www.gpugrid.net/workunit.php?wuid=1412871
http://www.gpugrid.net/workunit.php?wuid=1404490
http://www.gpugrid.net/workunit.php?wuid=1413442
http://www.gpugrid.net/workunit.php?wuid=1413519
http://www.gpugrid.net/workunit.php?wuid=1413117
http://www.gpugrid.net/workunit.php?wuid=1413857
|
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
ga10R failled after 7 min.
____________
Ton (ftpd) Netherlands |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
f114r5-TONI_GA10R-0-1-RND7226
Too many errors (may have bug)
|
|
|
|
GA11R....two failed & two completed ok. Looks like there's still an issue! |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Sorry guys, GA is making me sweat too... However for now I am not aware of mistakes in GA11. |
|
|
|
Add another exit code 1 (0x1) to the collection:
f196r4-TONI_GA11R-0-1-RND1898_0
Edit - And a 'ERROR: file tclutil.cpp line 31: get_Dvec() element 0 (b) ':
f136r9-TONI_GA11R-0-1-RND4524_0 |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Hi Richard,
could it be that you are getting errors on host 45218 since you upgraded from 6.10.48 to 6.10.51? |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Stopped all suspicious GAxx. There is a small number of GAUS1 out that should run fine, except they produce large uploads. A batch of GAUS2 should work well. Thanks for all of your reports. |
|
|
|
Hi Richard,
could it be that you are getting errors on host 45218 since you upgraded from 6.10.48 to 6.10.51?
Interesting question. True, but I don't think you can claim "cause and effect".
I upgraded host 43404 to BOINC v6.10.51 at the same time. That host hasn't thrown any errors yet - but then, it hasn't been issued any GA11R tasks either.
The other difference between the hosts is that 43404 (factory overclocked 9800GTX+, no errors at the moment) is running NVidia drivers 190.38: host 45218 (stock speed 9800GT, errors on GA11R) I opgraded from 197.13 to 197.45 in the same session as I installed v6.10.51. (Both 197 drivers have difficulty holding my 1600 x 1200 resolution when I switch the DVI KVM to another host). I'm active in BOINC development testing, and I'm not aware of any changes in v6.10.51 that could cause application errors - if anything, the 197 drivers might be more of a problem, because (at least as reported by BOINC) they leave less GPU RAM available for apps to use.
Both hosts are currently running TONI_GAUS1 tasks (do they count as 'GA' for the purposes of this thread?): 43404 is at 15%, 45218 is at 65%. That'll be the first head-to-head comparison between the two hosts - resultis in a few hours. |
|
|
|
Uh oh. "exit code 1 (0x1)" on f5r6-TONI_GAUS1-0-50-RND5224_1 - that's on 43404, the 9800 GTX+ that was OK overnight. |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
Yes the driver is more likely to be a culprit than the BOINC v. (And I would be also cautious in cause-effects).
Your hosts make an interesing pair. So, if I understand
43404, factory oc -> v 190.38, errors
45218 was 197.13, now 197.45, errors after upgrade (but did not crunch GA before it)
is that correct? |
|
|
|
Yes the driver is more likely to be a culprit than the BOINC v. (And I would be also cautious in cause-effects).
Your hosts make an interesing pair. So, if I understand
43404, factory oc -> v 190.38, errors
45218 was 197.13, now 197.45, errors after upgrade (but did not crunch GA before it)
is that correct?
I wouldn't put it that way.
43404, factory oc -> v 190.38, 1 error with GAUS1, success with CAPBIND, HERG and pTEEI. Also succeeded with a GA10F and a GA8F, errored with another GA8F.
45218 fair comment, but it has no GA tasks shown from before upgrade, ONLY (so far) GA11R tasks since upgrade.
PS - I have a third 'control' host, 43362, with a non-overclicked 9800GT, BOINC v6.10.36, 190.38 driver. But it hasn't got any GA tasks yet....
Got to go out now, will review outcomes when I get back. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
I'm also getting errors on my system with four GT240's.
Two things are interesting,
- they are all TONI_GA11R-0-1-RND
- The system picked up a few Beta 6.22 tasks. Although these all ran successfully, perhaps they interfeered with the GA11R task in some way.
I say this because I saw the same pattern a few days ago; Betas OK but other WU's failed.
The Failures,
2266574 1428922 3 May 2010 13:02:26 UTC 3 May 2010 15:17:42 UTC Error while computing 4,354.23 518.53 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda)
2265705 1429788 3 May 2010 21:25:49 UTC 4 May 2010 0:32:03 UTC Error while computing 827.67 108.02 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda)
2271301 1429669 4 May 2010 1:32:37 UTC 4 May 2010 6:44:26 UTC Error while computing 18,111.57 2,181.78 3,429.72 --- ACEMD - GPU molecular dynamics v6.03 (cuda)
f184r5-TONI_GA11R-0-1-RND4020_2
f123r4-TONI_GA11R-0-1-RND7715_1
f193r9-TONI_GA11R-0-1-RND1232_0
(6.10.45, 19745, 3 of the cards on this system are OC'd but complete other tasks OK) |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
f199r1-TONI_GA11R-0-1-RND7585_1
Workunit 1429855
Aangemaakt 4 May 2010 5:26:42 UTC
Sent 4 May 2010 5:40:45 UTC
Received 4 May 2010 14:56:44 UTC
Server state Over
Outcome Success
Client state Geen
Exit status 0 (0x0)
Computer ID 47762
Report deadline 9 May 2010 5:40:45 UTC
Run time 14756.328125
CPU time 1624.516
stderr out <core_client_version>6.10.50</core_client_version>
<![CDATA[
<stderr_txt>
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939327488 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 30
# Number of cores: 240
MDIO ERROR: cannot open file "restart.coor"
# Time per step: 29.504 ms
# Approximate elapsed time for entire WU: 14751.813 s
called boinc_finish
</stderr_txt>
]
This one is OK with driver 197.45 and windows-xp-pro and boinc 06.10.50.
____________
Ton (ftpd) Netherlands |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
f8r6-TONI_GAUS1-0-50-RND3701_2
Workunit 1431573
Aangemaakt 4 May 2010 5:33:57 UTC
Sent 4 May 2010 5:40:45 UTC
Received 4 May 2010 15:00:57 UTC
Server state Over
Outcome Success
Client state Geen
Exit status 0 (0x0)
Computer ID 47762
Report deadline 9 May 2010 5:40:45 UTC
Run time 18665.984375
CPU time 2017.328
stderr out <core_client_version>6.10.50</core_client_version>
<![CDATA[
<stderr_txt>
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939327488 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 295"
# Clock rate: 1.24 GHz
# Total amount of global memory: 939196416 bytes
# Number of multiprocessors: 30
# Number of cores: 240
MDIO ERROR: cannot open file "restart.coor"
# Time per step: 28.732 ms
# Approximate elapsed time for entire WU: 18675.547 s
called boinc_finish
</stderr_txt>
]]>
This one is also OK, only 60,13MB upload, same driver and boinc-version.
____________
Ton (ftpd) Netherlands |
|
|
|
Hi,
Is it possible to reactivate log information about GPU?
We want to know, on which card is WU crunched.
We have 4x295 system and now, if here is a error WU, we dont know, which card is bad.
In old log was information like:
WU is started on GPU 4...
Computer:
http://www.gpugrid.net/show_host_detail.php?hostid=59988
OS: Ubuntu 9.10 64b
Drivers: 195.36.15
Its only TONI-GA WUs
Error wus in last 7 days:
http://www.gpugrid.net/result.php?resultid=2268379
http://www.gpugrid.net/result.php?resultid=2268375
http://www.gpugrid.net/result.php?resultid=2265727
http://www.gpugrid.net/result.php?resultid=2265544
http://www.gpugrid.net/result.php?resultid=2265061
http://www.gpugrid.net/result.php?resultid=2264541
http://www.gpugrid.net/result.php?resultid=2241247
http://www.gpugrid.net/result.php?resultid=2241094
http://www.gpugrid.net/result.php?resultid=2240609
http://www.gpugrid.net/result.php?resultid=2240370
http://www.gpugrid.net/result.php?resultid=2240225
http://www.gpugrid.net/result.php?resultid=2228009 |
|
|
|
Uh oh. "exit code 1 (0x1)" on f5r6-TONI_GAUS1-0-50-RND5224_1 - that's on 43404, the 9800 GTX+ that was OK overnight.
But it was followed immediately by SUCCESS with f92r9-TONI_GAUS1-0-50-RND9426_1. You're not wrong about that upload! |
|
|
|
Is it possible to reactivate log information about GPU? We want to know, on which card is WU crunched.
Definitely would like to have that information in the WU's again. |
|
|
|
Latest conundrum:
f10r3-TONI_GAUS2-0-50-RND3526_0 FAILED on stock 9800GT
f79r2-TONI_GAUS2-0-50-RND5339_1 SUCCESS on overclocked 9800GTX+ |
|
|
ToniVolunteer moderator Project administrator Project developer Project tester Project scientist Send message
Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level
Scientific publications
|
> Definitely would like to have that information in the WU's again.
It should be restored in some coming update in fact. |
|
|
|
> Definitely would like to have that information in the WU's again.
It should be restored in some coming update in fact.
Thanks for the update. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Task, f11r7-TONI_GAUS2-1-50-RND2945_4 failed after 21691 sec (6h). System not being used by user at the time.
Again, there were betas about!
Previous task on that card was successful, so far the next task is OK (@90%).
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Name f38r0-TONI_GAUS2-4-50-RND0732_0
Workunit 1455699
Created 9 May 2010 15:12:28 UTC
Sent 9 May 2010 15:51:00 UTC
Received 10 May 2010 13:14:06 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 3 (0x3)
Computer ID 51747
Report deadline 14 May 2010 15:51:00 UTC
Run time 37046.842462
CPU time 2079.649
stderr out
<core_client_version>6.10.43</core_client_version>
<![CDATA[
<message>
The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: cannot open file "restart.coor"
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 12
# Number of cores: 96
SWAN : FATAL : Failure executing kernel sync [nb_k_nt_tp_pme] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 4458.63078703704
Granted credit 0
application version ACEMD - GPU molecular dynamics v6.03 (cuda)
---
Name f26r3-TONI_GAUS2-7-50-RND1787_1
Workunit 1456753
Created 10 May 2010 8:06:41 UTC
Sent 10 May 2010 8:17:06 UTC
Received 10 May 2010 17:58:29 UTC
Server state Over
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
Computer ID 55951
Report deadline 15 May 2010 8:17:06 UTC
Run time 29994.549999
CPU time 4620.22
stderr out
<core_client_version>6.10.50</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96
MDIO ERROR: cannot open file "restart.coor"
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96
# There is 1 device supporting CUDA
# Device 0: "GeForce GT 240"
# Clock rate: 1.62 GHz
# Total amount of global memory: 536870912 bytes
# Number of multiprocessors: 12
# Number of cores: 96
</stderr_txt>
]]>
Validate state Invalid
Claimed credit 4458.63078703704
Granted credit 0
application version ACEMD - GPU molecular dynamics v6.03 (cuda) |
|
|