Author |
Message |
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
Please use this thread to post any problem regarding all workunits tagged as *_pYEEI_*.
Thanks,
ignasi |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
I am already aware of very recent reports of *_reverse1_pYEEI_2112_* failing.
Let any new result die. They have been cancelled though. |
|
|
|
They have been cancelled though.
Thanks for doing that. It isn't terribly much of a problem when tasks crash after five seconds - much more annoying when they run for several hours first and then waste all the effort - but it is a bit wasteful of bandwidth when tasks take longer to download than they do to crunch! |
|
|
|
They have been cancelled though.
I don't think you zapped them all. I'll let my copy of WU 1037889 get its five seconds of fame overnight, but I will be most surprised if it's still alive in the morning. |
|
|
Edboard Send message
Joined: 24 Sep 08 Posts: 72 Credit: 12,410,275 RAC: 0 Level
Scientific publications
|
I have received two more of them today (they errored as expected after 2-3 seconds). You can see them here and here. |
|
|
|
I just got more and more, and now GPUGrid won't send me any new WUs ... I'll be back after a few days of Milkyway ... hopefully you will either really have them deleted or they will have failed so many times by then they will automagically be taken out of the pool :-(
These are 2312 series ... looks like same error got put into the *replacement* batch?
____________
Thanks - Steve |
|
|
vaio Send message
Joined: 6 Nov 09 Posts: 20 Credit: 10,781,505 RAC: 0 Level
Scientific publications
|
I came home today to a dead wu and a corrupted desktop.
This on a rock steady rig with a GTS 250 at stock....fine for >2 months til now.
Can't get work.....have it folding at the moment.
Never noted what wu it was.....had to reboot to check out problems.
Corrupted desktop went with the reboot.
____________
Join Here
Team Forums |
|
|
vaio Send message
Joined: 6 Nov 09 Posts: 20 Credit: 10,781,505 RAC: 0 Level
Scientific publications
|
Just had a look.....todays electricity bill contribution:
23 Dec 2009 13:53:20 UTC 23 Dec 2009 16:24:17 UTC Error while computing 7.32 6.92 0.04 --- Full-atom molecular dynamics v6.71 (cuda)
1661886 1042549 23 Dec 2009 13:49:32 UTC 23 Dec 2009 13:53:20 UTC Error while computing 5.38 4.84 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661884 1042547 23 Dec 2009 13:41:25 UTC 23 Dec 2009 13:45:50 UTC Error while computing 5.45 4.83 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661877 1042543 23 Dec 2009 13:24:37 UTC 23 Dec 2009 13:28:34 UTC Error while computing 5.35 4.94 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661868 1042537 23 Dec 2009 13:32:52 UTC 23 Dec 2009 13:36:48 UTC Error while computing 6.35 5.89 4,022.81 --- Full-atom molecular dynamics v6.71 (cuda)
1661862 1042532 23 Dec 2009 13:28:34 UTC 23 Dec 2009 13:32:52 UTC Error while computing 3.38 2.94 0.02 --- Full-atom molecular dynamics v6.71 (cuda)
1661803 1042484 23 Dec 2009 13:36:48 UTC 23 Dec 2009 13:41:25 UTC Error while computing 6.31 5.88 4,022.81 --- Full-atom molecular dynamics v6.71 (cuda)
1661790 1042473 23 Dec 2009 13:20:20 UTC 23 Dec 2009 13:24:37 UTC Error while computing 5.52 4.86 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661780 1042463 23 Dec 2009 13:45:50 UTC 23 Dec 2009 13:49:32 UTC Error while computing 5.34 4.83 4,531.91 --- Full-atom molecular dynamics v6.71 (cuda)
1661734 1042434 23 Dec 2009 13:15:30 UTC 23 Dec 2009 13:20:20 UTC Error while computing 7.60 7.00 0.04 --- Full-atom molecular dynamics v6.71 (cuda)
1661728 1042430 23 Dec 2009 13:11:24 UTC 23 Dec 2009 13:15:30 UTC Error while computing 7.44 6.86 0.04 --- Full-atom molecular dynamics v6.71 (cuda)
1661471 1042269 23 Dec 2009 11:35:27 UTC 23 Dec 2009 13:11:24 UTC Error while computing 5,546.53 768.25 3,539.96 --- Full-atom molecular dynamics v6.71 (cuda)
1660795 1041835 23 Dec 2009 7:20:21 UTC 28 Dec 2009 7:20:21 UTC In progress --- --- --- --- Full-atom molecular dynamics v6.71 (cuda)
1660714 1030551 23 Dec 2009 6:01:18 UTC 23 Dec 2009 20:10:45 UTC Completed and validated 45,871.60 5,901.35 3,539.96 4,778.94 Full-atom molecular dynamics v6.71 (cuda)
1659183 1040817 22 Dec 2009 19:52:50 UTC 23 Dec 2009 11:41:26 UTC Completed and validated 56,499.53 4,107.08 4,531.91 6,118.08 Full-atom molecular dynamics v6.71 (cuda)
1658422 1035863 22 Dec 2009 15:30:32 UTC 23 Dec 2009 7:20:21 UTC Error while computing 4,779.29 644.83 3,539.96 --- Full-atom molecular dynamics v6.71 (cuda)
1655707 1038933 21 Dec 2009 23:47:15 UTC 23 Dec 2009 6:01:18 UTC Completed and validated 52,962.71 2,202.70 4,428.01 5,977.82 Full-atom molecular dynamics v6.71 (cuda)
1654210 1038221 22 Dec 2009 4:53:16 UTC 22 Dec 2009 19:55:27 UTC Completed and validated 53,802.30 2,652.13 4,428.01 5,977.82 Full-atom molecular dynamics v6.71 (cuda)
1653811 1037642 21 Dec 2009 13:08:13 UTC 21 Dec 2009 13:09:44 UTC Error while computing 4.61 0.16 0.00 --- Full-atom molecular dynamics v6.71 (cuda)
____________
Join Here
Team Forums |
|
|
GDFVolunteer moderator Project administrator Project developer Project tester Volunteer developer Volunteer tester Project scientist Send message
Joined: 14 Mar 07 Posts: 1957 Credit: 629,356 RAC: 0 Level
Scientific publications
|
Vaio,
what's your host?
gdf |
|
|
|
Vaio,
what's your host?
gdf
It's host 55606
Although he's had a couple of pYEEIs, I reckon it was the IBUCH_TRYP at 13:11 which did the damage - corrupted the internal state of the card so badly that all subsequent tasks failed (even ones which are normally OK on G92). That would account for the corrupted desktop as well. |
|
|
|
I had one fail today:
184-IBUCH_reverse1fix_pYEEI_2312-0-40-RND4766_4
I guess the "fix" in the name isn't quite there yet. :)
This WU has failed on 5 hosts so far.
WU: 1043207
stderr:
<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
- exit code 98 (0x62)
</message>
<stderr_txt>
# Using CUDA device 0
# There is 1 device supporting CUDA
# Device 0: "GeForce GTX 280"
# Clock rate: 1.35 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 30
# Number of cores: 240
ERROR: mdsim.cu, line 101: Failed to parse input file
called boinc_finish
</stderr_txt>
]]>
|
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
certainly not.
my apologies,
ignasi |
|
|
canardoSend message
Joined: 11 Feb 09 Posts: 4 Credit: 8,675,472 RAC: 0 Level
Scientific publications
|
Hello Ignasi,
Just have a look here comp id: 26091
also, TONI' s fail, unfortunately at the end of the run.
Ciao,
Jaak
____________
|
|
|
|
In general, do you test new configurations with the people who have opted in to "Run test applications"?
____________
Thanks - Steve |
|
|
vaio Send message
Joined: 6 Nov 09 Posts: 20 Credit: 10,781,505 RAC: 0 Level
Scientific publications
|
I pulled down some new work today and it seems to be behaving so far.
Also, the downloading issue seems to have corrected itself....whatever weighting I gave it would never give me more than one work unit at a time.
This morning it pulled two.
____________
Join Here
Team Forums |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
Hello Ignasi,
Just have a look here comp id: 26091
also, TONI' s fail, unfortunately at the end of the run.
Ciao,
Jaak
Please report this *HERG* fail on its thread as well, it will be helpful.
|
|
|
|
Another quick fail batch ...
name 333-IBUCH_reverse_pYEEI_2912-0-40-RND8124
application Full-atom molecular dynamics
created 29 Dec 2009 13:42:42 UTC
minimum quorum 1
initial replication 1
max # of error/total/success tasks 5, 10, 6
Task ID
click for details Computer Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
1683801 26061 29 Dec 2009 20:20:30 UTC 29 Dec 2009 20:21:53 UTC Error while computing 2.22 0.06 0.00 --- Full-atom molecular dynamics v6.71 (cuda23)
1685230 31780 29 Dec 2009 21:06:43 UTC 29 Dec 2009 21:08:06 UTC Error while computing 2.12 0.09 0.00 --- Full-atom molecular dynamics v6.71 (cuda23)
1685404 54778 29 Dec 2009 21:30:44 UTC 3 Jan 2010 21:30:44 UTC In progress --- --- --- --- Full-atom molecular dynamics v6.71 (cuda23)
____________
Thanks - Steve |
|
|
|
Starting a little while ago, I've been receiving a bunch of these WUs -- and they're all getting compute errors within a few seconds of starting.
GPU is EVGA GTX280 (factory OC), CPU is C2Q Q6600 2.4 GHZ (no OC). Vista 32 bit.
You can follow my name link to get to the details on the computer and the WUs.
____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.
|
|
|
|
so now it looks like I have errored too many times so the server will not send me any work. See ya later ... I'm going back to Milkyway for another couple of days :-(
____________
Thanks - Steve |
|
|
|
A couple of my gpu's have just choked on seven of these reverse_pYEEI wu's and now are idle. Begs the question as to why so many were sent out in the first place when they were suspect!
UPDATE: another four have taken out one more gpu :{ |
|
|
|
I *just* managed to squeak by. I had six of these error out, dropping my daily quota to 9. The next WU was the 9th of the day; fortunately, it's a different series and is crunching normally. If it had been another error I think this GPU would have been done for the day. (Unless it's still counting this as WUs per CPU core, in which case I had a lot of headway.)
____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.
|
|
|
|
UPDATE: Four more have trashed another gpu
Aborted a boat load of these critters, yet still they come. It's like they are breeding! |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Can you PLEASE PLEASE PLEASE make sure WU batches are OK before sending them out. |
|
|
|
GTX295 - Nine *_pYEEI_* WUs crashed in a row.
http://www.gpugrid.net/results.php?hostid=53295
"MDIO ERROR: syntax error in file "structure.psf", line number 1: failed to find PSF keyword
ERROR: mdioload.cu, line 172: Unable to read topology file"
No new work sent for 7,5 hours. (recently got new)
Should I abort *_pYEEI_* on other GPUs (cache)?
|
|
|
hzelsSend message
Joined: 4 Sep 08 Posts: 7 Credit: 52,864,406 RAC: 0 Level
Scientific publications
|
last WUs all going down the drain:
<stderr_txt>
# Using CUDA device 0
# There are 2 devices supporting CUDA
# Device 0: "GeForce GTX 280"
# Clock rate: 1.55 GHz
# Total amount of global memory: 1073741824 bytes
# Number of multiprocessors: 30
# Number of cores: 240
# Device 1: "GeForce GTX 260"
# Clock rate: 1.51 GHz
# Total amount of global memory: 939524096 bytes
# Number of multiprocessors: 27
# Number of cores: 216
MDIO ERROR: syntax error in file "structure.psf", line number 1: failed to find PSF keyword
ERROR: mdioload.cu, line 172: Unable to read topology file
called boinc_finish
</stderr_txt>
I'm over to Collatz for some days. |
|
|
|
I just had another one of these fail:
1057058
____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Had 2 fail in a few seconds on one system, 3 on another.
184-IBUCH_reverse_pYEEI_2912-0-40-RND6748 http://www.gpugrid.net/workunit.php?wuid=1056751
128-IBUCH_reverse_pYEEI_2912-0-40-RND3643 http://www.gpugrid.net/workunit.php?wuid=1056695
Also, could not get any tasks this morning between about 1am and noon, on the same system, but running a task now.
http://www.gpugrid.net/workunit.php?wuid=1056826
http://www.gpugrid.net/workunit.php?wuid=1056758
http://www.gpugrid.net/workunit.php?wuid=1056826 |
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
Please use this thread to post any problem regarding all workunits tagged as *_pYEEI_*.
Thanks,
ignasi
As you can see (I hope) massive problems have been reported and many systems have been locked out (and are sitting idle) of receiving new WUs due to these faulty units. Don't you think it's about time to pull the rest? It looks like they're just being allowed to run until they fail so many times that the server cancels them. That's not showing any concern at all for the people who are doing your work.
I know they're not being canceled because I've received 22 of them so far today. Every one of those 22 has failed on several machines before being sent to me. That's just wrong.
|
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
In a way the _pYEEI_ tasks are SPAM!
I had to take extreme action yesterday - shut down my system for a couple of hours ;) |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
My most sincere apologies to everybody for all this.
I wanted to fill up the queue before going offline for some days but obviously it didn't work as expected.
The balance between keeping crunchers support, not having an empty queue and having private life is always very sensitive to human errors.
Sincerely,
ignasi |
|
|
|
My most sincere apologies to everybody for all this.
No worries here; stuff happens. It's the nature of the "free" distributed computing that there are going to be minor problems along the way.
Happy new year!
____________
Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG.
|
|
|
Beyond Send message
Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level
Scientific publications
|
My most sincere apologies to everybody for all this.
Happy new year!
Thanks for letting us know what happened. Communication is appreciated.
Happy new year everyone!
|
|
|
|
The balance between keeping crunchers support, not having an empty queue and having private life is always very sensitive to human errors.
Sincerely,
ignasi
"A PRIVATE life".......... well ok. However, we expect you to sleep with the server :)
|
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
"A PRIVATE life".......... well ok. However, we expect you to sleep with the server :)
[/quote]
I am afraid girlfriends are too jealous... |
|
|
|
"A PRIVATE life".......... well ok. However, we expect you to sleep with the server :)
I am afraid girlfriends are too jealous...
You have more than ONE !!! No wonder he can't get the WUs straight , he is sleep deprived :-)
Keep up the good work, we'll crunch the best we can!
____________
Thanks - Steve |
|
|
|
Three compute errors 1,2,3 on this host . |
|
|
AndyMMSend message
Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level
Scientific publications
|
Sorry but gong to say good bye. Last 3 days non stop computation errors made even worse by the fact the cards just sat there doing nothing.
Switching all my GPUs to F@H. I do not accept having my money wasted with units processing for 17 hours then showing a computing error.
|
|
|
|
Hi,
It's because you have been accepting beta work from us. If reliability of work is of paramount importance to you, don't track the beta application.
Matt |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Switching all my GPUs to F@H.
Your cards do a lot more work here than they can at F@H.
If the problem is Beta related you just need to turn the Betas off, as MJH said. It might also be that you need to restart the system. Sometimes one failure can cause contunuous failures (a runaway) and you need to restart the system. I say this because the problem was only limited to your GTX 295, and not your GTX 275.
Many of your tasks seem to have been aborted by user. Some immediately and one after running for a long time,
286-IBUCH_esrever_pYEEI_0301-10-40-RND7408 - Aborted by user after 43,189.28 seconds.
Turn off Betas, restart and see how you get on. |
|
|
AndyMMSend message
Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level
Scientific publications
|
Thanks for the comments. I looked in my GPUGrid preferences and did not notice anything saying Beta
I did see
"Run test applications?
This helps us develop applications, but may cause jobs to fail on your computer"
Which was already set to no.
Please advise, how do a turn off receiving Beta work units
Thanks
Andy |
|
|
AndyMMSend message
Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level
Scientific publications
|
[quote]Thanks for the comments. I looked in my GPUGrid preferences and did not notice anything saying Beta
I did see
"Run test applications?
This helps us develop applications, but may cause jobs to fail on your computer"
Which was already set to no.
[quote]
I have found the answer in another thread. So unless someone has switched the "Run Test Applications" off for me in the last 2 days I have never accepted Beta Applications.
I have re attached a 275. I will leave that running for a few days. The 295s will stay on F@H for now, they run F@H (and used to run S@H) fine it was only GPUGrid causing problems.
Also FYI it was me who aborted the work units after seeing this thread and relating it to the problems I had been having. After seeing work units processing for hours then showing computation error I was not in the mood to waste any more time. |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Other users cannot see if you have Betas enabled or not, just suggest you turn it off if you are having problems. There are many things that can cause errors. We can only guess as we do not have all the info. I cant tell if your system has automatic updates turned on, or if you have your local Boinc client set to Use GPU when computer is in use. All I can do is suggest you disable automatic updates as these force restarts and crash tasks, and turn off Use GPU when computer is in use, if you watch any video on your system.
GL |
|
|
AndyMMSend message
Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level
Scientific publications
|
Thanks for the advice. The PCs are all part of a crunching farm I have. All headless and controlled by VNC.
Only 4 of them have 9 series or higher Nvidia cards suitable for GPU Grid. Rest are simple Quads with built in graphics running Rosetta and WCG.
Either way I will leave a single 275 running on GpuGrid for now. The rest can stay on F@H.
Andy
|
|
|
|
Compute error with a GTX295 GPU on this computer . |
|
|
|
i got weird wu(1949860), it error out but then a success?
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.75 GHz
# Total amount of global memory: 523829248 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.75 GHz
# Total amount of global memory: 523829248 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Time per step: 59.524 ms
# Approximate elapsed time for entire WU: 37202.798 s
called boinc_finish
____________
|
|
|
|
I've seen a recent handful of errors on my GTX295 and I know a team mate of mine has seen a few also. TONI wu process fine (which I think are more computationally intensive) so I think our OC is OK.
Are you seeing a higher failure rate on these WUs between last night and early this morning?
____________
Thanks - Steve |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
Still happening?
Could you post some of this failed results so I can double check they are right?
thanks |
|
|
|
Still happening?
No. Everything looks good now :-)
____________
Thanks - Steve |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
16-3-2010 10:40:54 GPUGRID Restarting task p34-IBUCH_chall_pYEEI_100301-15-40-RND6745_1 using acemd version 671
16-3-2010 10:40:55 GPUGRID Restarting task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 using acemd2 version 603
16-3-2010 10:58:32 GPUGRID Computation for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 finished
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_1 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_2 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_3 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Starting p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0
16-3-2010 10:58:34 GPUGRID Starting task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 using acemd2 version 603
16-3-2010 11:29:43 GPUGRID Computation for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 finished
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_1 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_2 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_3 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Starting a33-TONI_HERG79a-3-100-RND6672_0
16-3-2010 11:29:44 GPUGRID Starting task a33-TONI_HERG79a-3-100-RND6672_0 using acemd2 version 603
I am also using GTX 295, both jobs cancelled after 45 min. device 1.
Yesterday 3 jobs out of 4 cancelled after almost 5 hours processing.
I can use some help!!!!!!
See also "errors after 7 hours"
____________
Ton (ftpd) Netherlands |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
Please HELP!!!
Today again 4 out of 5 jobs cancelled after more than 4 hours processing!!
GTX 295 - Windows XP
____________
Ton (ftpd) Netherlands |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
Again today 6 out of 6 cancelled after 45 secs.
Windows XP - gtx 295 - driver 197.13
Also working Windows XP - gts 250 - driver 197.13 - no problems in abou 10 hours.
Any ideas????
____________
Ton (ftpd) Netherlands |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
@ftpd
I see all your errors, yes.
Your case is one of the hardest to debug. All WU you took where already started by somebody else, therefore it is not an input file corruption. Neither, given by the fact that they fail after some execution time.
We neither see no major failure due to the application solely at least.
But what I observe in your case is that none of the other cards have such a high rate of failure with similar or even equal WU's and app version.
Have you considered that the source might be the card itself?
What brand is the card? Can you monitor temperature while running?
Is that your video output card?
Do you experience that sort of fails in other projects? |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
Dear Ignasi,
Since last day i have also problems with WU-Milky Way with the same card.
The temp is OK = about 65C
In case of processin 6.71 cuda no problems, only with acemd2?
This computer is working 24/7 and is not using (except for monitor) the card.
The card is 6 months old.
Regards,
Ton
PS Now processing device 1 = collatz and device 0 = gpugrid
____________
Ton (ftpd) Netherlands |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
You may want to consider returning it to the seller or manufacturer (RTM), if it is under warrantee. If you have tried it in more than one system with the same result, I think it is fair to say the issue is with the card. As you are now getting errors with other projects and the error rate is rising the card might actually fail soon. |
|
|
|
looks like this WU is bad ...
p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5
I will be starting it in a few hours so we'll see if the string of errors continues for this WU.
____________
Thanks - Steve |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
Last night same machine GTX 295 3 out of 4 were OK!!!!!!!!!!
____________
Ton (ftpd) Netherlands |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
looks like this WU is bad ...
p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5
I will be starting it in a few hours so we'll see if the string of errors continues for this WU.
Thanks Snow Crash,
certainly some WUs seem to be condemned to die. We have been discussing that internally and it can be either by chance that a result is corrupted when saved/uploaded/etc. or that particular cards are are corrupting results from time to time.
Anyways, please let us know if you detect any pattern of failure regarding 'condemned WUs'.
cheers,
i |
|
|
|
You guys do such a good job that I have not seen another "one off" wu error.
I just finished my first *long* WU and it took precisely twice as long as previous pYEEI wus which I bet is exactly what you planned. Excellent work.
Can you tell us anything about the numbr of atoms and how much time these wus model?
____________
Thanks - Steve |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
Can you tell us anything about the numbr of atoms and how much time these wus model?
Sure.
These *long* Wu's are exactly twice as long as the previous one's with similar name. They are modeling exactly 5 nanoseconds (ns) of ~36000 atoms (*pYEEI* & *PQpYEEIPI*). In these systems we have a protein (good old friend SH2 domain) and ligand (phosphoTYR-GLU-GLU-ILE & PRO-GLN-phosphoTYR-GLU-GLU-ILE-PRO-ILE //aminoacids) for which we are computing 'free energies of binding'. Basically the strength of their interaction.
We are willing to increase the size for one main reason. Our 'optimal' simulation time for analysis is no shorter than 5 ns at the moment. That means that our waiting time then is made of a normal WU (2.5ns) + queuing + normal WU (2.5ns), this times 50 which is the number of WUs for one of these experiments.
As you may see, the time-to-answer will greatly vary. With twice as long WUs, we omit the queuing time. Now with a faster application shouldn't be much of a hassle.
However, it is still a test. We want to have your feedback on them.
thanks,
ignasi |
|
|
|
Looking at 9 hours of processing on a current, state of the art, GPU to return 5 ns worth of realtime simulation puts into perspective just how important it is for all of us to pull together. I've read some of the papers you guys have published and not that I can follow any of it but I always knew you were working at the atomic level (seriously cool, you rock!).
Also, knowing that with the normal size you ultimately need to put together 100 WUs back to back before you have anything that even makes sense for you to start looking at highlites why we need to turn these WUs as quickly as possible. Best case scenario you don't get a finished run for more than 3 months ... and that's best case. I imagine it is more common to have to wait at least 4 months.
Running stable with a small cache will reduce the overall runtime of an experiment much more than any one GPU running fast. So, everyone ... get those card stable, turn your cache down low, and keep on crunching!
____________
Thanks - Steve |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
Today 6 out of 6 cancelled after a few hours processing! Also the long WU 6.71
What can we do about it???
____________
Ton (ftpd) Netherlands |
|
|
|
1. Are you connecting to this machine remotely?
2. Are you crunching anything else on this machine?
3. Can you suspend one of the WUs that are currently running to see if the other one will finish properly.
____________
Thanks - Steve |
|
|
skgivenVolunteer moderator Volunteer tester
Send message
Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level
Scientific publications
|
Your GTX295 is getting about 7K points per day on average. On that system it should be getting about 49K! It must be particularly annoying to have 4 tasks all fail after going more than 50% through a task; one task must have been about 20min from finishing!
RTM it, try it in a different system or edit your config file to run only 1 tasks at a time on your GTX295 (28500 would be a good bit better than 7000, if that worked), or try Snow Crash's suggestion - to suspend one task and let one finish before beginning the second (need to select no new tasks before starting the second task).
By the way, one of the tasks that failed on your GTX295 also failed for me on a card that very rarely fails, and also failed for someone else. So it is possible that that particular task was problematic.
At least your new GTS250 is running well! |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
I am not connected remotely. It is my office-machine. It was crunching the weekend.
It is crunching also Milkyway, Collatz, Seti all cuda-gpu-jobs.
I also do 1 job - gpugrid and 1 job - seti or anything else.
I have also GTX 260 and GTS 250(in other machines) - no problems with that cards.
____________
Ton (ftpd) Netherlands |
|
|
ftpd Send message
Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level
Scientific publications
|
25-3-2010 13:12:57 GPUGRID Computation for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 finished
25-3-2010 13:12:57 GPUGRID Output file p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0_1 for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 absent
25-3-2010 13:12:57 GPUGRID Output file p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0_2 for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 absent
25-3-2010 13:12:57 GPUGRID Output file p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0_3 for task p20-IBUCH_025a_pYEEI_100309-13-20-RND9969_0 absent
25-3-2010 13:12:57 GPUGRID Starting p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0
25-3-2010 13:12:58 GPUGRID Starting task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 using acemd version 671
25-3-2010 13:13:34 GPUGRID Computation for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 finished
25-3-2010 13:13:34 GPUGRID Output file p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0_1 for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 absent
25-3-2010 13:13:34 GPUGRID Output file p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0_2 for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 absent
25-3-2010 13:13:34 GPUGRID Output file p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0_3 for task p16-IBUCH_2_PQpYEEIPI_long_100319-2-4-RND1703_0 absent
1 job cancelled after 3 hours 12 minutes and 1 job cancelled after 22 secs.
Any reasons?
Yesterday 4 jobs - all OK!!!
____________
Ton (ftpd) Netherlands |
|
|
|
I was very stable until I started running both versions of the apps. Then I started to get failures on the old 6.71 which made my system unstable and the new version 6.03 would start to crash. I would restart my computer and a couple of 6.03 would run and all was good until I ran a 6.71 and it errored and again made my system unstable.
Last night in BOINC Manger I told it "No New Tasks" for GPUGrid
Then I went to my GPUGrid preferences here on the webite and told it to only send me ACEMD 2. (this is the new app version and is much faster).
Back in BOINC Manager I "Reset" GPUGrid.
Then I told it to accept new work.
So far everything looks good with no errors. I have a vague suspicion that one of the dlls distributed with the apps is different but is not being replaced and is what causes problems on otherwise stable machines.
____________
Thanks - Steve |
|
|
ignasiSend message
Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level
Scientific publications
|
Actually, we shouldn't be distributing the old app anymore.
There are though some WUs sent last week with the old app, but that was a mistake.
In principle all new WUs are going to come with the new app.
Let's see if that ends up with weird failures.
cheers,
i
|
|
|