*_pYEEI_* information and issues

Message boards : Graphics cards (GPUs) : *_pYEEI_* information and issues
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
AndyMM

Send message
Joined: 27 Jan 09
Posts: 4
Credit: 582,988,184
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14793 - Posted: 29 Jan 2010, 15:09:41 UTC - in response to Message 14789.  

[quote]Thanks for the comments. I looked in my GPUGrid preferences and did not notice anything saying Beta
I did see
"Run test applications?
This helps us develop applications, but may cause jobs to fail on your computer"

Which was already set to no.
[quote]

I have found the answer in another thread. So unless someone has switched the "Run Test Applications" off for me in the last 2 days I have never accepted Beta Applications.

I have re attached a 275. I will leave that running for a few days. The 295s will stay on F@H for now, they run F@H (and used to run S@H) fine it was only GPUGrid causing problems.

Also FYI it was me who aborted the work units after seeing this thread and relating it to the problems I had been having. After seeing work units processing for hours then showing computation error I was not in the mood to waste any more time.
ID: 14793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14799 - Posted: 29 Jan 2010, 17:25:47 UTC - in response to Message 14793.  

Other users cannot see if you have Betas enabled or not, just suggest you turn it off if you are having problems. There are many things that can cause errors. We can only guess as we do not have all the info. I cant tell if your system has automatic updates turned on, or if you have your local Boinc client set to Use GPU when computer is in use. All I can do is suggest you disable automatic updates as these force restarts and crash tasks, and turn off Use GPU when computer is in use, if you watch any video on your system.

GL
ID: 14799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
AndyMM

Send message
Joined: 27 Jan 09
Posts: 4
Credit: 582,988,184
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 14823 - Posted: 30 Jan 2010, 9:46:57 UTC - in response to Message 14799.  

Thanks for the advice. The PCs are all part of a crunching farm I have. All headless and controlled by VNC.
Only 4 of them have 9 series or higher Nvidia cards suitable for GPU Grid. Rest are simple Quads with built in graphics running Rosetta and WCG.
Either way I will leave a single 275 running on GpuGrid for now. The rest can stay on F@H.
Andy
ID: 14823 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [AF>Libristes>Jip] Elgrande71
Avatar

Send message
Joined: 16 Jul 08
Posts: 45
Credit: 78,618,001
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15015 - Posted: 5 Feb 2010, 11:29:44 UTC

Compute error with a GTX295 GPU on this computer .
ID: 15015 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile X-Files 27
Avatar

Send message
Joined: 11 Oct 08
Posts: 95
Credit: 68,023,693
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15647 - Posted: 8 Mar 2010, 14:03:45 UTC

i got weird wu(1949860), it error out but then a success?

# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.75 GHz
# Total amount of global memory: 523829248 bytes
# Number of multiprocessors: 14
# Number of cores: 112
MDIO ERROR: cannot open file "restart.coor"
SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999]
Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
# There is 1 device supporting CUDA
# Device 0: "GeForce 9800 GT"
# Clock rate: 1.75 GHz
# Total amount of global memory: 523829248 bytes
# Number of multiprocessors: 14
# Number of cores: 112
# Time per step: 59.524 ms
# Approximate elapsed time for entire WU: 37202.798 s
called boinc_finish

ID: 15647 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15648 - Posted: 8 Mar 2010, 14:47:29 UTC - in response to Message 15647.  

I've seen a recent handful of errors on my GTX295 and I know a team mate of mine has seen a few also. TONI wu process fine (which I think are more computationally intensive) so I think our OC is OK.

Are you seeing a higher failure rate on these WUs between last night and early this morning?
Thanks - Steve
ID: 15648 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ignasi

Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15774 - Posted: 16 Mar 2010, 10:03:46 UTC - in response to Message 15648.  

Still happening?

Could you post some of this failed results so I can double check they are right?

thanks
ID: 15774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15775 - Posted: 16 Mar 2010, 10:32:30 UTC - in response to Message 15774.  

Still happening?

No. Everything looks good now :-)

Thanks - Steve
ID: 15775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15776 - Posted: 16 Mar 2010, 10:35:12 UTC - in response to Message 15774.  

16-3-2010 10:40:54 GPUGRID Restarting task p34-IBUCH_chall_pYEEI_100301-15-40-RND6745_1 using acemd version 671
16-3-2010 10:40:55 GPUGRID Restarting task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 using acemd2 version 603
16-3-2010 10:58:32 GPUGRID Computation for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 finished
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_1 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_2 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_3 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent
16-3-2010 10:58:32 GPUGRID Starting p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0
16-3-2010 10:58:34 GPUGRID Starting task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 using acemd2 version 603
16-3-2010 11:29:43 GPUGRID Computation for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 finished
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_1 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_2 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_3 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent
16-3-2010 11:29:43 GPUGRID Starting a33-TONI_HERG79a-3-100-RND6672_0
16-3-2010 11:29:44 GPUGRID Starting task a33-TONI_HERG79a-3-100-RND6672_0 using acemd2 version 603

I am also using GTX 295, both jobs cancelled after 45 min. device 1.
Yesterday 3 jobs out of 4 cancelled after almost 5 hours processing.

I can use some help!!!!!!

See also "errors after 7 hours"

Ton (ftpd) Netherlands
ID: 15776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15797 - Posted: 17 Mar 2010, 14:18:07 UTC

Please HELP!!!

Today again 4 out of 5 jobs cancelled after more than 4 hours processing!!

GTX 295 - Windows XP


Ton (ftpd) Netherlands
ID: 15797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15809 - Posted: 18 Mar 2010, 9:25:32 UTC

Again today 6 out of 6 cancelled after 45 secs.

Windows XP - gtx 295 - driver 197.13

Also working Windows XP - gts 250 - driver 197.13 - no problems in abou 10 hours.

Any ideas????
Ton (ftpd) Netherlands
ID: 15809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ignasi

Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15810 - Posted: 18 Mar 2010, 12:21:43 UTC - in response to Message 15809.  
Last modified: 18 Mar 2010, 12:24:51 UTC

@ftpd

I see all your errors, yes.
Your case is one of the hardest to debug. All WU you took where already started by somebody else, therefore it is not an input file corruption. Neither, given by the fact that they fail after some execution time.
We neither see no major failure due to the application solely at least.
But what I observe in your case is that none of the other cards have such a high rate of failure with similar or even equal WU's and app version.

Have you considered that the source might be the card itself?
What brand is the card? Can you monitor temperature while running?
Is that your video output card?
Do you experience that sort of fails in other projects?
ID: 15810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15811 - Posted: 18 Mar 2010, 12:31:59 UTC - in response to Message 15810.  

Dear Ignasi,

Since last day i have also problems with WU-Milky Way with the same card.

The temp is OK = about 65C

In case of processin 6.71 cuda no problems, only with acemd2?

This computer is working 24/7 and is not using (except for monitor) the card.

The card is 6 months old.

Regards,

Ton

PS Now processing device 1 = collatz and device 0 = gpugrid
Ton (ftpd) Netherlands
ID: 15811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15813 - Posted: 18 Mar 2010, 13:09:10 UTC - in response to Message 15811.  

You may want to consider returning it to the seller or manufacturer (RTM), if it is under warrantee. If you have tried it in more than one system with the same result, I think it is fair to say the issue is with the card. As you are now getting errors with other projects and the error rate is rising the card might actually fail soon.
ID: 15813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15822 - Posted: 19 Mar 2010, 0:48:15 UTC

looks like this WU is bad ...
p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5
I will be starting it in a few hours so we'll see if the string of errors continues for this WU.

Thanks - Steve
ID: 15822 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ftpd

Send message
Joined: 6 Jun 08
Posts: 152
Credit: 328,250,382
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15826 - Posted: 19 Mar 2010, 9:27:31 UTC

Last night same machine GTX 295 3 out of 4 were OK!!!!!!!!!!

Ton (ftpd) Netherlands
ID: 15826 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ignasi

Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15830 - Posted: 19 Mar 2010, 10:02:08 UTC - in response to Message 15822.  

looks like this WU is bad ...
p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5
I will be starting it in a few hours so we'll see if the string of errors continues for this WU.


Thanks Snow Crash,
certainly some WUs seem to be condemned to die. We have been discussing that internally and it can be either by chance that a result is corrupted when saved/uploaded/etc. or that particular cards are are corrupting results from time to time.

Anyways, please let us know if you detect any pattern of failure regarding 'condemned WUs'.

cheers,
i
ID: 15830 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15875 - Posted: 21 Mar 2010, 13:09:48 UTC

You guys do such a good job that I have not seen another "one off" wu error.

I just finished my first *long* WU and it took precisely twice as long as previous pYEEI wus which I bet is exactly what you planned. Excellent work.

Can you tell us anything about the numbr of atoms and how much time these wus model?
Thanks - Steve
ID: 15875 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ignasi

Send message
Joined: 10 Apr 08
Posts: 254
Credit: 16,836,000
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 15902 - Posted: 22 Mar 2010, 10:54:27 UTC - in response to Message 15875.  
Last modified: 22 Mar 2010, 10:55:07 UTC

Can you tell us anything about the numbr of atoms and how much time these wus model?


Sure.
These *long* Wu's are exactly twice as long as the previous one's with similar name. They are modeling exactly 5 nanoseconds (ns) of ~36000 atoms (*pYEEI* & *PQpYEEIPI*). In these systems we have a protein (good old friend SH2 domain) and ligand (phosphoTYR-GLU-GLU-ILE & PRO-GLN-phosphoTYR-GLU-GLU-ILE-PRO-ILE //aminoacids) for which we are computing 'free energies of binding'. Basically the strength of their interaction.
We are willing to increase the size for one main reason. Our 'optimal' simulation time for analysis is no shorter than 5 ns at the moment. That means that our waiting time then is made of a normal WU (2.5ns) + queuing + normal WU (2.5ns), this times 50 which is the number of WUs for one of these experiments.
As you may see, the time-to-answer will greatly vary. With twice as long WUs, we omit the queuing time. Now with a faster application shouldn't be much of a hassle.

However, it is still a test. We want to have your feedback on them.

thanks,
ignasi
ID: 15902 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 4 Apr 09
Posts: 450
Credit: 539,316,349
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 15908 - Posted: 22 Mar 2010, 12:24:22 UTC - in response to Message 15902.  

Looking at 9 hours of processing on a current, state of the art, GPU to return 5 ns worth of realtime simulation puts into perspective just how important it is for all of us to pull together. I've read some of the papers you guys have published and not that I can follow any of it but I always knew you were working at the atomic level (seriously cool, you rock!).

Also, knowing that with the normal size you ultimately need to put together 100 WUs back to back before you have anything that even makes sense for you to start looking at highlites why we need to turn these WUs as quickly as possible. Best case scenario you don't get a finished run for more than 3 months ... and that's best case. I imagine it is more common to have to wait at least 4 months.

Running stable with a small cache will reduce the overall runtime of an experiment much more than any one GPU running fast. So, everyone ... get those card stable, turn your cache down low, and keep on crunching!
Thanks - Steve
ID: 15908 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Graphics cards (GPUs) : *_pYEEI_* information and issues

©2026 Universitat Pompeu Fabra