*_pYEEI_* information and issues

Author	Message
AndyMM Send message Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level Scientific publications	Message 14793 - Posted: 29 Jan 2010, 15:09:41 UTC - in response to Message 14789. [quote]Thanks for the comments. I looked in my GPUGrid preferences and did not notice anything saying Beta I did see "Run test applications? This helps us develop applications, but may cause jobs to fail on your computer" Which was already set to no. [quote] I have found the answer in another thread. So unless someone has switched the "Run Test Applications" off for me in the last 2 days I have never accepted Beta Applications. I have re attached a 275. I will leave that running for a few days. The 295s will stay on F@H for now, they run F@H (and used to run S@H) fine it was only GPUGrid causing problems. Also FYI it was me who aborted the work units after seeing this thread and relating it to the problems I had been having. After seeing work units processing for hours then showing computation error I was not in the mood to waste any more time. ID: 14793 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 14799 - Posted: 29 Jan 2010, 17:25:47 UTC - in response to Message 14793. Other users cannot see if you have Betas enabled or not, just suggest you turn it off if you are having problems. There are many things that can cause errors. We can only guess as we do not have all the info. I cant tell if your system has automatic updates turned on, or if you have your local Boinc client set to Use GPU when computer is in use. All I can do is suggest you disable automatic updates as these force restarts and crash tasks, and turn off Use GPU when computer is in use, if you watch any video on your system. GL ID: 14799 · Rating: 0 · rate: / Reply Quote

AndyMM Send message Joined: 27 Jan 09 Posts: 4 Credit: 582,988,184 RAC: 0 Level Scientific publications	Message 14823 - Posted: 30 Jan 2010, 9:46:57 UTC - in response to Message 14799. Thanks for the advice. The PCs are all part of a crunching farm I have. All headless and controlled by VNC. Only 4 of them have 9 series or higher Nvidia cards suitable for GPU Grid. Rest are simple Quads with built in graphics running Rosetta and WCG. Either way I will leave a single 275 running on GpuGrid for now. The rest can stay on F@H. Andy ID: 14823 · Rating: 0 · rate: / Reply Quote

[AF>Libristes>Jip] Elgrande71 Send message Joined: 16 Jul 08 Posts: 45 Credit: 78,618,001 RAC: 0 Level Scientific publications	Message 15015 - Posted: 5 Feb 2010, 11:29:44 UTC Compute error with a GTX295 GPU on this computer . ID: 15015 · Rating: 0 · rate: / Reply Quote

X-Files 27 Send message Joined: 11 Oct 08 Posts: 95 Credit: 68,023,693 RAC: 0 Level Scientific publications	Message 15647 - Posted: 8 Mar 2010, 14:03:45 UTC i got weird wu(1949860), it error out but then a success? # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.75 GHz # Total amount of global memory: 523829248 bytes # Number of multiprocessors: 14 # Number of cores: 112 MDIO ERROR: cannot open file "restart.coor" SWAN : FATAL : Failure executing kernel sync [frc_sum_kernel] [999] Assertion failed: 0, file ../swan/swanlib_nv.cpp, line 203 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. # There is 1 device supporting CUDA # Device 0: "GeForce 9800 GT" # Clock rate: 1.75 GHz # Total amount of global memory: 523829248 bytes # Number of multiprocessors: 14 # Number of cores: 112 # Time per step: 59.524 ms # Approximate elapsed time for entire WU: 37202.798 s called boinc_finish ID: 15647 · Rating: 0 · rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15648 - Posted: 8 Mar 2010, 14:47:29 UTC - in response to Message 15647. I've seen a recent handful of errors on my GTX295 and I know a team mate of mine has seen a few also. TONI wu process fine (which I think are more computationally intensive) so I think our OC is OK. Are you seeing a higher failure rate on these WUs between last night and early this morning? Thanks - Steve ID: 15648 · Rating: 0 · rate: / Reply Quote

ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 15774 - Posted: 16 Mar 2010, 10:03:46 UTC - in response to Message 15648. Still happening? Could you post some of this failed results so I can double check they are right? thanks ID: 15774 · Rating: 0 · rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15775 - Posted: 16 Mar 2010, 10:32:30 UTC - in response to Message 15774. Still happening? No. Everything looks good now :-) Thanks - Steve ID: 15775 · Rating: 0 · rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 15776 - Posted: 16 Mar 2010, 10:35:12 UTC - in response to Message 15774. 16-3-2010 10:40:54 GPUGRID Restarting task p34-IBUCH_chall_pYEEI_100301-15-40-RND6745_1 using acemd version 671 16-3-2010 10:40:55 GPUGRID Restarting task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 using acemd2 version 603 16-3-2010 10:58:32 GPUGRID Computation for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 finished 16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_1 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent 16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_2 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent 16-3-2010 10:58:32 GPUGRID Output file p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0_3 for task p31-IBUCH_21_pYEEI_100301-13-40-RND4121_0 absent 16-3-2010 10:58:32 GPUGRID Starting p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 16-3-2010 10:58:34 GPUGRID Starting task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 using acemd2 version 603 16-3-2010 11:29:43 GPUGRID Computation for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 finished 16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_1 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent 16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_2 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent 16-3-2010 11:29:43 GPUGRID Output file p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0_3 for task p9-IBUCH_201_pYEEI_100301-13-40-RND6673_0 absent 16-3-2010 11:29:43 GPUGRID Starting a33-TONI_HERG79a-3-100-RND6672_0 16-3-2010 11:29:44 GPUGRID Starting task a33-TONI_HERG79a-3-100-RND6672_0 using acemd2 version 603 I am also using GTX 295, both jobs cancelled after 45 min. device 1. Yesterday 3 jobs out of 4 cancelled after almost 5 hours processing. I can use some help!!!!!! See also "errors after 7 hours" Ton (ftpd) Netherlands ID: 15776 · Rating: 0 · rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 15797 - Posted: 17 Mar 2010, 14:18:07 UTC Please HELP!!! Today again 4 out of 5 jobs cancelled after more than 4 hours processing!! GTX 295 - Windows XP Ton (ftpd) Netherlands ID: 15797 · Rating: 0 · rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 15809 - Posted: 18 Mar 2010, 9:25:32 UTC Again today 6 out of 6 cancelled after 45 secs. Windows XP - gtx 295 - driver 197.13 Also working Windows XP - gts 250 - driver 197.13 - no problems in abou 10 hours. Any ideas???? Ton (ftpd) Netherlands ID: 15809 · Rating: 0 · rate: / Reply Quote

ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 15810 - Posted: 18 Mar 2010, 12:21:43 UTC - in response to Message 15809. Last modified: 18 Mar 2010, 12:24:51 UTC @ftpd I see all your errors, yes. Your case is one of the hardest to debug. All WU you took where already started by somebody else, therefore it is not an input file corruption. Neither, given by the fact that they fail after some execution time. We neither see no major failure due to the application solely at least. But what I observe in your case is that none of the other cards have such a high rate of failure with similar or even equal WU's and app version. Have you considered that the source might be the card itself? What brand is the card? Can you monitor temperature while running? Is that your video output card? Do you experience that sort of fails in other projects? ID: 15810 · Rating: 0 · rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 15811 - Posted: 18 Mar 2010, 12:31:59 UTC - in response to Message 15810. Dear Ignasi, Since last day i have also problems with WU-Milky Way with the same card. The temp is OK = about 65C In case of processin 6.71 cuda no problems, only with acemd2? This computer is working 24/7 and is not using (except for monitor) the card. The card is 6 months old. Regards, Ton PS Now processing device 1 = collatz and device 0 = gpugrid Ton (ftpd) Netherlands ID: 15811 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 15813 - Posted: 18 Mar 2010, 13:09:10 UTC - in response to Message 15811. You may want to consider returning it to the seller or manufacturer (RTM), if it is under warrantee. If you have tried it in more than one system with the same result, I think it is fair to say the issue is with the card. As you are now getting errors with other projects and the error rate is rising the card might actually fail soon. ID: 15813 · Rating: 0 · rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15822 - Posted: 19 Mar 2010, 0:48:15 UTC looks like this WU is bad ... p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5 I will be starting it in a few hours so we'll see if the string of errors continues for this WU. Thanks - Steve ID: 15822 · Rating: 0 · rate: / Reply Quote

ftpd Send message Joined: 6 Jun 08 Posts: 152 Credit: 328,250,382 RAC: 0 Level Scientific publications	Message 15826 - Posted: 19 Mar 2010, 9:27:31 UTC Last night same machine GTX 295 3 out of 4 were OK!!!!!!!!!! Ton (ftpd) Netherlands ID: 15826 · Rating: 0 · rate: / Reply Quote

ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 15830 - Posted: 19 Mar 2010, 10:02:08 UTC - in response to Message 15822. looks like this WU is bad ... p25-IBUCH_101b_pYEEI_100304-11-80-RND0419_5 I will be starting it in a few hours so we'll see if the string of errors continues for this WU. Thanks Snow Crash, certainly some WUs seem to be condemned to die. We have been discussing that internally and it can be either by chance that a result is corrupted when saved/uploaded/etc. or that particular cards are are corrupting results from time to time. Anyways, please let us know if you detect any pattern of failure regarding 'condemned WUs'. cheers, i ID: 15830 · Rating: 0 · rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15875 - Posted: 21 Mar 2010, 13:09:48 UTC You guys do such a good job that I have not seen another "one off" wu error. I just finished my first long WU and it took precisely twice as long as previous pYEEI wus which I bet is exactly what you planned. Excellent work. Can you tell us anything about the numbr of atoms and how much time these wus model? Thanks - Steve ID: 15875 · Rating: 0 · rate: / Reply Quote

ignasi Send message Joined: 10 Apr 08 Posts: 254 Credit: 16,836,000 RAC: 0 Level Scientific publications	Message 15902 - Posted: 22 Mar 2010, 10:54:27 UTC - in response to Message 15875. Last modified: 22 Mar 2010, 10:55:07 UTC Can you tell us anything about the numbr of atoms and how much time these wus model? Sure. These long Wu's are exactly twice as long as the previous one's with similar name. They are modeling exactly 5 nanoseconds (ns) of ~36000 atoms (pYEEI & PQpYEEIPI). In these systems we have a protein (good old friend SH2 domain) and ligand (phosphoTYR-GLU-GLU-ILE & PRO-GLN-phosphoTYR-GLU-GLU-ILE-PRO-ILE //aminoacids) for which we are computing 'free energies of binding'. Basically the strength of their interaction. We are willing to increase the size for one main reason. Our 'optimal' simulation time for analysis is no shorter than 5 ns at the moment. That means that our waiting time then is made of a normal WU (2.5ns) + queuing + normal WU (2.5ns), this times 50 which is the number of WUs for one of these experiments. As you may see, the time-to-answer will greatly vary. With twice as long WUs, we omit the queuing time. Now with a faster application shouldn't be much of a hassle. However, it is still a test. We want to have your feedback on them. thanks, ignasi ID: 15902 · Rating: 0 · rate: / Reply Quote

Snow Crash Send message Joined: 4 Apr 09 Posts: 450 Credit: 539,316,349 RAC: 0 Level Scientific publications	Message 15908 - Posted: 22 Mar 2010, 12:24:22 UTC - in response to Message 15902. Looking at 9 hours of processing on a current, state of the art, GPU to return 5 ns worth of realtime simulation puts into perspective just how important it is for all of us to pull together. I've read some of the papers you guys have published and not that I can follow any of it but I always knew you were working at the atomic level (seriously cool, you rock!). Also, knowing that with the normal size you ultimately need to put together 100 WUs back to back before you have anything that even makes sense for you to start looking at highlites why we need to turn these WUs as quickly as possible. Best case scenario you don't get a finished run for more than 3 months ... and that's best case. I imagine it is more common to have to wait at least 4 months. Running stable with a small cache will reduce the overall runtime of an experiment much more than any one GPU running fast. So, everyone ... get those card stable, turn your cache down low, and keep on crunching! Thanks - Steve ID: 15908 · Rating: 0 · rate: / Reply Quote