New project in long queue

Author	Message
noelia Send message Joined: 5 Jul 12 Posts: 35 Credit: 393,375 RAC: 0 Level Scientific publications	Message 28895 - Posted: 1 Mar 2013, 10:49:14 UTC Hello all, After testing the new application, it is time to send a new project. I'm sending at the moment around 6000WUs to the long queue. Credits will be around 100000. Let me know if you have any issues, since is the first big thing we submit to the recently updated long queue and these WUs include new features. Noelia ID: 28895 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28896 - Posted: 1 Mar 2013, 11:16:58 UTC I can't download any, I keep trying, but no long runs in the last hour. ID: 28896 · Rating: 0 · rate: / Reply Quote

Bedrich Hajek Send message Joined: 28 Mar 09 Posts: 490 Credit: 11,731,645,728 RAC: 69 Level Scientific publications	Message 28899 - Posted: 1 Mar 2013, 12:20:38 UTC - in response to Message 28896. These units appear to be very long, could be close to 20 hours finishing time on my computers. Assuming there are no errors!! ID: 28899 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28911 - Posted: 2 Mar 2013, 7:02:10 UTC I've had to abort 3 NOELIA'S in the past 2 hours, GPU usage was at 100% and the memory controller was at 0%. I had to reboot the computer to get the GPU's working again. Windows popped up an error message complaining that "acemd.2865P.exe had to be terminated unexpectedly". As soon as I suspended the NOELIA work unit the error message went away. This was on a GTX560, GTX670 and a GTX680. Windows XP 64 bit, BIONIC v7.0.28 ID: 28911 · Rating: 0 · rate: / Reply Quote

dskagcommunity Send message Joined: 28 Apr 11 Posts: 463 Credit: 958,266,958 RAC: 41 Level Scientific publications	Message 28930 - Posted: 3 Mar 2013, 9:20:30 UTC Last modified: 3 Mar 2013, 9:21:51 UTC So far i got an error one After 14000secs :( and a second one witch was successful after 15 hours (560ti 448 core edition, 157k credits). Now its calculating a third one..lets see. DSKAG Austria Research Team: http://www.research.dskag.at ID: 28930 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 28933 - Posted: 3 Mar 2013, 13:17:06 UTC - in response to Message 28895. The first one of these I received was: 005px1x2-NOELIA_005p-0-2-RND6570_0 After running for over 24 hours this happened: <core_client_version>7.0.52</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> I have another one that's at 62.5% after 14 hours. Looking at some of the NOELIA Wus, they seem to be failing all over the place, some of them repeatedly. They're also too long for my machines to process and return in 24 hours. After the one that's running either errors out or completes I will be aborting the NOELIA WUs. Wasting 24+ hours of GPU time per failure is not my favorite way to waste electricity. Sorry. BTW, the TONI WUs run fine. ID: 28933 · Rating: 0 · rate: / Reply Quote

microchip Send message Joined: 4 Sep 11 Posts: 110 Credit: 326,102,587 RAC: 0 Level Scientific publications	Message 28934 - Posted: 3 Mar 2013, 14:10:34 UTC I've found NOELIA WUs to be highly unreliable, even on the short queue. I don't like getting one as I've no idea if it'll complete without errors. I had to abort a short NOELIA one yesterday as it kept crunching in circles meaning it crunched for some minutes and then returned to the beginning to do the same all over again. Team Belgium ID: 28934 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 28935 - Posted: 3 Mar 2013, 14:17:03 UTC Last modified: 3 Mar 2013, 14:18:20 UTC These new NOELIA tasks don't use a full CPU thread (core if you like) to feed a Kepler type GPU, like other workunits (like TONI_AGG) used to. Is this behavior intentional or not? Maybe that's why it takes so long to process them. It takes 40.400 secs for my overclocked GTX 580 to finish these tasks, while it takes 38.800 for a (slightly overclocked) GTX 670, so there is a demonstrable loss (~5%) in their performance. ID: 28935 · Rating: 0 · rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 28936 - Posted: 3 Mar 2013, 15:19:13 UTC Some of the new NOELIA units are bugged somehow, I think. Some run fine, some of them not. ID: 28936 · Rating: 0 · rate: / Reply Quote

Jim Daniels (JD) Send message Joined: 20 Jan 13 Posts: 9 Credit: 206,731,892 RAC: 0 Level Scientific publications	Message 28937 - Posted: 3 Mar 2013, 17:17:41 UTC I posted this in the "long application updated to the latest version" but Firehawk inferred these issues should be reported in this thread. I don't know if this is a 6.18 issue or a NOELIA WU issue but I guess time will tell. So I apologize in advance for the double posting if that is a bigger faux pas than not knowing which thread is the appropriate one to post to. ;-) -------------------- While running my first 6.18 long run task my laptop locked up and I had to do a hard reboot. After the system was back up this WU had terminated with an error. The details are below. However, I have run two 6.18 WUs successfully since then. It appears one other host also terminated with an error on this WU. The NOELIA WUs seem to be averaging about 80% utilization GTX 680M and the run times are over 18 hours. I don't know how much effect having to share CPU time is having on these numbers. Error Details: i7-3740QM 16GB - Win7 Pro x64 - GTX 680m (Alienware 9.18.13.717) GPU: dedicated to GPUGRID - CPU: SETI, Poem, Milkyway, WUProp, FreeHAL ------------------------------------------------------------------------ Work Unit: 4209987 (041px21x2-NOELIA_041p-0-2-RND9096_0) Stderr output <core_client_version>7.0.28</core_client_version> <![CDATA[ <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> MDIO: cannot open file "restart.coor" SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. </stderr_txt> ]]> ID: 28937 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28938 - Posted: 3 Mar 2013, 17:40:37 UTC You're getting us hawks mixed up, I've been using this name sense 95 and that's the first time I think that's happend. ID: 28938 · Rating: 0 · rate: / Reply Quote

Jim Daniels (JD) Send message Joined: 20 Jan 13 Posts: 9 Credit: 206,731,892 RAC: 0 Level Scientific publications	Message 28940 - Posted: 3 Mar 2013, 19:09:22 UTC - in response to Message 28938. Mea Culpa. ID: 28940 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 28941 - Posted: 3 Mar 2013, 19:28:05 UTC Last modified: 3 Mar 2013, 19:31:29 UTC Am also getting several Noelia tasks making very slow progress. Same problem as flagged in the beta test. Also the size of the upload is causing issues for me as well. ID: 28941 · Rating: 0 · rate: / Reply Quote

Bikermatt Send message Joined: 8 Apr 10 Posts: 37 Credit: 4,422,457,619 RAC: 93 Level Scientific publications	Message 28942 - Posted: 3 Mar 2013, 20:00:13 UTC The Noelia workunits refuse to run on my 660ti linux system. They lock up or make no progress. I have finished one on two different linux systems with 670s without problems. ID: 28942 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 28943 - Posted: 3 Mar 2013, 20:33:00 UTC The Noelia longs either fail in the first 3 to 4 minutes on my GTX 560 and GTX 650 Ti (the only four failures I have had on GPUGRID), or else they complete successfully. I can't tell if they take any longer though; the last one took 23 hours, instead of the more usual 18 hours, but I have seen that on the 6.17 work units also, so it may just be the size of the work unit. I have not had any hangs thus far (Win7 64-bit; BOINC 7.0.52 x64). All in all, it is not that bad for me. ID: 28943 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28944 - Posted: 3 Mar 2013, 21:12:11 UTC Well Jim, now you can understand how most of us got 90% of our errors. If you had looked closer you would have noticed that almost all of them came from a first run of NOELIA's in early February. Instead, you thought you would display you're distributed computing prowess and give us you're expert advice and proceeded to tell us about our substandard components or our inability to overclock correctly and the overheating issue's we must be having. I'm referencing this thread. http://www.gpugrid.net/forum_thread.php?id=3299 ID: 28944 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 28 Jul 12 Posts: 819 Credit: 1,591,285,971 RAC: 0 Level Scientific publications	Message 28945 - Posted: 3 Mar 2013, 22:12:07 UTC - in response to Message 28944. flashawk, Thank you for your insight. But I just started on Feb. 14, and I think it was well past your first group of errors. At any rate, they ran fine on my cards even though not on some others, where they often failed after an hour or more. Maybe you can give better advice? ID: 28945 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28946 - Posted: 3 Mar 2013, 22:44:05 UTC I guess it's understandable, the best advice I could ever give in my 51 years "is wait and see". I don't walk in to another’s club house and start rearranging the furniture. There's been many a time when I've jumped to quick conclusions in my own mind only to find out later that I was wrong. Anyway, didn't mean to be too harsh, let me be the first to say "Welcome to GPU-GRID" and I'm sure you have allot to contribute. ID: 28946 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 428 Level Scientific publications	Message 28947 - Posted: 4 Mar 2013, 0:19:33 UTC Last modified: 4 Mar 2013, 0:44:59 UTC I managed to get http://www.gpugrid.net/result.php?resultid=6567675 to run to completion, by making sure it wasn't interrupted during computation. But 12 hours on a GTX 670 is a long time to run without task switching, when you're trying to support more than one BOINC project. Edit - on the other hand, task http://www.gpugrid.net/result.php?resultid=6563457 following behind on the same card with the same configuration failed three times with SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. The TONI task following, again on the same card, seems to have started and to be running normally. ID: 28947 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 28948 - Posted: 4 Mar 2013, 1:07:04 UTC - in response to Message 28947. Last modified: 4 Mar 2013, 1:27:51 UTC Richard Haselgrove wrote: SWAN : FATAL : Cuda driver error 999 in file 'swanlibnv2.cpp' in line 1574. Assertion failed: a, file swanlibnv2.cpp, line 59 This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information. I just had the same thing happen to me Richard, right after the computation error, a TONI wu started on the same GPU card and it was at an idle with 0% GPU load and 0% memory controller usage. I had to suspend BOINC and reboot to get the GPU crunching again. As far as times go on my GTX670's, the NOELIA wu's have ranged from 112MB to 172MB so far and the smaller one took 7.5 hours and the large one took 11.75 hours. So I think the size of the output file directly effects the run time (as usual). They may have to pull the plug on this batch and rework them, we'll have to wait and see what they decide. Edit: Check out this one, I just downloaded it a couple minutes ago. I noticed it ended in a 6, that means I'm the 7th person to get it. This is off the hook - man! http://www.gpugrid.net/workunit.php?wuid=4210634 ID: 28948 · Rating: 0 · rate: / Reply Quote