NOELIAs are back!

Author	Message
Trotador Send message Joined: 25 Mar 12 Posts: 103 Credit: 14,948,929,771 RAC: 0 Level Scientific publications	Message 29735 - Posted: 7 May 2013, 18:56:38 UTC Last modified: 7 May 2013, 19:29:03 UTC Up to now 4 of them in Linux in 660Ti without problem, the fourth one is about to finish yet, between 11 and 12 hours. Lower ppd than Nathan's but better for the summer as they stress less the GPUs :) Edit: crunching times seems to improve, maybe they were that high due to the problems I'm having with WCG CEP unit uploading. I will report once I have additional data ID: 29735 · Rating: 0 · rate: / Reply Quote

Stoneageman Send message Joined: 25 May 09 Posts: 224 Credit: 34,057,374,498 RAC: 0 Level Scientific publications	Message 29736 - Posted: 7 May 2013, 19:00:21 UTC Have processed 40 Noelias so far without error. I've had 15 of Nathans fail in the last 7 days! ID: 29736 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 29737 - Posted: 7 May 2013, 19:03:45 UTC I've completed 21 NOELIA's so far and have had 0 errors. ID: 29737 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level Scientific publications	Message 29738 - Posted: 7 May 2013, 19:08:03 UTC - in response to Message 29732. An experience of 1 is not much experience. 65 recently-reported successes is an experience!! ID: 29738 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 29739 - Posted: 7 May 2013, 19:30:43 UTC - in response to Message 29738. An experience of 1 is not much experience. 65 recently-reported successes is an experience!! So because they run for maybe 1/2 the people (looks like mainly XP and Linux, although I did so far have 1 finish in Win7/64), all is OK with you? I remember VERY well how loudly you screamed when YOU were having problems with some WUs. ID: 29739 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 29740 - Posted: 7 May 2013, 19:35:28 UTC Guys, let's be fair here: the issues with this run seem to be less than with previous Noelias. Whether the job was fit for the production queue is up to debate, but don't assume they didn't do any internal testing just because errors happen. And don't assume it's all Noelias fault and that the others could have avoided these errors easily - she is using features which had previously been implemented but had not been used in production runs before. @neilp62: taking a look at your failures I see that other people return many of those tasks just fine. And they all fail within 5.x seconds, i.e. during/after initialization. That means it's not a random error, it's systematic. In which case a driver update is the first thing to try. You're using 301.42, which is rather old by GPU standards. Try 314.22 (works for me) or the current one (320.something). @Firehawk: well.. you're completing quite a few of them with driver 301.42. Anyway, in case of problems (which surely applies now) the 1st thing I'd do is to update the driver. @Beyond: the error message you're getting rather quickly after a WU started on your GTX460 768MB is swanMakeTexture2D failed -- array create failedAssertion failed: a, file swanlibnv2.cpp, line 59 That looks like the creation / allocation of some array. This might indeed be due to running out of memory. BOINC should prevent this, but the amount of memory needed has to be reported properly by the WU, otherwise BOINC can't do anything about it. I'll forward this to the Devs. @John: your CPU has 2 cores disguised as 4. As SK said, I'd first try to reduce the number of CPU tasks and see how GPU performance improves. Once you've got these numbers you can still decide whether this trade-off is worth it for you. MrS Scanning for our furry friends since Jan 2002 ID: 29740 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 29741 - Posted: 7 May 2013, 20:29:12 UTC Hey guys, I wasn't trying to diminish from the problems you are having and I apologize for coming off wrong. ID: 29741 · Rating: 0 · rate: / Reply Quote

GPUGRID Send message Joined: 12 Dec 11 Posts: 91 Credit: 2,730,095,033 RAC: 0 Level Scientific publications	Message 29742 - Posted: 7 May 2013, 20:48:52 UTC - in response to Message 29740. Last modified: 7 May 2013, 21:05:04 UTC @Firehawk: well.. you're completing quite a few of them with driver 301.42. Anyway, in case of problems (which surely applies now) the 1st thing I'd do is to update the driver. Well remembered mate. I had to roll back to this driver because the newer ones had a way worse temperature control, but since these new units run cooler, I may give it a try prior to change projects. Thanks for the heads up. Update: and this seems to did the trick on the AMD 6x690 machine, wich was suffering from poor performance on half of the gpus. They are warming right now, wich always is a good sign. Will see about the stability in some hours. Thanks again ETApes ID: 29742 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 29743 - Posted: 7 May 2013, 21:18:27 UTC - in response to Message 29742. I'm presently using 311.06 on W7x64. I've had a few driver resets and app crashes, but so far these appear close to the start of runs (up to 5%), and if I just close and open Boinc, Noelia's WU's restart and crunch away reasonably well. The last time I had a driver restart, it was when I resumed a suspended Albert CPU WU. As soon as it resumed it suspended the second GPUGrid WU running so it could start. This immediately resulted in a driver restart. However the Noelia WU managed to restart, after the Albert WU finished, without any intervention on my part. When I did several suspends on a WU yesterday I didn’t have any driver problems, but they were further into the run. I really dislike Boinc suspending GPU WU’s to use a CPU thread. Obviously I won't be running Albert tasks any time soon. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 29743 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 29744 - Posted: 7 May 2013, 21:22:44 UTC I had two failures, and many successful 'NOELIA_klebe_run-0-3's. The first one was stuck for 8 hours before I noticed it, but after a system restart it immediately ran into an error "Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59". The second one ran for 3 hours, then ran into an "ERROR: file deven.cpp line 1743: deven_Cellresize(): invalid dimensions" - it's probably due to overclocking. ID: 29744 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 29745 - Posted: 7 May 2013, 21:24:20 UTC - in response to Message 29708. These new units are using really low CPU power. The gpus are way cooler aswell, and the processing times seems to be very high. Not using all the hardware. +1 ID: 29745 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 29746 - Posted: 7 May 2013, 21:59:22 UTC - in response to Message 29745. I had two failures, and many successful 'NOELIA_klebe_run-0-3's. The first one was stuck for 8 hours before I noticed it, but after a system restart it immediately ran into an error "Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59". The second one ran for 3 hours, then ran into an "ERROR: file deven.cpp line 1743: deven_Cellresize(): invalid dimensions" - it's probably due to overclocking. file swanlibnv2.cpp, line 59 - might be a cuda bug/app issue. The Cellresize error might be due to OC, or it might be something else/new; I don't recall seeing it before. Probably best to report such errors, just in case. These new units are using really low CPU power. The gpus are way cooler aswell, and the processing times seems to be very high. Not using all the hardware. +1 Yeah, we think these tasks could perhaps do with using more CPU resources(or possibly be higher priority), but conversely, they are using too much GPU memory for some cards. So, not using enough resources for some and using too much for others. Making everybody happy is a cinch :)) The reported downclocking may depend on setup and OS/drivers. When I reduce the CPU usage I'm getting reasonably high GPU performances (W7x64). I'm running 4 climate models (for stability ;p) and two GPUGRid WU's. The GTX660Ti is presently using 88% power, 89% GPU utilization, 848MB GDDR5 (for the bigger equations perhaps) and the shaders are up to 1215MHz. The GTX470 is using 93% GPU and 751MB GDDR (smaller equation maybe). Are there larger and smaller equations in use and are they being used based on GDDR capacity or Compute Capability (CC2.0 small CC2.1 or above large)? Perhaps there is some variation in memory usage of these tasks (different WU's use different amounts of memory); Im seeing 751MB, 801MB and 848MB on different GPU's. Only the 751MB WU is running on a card not used for a display. Again, might be a coincidence or might be something in it? Those with 2 GPU's could check. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 29746 · Rating: 0 · rate: / Reply Quote

neilp62 Send message Joined: 23 Nov 10 Posts: 14 Credit: 8,017,535,732 RAC: 0 Level Scientific publications	Message 29747 - Posted: 7 May 2013, 22:28:46 UTC - in response to Message 29740. @neilp62: taking a look at your failures I see that other people return many of those tasks just fine. And they all fail within 5.x seconds, i.e. during/after initialization. That means it's not a random error, it's systematic. In which case a driver update is the first thing to try. You're using 301.42, which is rather old by GPU standards. Try 314.22 (works for me) or the current one MrS, many thanks for the suggestion! I normally avoid changing my crunching rig config as much as possible once it is stable, but I see that my other rig that has a mix of 650Ti & 560Ti cards running version 314.22 has now completed 4 Noelia WUs. I'll try upgrading the 680 rig as soon as I get home. Do you tend to keep your crunch platforms on the latest drivers, or do you wait for systematic issues to arise before upgrading? ID: 29747 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 29749 - Posted: 8 May 2013, 7:57:05 UTC - in response to Message 29747. I like to play it half-safe: wait a few weeks if others discover any issues with newer drivers, and if not I'll upgrade when I feel like it. And I won't use beta drivers for crunching, unless there's a very good reason to do so. Going straight for the leading edge can be painful in the BOINC world, it's more like the bleeding edge :p MrS Scanning for our furry friends since Jan 2002 ID: 29749 · Rating: 0 · rate: / Reply Quote

neilp62 Send message Joined: 23 Nov 10 Posts: 14 Credit: 8,017,535,732 RAC: 0 Level Scientific publications	Message 29757 - Posted: 8 May 2013, 18:33:57 UTC - in response to Message 29749. Going straight for the leading edge can be painful in the BOINC world, it's more like the bleeding edge :p Unfortunately, staying still for the sake of 'stability' seems to be just as bloody... I've upgraded to version 314.22 WHQL driver, and my GTX680 platform (ID 87170) is still failing the long run WU within 6 seconds, with the following error in the BOINC log: 5/8/2013 10:57:22 \| GPUGRID \| [sched_op] Reason: Unrecoverable error for task 290px29x1-NOELIA_klebe_run-1-3-RND8276_0 ( - exit code 98 (0x62)) No hardware or software config changes have been made to this platform since the new long run WUs were queued. I have no clue why my other rig (ID 137898) (in which I am constantly changing GPUs setups, and runs hotter than the GTX680 platform) is completing long runs successfully, but 87170 can't even initialize the new long runs. Any further insights would be greatly appreciated. Thank you in advance for the help and the patience! ID: 29757 · Rating: 0 · rate: / Reply Quote

ExtraTerrestrial Apes Volunteer moderator Volunteer tester Send message Joined: 17 Aug 08 Posts: 2705 Credit: 1,311,122,549 RAC: 0 Level Scientific publications	Message 29760 - Posted: 8 May 2013, 19:40:48 UTC - in response to Message 29757. I've just seen that you're using an anonymous platform on that rig. There's been an app update in the not too distant past, introducing features which hadn't been used by the previous Nathans, but are being used now. If you're still running an older app there we'd have an easy explanation. MrS Scanning for our furry friends since Jan 2002 ID: 29760 · Rating: 0 · rate: / Reply Quote

flashawk Send message Joined: 18 Jun 12 Posts: 297 Credit: 3,572,627,986 RAC: 0 Level Scientific publications	Message 29761 - Posted: 8 May 2013, 19:47:00 UTC When Tomba first started this thread, I jumped up and checked my computers and discovered that I had 3 running. I use Precision X to adjust my Evga cards, the 3 NOELIA's that were running had caused those cards boost clock to jump up considerably along with more frequent voltage spikes. Now that all my cards are running these new NOELIA's, I've had to adjust the voltage and core clock on all the cards back down to where I know they are in a safe range. Every card is different from each other and I have now turned on K-Boost which locks all the voltages to where I want them to be. I've had pretty good luck with that feature of Precision X, I did have issue's getting it to work in Windows XP Pro x64 because of the lack of SP3. When I installed it, I put it in compatibility mode (the installation files and the installed program) and it worked. That doesn't mean that any of this stuff will work for anyone else, I just thought it might be worth mentioning especially how my cards core clock jumped up so high on their own, if I hadn't used a utility to adjust the clocks and voltages, then these issue's may have not showed at all. ID: 29761 · Rating: 0 · rate: / Reply Quote

Robert Gammon Send message Joined: 28 May 12 Posts: 63 Credit: 714,535,121 RAC: 0 Level Scientific publications	Message 29762 - Posted: 8 May 2013, 19:56:00 UTC - in response to Message 29761. I tested NOELIAs on Ubuntu 13.04 with Nvidia 319.17 on a GTX660Ti. Progress thru 9 hours of execution of a SHORT RUN, showed less than 20% progress. Further tests with Long Runs indicated a runtime of over 5000 minutes, minimum. I will wait for these to be exhausted, or for the GPUGrid admins to fix ACEMD for Linux so that it runs these klebe NOELIAs as well as the Windoze version of ACEMD ID: 29762 · Rating: 0 · rate: / Reply Quote

neilp62 Send message Joined: 23 Nov 10 Posts: 14 Credit: 8,017,535,732 RAC: 0 Level Scientific publications	Message 29767 - Posted: 9 May 2013, 1:21:25 UTC - in response to Message 29760. If you're still running an older app there we'd have an easy explanation. Yes - I believe you've found the root cause. I adapted an app_info.xml from the forum to run on rig 87170 August last year to permit execution of two low-utilization Paola 3EKO WUs at a time. The cudart32_42_9.dll, cufft32_42_9.dll and acemd.2562.cuda42.dll are dated 6/17/2012, and the tcl85.dll is 11/23/2010. I see from my other rig that app updates occurred on 11/4/2012 and 2/25/2013. What's the fastest/simplest resolution - delete the current app_info.xml? Thanks! ID: 29767 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 29774 - Posted: 9 May 2013, 11:13:05 UTC - in response to Message 29767. Last modified: 9 May 2013, 12:01:56 UTC Yes, delete the app_info file and restart Boinc. While I'm not overly impressed with app_config (confusing terminology, misleading reporting, and the requirement to reset the project to get rid of it), it's probably a better option than building and maintaining app_info files, and it might improve in later versions. In the past some WU's didn't utilize the GPU to a high extent (<60%). These were mostly Beta's and it isn't usually the situation. Although having some non-recommended configurations/setups with your GTX 680 (4095MB) might allow you to etch out some slight overall performance advantage, I can get both Nate's and Noelia's WU's to run at ~90% GPU utilization, while running 5 CPU WU's, or 94/95% simply by reducing the CPU usage for CPU projects. At present I doubt that you will get a significant improvement running two tasks, and you could simply tweak your setup to optimize for GPUGrid throughput. So I don't really see the need for using app_info or app_config. Also, the very concept of trying to reach 100% GPU utilization with Kepler's is a bit dubious; these card self-adjust towards a GPU specific power optimal. If your GPU utilization rises but your core clock rate drops, what's the point? - Just want to add that I've not had any Noelia WU failures, despite messing, and my second GPU [GTX470] (operating at PCIE2x8 in PCIE slot 1 going by GPUZ, the motherboard manufacturer and Afterburner, but GPU (device 0) according to BM) has only used 751MB GDDR on since I installed it (for 3 of Noelia's WU's). The GTX660Ti (top PCIE slot; operating at PCIE3x8) has used varying amounts of GDDR from ~800MB to almost 1GB. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 29774 · Rating: 0 · rate: / Reply Quote