NOELIAs are back!

Message boards : Number crunching : NOELIAs are back!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Trotador

Send message
Joined: 25 Mar 12
Posts: 103
Credit: 14,948,929,771
RAC: 14
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29735 - Posted: 7 May 2013, 18:56:38 UTC
Last modified: 7 May 2013, 19:29:03 UTC

Up to now 4 of them in Linux in 660Ti without problem, the fourth one is about to finish yet, between 11 and 12 hours. Lower ppd than Nathan's but better for the summer as they stress less the GPUs :)

Edit: crunching times seems to improve, maybe they were that high due to the problems I'm having with WCG CEP unit uploading. I will report once I have additional data
ID: 29735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Stoneageman
Avatar

Send message
Joined: 25 May 09
Posts: 224
Credit: 34,057,374,498
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29736 - Posted: 7 May 2013, 19:00:21 UTC

Have processed 40 Noelias so far without error. I've had 15 of Nathans fail in the last 7 days!
ID: 29736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29737 - Posted: 7 May 2013, 19:03:45 UTC

I've completed 21 NOELIA's so far and have had 0 errors.
ID: 29737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29738 - Posted: 7 May 2013, 19:08:03 UTC - in response to Message 29732.  

An experience of 1 is not much experience.

65 recently-reported successes is an experience!!

ID: 29738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29739 - Posted: 7 May 2013, 19:30:43 UTC - in response to Message 29738.  

An experience of 1 is not much experience.

65 recently-reported successes is an experience!!

So because they run for maybe 1/2 the people (looks like mainly XP and Linux, although I did so far have 1 finish in Win7/64), all is OK with you? I remember VERY well how loudly you screamed when YOU were having problems with some WUs.
ID: 29739 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29740 - Posted: 7 May 2013, 19:35:28 UTC

Guys, let's be fair here: the issues with this run seem to be less than with previous Noelias. Whether the job was fit for the production queue is up to debate, but don't assume they didn't do any internal testing just because errors happen.

And don't assume it's all Noelias fault and that the others could have avoided these errors easily - she is using features which had previously been implemented but had not been used in production runs before.

@neilp62: taking a look at your failures I see that other people return many of those tasks just fine. And they all fail within 5.x seconds, i.e. during/after initialization. That means it's not a random error, it's systematic. In which case a driver update is the first thing to try. You're using 301.42, which is rather old by GPU standards. Try 314.22 (works for me) or the current one (320.something).

@Firehawk: well.. you're completing quite a few of them with driver 301.42. Anyway, in case of problems (which surely applies now) the 1st thing I'd do is to update the driver.

@Beyond: the error message you're getting rather quickly after a WU started on your GTX460 768MB is
swanMakeTexture2D failed -- array create failedAssertion failed: a, file swanlibnv2.cpp, line 59

That looks like the creation / allocation of some array. This might indeed be due to running out of memory. BOINC should prevent this, but the amount of memory needed has to be reported properly by the WU, otherwise BOINC can't do anything about it. I'll forward this to the Devs.

@John: your CPU has 2 cores disguised as 4. As SK said, I'd first try to reduce the number of CPU tasks and see how GPU performance improves. Once you've got these numbers you can still decide whether this trade-off is worth it for you.

MrS
Scanning for our furry friends since Jan 2002
ID: 29740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29741 - Posted: 7 May 2013, 20:29:12 UTC

Hey guys, I wasn't trying to diminish from the problems you are having and I apologize for coming off wrong.
ID: 29741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
GPUGRID

Send message
Joined: 12 Dec 11
Posts: 91
Credit: 2,730,095,033
RAC: 0
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 29742 - Posted: 7 May 2013, 20:48:52 UTC - in response to Message 29740.  
Last modified: 7 May 2013, 21:05:04 UTC



@Firehawk: well.. you're completing quite a few of them with driver 301.42. Anyway, in case of problems (which surely applies now) the 1st thing I'd do is to update the driver.


Well remembered mate. I had to roll back to this driver because the newer ones had a way worse temperature control, but since these new units run cooler, I may give it a try prior to change projects. Thanks for the heads up.

Update: and this seems to did the trick on the AMD 6x690 machine, wich was suffering from poor performance on half of the gpus. They are warming right now, wich always is a good sign. Will see about the stability in some hours. Thanks again ETApes
ID: 29742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29743 - Posted: 7 May 2013, 21:18:27 UTC - in response to Message 29742.  

I'm presently using 311.06 on W7x64. I've had a few driver resets and app crashes, but so far these appear close to the start of runs (up to 5%), and if I just close and open Boinc, Noelia's WU's restart and crunch away reasonably well.

The last time I had a driver restart, it was when I resumed a suspended Albert CPU WU. As soon as it resumed it suspended the second GPUGrid WU running so it could start. This immediately resulted in a driver restart. However the Noelia WU managed to restart, after the Albert WU finished, without any intervention on my part. When I did several suspends on a WU yesterday I didn’t have any driver problems, but they were further into the run.

I really dislike Boinc suspending GPU WU’s to use a CPU thread. Obviously I won't be running Albert tasks any time soon.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 29743 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29744 - Posted: 7 May 2013, 21:22:44 UTC

I had two failures, and many successful 'NOELIA_klebe_run-0-3's.
The first one was stuck for 8 hours before I noticed it, but after a system restart it immediately ran into an error "Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59".
The second one ran for 3 hours, then ran into an "ERROR: file deven.cpp line 1743: deven_Cellresize(): invalid dimensions" - it's probably due to overclocking.
ID: 29744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Retvari Zoltan
Avatar

Send message
Joined: 20 Jan 09
Posts: 2380
Credit: 16,897,957,044
RAC: 0
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29745 - Posted: 7 May 2013, 21:24:20 UTC - in response to Message 29708.  

These new units are using really low CPU power. The gpus are way cooler aswell, and the processing times seems to be very high. Not using all the hardware.

+1
ID: 29745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29746 - Posted: 7 May 2013, 21:59:22 UTC - in response to Message 29745.  

I had two failures, and many successful 'NOELIA_klebe_run-0-3's.
The first one was stuck for 8 hours before I noticed it, but after a system restart it immediately ran into an error "Kernel not foundAssertion failed: a, file swanlibnv2.cpp, line 59".
The second one ran for 3 hours, then ran into an "ERROR: file deven.cpp line 1743: deven_Cellresize(): invalid dimensions" - it's probably due to overclocking.

file swanlibnv2.cpp, line 59 - might be a cuda bug/app issue.
The Cellresize error might be due to OC, or it might be something else/new; I don't recall seeing it before. Probably best to report such errors, just in case.

These new units are using really low CPU power. The gpus are way cooler aswell, and the processing times seems to be very high. Not using all the hardware.

+1

Yeah, we think these tasks could perhaps do with using more CPU resources(or possibly be higher priority), but conversely, they are using too much GPU memory for some cards. So, not using enough resources for some and using too much for others. Making everybody happy is a cinch :))

The reported downclocking may depend on setup and OS/drivers.

When I reduce the CPU usage I'm getting reasonably high GPU performances (W7x64). I'm running 4 climate models (for stability ;p) and two GPUGRid WU's. The GTX660Ti is presently using 88% power, 89% GPU utilization, 848MB GDDR5 (for the bigger equations perhaps) and the shaders are up to 1215MHz. The GTX470 is using 93% GPU and 751MB GDDR (smaller equation maybe).

Are there larger and smaller equations in use and are they being used based on GDDR capacity or Compute Capability (CC2.0 small CC2.1 or above large)?
Perhaps there is some variation in memory usage of these tasks (different WU's use different amounts of memory); Im seeing 751MB, 801MB and 848MB on different GPU's. Only the 751MB WU is running on a card not used for a display. Again, might be a coincidence or might be something in it? Those with 2 GPU's could check.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 29746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neilp62

Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,017,535,732
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29747 - Posted: 7 May 2013, 22:28:46 UTC - in response to Message 29740.  


@neilp62: taking a look at your failures I see that other people return many of those tasks just fine. And they all fail within 5.x seconds, i.e. during/after initialization. That means it's not a random error, it's systematic. In which case a driver update is the first thing to try. You're using 301.42, which is rather old by GPU standards. Try 314.22 (works for me) or the current one

MrS, many thanks for the suggestion! I normally avoid changing my crunching rig config as much as possible once it is stable, but I see that my other rig that has a mix of 650Ti & 560Ti cards running version 314.22 has now completed 4 Noelia WUs. I'll try upgrading the 680 rig as soon as I get home.

Do you tend to keep your crunch platforms on the latest drivers, or do you wait for systematic issues to arise before upgrading?
ID: 29747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29749 - Posted: 8 May 2013, 7:57:05 UTC - in response to Message 29747.  

I like to play it half-safe: wait a few weeks if others discover any issues with newer drivers, and if not I'll upgrade when I feel like it. And I won't use beta drivers for crunching, unless there's a very good reason to do so.
Going straight for the leading edge can be painful in the BOINC world, it's more like the bleeding edge :p

MrS
Scanning for our furry friends since Jan 2002
ID: 29749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neilp62

Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,017,535,732
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29757 - Posted: 8 May 2013, 18:33:57 UTC - in response to Message 29749.  

Going straight for the leading edge can be painful in the BOINC world, it's more like the bleeding edge :p

Unfortunately, staying still for the sake of 'stability' seems to be just as bloody...

I've upgraded to version 314.22 WHQL driver, and my GTX680 platform (ID 87170) is still failing the long run WU within 6 seconds, with the following error in the BOINC log:
5/8/2013 10:57:22 | GPUGRID | [sched_op] Reason: Unrecoverable error for task 290px29x1-NOELIA_klebe_run-1-3-RND8276_0 ( - exit code 98 (0x62))

No hardware or software config changes have been made to this platform since the new long run WUs were queued. I have no clue why my other rig (ID 137898) (in which I am constantly changing GPUs setups, and runs hotter than the GTX680 platform) is completing long runs successfully, but 87170 can't even initialize the new long runs.

Any further insights would be greatly appreciated. Thank you in advance for the help and the patience!
ID: 29757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29760 - Posted: 8 May 2013, 19:40:48 UTC - in response to Message 29757.  

I've just seen that you're using an anonymous platform on that rig. There's been an app update in the not too distant past, introducing features which hadn't been used by the previous Nathans, but are being used now.

If you're still running an older app there we'd have an easy explanation.

MrS
Scanning for our furry friends since Jan 2002
ID: 29760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 29761 - Posted: 8 May 2013, 19:47:00 UTC

When Tomba first started this thread, I jumped up and checked my computers and discovered that I had 3 running. I use Precision X to adjust my Evga cards, the 3 NOELIA's that were running had caused those cards boost clock to jump up considerably along with more frequent voltage spikes. Now that all my cards are running these new NOELIA's, I've had to adjust the voltage and core clock on all the cards back down to where I know they are in a safe range.

Every card is different from each other and I have now turned on K-Boost which locks all the voltages to where I want them to be. I've had pretty good luck with that feature of Precision X, I did have issue's getting it to work in Windows XP Pro x64 because of the lack of SP3. When I installed it, I put it in compatibility mode (the installation files and the installed program) and it worked. That doesn't mean that any of this stuff will work for anyone else, I just thought it might be worth mentioning especially how my cards core clock jumped up so high on their own, if I hadn't used a utility to adjust the clocks and voltages, then these issue's may have not showed at all.
ID: 29761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Robert Gammon

Send message
Joined: 28 May 12
Posts: 63
Credit: 714,535,121
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29762 - Posted: 8 May 2013, 19:56:00 UTC - in response to Message 29761.  

I tested NOELIAs on Ubuntu 13.04 with Nvidia 319.17 on a GTX660Ti. Progress thru 9 hours of execution of a SHORT RUN, showed less than 20% progress. Further tests with Long Runs indicated a runtime of over 5000 minutes, minimum.

I will wait for these to be exhausted, or for the GPUGrid admins to fix ACEMD for Linux so that it runs these klebe NOELIAs as well as the Windoze version of ACEMD
ID: 29762 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
neilp62

Send message
Joined: 23 Nov 10
Posts: 14
Credit: 8,017,535,732
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29767 - Posted: 9 May 2013, 1:21:25 UTC - in response to Message 29760.  

If you're still running an older app there we'd have an easy explanation.


Yes - I believe you've found the root cause. I adapted an app_info.xml from the forum to run on rig 87170 August last year to permit execution of two low-utilization Paola 3EKO WUs at a time. The cudart32_42_9.dll, cufft32_42_9.dll and acemd.2562.cuda42.dll are dated 6/17/2012, and the tcl85.dll is 11/23/2010. I see from my other rig that app updates occurred on 11/4/2012 and 2/25/2013.

What's the fastest/simplest resolution - delete the current app_info.xml?

Thanks!
ID: 29767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29774 - Posted: 9 May 2013, 11:13:05 UTC - in response to Message 29767.  
Last modified: 9 May 2013, 12:01:56 UTC

Yes, delete the app_info file and restart Boinc.

While I'm not overly impressed with app_config (confusing terminology, misleading reporting, and the requirement to reset the project to get rid of it), it's probably a better option than building and maintaining app_info files, and it might improve in later versions.

In the past some WU's didn't utilize the GPU to a high extent (<60%). These were mostly Beta's and it isn't usually the situation. Although having some non-recommended configurations/setups with your GTX 680 (4095MB) might allow you to etch out some slight overall performance advantage, I can get both Nate's and Noelia's WU's to run at ~90% GPU utilization, while running 5 CPU WU's, or 94/95% simply by reducing the CPU usage for CPU projects.
At present I doubt that you will get a significant improvement running two tasks, and you could simply tweak your setup to optimize for GPUGrid throughput. So I don't really see the need for using app_info or app_config.
Also, the very concept of trying to reach 100% GPU utilization with Kepler's is a bit dubious; these card self-adjust towards a GPU specific power optimal. If your GPU utilization rises but your core clock rate drops, what's the point?

-
Just want to add that I've not had any Noelia WU failures, despite messing, and my second GPU [GTX470] (operating at PCIE2x8 in PCIE slot 1 going by GPUZ, the motherboard manufacturer and Afterburner, but GPU (device 0) according to BM) has only used 751MB GDDR on since I installed it (for 3 of Noelia's WU's). The GTX660Ti (top PCIE slot; operating at PCIE3x8) has used varying amounts of GDDR from ~800MB to almost 1GB.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 29774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : NOELIAs are back!

©2025 Universitat Pompeu Fabra