NOELIAs are back!

Message boards : Number crunching : NOELIAs are back!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
[AF>Le_Pommier] McRoger

Send message
Joined: 30 Aug 08
Posts: 12
Credit: 15,800,629
RAC: 0
Level
Pro
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 29861 - Posted: 12 May 2013, 13:14:27 UTC - in response to Message 29858.  
Last modified: 12 May 2013, 13:15:10 UTC

There is a problem on some Linux operating systems; they want to take for ever to complete. I think it's the more recent versions of Linux that are impacted, but not all. It's possible there is something missing in the drivers or Linux that is preventing the correct use of the drivers; missing libraries.

If you consistently get SIGSEGV errors on Linux or these WU's are going to take hundreds of hours (go by the % complete) just abort them.

If anyone finds a fix, post it up.


I'm running Debian Wheezy, so yes, a very recent version of kernel, drivers and libraries.

And if I want to install « glibc-2.13-1 » (containing libpthread.so.0 mentioned in the error message), apt tells me that « libc6 » is already installed instead.

So yes indeed, might be that this is a choice of library to compile the Linux application that is not compatible with latest versions (but I'm no developer, so that is just an assumption).
ID: 29861 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 29880 - Posted: 12 May 2013, 18:44:01 UTC - in response to Message 29859.  

Finished my first Noelia without errors on XP running a 660TI. Took about 35 minutes longer than the same card on Linux Mint

While you have only run one task type each on Linux and XP, it looks like Linux Mint (3.5.0-17-generic) is ~5% faster (4.5% for Nathan's and 5.8% for Noelia's):

Linux
306px37x2-NOELIA_klebe_run-1-3-RND7661_1 4440201 10 May 2013 | 20:53:39 UTC 11 May 2013 | 7:16:08 UTC Completed and validated 36,653.31 16,521.36 127,800.00
I40R14-NATHAN_dhfr36_6-10-32-RND5144_0 4440304 10 May 2013 | 20:53:39 UTC 11 May 2013 | 12:08:30 UTC Completed and validated 17,866.73 17,627.17 70,800.00

XP
306px2x1-NOELIA_klebe_run-1-3-RND0127_0 4442470 11 May 2013 | 19:28:14 UTC 12 May 2013 | 6:21:56 UTC Completed and validated 38,796.59 17,414.91 127,800.00
I12R11-NATHAN_dhfr36_6-13-32-RND4528_0 4442251 11 May 2013 | 19:29:37 UTC 12 May 2013 | 11:33:18 UTC Completed and validated 18,676.30 18,565.16 70,800.00

All 'Long runs (8-12 hours on fastest card) v6.18 (cuda42)'

That's more than I thought it would be (1%, possibly 3%). There might be some task variation, but running on the same system is a very solid.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 29880 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jozef J

Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 30024 - Posted: 16 May 2013, 18:44:03 UTC

I2HDQ_17R4-SDOERR_2HDQc-1-4-RND1951_0
I99R11-NATHAN_dhfr36_6-8-32-RND2501_0 202 (0xca) EXIT_ABORTED_BY_PROJECT


http://www.gpugrid.net/result.php?resultid=6878277
http://www.gpugrid.net/result.php?resultid=6851367

Errors and errors again and again, and another noelias incomimng-(((
After cca two weeks without problems and restarting/blue screen.
Every time when i have rac about 620k incoming some wrong jobs noelia..but is not the first time when im complained to the problem when I have just about 600-620 rac..Now it is all my participation in the project after two weeks in the ass. Is there any conspiracy behind it or just incompetence?

Can I do something more than complain here?...-)))
ID: 30024 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30026 - Posted: 16 May 2013, 19:13:25 UTC - in response to Message 30024.  

Is there any conspiracy behind it or just incompetence?

Neither.

Noelia is not to blame, she's just the first to use new features which the project needs in the future.

MrS
Scanning for our furry friends since Jan 2002
ID: 30026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jozef J

Send message
Joined: 7 Jun 12
Posts: 112
Credit: 1,140,895,172
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwat
Message 30031 - Posted: 16 May 2013, 20:01:48 UTC - in response to Message 30026.  
Last modified: 16 May 2013, 20:04:57 UTC

Every time when i have rac about 620k incoming some wrong jobs noelia..but is not the first time when im complained to the problem when I have just about 600-620 rac..It is the third time exactly when again I have a problem of Noelia and just when I got 620k rac, it is amazing Mr. Scientist----------And that's your answer, Mr. moderator?

I have to prove it in the logs of work.

do you think that people in not perfect English can not understand simulations of proteins?! shame om you
ID: 30031 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30038 - Posted: 17 May 2013, 3:16:50 UTC
Last modified: 17 May 2013, 3:17:14 UTC

It's obviously an international conspiracy to keep your RAC low. We're all involved and participate in LJRAC (Lowering Jozef's Recent Average Credit). BTW, the checks are in the mail...

JK. Seriously, we're all suffering the same problem. This is not what one would call a smoothly running project. Just saying...
ID: 30038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30039 - Posted: 17 May 2013, 7:41:04 UTC

Well I have two systems with nVidia cards, slow ones though. However they do Noelia's without problems so far, taking between 2 and 3 days. The systems are stable nothing is overclocked and not the latest BOINC or drivers. If it works than I leave it as is. If not I'll try to update the video drivers. One CPU core is always free, that seems to be important.

It could be the setup of system and drivers that results in errors, Microsoft Windows overall is a complex and heavy controlling OS.
Greetings from TJ
ID: 30039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30047 - Posted: 17 May 2013, 13:16:55 UTC - in response to Message 30039.  
Last modified: 17 May 2013, 13:46:33 UTC

If it works than I leave it as is.

Always the best advice.
Unfortunately I test stupid problems and get errors for my efforts. Today while testing something and looking into another issue/fix, I had to suspend WU's. This caused the driver to restart two or three times, and then I got a blue screen. On reboot lots of C+ errors and all my running WU's crashed and burned. Not an issue if I had been running FightMalaria@home, but I was running 5 climate models and lost several hundred hours - scunnered!

Possible fix here - works for me.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 30047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30049 - Posted: 17 May 2013, 14:06:38 UTC - in response to Message 30047.  

I was running 5 climate models and lost several hundred hours - scunnered!

Ouch.
ID: 30049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30054 - Posted: 17 May 2013, 15:29:26 UTC

I have read here in the forum many times that suspending a GPUGRID WU will cause error and blue screen. That is why I have never tried it. For Albert and Einstein at home it can be done without harm (in my case).
Greetings from TJ
ID: 30054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile [PUGLIA] kidkidkid3
Avatar

Send message
Joined: 23 Feb 11
Posts: 101
Credit: 1,589,743,957
RAC: 360
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30058 - Posted: 18 May 2013, 10:52:54 UTC - in response to Message 30054.  

Hi all,
after 25 hours (72 % completed) i suspend this Noelia's WU.
During resume i had this error, after my abort because it starts from 0%.
Did you have an idea about this ?
Thanks in advance.
K.

http://www.gpugrid.net/result.php?resultid=6879994

Dreams do not always come true. But not because they are too big or impossible. Why did we stop believing.
(Martin Luther King)
ID: 30058 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30060 - Posted: 18 May 2013, 12:04:11 UTC

I think it's pretty safe to say that with the curent Noelias suspending a WU almost certainly triggers a driver reset. For me this has taken down a few hours of Einstein work, twice. Now I do my testing whenever I have other WUs running. Not ideal, but better than the alternative.

@Jozef: when was the last time that throwing insults at people actually helped you? Looking at your tasks I can see that in the last 2 weeks you had 3 Noelias and 2 Nathans fail with computation errors. That's unfortunate, but not unusual and you can be sure the scientists are looking into it. But it's nowhere near the scale of the global conspiracy which you seem to suspect.

Actually everyone gets bonus credits for each long-run WU as the risk of loosing them is higher than for shorter WUs. So you're being compensated for a certain failure rate from the beginning on.

MrS
Scanning for our furry friends since Jan 2002
ID: 30060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30075 - Posted: 18 May 2013, 21:39:15 UTC

The suspend-restart blue screen has never happened to me and I suspend quite often (Windows XP Pro x64). Maybe it's an OS specific issue, I also have my checkpoints set to 900 seconds (15 minutes), I did this mainly for the climate models I run. I do have problems when finishing a SDOERR and starting a NOELIA on the same GPU, no crashes, just the card running wild on the GPU clock.
ID: 30075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30088 - Posted: 19 May 2013, 13:44:05 UTC - in response to Message 30075.  
Last modified: 19 May 2013, 13:52:56 UTC

I've only seen the 'suspend & crash' problem on W7. Saying as different OS's handle the drivers differently it's bound to be OS related.

On XP I think you still can't set Prefer maximum performance in NVidia control panel - might explain the 'card running wild on the GPU clock' issue. Anyway, it's a driver issue; they took that feature away.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 30088 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
ExtraTerrestrial Apes
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 17 Aug 08
Posts: 2705
Credit: 1,311,122,549
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30090 - Posted: 19 May 2013, 15:54:47 UTC - in response to Message 30088.  

I've only seen the 'suspend & crash' problem on W7.

Add W8 to that!

MrS
Scanning for our furry friends since Jan 2002
ID: 30090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
klepel

Send message
Joined: 23 Dec 09
Posts: 189
Credit: 4,798,881,008
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30098 - Posted: 20 May 2013, 0:17:07 UTC - in response to Message 30047.  

This caused the driver to restart two or three times, and then I got a blue screen. On reboot lots of C+ errors and all my running WU's crashed and burned. Not an issue if I had been running FightMalaria@home, but I was running 5 climate models and lost several hundred hours - scunnered!


Sometimes I do think it is not necessary the GPUGRID WUs, which causes the bluescreen, I think it might be also the CLIMATEPREDICTION.NET WUs: I just had a bluescreen around the same time as you, and then one of the CLIMATEPREDICTION.NET WUs did not work anymore, and the GPUGRID did continue. However it mostly on a system with a GTX 570 card.
ID: 30098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
flashawk

Send message
Joined: 18 Jun 12
Posts: 297
Credit: 3,572,627,986
RAC: 0
Level
Arg
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 30100 - Posted: 20 May 2013, 2:46:37 UTC

The climate models (CPDN) are very, very sensitive to any kind of an interruption. When I set my checkpoints to every 15 minutes, my computation error rate dropped by 70% and if I do 3 or more suspend/restarts within 10 minutes, I'll get at least 1 error.

When I reboot my computers (every 200 hours), I suspend the tasks and close BOINC by clicking exit, that works every time for me. If I get any kind of a crash or system freeze (neither have happened in over a year), it's a guarantee that I will lose some Climate models, I even have new APC units just incase.
ID: 30100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30115 - Posted: 20 May 2013, 13:58:37 UTC - in response to Message 30100.  

The climate models (CPDN) are very, very sensitive to any kind of an interruption. When I set my checkpoints to every 15 minutes, my computation error rate dropped by 70% and if I do 3 or more suspend/restarts within 10 minutes, I'll get at least 1 error.

I've been wondering about CPDN, because the people reporting crashes often mention that they loose CPDN work. I'm not running that project and also have never had any hard crashes, nothing but some ACEMD errors on certain WU types. Nothing else running on the machine is ever affected.
ID: 30115 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Vagelis Giannadakis

Send message
Joined: 5 May 13
Posts: 187
Credit: 349,254,454
RAC: 0
Level
Asp
Scientific publications
watwatwatwatwatwatwatwatwatwat
Message 30286 - Posted: 24 May 2013, 9:53:46 UTC
Last modified: 24 May 2013, 9:56:19 UTC

Hi all,

My GTX 650Ti is currently working on a NOELIA, but the GPU utilization looks pretty low: elapsed 10h, remaining 17h. That will be a total of 27 hours! A previous SDOERR took 18h.

I use Linux and so can't observe GPU utilization directly, but judging by the temperature (50C), the GPU is clearly not being fully used. It gets at >60C when it is.

Has anyone else observed this?

Edit: The WU's process (acemd.2868) consumes 5-10% CPU, but this doesn't seem to be the cause of the under-utilization, rather the symptom of it. I tried suspending CPU tasks and projects and it didn't change.

My configuration:
Ubuntu Server 12.04 x86_64
Kernel 3.2.0-41-generic.
NVIDIA driver 319.17
BOINC 7.0.65
ID: 30286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Beyond
Avatar

Send message
Joined: 23 Nov 08
Posts: 1112
Credit: 6,162,416,256
RAC: 0
Level
Tyr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 30293 - Posted: 24 May 2013, 14:54:04 UTC - in response to Message 30286.  

My GTX 650Ti is currently working on a NOELIA, but the GPU utilization looks pretty low: elapsed 10h, remaining 17h. That will be a total of 27 hours! A previous SDOERR took 18h.

I use Linux and so can't observe GPU utilization directly, but judging by the temperature (50C), the GPU is clearly not being fully used. It gets at >60C when it is.

Has anyone else observed this?

On my 650 Ti GPUs (and others) the GPU utilization runs 5-6% lower on NOELIA and NATHAN_KID WUs than on SDOERR WUs. (Win7-64)
ID: 30293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : NOELIAs are back!

©2025 Universitat Pompeu Fabra