acemdshort application 8.15 - discussion

Message boards : News : acemdshort application 8.15 - discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32745 - Posted: 5 Sep 2013, 21:41:37 UTC - in response to Message 32743.  


We still have some work to do resolving the suspending of tasks. I've noticed that they continue running for up to 15 seconds even after suspense.


Side effect of having more code blocked out in critical sections. As the article you found indicates, prompt terminating on suspend requires the monitoring thread to wake up while the app thread is outside a critical region.

The only way this is going to get fixed to change the dumb way the boinc client lib blugeons the app process to death, and give the app opportunity to close down gracefully.
This will take a bit of work, but it's high on the Todo list.

MJH
ID: 32745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32746 - Posted: 5 Sep 2013, 21:43:54 UTC - in response to Message 32745.  
Last modified: 5 Sep 2013, 21:44:56 UTC

Thanks.
Regarding
http://boinc.berkeley.edu/trac/changeset/b98bc309cceccf95b9fac578c47cbea06a8b8150/boinc-v2

... It looked like a simple-to-moderate code change that just changes the way suspension works with the critical sections. It looks very applicable toward making our suspense requests run smoother, and I hope it isn't hard to implement. (I don't know much about where the API code comes in to play, but if it's just "a piece that's included when building apps", then maybe it'll be pretty easy for you to "hook it in")
ID: 32746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32747 - Posted: 5 Sep 2013, 21:50:25 UTC - in response to Message 32746.  

Jacob,

That fix would already have been included in 8.12 when I updated to the latest boinc library revision.

Matt
ID: 32747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32749 - Posted: 5 Sep 2013, 21:52:29 UTC - in response to Message 32747.  
Last modified: 5 Sep 2013, 21:55:31 UTC

Well, for my situation there, it was an 8.11 that caused the problem.
:) I'll keep testing, and hopefully it works even better in the already-released 8.12 and 8.13

Thanks for making progress - I really do appreciate it!!

Edit: 8.13 is suspending/resuming VERY nicely. I can't wait to have 8.13 running on a NOELIA_KLEBE task (to test it!)
ID: 32749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32755 - Posted: 5 Sep 2013, 22:13:06 UTC - in response to Message 32749.  

I have tested it on a NOELIA_INS task in order to get a beta. The suspending and starting again worked. (Not getting beta WU as it knew that a task was suspended :( )
Greetings from TJ
ID: 32755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 11 Jul 09
Posts: 1639
Credit: 10,159,968,649
RAC: 295,172
Level
Trp
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32757 - Posted: 5 Sep 2013, 23:23:36 UTC - in response to Message 32745.  

The only way this is going to get fixed to change the dumb way the boinc client lib blugeons the app process to death, and give the app opportunity to close down gracefully.
This will take a bit of work, but it's high on the Todo list.

MJH

You have allies in the BOINC community. Eric Korpela of SETI@home wrote (on 13 Nov 2008 - unfortunately in a private forum I can't link):

Yes, the terminate with no mercy policy sucks and we should find if there is a way around it, or at least a way to allow I/O to finish.

About time we got round to fixing that...
ID: 32757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 32760 - Posted: 5 Sep 2013, 23:44:31 UTC

First MJHarvey_Crash beta just finished.
http://www.gpugrid.net/result.php?resultid=7251789

Only betas I can't get to finish are the Noelia_Klebe
http://www.gpugrid.net/result.php?resultid=7248385

Could the Noelia_Klebes be troublesome because my card is a PE and ramps up to 1200MHz. on the core when crunching? Just wondered because it runs @ 1200 for all the other tasks too. I've also noticed that there is a huge time discrepancy between GPU/CPU on these failed tasks when all the others show GPU/CPU times to be very close.
ID: 32760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32766 - Posted: 6 Sep 2013, 7:31:07 UTC - in response to Message 32760.  

Your card mostly ran at 58°C, so it wasn't overly taxed by the Noelia_Klebe WU.
My GTX660Ti also clocks up to ~1200MHz. That said I also get the odd error from it and other similar cards.

The Noelia_Klebe WU's don't use a full CPU core/thread in the same way most other WU's do. This has been the case since they were first released.
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 32815 - Posted: 6 Sep 2013, 20:51:21 UTC

acemdshort is now updated to 8.14. This version has improved stability during suspend/resume.

MJH
ID: 32815 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nanoprobe

Send message
Joined: 26 Feb 12
Posts: 184
Credit: 222,376,233
RAC: 0
Level
Leu
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 32817 - Posted: 7 Sep 2013, 1:06:31 UTC

My GTX660Ti also clocks up to ~1200MHz. That said I also get the odd error from it and other similar cards.

Wish it was just an odd now or than error. I haven't had 1 NOELIA_KLEBE beta complete and validate yet. The all end with the time exceeded error after running for an hour or so.
ID: 32817 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32826 - Posted: 8 Sep 2013, 14:02:19 UTC

I had one CRASH test overnight that took 22,493.99 seconds to complete. Checking the system shows that the one that was running on half the core clock of the GPU. So that is the explanation. However no reason in the stderr report, core clock was there reported as it should be, 1058MHz. I reboot the system and all is normal again.
I have seen reduced clock speeds, but that was after an error or ACEMD crash , this is new that it happened without any errors.
Greetings from TJ
ID: 32826 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32846 - Posted: 8 Sep 2013, 17:55:34 UTC

Hello: 8.14 Tasks are running low load on the GPU <60% and also very unstable, varies more than 10% + -. in my GTX 770.

The CPU runs smoothly, but the result is that it takes twice as necessary, a short assignment are about four hours ...??.
ID: 32846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile skgiven
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 23 Apr 09
Posts: 3968
Credit: 1,995,359,260
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32848 - Posted: 8 Sep 2013, 18:10:52 UTC - in response to Message 32826.  

TJ, I guess you are referring to this WU,

194-MJHARVEY_CRASH1-1-25-RND6694_0 4759387 7 Sep 2013 | 14:08:27 UTC 8 Sep 2013 | 0:59:20 UTC Completed and validated 22,493.99 22,461.82 18,750.00 ACEMD beta version v8.14 (cuda55)

When a WU doesn't use the GPU enough, it can cause the GPU to downclock. The temps were only 52°C, while your other runs on similar WU's had temps rising to 67°C. A 15°C drop sounds about right for a downclock.

Perhaps a mechanism to report changes in core clock, as well as temp, would be useful (if it's not too late)!
FAQ's

HOW TO:
- Opt out of Beta Tests
- Ask for Help
ID: 32848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32849 - Posted: 8 Sep 2013, 18:54:34 UTC - in response to Message 32848.  

Hello: If this task is completed, but is now running; SANTI_MARwt2-4-25-uan RND4912_0 with load <65%. GPU

The GTX770 is running at 1254 Mhz GPU Clock without problem.
Temperature 55 °C, 20% use FB, BUS use 7% (two variants unstable + - 2%)
Memory Usage: 519 MB in GPU.
ID: 32849 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32850 - Posted: 8 Sep 2013, 19:31:40 UTC - in response to Message 32849.  
Last modified: 8 Sep 2013, 19:40:37 UTC

Hello: If this task is completed, but is now running; SANTI_MARwt2-4-25-uan RND4912_0 with load <65%. GPU


Hello: Regarding the issue of little use GPU if it has to be the way of working of these tasks, the solution will perform two tasks on the GPU to achieve maximum load.

That those responsible will be interesting to confirm this issue in order to decide how to handle these tasks.

NOTE: I happened to run two tasks at the same GPU 8.14 GTX770 and the total charge passed 55% to 70% + - 5% Memory 777 MB FB and BUS 22% and 8% 1254 Mhz GPU.
ID: 32850 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32851 - Posted: 8 Sep 2013, 19:48:48 UTC - in response to Message 32848.  

TJ, I guess you are referring to this WU,

194-MJHARVEY_CRASH1-1-25-RND6694_0 4759387 7 Sep 2013 | 14:08:27 UTC 8 Sep 2013 | 0:59:20 UTC Completed and validated 22,493.99 22,461.82 18,750.00 ACEMD beta version v8.14 (cuda55)

Yes, skgiven that is the one.
Later this morning I had one error, but that did not down clock the core clock.
But as these CRASH tests are Santi's SR and I had a lot of errors of them, my error rate has lowered significantly.
Greetings from TJ
ID: 32851 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Carlesa25
Avatar

Send message
Joined: 13 Nov 10
Posts: 328
Credit: 72,619,453
RAC: 0
Level
Thr
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32860 - Posted: 9 Sep 2013, 10:30:32 UTC - in response to Message 32850.  

Hello: If this task is completed, but is now running; SANTI_MARwt2-4-25-uan RND4912_0 with load <65%. GPU


Hello: Regarding the issue of little use GPU if it has to be the way of working of these tasks, the solution will perform two tasks on the GPU to achieve maximum load.

That those responsible will be interesting to confirm this issue in order to decide how to handle these tasks.

NOTE: I happened to run two tasks at the same GPU 8.14 GTX770 and the total charge passed 55% to 70% + - 5% Memory 777 MB FB and BUS 22% and 8% 1254 Mhz GPU.


Hello: Sorry ... 8.14 my problems with no load on the GPU result from a corruption of the driver, reinstalled the question has been solved and GPU load is normal 85% + -
ID: 32860 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
TJ

Send message
Joined: 26 Jun 09
Posts: 815
Credit: 1,470,385,294
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 32861 - Posted: 9 Sep 2013, 11:35:31 UTC - in response to Message 32860.  
Last modified: 9 Sep 2013, 11:36:48 UTC

My Asus 770 runs at 91-92% GPU load steady, with core clock of 1097MHz, however I have it set to 1060MHz. This is for a Nathan WU and obvious no 8.14 app yet.
Greetings from TJ
ID: 32861 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 33143 - Posted: 22 Sep 2013, 11:43:27 UTC

The server should now once again be dishing out Short tasks to Linux clients.

MJH
ID: 33143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 34154 - Posted: 8 Dec 2013, 15:34:24 UTC

I've promoted the 8.15 beta application to the short queue.
This version has a workaround to catch tasks that repeatedly fail, necessitating a machine reset.

Matt
ID: 34154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

Message boards : News : acemdshort application 8.15 - discussion

©2025 Universitat Pompeu Fabra