New CPU work units

Message boards : News : New CPU work units
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38545 - Posted: 16 Oct 2014, 14:13:50 UTC - in response to Message 38544.  
Last modified: 16 Oct 2014, 14:17:59 UTC

For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task.



Read here: http://www.gpugrid.net/forum_thread.php?id=3898&nowrap=true#38505

It's not random.

When your GPU tasks switched out of "high priority" (deadline panic) mode, they also became lower on the food chain of client task scheduling. Instead of order 1 (where they were scheduled before the MT task) they became order 3 (scheduled after the MT task). And then, since the scheduler will only schedule up to ncpus+1, that is why only 1 GPU task is presently scheduled, instead of both (assuming each of your GPU tasks is budgeted to use 0.5 or more CPU also).

Not random at all. Working as designed, correctly...
... given the circumstances of the GPUGrid MT task estimates being completely broken.
ID: 38545 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38547 - Posted: 16 Oct 2014, 15:15:09 UTC - in response to Message 38545.  
Last modified: 16 Oct 2014, 15:23:12 UTC

For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task.



Read here: http://www.gpugrid.net/forum_thread.php?id=3898&nowrap=true#38505

It's not random.

When your GPU tasks switched out of "high priority" (deadline panic) mode, they also became lower on the food chain of client task scheduling. Instead of order 1 (where they were scheduled before the MT task) they became order 3 (scheduled after the MT task). And then, since the scheduler will only schedule up to ncpus+1, that is why only 1 GPU task is presently scheduled, instead of both (assuming each of your GPU tasks is budgeted to use 0.5 or more CPU also).

Not random at all. Working as designed, correctly...
... given the circumstances of the GPUGrid MT task estimates being completely broken.


Jacob- one GPU task been running for 37hr straight in high priority mode- one GPU task for 22hr straight high priority and one CPUMD task for 24 straight hours high priority mode. During this time I haven't added any task to cache- If all three task were already in high priority (Order 1 or 3/is there a way to find out which?)mode running- why did BOINC kick one out after all this time? Since very beginning these three tasks have been in High priority and I haven't changed any BOINC scheduler or allowed CPU usage. I had a similar issue when a CPUMD task was in cache- so I've stopped allowing any task to sit in cache- only keeping tasks capable of computing on available GPU/CPU.

If I suspend CPUMD task- both GPU task will run with one being in High priority and other not. If I suspend CPUMD task one GPU that in high Priority changes to non-high priority. When CPUMD task is running along side one GPU task- when the task that's in waiting to run is suspended - the GPU task running stops high priority mode.
ID: 38547 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38548 - Posted: 16 Oct 2014, 15:27:44 UTC - in response to Message 38547.  
Last modified: 16 Oct 2014, 15:34:59 UTC

"High priority mode" for a task means that "Presently, if tasks were scheduled in a FIFO order in the round-robin scheduler, the given task will not make deadline. We need to prioritize it to be ran NOW." It should show you, in the UI, if the task is in "High Priority" mode, on that Tasks tab, in the Status column.

A task can move out of "High priority mode" when the round-robin simulation indicates that it WOULD make deadline. When tasks are suspended/resumed/downloaded, when progress percentages get updated, when running estimates get adjusted (as tasks progress), when the computers on_frac and active_frac and gpu_active_frac values change ... the client re-evaluates all tasks to determine which ones need to be "High priority" or not.

Did you read the information in the links that were in my post? They're useful. After reading that information, do you still think the client scheduler is somehow broken?

Also, you can turn on some cc_config flags to see extra output in Event Log... specifically, you could investigate rr_simulation, rrsim_detail, cpu_sched, cpu_sched_debug, or coproc_debug. I won't be able to explain the output, but you could probably infer the meaning of some of it.
ID: 38548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38549 - Posted: 16 Oct 2014, 17:49:30 UTC - in response to Message 38548.  
Last modified: 16 Oct 2014, 18:10:23 UTC

Some cc_config flags information- BOINC thinks I'm going to miss deadline for CPUMD task----
(1138hr remaining estimate/14/10/16 13:34:52 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1) Boinc says CPUMD is 20% compete in 24hr--progress file is at 3.5million step )

BOINC will run unfold Noelia task (97%compete/18hr est remaining/14/10/16 13:33:52 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1) in High priority when CPUMD task is running while booting the task Boinc thinks will miss a deadline-- 63% compete SDOERR task (174hr remaining estimate) (SDOERR)14/10/16 13:33:52 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0

Here some newer tasks states that have changed---14/10/16 13:43:13 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 1 next 1 task state 0

14/10/16 13:47:13 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1

14/10/16 13:47:13 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1

14/10/16 13:56:05 | GPUGRID | [rr_sim] 24011.34: unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (721404.58G/30.04G)

14/10/16 14:00:07 | GPUGRID | [rr_sim] 4404370.74: 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 finishes (4.00 CPU) (54297244.54G/12.33G)

14/10/16 13:56:05 | GPUGRID | [rr_sim] 658381.65: I1R119-SDOERR_BARNA5-38-100-RND1580_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (19780638.18G/30.04G)
14/10/16 13:56:05 | GPUGRID | [rr_sim] I1R119-SDOERR_BARNA5-38-100-RND1580_0 misses deadline by 348785.46
14/10/16 13:58:05 | GPUGRID | [cpu_sched_debug] skipping GPU job I1R119-SDOERR_BARNA5-38-100-RND1580_0; CPU committed

14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1

14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0

14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1

Now the three tasks are all running with new task states after being rescheduling ( I downloaded a new Long task)---
14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1

14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1

14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1
ID: 38549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38585 - Posted: 20 Oct 2014, 12:14:34 UTC
Last modified: 20 Oct 2014, 12:16:48 UTC

ID: 38585 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
d_a_dempsey

Send message
Joined: 18 Dec 09
Posts: 6
Credit: 1,046,736,560
RAC: 0
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38632 - Posted: 22 Oct 2014, 13:03:58 UTC

I have a problem with the Test application for CPU MD work units. This is obviously a test setup, according to both application name and this discussion thread, and the work units are being pushed to my machines even though my profile is set to not receive WUs from test applications.

I'm happy to do GPU computing for you guys, but I'm not willing to let you take over complete machines for days. Please control your app to respect the "Run test applications?" setting in our profiles.

Thank you,

David
ID: 38632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38634 - Posted: 22 Oct 2014, 13:44:00 UTC - in response to Message 38632.  

Hm, sorry about that. Should only be going to machines opted in to test WUs.
I should point out the app is close to production - the main remaining problem with it is the ridiculous runtime estimates the client is inexplicably generating.

Matt
ID: 38634 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38724 - Posted: 28 Oct 2014, 11:57:57 UTC - in response to Message 38634.  

Are the working SSE2 CPUMD tasks on vacation? Were return results incomplete/invalid? 10000 tasks disappeared.
From the look of BOINC stats and GPUGRID graphs- a decent amount of new user CPU only machines were added with credit rewarded.
ID: 38724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
sis651

Send message
Joined: 25 Nov 13
Posts: 66
Credit: 282,724,028
RAC: 69
Level
Asn
Scientific publications
watwatwatwatwatwatwatwatwat
Message 38731 - Posted: 28 Oct 2014, 22:22:33 UTC

I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before.
My problem is when I paused the units and restarted the Boinc none of the CPU works resume crunching from their last progress. They start crunching from the beginning. In an area with short but frequent blackouts its not possible to run these CPU units.
ID: 38731 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boinc127

Send message
Joined: 31 Aug 13
Posts: 11
Credit: 7,952,212
RAC: 0
Level
Ser
Scientific publications
watwatwatwatwatwatwat
Message 38732 - Posted: 28 Oct 2014, 23:59:14 UTC - in response to Message 38731.  

I believe the project admins dumped the AVX mt program because of some flaws in it. When I ran the AVX program I also noticed the program never checkpointed.

from MJH on another post:

The buggy Windows AVX app is gone now. Please abort any instances of it still running. It's replaced with the working SSE2 app.


http://www.gpugrid.net/forum_thread.php?id=3812&nowrap=true#38680

For now at least, there are no other CPU beta workunits to test. I guess the project admins will revise and replace the workunits when they are ready and able to.
ID: 38732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38737 - Posted: 29 Oct 2014, 7:52:00 UTC - in response to Message 38731.  

I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before.


Make sure that the application executable that you are running has "sse2" in its name, not "avx". Manually delete the old AVX app binary from the project directory if necessary.

MJH
ID: 38737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38741 - Posted: 29 Oct 2014, 12:57:50 UTC - in response to Message 38737.  
Last modified: 29 Oct 2014, 13:00:03 UTC

Received 5 abandoned 9.03 "AVX" tasks. All are computing SSE2 even with AVX app binary in directory- checkpoints are working- BOINC client progress reporting is still off.(@70% with 3.7million steps left to compute) Progress file is reporting steps computed properly.
ID: 38741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
John C MacAlister

Send message
Joined: 17 Feb 13
Posts: 181
Credit: 144,871,276
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwat
Message 38768 - Posted: 30 Oct 2014, 14:30:22 UTC

Hola, Amigos en Barcelona!

No CPU tasks received: are there any available?

Thanks!

John
ID: 38768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Astiesan

Send message
Joined: 8 Jun 10
Posts: 3
Credit: 1,209,302,653
RAC: 29,582
Level
Met
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38777 - Posted: 1 Nov 2014, 0:38:29 UTC
Last modified: 1 Nov 2014, 0:42:26 UTC

mdrun-463-901-sse-32 causes a soft system freeze occassionally when exiting active state into sleeping state i.e. screensaver off to on.

By soft system freeze, I mean that the start bar/menu (I do use start8, but it's confirmed to occur without this active as well), all parts of it are locked. Windows-R can bring up the Run menu, and I can use cmd and taskkill mdrun and the start menu itself will return to normalcy, however the bar will continue to be unresponsive. Killing explorer.exe to reset the start bar will result in a hard freeze requiring reboot. During the soft freeze, alt-tab and other windows will be VERY slow to respond until mdrun is killed, afterwards all other windows work fine, but the start bar is unusable and will force a reboot of the system.

There is nothing in the error logs.

Any assistance or ideas in resolving this would be appreciated.

My system:
Windows 8.1 64-bit
i7 4790K @ stock
ASRock Z97-Extreme4
EVGA GTX 970 SC ACX @ stock
2x8GB HyperX Fury DDR3-1866 @ stock
ID: 38777 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38781 - Posted: 1 Nov 2014, 10:20:52 UTC

I gave four cores of my AMD FX-8350 to the app. I've done four WUs, which all completed in a remarkably consistent time of just over 16 hours, with a-bit-mean 920 credits each.

I just checked the server status:



...and was a little surprised to see my 16 hours well under the minimum run time of 19.16 hours.

ID: 38781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
eXaPower

Send message
Joined: 25 Sep 13
Posts: 293
Credit: 1,897,601,978
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 38782 - Posted: 1 Nov 2014, 11:12:59 UTC - in response to Message 38781.  
Last modified: 1 Nov 2014, 11:26:59 UTC

A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps.
ID: 38782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38784 - Posted: 1 Nov 2014, 12:09:50 UTC - in response to Message 38782.  

A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps.

I did complete this 5M-step WU on 24 October and got 3342 credits...
ID: 38784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile MJH

Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 38785 - Posted: 1 Nov 2014, 12:17:08 UTC - in response to Message 38781.  

Yes, the credit allocation is wrong - need to work out how to fix that.

Matt
ID: 38785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
tomba

Send message
Joined: 21 Feb 09
Posts: 497
Credit: 700,690,702
RAC: 0
Level
Lys
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38786 - Posted: 1 Nov 2014, 12:24:51 UTC - in response to Message 38785.  

Yes, the credit allocation is wrong - need to work out how to fix that.

Matt

A fixed 2.5M per completion would be a nice 'n' easy solution ;)
ID: 38786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jacob Klein

Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 38787 - Posted: 1 Nov 2014, 12:51:29 UTC
Last modified: 1 Nov 2014, 12:58:05 UTC

I have completed 2 of the new (I think?) tasks, of application type "Test application for CPU MD v9.01 (mtsse2)", on my host (id: 153764), running 8 logical CPUs (4 cores hyperthreaded).

When I first got the tasks, I think the estimated run time was something like 4.5 hours. But then, after it completed the first task (which took way longer - it took 15.75 hours of run time), it realized it was wrong, and adjusted the estimated run times for the other tasks to be ~16 hours.

For each of the 2 completed tasks:
- Task size: 2.5 million steps
- Run Time: ~16.4 hours
- CPU Time: ~104 hours (My CPUs were slightly overcommitted by my own doing)
- Credit granted: ~3700

I will continue to occasionally run these, to help you test, especially when new versions come out.

Regards,
Jacob
ID: 38787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : News : New CPU work units

©2025 Universitat Pompeu Fabra