New CPU work units

Author	Message
Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 38545 - Posted: 16 Oct 2014, 14:13:50 UTC - in response to Message 38544. Last modified: 16 Oct 2014, 14:17:59 UTC For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task. Read here: http://www.gpugrid.net/forum_thread.php?id=3898&nowrap=true#38505 It's not random. When your GPU tasks switched out of "high priority" (deadline panic) mode, they also became lower on the food chain of client task scheduling. Instead of order 1 (where they were scheduled before the MT task) they became order 3 (scheduled after the MT task). And then, since the scheduler will only schedule up to ncpus+1, that is why only 1 GPU task is presently scheduled, instead of both (assuming each of your GPU tasks is budgeted to use 0.5 or more CPU also). Not random at all. Working as designed, correctly... ... given the circumstances of the GPUGrid MT task estimates being completely broken. ID: 38545 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38547 - Posted: 16 Oct 2014, 15:15:09 UTC - in response to Message 38545. Last modified: 16 Oct 2014, 15:23:12 UTC For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task. Read here: http://www.gpugrid.net/forum_thread.php?id=3898&nowrap=true#38505 It's not random. When your GPU tasks switched out of "high priority" (deadline panic) mode, they also became lower on the food chain of client task scheduling. Instead of order 1 (where they were scheduled before the MT task) they became order 3 (scheduled after the MT task). And then, since the scheduler will only schedule up to ncpus+1, that is why only 1 GPU task is presently scheduled, instead of both (assuming each of your GPU tasks is budgeted to use 0.5 or more CPU also). Not random at all. Working as designed, correctly... ... given the circumstances of the GPUGrid MT task estimates being completely broken. Jacob- one GPU task been running for 37hr straight in high priority mode- one GPU task for 22hr straight high priority and one CPUMD task for 24 straight hours high priority mode. During this time I haven't added any task to cache- If all three task were already in high priority (Order 1 or 3/is there a way to find out which?)mode running- why did BOINC kick one out after all this time? Since very beginning these three tasks have been in High priority and I haven't changed any BOINC scheduler or allowed CPU usage. I had a similar issue when a CPUMD task was in cache- so I've stopped allowing any task to sit in cache- only keeping tasks capable of computing on available GPU/CPU. If I suspend CPUMD task- both GPU task will run with one being in High priority and other not. If I suspend CPUMD task one GPU that in high Priority changes to non-high priority. When CPUMD task is running along side one GPU task- when the task that's in waiting to run is suspended - the GPU task running stops high priority mode. ID: 38547 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 38548 - Posted: 16 Oct 2014, 15:27:44 UTC - in response to Message 38547. Last modified: 16 Oct 2014, 15:34:59 UTC "High priority mode" for a task means that "Presently, if tasks were scheduled in a FIFO order in the round-robin scheduler, the given task will not make deadline. We need to prioritize it to be ran NOW." It should show you, in the UI, if the task is in "High Priority" mode, on that Tasks tab, in the Status column. A task can move out of "High priority mode" when the round-robin simulation indicates that it WOULD make deadline. When tasks are suspended/resumed/downloaded, when progress percentages get updated, when running estimates get adjusted (as tasks progress), when the computers on_frac and active_frac and gpu_active_frac values change ... the client re-evaluates all tasks to determine which ones need to be "High priority" or not. Did you read the information in the links that were in my post? They're useful. After reading that information, do you still think the client scheduler is somehow broken? Also, you can turn on some cc_config flags to see extra output in Event Log... specifically, you could investigate rr_simulation, rrsim_detail, cpu_sched, cpu_sched_debug, or coproc_debug. I won't be able to explain the output, but you could probably infer the meaning of some of it. ID: 38548 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38549 - Posted: 16 Oct 2014, 17:49:30 UTC - in response to Message 38548. Last modified: 16 Oct 2014, 18:10:23 UTC Some cc_config flags information- BOINC thinks I'm going to miss deadline for CPUMD task---- (1138hr remaining estimate/14/10/16 13:34:52 \| GPUGRID \| [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1) Boinc says CPUMD is 20% compete in 24hr--progress file is at 3.5million step ) BOINC will run unfold Noelia task (97%compete/18hr est remaining/14/10/16 13:33:52 \| GPUGRID \| [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1) in High priority when CPUMD task is running while booting the task Boinc thinks will miss a deadline-- 63% compete SDOERR task (174hr remaining estimate) (SDOERR)14/10/16 13:33:52 \| GPUGRID \| [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0 Here some newer tasks states that have changed---14/10/16 13:43:13 \| GPUGRID \| [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 1 next 1 task state 0 14/10/16 13:47:13 \| GPUGRID \| [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1 14/10/16 13:47:13 \| GPUGRID \| [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1 14/10/16 13:56:05 \| GPUGRID \| [rr_sim] 24011.34: unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (721404.58G/30.04G) 14/10/16 14:00:07 \| GPUGRID \| [rr_sim] 4404370.74: 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 finishes (4.00 CPU) (54297244.54G/12.33G) 14/10/16 13:56:05 \| GPUGRID \| [rr_sim] 658381.65: I1R119-SDOERR_BARNA5-38-100-RND1580_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (19780638.18G/30.04G) 14/10/16 13:56:05 \| GPUGRID \| [rr_sim] I1R119-SDOERR_BARNA5-38-100-RND1580_0 misses deadline by 348785.46 14/10/16 13:58:05 \| GPUGRID \| [cpu_sched_debug] skipping GPU job I1R119-SDOERR_BARNA5-38-100-RND1580_0; CPU committed 14/10/16 13:59:05 \| GPUGRID \| [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1 14/10/16 13:59:05 \| GPUGRID \| [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0 14/10/16 13:59:05 \| GPUGRID \| [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1 Now the three tasks are all running with new task states after being rescheduling ( I downloaded a new Long task)--- 14/10/16 14:10:40 \| GPUGRID \| [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1 14/10/16 14:10:40 \| GPUGRID \| [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1 14/10/16 14:10:40 \| GPUGRID \| [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1 ID: 38549 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38585 - Posted: 20 Oct 2014, 12:14:34 UTC Last modified: 20 Oct 2014, 12:16:48 UTC CPUMD tasks completed past deadline: credit is rewarded. http://www.gpugrid.net/workunit.php?wuid=10159833 http://www.gpugrid.net/workunit.php?wuid=10158842 ID: 38585 · Rating: 0 · rate: / Reply Quote

d_a_dempsey Send message Joined: 18 Dec 09 Posts: 6 Credit: 1,046,736,560 RAC: 0 Level Scientific publications	Message 38632 - Posted: 22 Oct 2014, 13:03:58 UTC I have a problem with the Test application for CPU MD work units. This is obviously a test setup, according to both application name and this discussion thread, and the work units are being pushed to my machines even though my profile is set to not receive WUs from test applications. I'm happy to do GPU computing for you guys, but I'm not willing to let you take over complete machines for days. Please control your app to respect the "Run test applications?" setting in our profiles. Thank you, David ID: 38632 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 38634 - Posted: 22 Oct 2014, 13:44:00 UTC - in response to Message 38632. Hm, sorry about that. Should only be going to machines opted in to test WUs. I should point out the app is close to production - the main remaining problem with it is the ridiculous runtime estimates the client is inexplicably generating. Matt ID: 38634 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38724 - Posted: 28 Oct 2014, 11:57:57 UTC - in response to Message 38634. Are the working SSE2 CPUMD tasks on vacation? Were return results incomplete/invalid? 10000 tasks disappeared. From the look of BOINC stats and GPUGRID graphs- a decent amount of new user CPU only machines were added with credit rewarded. ID: 38724 · Rating: 0 · rate: / Reply Quote

sis651 Send message Joined: 25 Nov 13 Posts: 66 Credit: 282,724,028 RAC: 0 Level Scientific publications	Message 38731 - Posted: 28 Oct 2014, 22:22:33 UTC I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before. My problem is when I paused the units and restarted the Boinc none of the CPU works resume crunching from their last progress. They start crunching from the beginning. In an area with short but frequent blackouts its not possible to run these CPU units. ID: 38731 · Rating: 0 · rate: / Reply Quote

boinc127 Send message Joined: 31 Aug 13 Posts: 11 Credit: 7,952,212 RAC: 0 Level Scientific publications	Message 38732 - Posted: 28 Oct 2014, 23:59:14 UTC - in response to Message 38731. I believe the project admins dumped the AVX mt program because of some flaws in it. When I ran the AVX program I also noticed the program never checkpointed. from MJH on another post: The buggy Windows AVX app is gone now. Please abort any instances of it still running. It's replaced with the working SSE2 app. http://www.gpugrid.net/forum_thread.php?id=3812&nowrap=true#38680 For now at least, there are no other CPU beta workunits to test. I guess the project admins will revise and replace the workunits when they are ready and able to. ID: 38732 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 38737 - Posted: 29 Oct 2014, 7:52:00 UTC - in response to Message 38731. I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before. Make sure that the application executable that you are running has "sse2" in its name, not "avx". Manually delete the old AVX app binary from the project directory if necessary. MJH ID: 38737 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38741 - Posted: 29 Oct 2014, 12:57:50 UTC - in response to Message 38737. Last modified: 29 Oct 2014, 13:00:03 UTC Received 5 abandoned 9.03 "AVX" tasks. All are computing SSE2 even with AVX app binary in directory- checkpoints are working- BOINC client progress reporting is still off.(@70% with 3.7million steps left to compute) Progress file is reporting steps computed properly. ID: 38741 · Rating: 0 · rate: / Reply Quote

John C MacAlister Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level Scientific publications	Message 38768 - Posted: 30 Oct 2014, 14:30:22 UTC Hola, Amigos en Barcelona! No CPU tasks received: are there any available? Thanks! John ID: 38768 · Rating: 0 · rate: / Reply Quote

Astiesan Send message Joined: 8 Jun 10 Posts: 3 Credit: 1,209,302,653 RAC: 32 Level Scientific publications	Message 38777 - Posted: 1 Nov 2014, 0:38:29 UTC Last modified: 1 Nov 2014, 0:42:26 UTC mdrun-463-901-sse-32 causes a soft system freeze occassionally when exiting active state into sleeping state i.e. screensaver off to on. By soft system freeze, I mean that the start bar/menu (I do use start8, but it's confirmed to occur without this active as well), all parts of it are locked. Windows-R can bring up the Run menu, and I can use cmd and taskkill mdrun and the start menu itself will return to normalcy, however the bar will continue to be unresponsive. Killing explorer.exe to reset the start bar will result in a hard freeze requiring reboot. During the soft freeze, alt-tab and other windows will be VERY slow to respond until mdrun is killed, afterwards all other windows work fine, but the start bar is unusable and will force a reboot of the system. There is nothing in the error logs. Any assistance or ideas in resolving this would be appreciated. My system: Windows 8.1 64-bit i7 4790K @ stock ASRock Z97-Extreme4 EVGA GTX 970 SC ACX @ stock 2x8GB HyperX Fury DDR3-1866 @ stock ID: 38777 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level Scientific publications	Message 38781 - Posted: 1 Nov 2014, 10:20:52 UTC I gave four cores of my AMD FX-8350 to the app. I've done four WUs, which all completed in a remarkably consistent time of just over 16 hours, with a-bit-mean 920 credits each. I just checked the server status: ...and was a little surprised to see my 16 hours well under the minimum run time of 19.16 hours. ID: 38781 · Rating: 0 · rate: / Reply Quote

eXaPower Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level Scientific publications	Message 38782 - Posted: 1 Nov 2014, 11:12:59 UTC - in response to Message 38781. Last modified: 1 Nov 2014, 11:26:59 UTC A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps. ID: 38782 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level Scientific publications	Message 38784 - Posted: 1 Nov 2014, 12:09:50 UTC - in response to Message 38782. A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps. I did complete this 5M-step WU on 24 October and got 3342 credits... ID: 38784 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 38785 - Posted: 1 Nov 2014, 12:17:08 UTC - in response to Message 38781. Yes, the credit allocation is wrong - need to work out how to fix that. Matt ID: 38785 · Rating: 0 · rate: / Reply Quote

tomba Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level Scientific publications	Message 38786 - Posted: 1 Nov 2014, 12:24:51 UTC - in response to Message 38785. Yes, the credit allocation is wrong - need to work out how to fix that. Matt A fixed 2.5M per completion would be a nice 'n' easy solution ;) ID: 38786 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 38787 - Posted: 1 Nov 2014, 12:51:29 UTC Last modified: 1 Nov 2014, 12:58:05 UTC I have completed 2 of the new (I think?) tasks, of application type "Test application for CPU MD v9.01 (mtsse2)", on my host (id: 153764), running 8 logical CPUs (4 cores hyperthreaded). When I first got the tasks, I think the estimated run time was something like 4.5 hours. But then, after it completed the first task (which took way longer - it took 15.75 hours of run time), it realized it was wrong, and adjusted the estimated run times for the other tasks to be ~16 hours. For each of the 2 completed tasks: - Task size: 2.5 million steps - Run Time: ~16.4 hours - CPU Time: ~104 hours (My CPUs were slightly overcommitted by my own doing) - Credit granted: ~3700 I will continue to occasionally run these, to help you test, especially when new versions come out. Regards, Jacob ID: 38787 · Rating: 0 · rate: / Reply Quote