Message boards :
News :
New CPU work units
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task. Read here: http://www.gpugrid.net/forum_thread.php?id=3898&nowrap=true#38505 It's not random. When your GPU tasks switched out of "high priority" (deadline panic) mode, they also became lower on the food chain of client task scheduling. Instead of order 1 (where they were scheduled before the MT task) they became order 3 (scheduled after the MT task). And then, since the scheduler will only schedule up to ncpus+1, that is why only 1 GPU task is presently scheduled, instead of both (assuming each of your GPU tasks is budgeted to use 0.5 or more CPU also). Not random at all. Working as designed, correctly... ... given the circumstances of the GPUGrid MT task estimates being completely broken. |
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
For last couple days- I've had two GPU tasks and one CPUMD tasks running in high priority- up until now all ran with no issues. Just now and randomly BOINC has decided to kill one of GPU tasks- sending it to "waiting to run" mode. If I suspend CPUMD task both GPU tasks will run. Allowing CPUMD task to run will shut a GPU task. Jacob- one GPU task been running for 37hr straight in high priority mode- one GPU task for 22hr straight high priority and one CPUMD task for 24 straight hours high priority mode. During this time I haven't added any task to cache- If all three task were already in high priority (Order 1 or 3/is there a way to find out which?)mode running- why did BOINC kick one out after all this time? Since very beginning these three tasks have been in High priority and I haven't changed any BOINC scheduler or allowed CPU usage. I had a similar issue when a CPUMD task was in cache- so I've stopped allowing any task to sit in cache- only keeping tasks capable of computing on available GPU/CPU. If I suspend CPUMD task- both GPU task will run with one being in High priority and other not. If I suspend CPUMD task one GPU that in high Priority changes to non-high priority. When CPUMD task is running along side one GPU task- when the task that's in waiting to run is suspended - the GPU task running stops high priority mode. |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
"High priority mode" for a task means that "Presently, if tasks were scheduled in a FIFO order in the round-robin scheduler, the given task will not make deadline. We need to prioritize it to be ran NOW." It should show you, in the UI, if the task is in "High Priority" mode, on that Tasks tab, in the Status column. A task can move out of "High priority mode" when the round-robin simulation indicates that it WOULD make deadline. When tasks are suspended/resumed/downloaded, when progress percentages get updated, when running estimates get adjusted (as tasks progress), when the computers on_frac and active_frac and gpu_active_frac values change ... the client re-evaluates all tasks to determine which ones need to be "High priority" or not. Did you read the information in the links that were in my post? They're useful. After reading that information, do you still think the client scheduler is somehow broken? Also, you can turn on some cc_config flags to see extra output in Event Log... specifically, you could investigate rr_simulation, rrsim_detail, cpu_sched, cpu_sched_debug, or coproc_debug. I won't be able to explain the output, but you could probably infer the meaning of some of it. |
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Some cc_config flags information- BOINC thinks I'm going to miss deadline for CPUMD task---- (1138hr remaining estimate/14/10/16 13:34:52 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1) Boinc says CPUMD is 20% compete in 24hr--progress file is at 3.5million step ) BOINC will run unfold Noelia task (97%compete/18hr est remaining/14/10/16 13:33:52 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1) in High priority when CPUMD task is running while booting the task Boinc thinks will miss a deadline-- 63% compete SDOERR task (174hr remaining estimate) (SDOERR)14/10/16 13:33:52 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0 Here some newer tasks states that have changed---14/10/16 13:43:13 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 1 next 1 task state 0 14/10/16 13:47:13 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1 14/10/16 13:47:13 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1 14/10/16 13:56:05 | GPUGRID | [rr_sim] 24011.34: unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (721404.58G/30.04G) 14/10/16 14:00:07 | GPUGRID | [rr_sim] 4404370.74: 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 finishes (4.00 CPU) (54297244.54G/12.33G) 14/10/16 13:56:05 | GPUGRID | [rr_sim] 658381.65: I1R119-SDOERR_BARNA5-38-100-RND1580_0 finishes (0.90 CPU + 1.00 NVIDIA GPU) (19780638.18G/30.04G) 14/10/16 13:56:05 | GPUGRID | [rr_sim] I1R119-SDOERR_BARNA5-38-100-RND1580_0 misses deadline by 348785.46 14/10/16 13:58:05 | GPUGRID | [cpu_sched_debug] skipping GPU job I1R119-SDOERR_BARNA5-38-100-RND1580_0; CPU committed 14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1 14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 1 next 1 task state 0 14/10/16 13:59:05 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1 Now the three tasks are all running with new task states after being rescheduling ( I downloaded a new Long task)--- 14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] unfoldx5-NOELIA_UNFOLD-19-72-RND4631_0 sched state 2 next 2 task state 1 14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] I1R119-SDOERR_BARNA5-38-100-RND1580_0 sched state 2 next 2 task state 1 14/10/16 14:10:40 | GPUGRID | [cpu_sched_debug] 5146-MJHARVEY_CPUDHFR-0-1-RND3131_0 sched state 2 next 2 task state 1 |
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
CPUMD tasks completed past deadline: credit is rewarded. http://www.gpugrid.net/workunit.php?wuid=10159833 http://www.gpugrid.net/workunit.php?wuid=10158842 |
Send message Joined: 18 Dec 09 Posts: 6 Credit: 1,046,736,560 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have a problem with the Test application for CPU MD work units. This is obviously a test setup, according to both application name and this discussion thread, and the work units are being pushed to my machines even though my profile is set to not receive WUs from test applications. I'm happy to do GPU computing for you guys, but I'm not willing to let you take over complete machines for days. Please control your app to respect the "Run test applications?" setting in our profiles. Thank you, David |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Hm, sorry about that. Should only be going to machines opted in to test WUs. I should point out the app is close to production - the main remaining problem with it is the ridiculous runtime estimates the client is inexplicably generating. Matt |
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Are the working SSE2 CPUMD tasks on vacation? Were return results incomplete/invalid? 10000 tasks disappeared. From the look of BOINC stats and GPUGRID graphs- a decent amount of new user CPU only machines were added with credit rewarded. |
Send message Joined: 25 Nov 13 Posts: 66 Credit: 282,724,028 RAC: 62 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before. My problem is when I paused the units and restarted the Boinc none of the CPU works resume crunching from their last progress. They start crunching from the beginning. In an area with short but frequent blackouts its not possible to run these CPU units. |
Send message Joined: 31 Aug 13 Posts: 11 Credit: 7,952,212 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I believe the project admins dumped the AVX mt program because of some flaws in it. When I ran the AVX program I also noticed the program never checkpointed. from MJH on another post: The buggy Windows AVX app is gone now. Please abort any instances of it still running. It's replaced with the working SSE2 app. http://www.gpugrid.net/forum_thread.php?id=3812&nowrap=true#38680 For now at least, there are no other CPU beta workunits to test. I guess the project admins will revise and replace the workunits when they are ready and able to. |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
I got some CPU works to test but I had a problem with them. Currently I'm crunching some AVX units and crunched non AVX/SSE2 units before. Make sure that the application executable that you are running has "sse2" in its name, not "avx". Manually delete the old AVX app binary from the project directory if necessary. MJH |
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Received 5 abandoned 9.03 "AVX" tasks. All are computing SSE2 even with AVX app binary in directory- checkpoints are working- BOINC client progress reporting is still off.(@70% with 3.7million steps left to compute) Progress file is reporting steps computed properly. |
Send message Joined: 17 Feb 13 Posts: 181 Credit: 144,871,276 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Hola, Amigos en Barcelona! No CPU tasks received: are there any available? Thanks! John |
Send message Joined: 8 Jun 10 Posts: 3 Credit: 1,209,302,653 RAC: 26,784 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
mdrun-463-901-sse-32 causes a soft system freeze occassionally when exiting active state into sleeping state i.e. screensaver off to on. By soft system freeze, I mean that the start bar/menu (I do use start8, but it's confirmed to occur without this active as well), all parts of it are locked. Windows-R can bring up the Run menu, and I can use cmd and taskkill mdrun and the start menu itself will return to normalcy, however the bar will continue to be unresponsive. Killing explorer.exe to reset the start bar will result in a hard freeze requiring reboot. During the soft freeze, alt-tab and other windows will be VERY slow to respond until mdrun is killed, afterwards all other windows work fine, but the start bar is unusable and will force a reboot of the system. There is nothing in the error logs. Any assistance or ideas in resolving this would be appreciated. My system: Windows 8.1 64-bit i7 4790K @ stock ASRock Z97-Extreme4 EVGA GTX 970 SC ACX @ stock 2x8GB HyperX Fury DDR3-1866 @ stock |
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I gave four cores of my AMD FX-8350 to the app. I've done four WUs, which all completed in a remarkably consistent time of just over 16 hours, with a-bit-mean 920 credits each. I just checked the server status: ![]() ...and was a little surprised to see my 16 hours well under the minimum run time of 19.16 hours. |
Send message Joined: 25 Sep 13 Posts: 293 Credit: 1,897,601,978 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps. |
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
A current CPUMD task is 2.5million steps - not 5million as prior tasks. Maybe this why credit rewarded is lesser? All four of tasks you completed were 2.5million steps. I did complete this 5M-step WU on 24 October and got 3342 credits... |
![]() Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]() ![]() |
Yes, the credit allocation is wrong - need to work out how to fix that. Matt |
Send message Joined: 21 Feb 09 Posts: 497 Credit: 700,690,702 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
Yes, the credit allocation is wrong - need to work out how to fix that. A fixed 2.5M per completion would be a nice 'n' easy solution ;) |
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
I have completed 2 of the new (I think?) tasks, of application type "Test application for CPU MD v9.01 (mtsse2)", on my host (id: 153764), running 8 logical CPUs (4 cores hyperthreaded). When I first got the tasks, I think the estimated run time was something like 4.5 hours. But then, after it completed the first task (which took way longer - it took 15.75 hours of run time), it realized it was wrong, and adjusted the estimated run times for the other tasks to be ~16 hours. For each of the 2 completed tasks: - Task size: 2.5 million steps - Run Time: ~16.4 hours - CPU Time: ~104 hours (My CPUs were slightly overcommitted by my own doing) - Credit granted: ~3700 I will continue to occasionally run these, to help you test, especially when new versions come out. Regards, Jacob |
©2025 Universitat Pompeu Fabra