Advice for GPU placement

Author	Message
Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 51266 - Posted: 11 Jan 2019, 3:37:07 UTC - in response to Message 51264. Last modified: 11 Jan 2019, 3:39:15 UTC :) I'm a rock star at breaking things, for sure! I am happy to hear that you find the feature as useful as I do! I about to join you as a "rock star" for breaking things to apparently. The client code commit that DA wrote to fix my original problem is going to cause major problems for anyone using a max_concurrent or project_max_concurrent statement. The unintended consequence of the code change prevents requesting work fetch task replacement until the hosts cache is empty. Only then does the host report all finished work and then asks for work to refill the cache. So the end of keeping your cache topped up at every 5 minute scheduler connection. The PR2918 commit is close to be being accepted into the master branch. I have voiced my displeasure but since only DA usually authorizes pull requests into the master branch, that decision is up to him. Richard Haselgrove also has voiced his concerns. It sounds like you were using "max concurrent" to mean "only run this many at the same time, but allow fetch of more." David is likely arguing that, if you can't run more than that many simultaneously, then why buffer more? Consider tasks that take 300 days to complete (yes, RNA World has them). If you're set to only run 3 as "max concurrent", then why would you want to get a 4th task that would sit there for 300 days? You might consider asking for a separation of functionality --- "max_concurrent_to_schedule" [which is what you want] vs "max_concurrent_to_fetch" [which is what David is changing max_concurrent to mean]. Then you could set the first one to a value, and leave the second one unbound, and get back your desired behavior. I hope this makes sense to you. Please feel free to add the text/info to the PR. Note: I doubt it waits until the cache is completely exhausted of max_blah items before asking more. I'm betting, instead, that work fetch will still top off, even if you have some of that task type, but only up to the max_blah setting. ID: 51266 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 51267 - Posted: 11 Jan 2019, 3:54:32 UTC - in response to Message 51266. :) I'm a rock star at breaking things, for sure! I am happy to hear that you find the feature as useful as I do! I about to join you as a "rock star" for breaking things to apparently. The client code commit that DA wrote to fix my original problem is going to cause major problems for anyone using a max_concurrent or project_max_concurrent statement. The unintended consequence of the code change prevents requesting work fetch task replacement until the hosts cache is empty. Only then does the host report all finished work and then asks for work to refill the cache. So the end of keeping your cache topped up at every 5 minute scheduler connection. The PR2918 commit is close to be being accepted into the master branch. I have voiced my displeasure but since only DA usually authorizes pull requests into the master branch, that decision is up to him. Richard Haselgrove also has voiced his concerns. It sounds like you were using "max concurrent" to mean "only run this many at the same time, but allow fetch of more." David is likely arguing that, if you can't run more than that many simultaneously, then why buffer more? Consider tasks that take 300 days to complete (yes, RNA World has them). If you're set to only run 3 as "max concurrent", then why would you want to get a 4th task that would sit there for 300 days? You might consider asking for a separation of functionality --- "max_concurrent_to_schedule" [which is what you want] vs "max_concurrent_to_fetch" [which is what David is changing max_concurrent to mean]. Then you could set the first one to a value, and leave the second one unbound, and get back your desired behavior. I hope this makes sense to you. Please feel free to add the text/info to the PR. Note: I doubt it waits until the cache is completely exhausted of max_blah items before asking more. I'm betting, instead, that work fetch will still top off, even if you have some of that task type, but only up to the max_blah setting. Yes, I guess I generalized. I didn't wait to see if all my 1000 tasks finished before the work request was initiated. From the testing by Richard and in the host emulator, I assume that when the number of tasks fell below my <project_max_concurrent>16<project_max_concurrent> statement that the client would finally report all 485 completed tasks and finally ask for more work. But from the client configuration document https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration the intended purpose of max_concurrent and project_max_concurrent is: max_concurrent The maximum number of tasks of this application to run at a given time. and: project_max_concurrent A limit on the number of running jobs for this project. The original purpose of the max_concurrent parameters shouldn't be circumvented by the new commit code. The key point that needs to be emphasized is run at a given time and number of running jobs ID: 51267 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 51268 - Posted: 11 Jan 2019, 3:55:32 UTC - in response to Message 51266. :) I'm a rock star at breaking things, for sure! I am happy to hear that you find the feature as useful as I do! I about to join you as a "rock star" for breaking things to apparently. The client code commit that DA wrote to fix my original problem is going to cause major problems for anyone using a max_concurrent or project_max_concurrent statement. The unintended consequence of the code change prevents requesting work fetch task replacement until the hosts cache is empty. Only then does the host report all finished work and then asks for work to refill the cache. So the end of keeping your cache topped up at every 5 minute scheduler connection. The PR2918 commit is close to be being accepted into the master branch. I have voiced my displeasure but since only DA usually authorizes pull requests into the master branch, that decision is up to him. Richard Haselgrove also has voiced his concerns. It sounds like you were using "max concurrent" to mean "only run this many at the same time, but allow fetch of more." David is likely arguing that, if you can't run more than that many simultaneously, then why buffer more? Consider tasks that take 300 days to complete (yes, RNA World has them). If you're set to only run 3 as "max concurrent", then why would you want to get a 4th task that would sit there for 300 days? You might consider asking for a separation of functionality --- "max_concurrent_to_schedule" [which is what you want] vs "max_concurrent_to_fetch" [which is what David is changing max_concurrent to mean]. Then you could set the first one to a value, and leave the second one unbound, and get back your desired behavior. I hope this makes sense to you. Please feel free to add the text/info to the PR. Note: I doubt it waits until the cache is completely exhausted of max_blah items before asking more. I'm betting, instead, that work fetch will still top off, even if you have some of that task type, but only up to the max_blah setting. Yes, I guess I generalized. I didn't wait to see if all my 1000 tasks finished before the work request was initiated. From the testing by Richard and in the host emulator, I assume that when the number of tasks fell below my <project_max_concurrent>16<project_max_concurrent> statement that the client would finally report all 485 completed tasks and finally ask for more work. But from the client configuration document https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration the intended purpose of max_concurrent and project_max_concurrent is: max_concurrent The maximum number of tasks of this application to run at a given time. and: project_max_concurrent A limit on the number of running jobs for this project. The original purpose of the max_concurrent parameters shouldn't be circumvented by the new commit code. The key point that needs to be emphasized is run at a given time and number of running jobs ID: 51268 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 51269 - Posted: 11 Jan 2019, 4:05:53 UTC Last modified: 11 Jan 2019, 4:26:26 UTC Thanks for the comments Jacob. I have added your observations to my post and will await what Richard has to say about your new classifications when the new day for him begins. [Edit] I should also comment that I have been using <project_max_concurrent> statements in all my hosts for years. They have never been limited to the N number of tasks in those statements in their caches. I have always maintained the server side 100 cpu +100 tasks per gpu limit for the caches on all hosts. Never had any issues fetching replacement work for my hosts to keep topped up at the cache limit. The problem I incurred in my original post was when I added the gpu_exclude statements which prevented all cpu tasks from running. ID: 51269 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 51270 - Posted: 11 Jan 2019, 4:22:43 UTC - in response to Message 51269. You're welcome. Since max_concurrent documentation has a 'documented meaning' already, you might use that to suggest (gently push) toward keeping it the same, and putting any work fetch max limits into a new variable. I could see it ending up that way maybe. ID: 51270 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 2 Level Scientific publications	Message 51274 - Posted: 11 Jan 2019, 10:49:35 UTC Hi Keith and Jacob! Slightly odd place for this conversation, but I hope other posters will excuse me joining in. There's a problem with <max_concurrent>. I noticed a couple of years ago that it could prevent full use of your computer, if you tried to run both GPU and CPU versions of the same app at the same time. Last year, Keith fell over the same problem, and with two independent reports David started work on fixing it. This work is explicitly about the proper definition of max_concurrent, but in fixing it, David realised that if you do less work concurrently, you'll do less work overall, and you might over-fill the cache and run into deadline problems. So, for the time being, he's put in a rather crude kludge. I've seen and documented a case where a project ran completely dry before finally requesting new work. I think that there's universal agreement that this is the wrong sledgehammer, used to crack the wrong nut. I'm trying to put forward the concept of "proportional work fetch". I tend to run a 0.25 day cache (partly because of the fast turnround required by this project). If you run on all four cores of a quad CPU, BOINC will interpret that as a 1.0 day CPU cache to keep all four cores busy for 0.25 days. A max_concurrent of 2 should limit that to 0.5 days, and so on. Unless anyone can suggest a better solution? I was on a conference call last night, where I heard other developers, in no uncertain terms, urge David to start work on a better work fetch solution as soon as possible. And, in particular, not to release a new client in the gap where the max_concurrent bug has been fixed but work fetch is broken. We're getting very close to the completion of the max_concurrent fix. Keith's logs from two nights ago revealed a small bug, which I've reported, but should be easy to fix. And then we can move on to phase 2. Contrary to what Keith said, under the new "Community Governance" management of BOINC, no developer is allowed to merge their own code without independent scrutiny - not even David. I got approval last night that, if Keith and I can say that the max_concurrent code works within its design limits (i.e. putting aside the work fetch bug for the moment), and if Juha can confirm that the C++ code has been written correctly, then we can collectively approve and merge. David wants that to happen so he has a clean code base before he starts on the work fetch part of the problem. I'll keep an eye on progress, and it's very easy in Windows to test as we go along (replacement binaries are available - automatically, within minutes - of any code change: And if the code won't compile, that's also reported back to the developer automatically). But I don't run the sort of heavy metal that you guys do, so any help with the monitoring and testing process that you can contribute will be greatly appreciated. ID: 51274 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,188,446,190 RAC: 1,336,521 Level Scientific publications	Message 51280 - Posted: 11 Jan 2019, 16:02:53 UTC Good morning Richard, sorry about the location of this discussion. Jacob has provided some useful information to enhance my understanding. My concerns were that David in the PR2918 comments said: Not fetching work if max_concurrent is reached is the intended behavior. and that is what alarmed me greatly. Very encouraging to hear that your conference call with the other developers also mentioned their concerns with work fetching. I was getting overly worked up I guess in thinking the commit was going to master soon with the unintended consequence of breaking work fetch. I thought that would cause massive complaints from everyone who would notice that they didn't maintain their caches at all times. Thanks for clarification that even David has to have consensus from the other developers to merge code into master. ID: 51280 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 3,629 Level Scientific publications	Message 51282 - Posted: 11 Jan 2019, 19:58:55 UTC I've never liked using those 2 commands when trying to limit a project to something like 50% of CPU threads awhile wanting another project to use the other 50%. I would many times end up with a full queue of the project using max # of tasks command and the other threads would be idle as the queue would be full. Another situation where the task run priority should be separate from the work download priority and another BOINC client instance ends up being the favorable way to finely tune BOINC management on a PC. ID: 51282 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 51440 - Posted: 7 Feb 2019, 17:02:55 UTC Last modified: 7 Feb 2019, 17:03:35 UTC Are you guys sure that GTX 2080 GPUs get work from GPU Grid? From my testing, it seems my PC hasn't gotten a single task from that project since installing that GPU. http://www.gpugrid.net/results.php?hostid=326413 ID: 51440 · Rating: 0 · rate: / Reply Quote

PappaLitto Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level Scientific publications	Message 51441 - Posted: 7 Feb 2019, 17:31:44 UTC The only 2080s at this time that should get WUs are the ones in the same machine as a 1000 series card or below. The WU will not yet work on 2000 series cards ID: 51441 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 51442 - Posted: 7 Feb 2019, 17:43:54 UTC - in response to Message 51441. My machine has 2080, 980 Ti, 980 Ti. And I have a GPU Exclusion setup so GPUGrid work doesn't run work on the 2080. Yet, GPUGrid work fetch never gives any work for my PC. Any idea why? ID: 51442 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 3,629 Level Scientific publications	Message 51445 - Posted: 8 Feb 2019, 0:35:13 UTC - in response to Message 51442. My machine has 2080, 980 Ti, 980 Ti. And I have a GPU Exclusion setup so GPUGrid work doesn't run work on the 2080. Yet, GPUGrid work fetch never gives any work for my PC. Any idea why? Do you have a device_num with your project URL exclusion? Otherwise all GPUs will be excluded instead of just the Turing card. <device_num>0</device_num> ID: 51445 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 51446 - Posted: 8 Feb 2019, 2:09:26 UTC Yes, I have that set correctly. 2/7/2019 12:12:35 PM \| \| Starting BOINC client version 7.14.2 for windows_x86_64 2/7/2019 12:12:35 PM \| \| log flags: file_xfer, sched_ops, task, scrsave_debug, unparsed_xml 2/7/2019 12:12:35 PM \| \| Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8 2/7/2019 12:12:35 PM \| \| Data directory: E:\BOINC Data 2/7/2019 12:12:35 PM \| \| Running under account jacob 2/7/2019 12:12:35 PM \| \| CUDA: NVIDIA GPU 0: GeForce RTX 2080 (driver version 418.81, CUDA version 10.1, compute capability 7.5, 4096MB, 3551MB available, 10687 GFLOPS peak) 2/7/2019 12:12:35 PM \| \| CUDA: NVIDIA GPU 1: GeForce GTX 980 Ti (driver version 418.81, CUDA version 10.1, compute capability 5.2, 4096MB, 3959MB available, 6060 GFLOPS peak) 2/7/2019 12:12:35 PM \| \| CUDA: NVIDIA GPU 2: GeForce GTX 980 Ti (driver version 418.81, CUDA version 10.1, compute capability 5.2, 4096MB, 3959MB available, 7271 GFLOPS peak) 2/7/2019 12:12:35 PM \| \| OpenCL: NVIDIA GPU 0: GeForce RTX 2080 (driver version 418.81, device version OpenCL 1.2 CUDA, 8192MB, 3551MB available, 10687 GFLOPS peak) 2/7/2019 12:12:35 PM \| \| OpenCL: NVIDIA GPU 1: GeForce GTX 980 Ti (driver version 418.81, device version OpenCL 1.2 CUDA, 6144MB, 3959MB available, 6060 GFLOPS peak) 2/7/2019 12:12:35 PM \| \| OpenCL: NVIDIA GPU 2: GeForce GTX 980 Ti (driver version 418.81, device version OpenCL 1.2 CUDA, 6144MB, 3959MB available, 7271 GFLOPS peak) 2/7/2019 12:12:35 PM \| \| Host name: Speed 2/7/2019 12:12:35 PM \| \| Processor: 16 GenuineIntel Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz [Family 6 Model 63 Stepping 2] 2/7/2019 12:12:35 PM \| \| Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 dca pbe fsgsbase bmi1 smep bmi2 2/7/2019 12:12:35 PM \| \| OS: Microsoft Windows 10: Professional x64 Edition, (10.00.18329.00) 2/7/2019 12:12:35 PM \| \| Memory: 63.89 GB physical, 73.39 GB virtual 2/7/2019 12:12:35 PM \| \| Disk: 80.00 GB total, 59.08 GB free 2/7/2019 12:12:35 PM \| \| Local time is UTC -5 hours 2/7/2019 12:12:35 PM \| \| No WSL found. 2/7/2019 12:12:35 PM \| \| VirtualBox version: 5.2.27 2/7/2019 12:12:35 PM \| GPUGRID \| Found app_config.xml 2/7/2019 12:12:35 PM \| GPUGRID \| Your app_config.xml file refers to an unknown application 'acemdbeta'. Known applications: 'acemdlong', 'acemdshort' 2/7/2019 12:12:35 PM \| GPUGRID \| Config: excluded GPU. Type: all. App: all. Device: 0 2/7/2019 12:12:35 PM \| Einstein@Home \| Config: excluded GPU. Type: all. App: all. Device: 0 2/7/2019 12:12:35 PM \| Albert@Home \| Config: excluded GPU. Type: all. App: all. Device: 0 2/7/2019 12:12:35 PM \| \| Config: event log limit 20000 lines 2/7/2019 12:12:35 PM \| \| Config: use all coprocessors ID: 51446 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 51447 - Posted: 8 Feb 2019, 2:11:19 UTC Last modified: 8 Feb 2019, 2:17:50 UTC However, this message appears to be related, and appears to show whenever a Short/Long task is available but not given to me. I'll have to think about what it means, and if I have something misconfigured. Edit: My cache settings are for 10 days buffer, plus 0.5 additional. Since I believe BOINC interprets that as "may be disconnected for 10 days", it may be limiting my ability to get work based on a calculation involving that 10 days. Time for me to rethink my cache settings (which I have intentionally set for other valid reasons), and then retest. 2/7/2019 3:36:39 PM \| GPUGRID \| Tasks won't finish in time: BOINC runs 85.2% of the time; computation is enabled 95.1% of that ID: 51447 · Rating: 0 · rate: / Reply Quote

Jacob Klein Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level Scientific publications	Message 51453 - Posted: 8 Feb 2019, 13:50:49 UTC - in response to Message 51447. Last modified: 8 Feb 2019, 13:51:36 UTC After I changed my cache settings, from 10.0d and 0.5d, to 2.0d and 0.5d .... The message went away, and I started getting GPUGrid work for the first time on this PC since getting the RTX 2080. http://www.gpugrid.net/results.php?hostid=326413 Thanks. ID: 51453 · Rating: 0 · rate: / Reply Quote

mmonnin Send message Joined: 2 Jul 16 Posts: 339 Credit: 7,990,341,558 RAC: 3,629 Level Scientific publications	Message 51454 - Posted: 8 Feb 2019, 23:03:06 UTC Good to see. I don't think I've ever seen that message about not finishing in time. ID: 51454 · Rating: 0 · rate: / Reply Quote

Chris S Send message Joined: 18 Jan 09 Posts: 21 Credit: 3,950,530 RAC: 0 Level Scientific publications	Message 51473 - Posted: 13 Feb 2019, 8:16:25 UTC 2/7/2019 3:36:39 PM \| GPUGRID \| Tasks won't finish in time: BOINC runs 85.2% of the time; computation is enabled 95.1% of that Well, there is your explanation clearly set out above! You weren't being sent any work because Boinc knew that you wouldn't finish it in time. If it doesn't suit you to run Boinc 24/7 then it seems the only was forward is to drop the cache levels. Glad you got it sorted out. ID: 51473 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51474 - Posted: 13 Feb 2019, 8:35:33 UTC - in response to Message 51230. For now I have to remove the project_max_concurrent statement from cc_config and use the cpu limitation in Local Preferences to limit the number of cores to 16. Why not use this in your cc_config??? <ncpus>16</ncpus> ID: 51474 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51475 - Posted: 13 Feb 2019, 9:00:38 UTC - in response to Message 51274. Last modified: 13 Feb 2019, 9:01:17 UTC We're getting very close to the completion of the max_concurrent fix. Richard, Not sure what you guys are up to but I sure hope you take the WCG MIP project into consideration before you roll it out. https://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=569786 They coded the use of the L3 Cache wrong and it uses 4-5 MB per MIP WU. If you exceed that I've BOINC performance cut in half. I have to use max_current in my WCG app_config or I cannot run MIP simulations. <app_config> <app> <name>mip1</name> <!-- needs 5 MB L3 cache per mip1 WU, use 5-10 --> <!-- Xeon E5-2699v4, L3 Cache = 55 MB --> <max_concurrent>10</max_concurrent> <fraction_done_exact>1</fraction_done_exact> </app> </app_config> ID: 51475 · Rating: 0 · rate: / Reply Quote

Aurum Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level Scientific publications	Message 51476 - Posted: 13 Feb 2019, 9:01:50 UTC - in response to Message 51475. We're getting very close to the completion of the max_concurrent fix. Richard, Not sure what you guys are up to but I sure hope you take the WCG MIP project into consideration before you roll it out. https://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=569786 They coded the use of the L3 Cache wrong and it uses 4-5 MB per MIP WU. If you exceed that I've had BOINC performance cut in half. I have to use max_current in my WCG app_config or I cannot run MIP simulations. <app_config> <app> <name>mip1</name> <!-- needs 5 MB L3 cache per mip1 WU, use 5-10 --> <!-- Xeon E5-2699v4, L3 Cache = 55 MB --> <max_concurrent>10</max_concurrent> <fraction_done_exact>1</fraction_done_exact> </app> </app_config> ID: 51476 · Rating: 0 · rate: / Reply Quote