Message boards :
Graphics cards (GPUs) :
Advice for GPU placement
Message board moderation
Previous · 1 · 2 · 3 · Next
| Author | Message |
|---|---|
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
:) I'm a rock star at breaking things, for sure! It sounds like you were using "max concurrent" to mean "only run this many at the same time, but allow fetch of more." David is likely arguing that, if you can't run more than that many simultaneously, then why buffer more? Consider tasks that take 300 days to complete (yes, RNA World has them). If you're set to only run 3 as "max concurrent", then why would you want to get a 4th task that would sit there for 300 days? You might consider asking for a separation of functionality --- "max_concurrent_to_schedule" [which is what you want] vs "max_concurrent_to_fetch" [which is what David is changing max_concurrent to mean]. Then you could set the first one to a value, and leave the second one unbound, and get back your desired behavior. I hope this makes sense to you. Please feel free to add the text/info to the PR. Note: I doubt it waits until the cache is completely exhausted of max_blah items before asking more. I'm betting, instead, that work fetch will still top off, even if you have some of that task type, but only up to the max_blah setting. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
:) I'm a rock star at breaking things, for sure! Yes, I guess I generalized. I didn't wait to see if all my 1000 tasks finished before the work request was initiated. From the testing by Richard and in the host emulator, I assume that when the number of tasks fell below my <project_max_concurrent>16<project_max_concurrent> statement that the client would finally report all 485 completed tasks and finally ask for more work. But from the client configuration document https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration the intended purpose of max_concurrent and project_max_concurrent is: max_concurrent and: project_max_concurrent The original purpose of the max_concurrent parameters shouldn't be circumvented by the new commit code. The key point that needs to be emphasized is run at a given time and number of running jobs |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
:) I'm a rock star at breaking things, for sure! Yes, I guess I generalized. I didn't wait to see if all my 1000 tasks finished before the work request was initiated. From the testing by Richard and in the host emulator, I assume that when the number of tasks fell below my <project_max_concurrent>16<project_max_concurrent> statement that the client would finally report all 485 completed tasks and finally ask for more work. But from the client configuration document https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration the intended purpose of max_concurrent and project_max_concurrent is: max_concurrent and: project_max_concurrent The original purpose of the max_concurrent parameters shouldn't be circumvented by the new commit code. The key point that needs to be emphasized is run at a given time and number of running jobs |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Thanks for the comments Jacob. I have added your observations to my post and will await what Richard has to say about your new classifications when the new day for him begins. [Edit] I should also comment that I have been using <project_max_concurrent> statements in all my hosts for years. They have never been limited to the N number of tasks in those statements in their caches. I have always maintained the server side 100 cpu +100 tasks per gpu limit for the caches on all hosts. Never had any issues fetching replacement work for my hosts to keep topped up at the cache limit. The problem I incurred in my original post was when I added the gpu_exclude statements which prevented all cpu tasks from running. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
You're welcome. Since max_concurrent documentation has a 'documented meaning' already, you might use that to suggest (gently push) toward keeping it the same, and putting any work fetch max limits into a new variable. I could see it ending up that way maybe. |
|
Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 351 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Hi Keith and Jacob! Slightly odd place for this conversation, but I hope other posters will excuse me joining in. There's a problem with <max_concurrent>. I noticed a couple of years ago that it could prevent full use of your computer, if you tried to run both GPU and CPU versions of the same app at the same time. Last year, Keith fell over the same problem, and with two independent reports David started work on fixing it. This work is explicitly about the proper definition of max_concurrent, but in fixing it, David realised that if you do less work concurrently, you'll do less work overall, and you might over-fill the cache and run into deadline problems. So, for the time being, he's put in a rather crude kludge. I've seen and documented a case where a project ran completely dry before finally requesting new work. I think that there's universal agreement that this is the wrong sledgehammer, used to crack the wrong nut. I'm trying to put forward the concept of "proportional work fetch". I tend to run a 0.25 day cache (partly because of the fast turnround required by this project). If you run on all four cores of a quad CPU, BOINC will interpret that as a 1.0 day CPU cache to keep all four cores busy for 0.25 days. A max_concurrent of 2 should limit that to 0.5 days, and so on. Unless anyone can suggest a better solution? I was on a conference call last night, where I heard other developers, in no uncertain terms, urge David to start work on a better work fetch solution as soon as possible. And, in particular, not to release a new client in the gap where the max_concurrent bug has been fixed but work fetch is broken. We're getting very close to the completion of the max_concurrent fix. Keith's logs from two nights ago revealed a small bug, which I've reported, but should be easy to fix. And then we can move on to phase 2. Contrary to what Keith said, under the new "Community Governance" management of BOINC, no developer is allowed to merge their own code without independent scrutiny - not even David. I got approval last night that, if Keith and I can say that the max_concurrent code works within its design limits (i.e. putting aside the work fetch bug for the moment), and if Juha can confirm that the C++ code has been written correctly, then we can collectively approve and merge. David wants that to happen so he has a clean code base before he starts on the work fetch part of the problem. I'll keep an eye on progress, and it's very easy in Windows to test as we go along (replacement binaries are available - automatically, within minutes - of any code change: And if the code won't compile, that's also reported back to the developer automatically). But I don't run the sort of heavy metal that you guys do, so any help with the monitoring and testing process that you can contribute will be greatly appreciated. |
|
Send message Joined: 13 Dec 17 Posts: 1419 Credit: 9,119,446,190 RAC: 731 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Good morning Richard, sorry about the location of this discussion. Jacob has provided some useful information to enhance my understanding. My concerns were that David in the PR2918 comments said: Not fetching work if max_concurrent is reached is the intended behavior. and that is what alarmed me greatly. Very encouraging to hear that your conference call with the other developers also mentioned their concerns with work fetching. I was getting overly worked up I guess in thinking the commit was going to master soon with the unintended consequence of breaking work fetch. I thought that would cause massive complaints from everyone who would notice that they didn't maintain their caches at all times. Thanks for clarification that even David has to have consensus from the other developers to merge code into master. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 213 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
I've never liked using those 2 commands when trying to limit a project to something like 50% of CPU threads awhile wanting another project to use the other 50%. I would many times end up with a full queue of the project using max # of tasks command and the other threads would be idle as the queue would be full. Another situation where the task run priority should be separate from the work download priority and another BOINC client instance ends up being the favorable way to finely tune BOINC management on a PC. |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Are you guys sure that GTX 2080 GPUs get work from GPU Grid? From my testing, it seems my PC hasn't gotten a single task from that project since installing that GPU. http://www.gpugrid.net/results.php?hostid=326413 |
|
Send message Joined: 21 Mar 16 Posts: 513 Credit: 4,673,458,277 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
The only 2080s at this time that should get WUs are the ones in the same machine as a 1000 series card or below. The WU will not yet work on 2000 series cards |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
My machine has 2080, 980 Ti, 980 Ti. And I have a GPU Exclusion setup so GPUGrid work doesn't run work on the 2080. Yet, GPUGrid work fetch never gives any work for my PC. Any idea why? |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 213 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
My machine has 2080, 980 Ti, 980 Ti. And I have a GPU Exclusion setup so GPUGrid work doesn't run work on the 2080. Do you have a device_num with your project URL exclusion? Otherwise all GPUs will be excluded instead of just the Turing card. <device_num>0</device_num> |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Yes, I have that set correctly. 2/7/2019 12:12:35 PM | | Starting BOINC client version 7.14.2 for windows_x86_64 2/7/2019 12:12:35 PM | | log flags: file_xfer, sched_ops, task, scrsave_debug, unparsed_xml 2/7/2019 12:12:35 PM | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.8 2/7/2019 12:12:35 PM | | Data directory: E:\BOINC Data 2/7/2019 12:12:35 PM | | Running under account jacob 2/7/2019 12:12:35 PM | | CUDA: NVIDIA GPU 0: GeForce RTX 2080 (driver version 418.81, CUDA version 10.1, compute capability 7.5, 4096MB, 3551MB available, 10687 GFLOPS peak) 2/7/2019 12:12:35 PM | | CUDA: NVIDIA GPU 1: GeForce GTX 980 Ti (driver version 418.81, CUDA version 10.1, compute capability 5.2, 4096MB, 3959MB available, 6060 GFLOPS peak) 2/7/2019 12:12:35 PM | | CUDA: NVIDIA GPU 2: GeForce GTX 980 Ti (driver version 418.81, CUDA version 10.1, compute capability 5.2, 4096MB, 3959MB available, 7271 GFLOPS peak) 2/7/2019 12:12:35 PM | | OpenCL: NVIDIA GPU 0: GeForce RTX 2080 (driver version 418.81, device version OpenCL 1.2 CUDA, 8192MB, 3551MB available, 10687 GFLOPS peak) 2/7/2019 12:12:35 PM | | OpenCL: NVIDIA GPU 1: GeForce GTX 980 Ti (driver version 418.81, device version OpenCL 1.2 CUDA, 6144MB, 3959MB available, 6060 GFLOPS peak) 2/7/2019 12:12:35 PM | | OpenCL: NVIDIA GPU 2: GeForce GTX 980 Ti (driver version 418.81, device version OpenCL 1.2 CUDA, 6144MB, 3959MB available, 7271 GFLOPS peak) 2/7/2019 12:12:35 PM | | Host name: Speed 2/7/2019 12:12:35 PM | | Processor: 16 GenuineIntel Intel(R) Core(TM) i7-5960X CPU @ 3.00GHz [Family 6 Model 63 Stepping 2] 2/7/2019 12:12:35 PM | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss htt tm pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 vmx tm2 dca pbe fsgsbase bmi1 smep bmi2 2/7/2019 12:12:35 PM | | OS: Microsoft Windows 10: Professional x64 Edition, (10.00.18329.00) 2/7/2019 12:12:35 PM | | Memory: 63.89 GB physical, 73.39 GB virtual 2/7/2019 12:12:35 PM | | Disk: 80.00 GB total, 59.08 GB free 2/7/2019 12:12:35 PM | | Local time is UTC -5 hours 2/7/2019 12:12:35 PM | | No WSL found. 2/7/2019 12:12:35 PM | | VirtualBox version: 5.2.27 2/7/2019 12:12:35 PM | GPUGRID | Found app_config.xml 2/7/2019 12:12:35 PM | GPUGRID | Your app_config.xml file refers to an unknown application 'acemdbeta'. Known applications: 'acemdlong', 'acemdshort' 2/7/2019 12:12:35 PM | GPUGRID | Config: excluded GPU. Type: all. App: all. Device: 0 2/7/2019 12:12:35 PM | Einstein@Home | Config: excluded GPU. Type: all. App: all. Device: 0 2/7/2019 12:12:35 PM | Albert@Home | Config: excluded GPU. Type: all. App: all. Device: 0 2/7/2019 12:12:35 PM | | Config: event log limit 20000 lines 2/7/2019 12:12:35 PM | | Config: use all coprocessors |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
However, this message appears to be related, and appears to show whenever a Short/Long task is available but not given to me. I'll have to think about what it means, and if I have something misconfigured. Edit: My cache settings are for 10 days buffer, plus 0.5 additional. Since I believe BOINC interprets that as "may be disconnected for 10 days", it may be limiting my ability to get work based on a calculation involving that 10 days. Time for me to rethink my cache settings (which I have intentionally set for other valid reasons), and then retest. 2/7/2019 3:36:39 PM | GPUGRID | Tasks won't finish in time: BOINC runs 85.2% of the time; computation is enabled 95.1% of that |
|
Send message Joined: 11 Oct 08 Posts: 1127 Credit: 1,901,927,545 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
After I changed my cache settings, from 10.0d and 0.5d, to 2.0d and 0.5d .... The message went away, and I started getting GPUGrid work for the first time on this PC since getting the RTX 2080. http://www.gpugrid.net/results.php?hostid=326413 Thanks. |
|
Send message Joined: 2 Jul 16 Posts: 338 Credit: 7,987,341,558 RAC: 213 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
Good to see. I don't think I've ever seen that message about not finishing in time. |
|
Send message Joined: 18 Jan 09 Posts: 21 Credit: 3,950,530 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]()
|
2/7/2019 3:36:39 PM | GPUGRID | Tasks won't finish in time: BOINC runs 85.2% of the time; computation is enabled 95.1% of that Well, there is your explanation clearly set out above! You weren't being sent any work because Boinc knew that you wouldn't finish it in time. If it doesn't suit you to run Boinc 24/7 then it seems the only was forward is to drop the cache levels. Glad you got it sorted out. |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
For now I have to remove the project_max_concurrent statement from cc_config and use the cpu limitation in Local Preferences to limit the number of cores to 16.Why not use this in your cc_config??? <ncpus>16</ncpus> |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
We're getting very close to the completion of the max_concurrent fix. Richard, Not sure what you guys are up to but I sure hope you take the WCG MIP project into consideration before you roll it out. https://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=569786 They coded the use of the L3 Cache wrong and it uses 4-5 MB per MIP WU. If you exceed that I've BOINC performance cut in half. I have to use max_current in my WCG app_config or I cannot run MIP simulations. <app_config>
<app>
<name>mip1</name>
<!-- needs 5 MB L3 cache per mip1 WU, use 5-10 -->
<!-- Xeon E5-2699v4, L3 Cache = 55 MB -->
<max_concurrent>10</max_concurrent>
<fraction_done_exact>1</fraction_done_exact>
</app>
</app_config> |
|
Send message Joined: 12 Jul 17 Posts: 404 Credit: 17,408,899,587 RAC: 0 Level ![]() Scientific publications ![]() ![]()
|
We're getting very close to the completion of the max_concurrent fix. |
©2025 Universitat Pompeu Fabra