PCI-e Bandwidth Usage

Author	Message
hiigaran Send message Joined: 25 Aug 12 Posts: 3 Credit: 52,783,413 RAC: 0 Level Scientific publications	Message 43788 - Posted: 18 Jun 2016, 14:17:56 UTC Copying this from the BOINC forums: I've been having some discussions on several sites regarding GPUs and bandwidth usage for distributed-computing projects, and I wanted to broaden things by hopefully getting some BOINC experts in on the matter. Now most of us are probably familiar with GPU mining and how hardware is generally deployed in these farms, but for anyone who isn't too familiar with it, a mining farm is typically comprised of a cheap motherboard with as many PCI-e slots of any size, some basic RAM, a cheap CPU, and of course, the GPUs. Due to space limitations, these GPUs are normally connected to the motherboard via a flexible riser which is an x1 adapter at the motherboard end, an x16 adapter on the GPU, and a USB 3.0 cable connecting the two to each other. Essentially, these are PCI-e x1 extension cables. They do not actually use a USB interface. a 3.0 cable is used simply because it has the right number of wires inside to map to an x1 interface. Now, given that these risers are bottlenecked at x1 bandwidths, this would limit performance for high-bandwidth applications such as gaming, and significant performance reductions would be observed. Since cryptocurrency mining does not require high bandwidth, no performance loss occurs here, as x1 bandwidth on PCI-e 2.0 or 3.0 is never maxed out. I had assumed that since mining does not require such high levels of bandwidth, perhaps distributed-computing projects might be the same. In the past few weeks, I've been discussing this over on the Folding@Home forums, and to my disappointment, anything less than PCI-e 3.0 x4 or PCI-e 2.0 x8 would result in bandwidth saturation, and thus a performance loss occurs due to the GPUs never reaching full load. This was rather disappointing for me, as I had wanted to build a system specced similarly to a mining rig, for the purposes of distributed-computing. After a bit of thinking, I started to wonder if every project would require the same levels of bandwidth as F@H, so here I am. With the lengthy backstory out of the way, my question to you guys is simply this: Are there any GPU projects on the BOINC platform that do not saturate the PCI-e x1 interface? I would love to get some data from anyone working on GPU projects. MSI Afterburner shows bus usage, so if a few people are willing to spend two or three minutes to take a few measurements, I would really appreciate it. Please let me know what size and version the PCI-e slot of your GPU is as well. If anyone is able to help out with posting some data from their rigs, it would really help. ID: 43788 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 43790 - Posted: 18 Jun 2016, 21:27:43 UTC - in response to Message 43788. I've been discussing this over on the Folding@Home forums, and to my disappointment, anything less than PCI-e 3.0 x4 or PCI-e 2.0 x8 would result in bandwidth saturation, and thus a performance loss occurs due to the GPUs never reaching full load. This is about the same as I've seen here. PCI-e 2.0 x8 gives full speed with my now aging TI 750 cards, while PCI-e 2.0 x4 is bottle-necking somewhat. Of course the faster the GPU, the more the probable bottleneck. ID: 43790 · Rating: 0 · rate: / Reply Quote

hiigaran Send message Joined: 25 Aug 12 Posts: 3 Credit: 52,783,413 RAC: 0 Level Scientific publications	Message 43791 - Posted: 19 Jun 2016, 7:42:07 UTC - in response to Message 43790. I was afraid of that. Really wanted to drop a good $10K on a dedicated F@H/BOINC rig and use PCI-e splitters to multiply the GPU capacity. Would have saved me a lot of money buying extra mobos and CPUs. You think every other GPU project on BOINC would be the same? ID: 43791 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 43798 - Posted: 20 Jun 2016, 0:35:34 UTC - in response to Message 43791. You think every other GPU project on BOINC would be the same? I'm sure some would be better and perhaps even be unaffected if they do all the processing on the GPU. It's been so long (years) since I tested this on other projects that I won't hazard a guess. There should be someone who's checked this behavior on other projects more recently. Sorry I can't be of more help. ID: 43798 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 43801 - Posted: 20 Jun 2016, 7:58:13 UTC - in response to Message 43791. I was afraid of that. Really wanted to drop a good $10K on a dedicated F@H/BOINC rig and use PCI-e splitters to multiply the GPU capacity. Would have saved me a lot of money buying extra mobos and CPUs. You don't have to buy a very expensive MB and CPU for GPU crunching, provided that you want to put only 1 GPU in every MB, and you don't crunch for CPU projects on that host. Even a recent Celeron could feed a GTX980Ti in a cheap m-ATX MB (however, I would recommend an i3 at least). You can gain 10-15% performance by using a non-WDDM OS like Linux, or Windows XP. You think every other GPU project on BOINC would be the same? Surely they are for some extent. Any calculation which is modelling an N-body process is much more complex (as it could need some double precision calculation, or applying extra "forces" depending on the state of the given system) than a hashing algorithm, thus it's need to be controlled by the CPU, and it needs PCIe bandwidth. There's a variation in the PCIe bandwidth requirement between different workunit batches for the GPUGrid app, it could be the same for other projects. The algorithm of "purely mathematical" projects (like primegrid, Collatz or maybe SETI@home) is more like a hashing algorithm, thus they could need less PCIe bandwidth than GPUGrid, Einstein@home or MilkyWay@home, but it could change over time. This situation is the result of that the GPUs we use for these projects are made for gaming, thus their computing capabilities are "crippled" (disabled or non-present double precision FPUs in the cores), but even the "professional" GPUs are still just co-processors, they can't do everything on their own (however their development is going to achieve this). ID: 43801 · Rating: 0 · rate: / Reply Quote

hiigaran Send message Joined: 25 Aug 12 Posts: 3 Credit: 52,783,413 RAC: 0 Level Scientific publications	Message 43802 - Posted: 20 Jun 2016, 11:32:49 UTC - in response to Message 43801. Thanks for the info. Guess my ultimate system vision might not be exactly how I'd have wanted it. I should be able to get away with multiple triple or quad card setups though, given the right chipset. So if each DC project is to some extent similar to each other when it comes to secondary hardware requirements, what kind of CPU would be recommended if each separate system were to have three or four cards each? ID: 43802 · Rating: 0 · rate: / Reply Quote

Chris Lee Send message Joined: 1 Jun 16 Posts: 1 Credit: 17,942,449 RAC: 0 Level Scientific publications	Message 44048 - Posted: 26 Jul 2016, 5:53:09 UTC Some of the work units require the use of a CPU core in addition to a process running on the GPU. With SETI for example, some of the work units require only very light CPU usage - and a single CPU core can quite easily keep many parallel processes running across multiple GPUs. Some of the work units however, require a dedicated CPU core or thread. A quick look in task manager or CPUID HWMonitor (very useful little app if you haven't already got it) will show what is going on. As far as I am aware, it is not currently possible to automatically select a certain type of work unit or refuse another. This has also resulted in a reduction in RAC as the work units that are CPU dependent dont necessarily yield higher credits, they just take several times longer to run. With SETI, I have found peak performance on my system (3770K and 980 Ti) occurs when crunching 4 work units at the same time on the GPU and no more than 3 at a time of the CPU (leaving 5 CPU threads free - 4 of which are used as required by the work units that require GPU and CPU). The older work units that do not require much of the CPU complete in about 15 minutes each. So I crunch about 16 per hour. The CPU intensive units take an hour each, so I only crunch 4 per hour. The credit received for either is broadly the same. The situation with GPUGRID is the opposite as this almost exlusively is designed to run just within the GPU. I am sure I could happily fit 4 x 980s in my PC case and work away at GPUGRID with the added bonus of not having to turn the heating on in my house during winter providing I am sitting in the same room as the computer! ID: 44048 · Rating: 0 · rate: / Reply Quote

Tomas Brada Send message Joined: 3 Nov 15 Posts: 38 Credit: 6,768,093 RAC: 0 Level Scientific publications	Message 44061 - Posted: 28 Jul 2016, 12:59:00 UTC - in response to Message 44048. The situation with GPUGRID is the opposite as this almost exlusively is designed to run just within the GPU. I do not know how the ACEMD app is designed, but on my system it uses 3-10% of CPU core. And often utilizes PCIe x4 bus up to 60%. The cpu usage, however small, has significant impact on performace. Running with realtime (RR) priority increased GPU usage from 80% to 96%. I am sure I could happily fit 4 x 980s in my PC case and work away at GPUGRID with the added bonus of not having to turn the heating on in my house during winter providing I am sitting in the same room as the computer! You think Eco :) I use this waste heat to dry my powders and papercraft. ID: 44061 · Rating: 0 · rate: / Reply Quote