Multiple Teslas in one box. Is there a limit per machine for tasks?

Author	Message
Coleslaw Send message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level Scientific publications	Message 37067 - Posted: 16 Jun 2014, 20:53:18 UTC One of my team mates has a very impressive rig and is having an issue with getting more than 8 work units at a time on his Teslas. He has 8 dual GPU cards in his rig and the only way he could feed all 16 was to add a second project. Is this a limitation on the server side? He is running Scientific Linux. http://paste.ubuntu.com/7654934/ System in question http://www.gpugrid.net/show_host_detail.php?hostid=176860 ID: 37067 · Rating: 0 · rate: / Reply Quote

Eagle07 Send message Joined: 22 Jan 14 Posts: 3 Credit: 729,009,819 RAC: 0 Level Scientific publications	Message 37068 - Posted: 16 Jun 2014, 21:46:21 UTC - in response to Message 37067. Last modified: 16 Jun 2014, 21:48:17 UTC All Teslas, Left are K10s, right are M2090s. I am not sure why, but Gpugrid seems to only want to see the first 9 cores out of 16 on the K10s. 0-8. ID: 37068 · Rating: 0 · rate: / Reply Quote

Eagle07 Send message Joined: 22 Jan 14 Posts: 3 Credit: 729,009,819 RAC: 0 Level Scientific publications	Message 37069 - Posted: 16 Jun 2014, 22:59:38 UTC - in response to Message 37068. It gets more fun... I have seen gpugrid try to use gpu 9 once in a while... but it has an immediate computational error... oO. I got a V7 installed that works on SL6.5 7.2.33.33 I copied over the *.xml and slots. When I fired it back up the work units landed in chaos. The work units that were gpugrid that landed on 9+ failed. Meanwhile Einstein seems perfectly happy to use those cores. ID: 37069 · Rating: 0 · rate: / Reply Quote

Coleslaw Send message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level Scientific publications	Message 37239 - Posted: 7 Jul 2014, 16:41:16 UTC Do the admins have any input on this issue? We were wondering if maybe there was some kind of limit server side preventing more than 8 work units in a single machine for GPU's or if running more than 8 at a time was a known issue. It surprises me that nobody has bothered to chime in even to get additional information. ID: 37239 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 37241 - Posted: 7 Jul 2014, 19:58:29 UTC - in response to Message 37239. Don't know - I've some 8-GPU (K40) machines that run ok. What goes wrong exactly? Who makes that server, BTW? MJH ID: 37241 · Rating: 0 · rate: / Reply Quote

Eagle07 Send message Joined: 22 Jan 14 Posts: 3 Credit: 729,009,819 RAC: 0 Level Scientific publications	Message 37243 - Posted: 8 Jul 2014, 4:46:26 UTC - in response to Message 37241. Last modified: 8 Jul 2014, 5:04:43 UTC Don't know - I've some 8-GPU (K40) machines that run ok. What goes wrong exactly? Who makes that server, BTW? MJH It is the 9th gpus that pushes it over. 0-8 makes 9 All work units fail instantly on gpu 9 or the 10th gpu core or higher... They rarely try on higher gpu numbers past 9 as the gpus are given work units sequentially. HP makes the SL270s Gen8 that these reside in. K10 is a dual gpu card, K40 is single gpu albeit significantly stronger. I have tried 2 installs of SL 6.5 I have removed the 5th and 6th card to make sure it was not the cards... Problem migrated to the next ones in line. I updated the drivers to 337.19 and the problem persisted. I dropped both servers down to 9 gpu cores and the problem goes away. I would prefer to have all the K10s in one box... As is I have 3 empty slots in the left SL node and require 2 more servers to use all of them... http://i.imgur.com/pEDLqoM.png ID: 37243 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 37245 - Posted: 8 Jul 2014, 8:30:50 UTC - in response to Message 37243. Last modified: 8 Jul 2014, 8:53:49 UTC 4 possibilities come to mind: 1. Boinc cant facilitate any more GPU's 2. The ACEMD app limits the number 3. It's processor related 4. It's PCIE lane related. I can't answer these possibilities but I can explain my thinking, 1. Is there a Boinc GPU cap/limit? If so that's the issue. 2. Are the apps limited to 8 GPUs? If so then this is the problem. 3. The E5-2670 is an 8core 16 thread S2 processor and there are two mounted. It would make sense to use 1 CPU until the next is needed (energy saving) and probably to use the next before using HT. Is there a problem starting the second CPU up? Do the drivers or the app not see it? Do the processors power settings need to be altered? 4. The maximum number of PCI Express Lanes is 40 for that processor. Don't know if the board supports twice that or not? Also, while it might be PCIE3.0 for the first slot (possibly 2nd, 3rd and 4th also), I doubt that they are all PCIE3 and would expect it to drop to PCIE2. These 40 lanes are also shared with other devices so in reality you might only have 32 which means 4 lanes per 8 GPU cores. The point is, your GPU's might not operate if there is less than 4 lanes/GPU. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 37245 · Rating: 0 · rate: / Reply Quote

Coleslaw Send message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level Scientific publications	Message 37247 - Posted: 8 Jul 2014, 12:41:55 UTC - in response to Message 37245. Last modified: 8 Jul 2014, 12:43:39 UTC skgiven, if BOINC has the limitation on GPU's, it must be server side. As he stated above, he can fill the GPU's with Einstein work. Just not GPUGrid. At least that is what I took from our conversation... ID: 37247 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 37320 - Posted: 20 Jul 2014, 20:21:51 UTC - in response to Message 37243. It is the 9th gpus that pushes it over. 0-8 makes 9 Yes. Our app has a limit of 8 GPUs/host. I'll see about getting that fixed, but it won't happen for a wee while. Matt ID: 37320 · Rating: 0 · rate: / Reply Quote

Coleslaw Send message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level Scientific publications	Message 37341 - Posted: 21 Jul 2014, 17:17:17 UTC - in response to Message 37320. It is the 9th gpus that pushes it over. 0-8 makes 9 Yes. Our app has a limit of 8 GPUs/host. I'll see about getting that fixed, but it won't happen for a wee while. Matt Thank you for the confirmation. This allows us to move on and not waste more time testing and tweaking. ID: 37341 · Rating: 0 · rate: / Reply Quote