Message boards :
Number crunching :
Multiple Teslas in one box. Is there a limit per machine for tasks?
Message board moderation
| Author | Message |
|---|---|
ColeslawSend message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
One of my team mates has a very impressive rig and is having an issue with getting more than 8 work units at a time on his Teslas. He has 8 dual GPU cards in his rig and the only way he could feed all 16 was to add a second project. Is this a limitation on the server side? He is running Scientific Linux. http://paste.ubuntu.com/7654934/ System in question http://www.gpugrid.net/show_host_detail.php?hostid=176860
|
|
Send message Joined: 22 Jan 14 Posts: 3 Credit: 729,009,819 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
All Teslas, Left are K10s, right are M2090s. I am not sure why, but Gpugrid seems to only want to see the first 9 cores out of 16 on the K10s. 0-8. |
|
Send message Joined: 22 Jan 14 Posts: 3 Credit: 729,009,819 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
It gets more fun... I have seen gpugrid try to use gpu 9 once in a while... but it has an immediate computational error... oO. I got a V7 installed that works on SL6.5 7.2.33.33 I copied over the *.xml and slots. When I fired it back up the work units landed in chaos. The work units that were gpugrid that landed on 9+ failed. Meanwhile Einstein seems perfectly happy to use those cores. |
ColeslawSend message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
Do the admins have any input on this issue? We were wondering if maybe there was some kind of limit server side preventing more than 8 work units in a single machine for GPU's or if running more than 8 at a time was a known issue. It surprises me that nobody has bothered to chime in even to get additional information.
|
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
Don't know - I've some 8-GPU (K40) machines that run ok. What goes wrong exactly? Who makes that server, BTW? MJH |
|
Send message Joined: 22 Jan 14 Posts: 3 Credit: 729,009,819 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]()
|
Don't know - I've some 8-GPU (K40) machines that run ok. What goes wrong exactly? It is the 9th gpus that pushes it over. 0-8 makes 9 All work units fail instantly on gpu 9 or the 10th gpu core or higher... They rarely try on higher gpu numbers past 9 as the gpus are given work units sequentially. HP makes the SL270s Gen8 that these reside in. K10 is a dual gpu card, K40 is single gpu albeit significantly stronger. I have tried 2 installs of SL 6.5 I have removed the 5th and 6th card to make sure it was not the cards... Problem migrated to the next ones in line. I updated the drivers to 337.19 and the problem persisted. I dropped both servers down to 9 gpu cores and the problem goes away. I would prefer to have all the K10s in one box... As is I have 3 empty slots in the left SL node and require 2 more servers to use all of them... http://i.imgur.com/pEDLqoM.png |
skgivenSend message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() |
4 possibilities come to mind: 1. Boinc cant facilitate any more GPU's 2. The ACEMD app limits the number 3. It's processor related 4. It's PCIE lane related. I can't answer these possibilities but I can explain my thinking, 1. Is there a Boinc GPU cap/limit? If so that's the issue. 2. Are the apps limited to 8 GPUs? If so then this is the problem. 3. The E5-2670 is an 8core 16 thread S2 processor and there are two mounted. It would make sense to use 1 CPU until the next is needed (energy saving) and probably to use the next before using HT. Is there a problem starting the second CPU up? Do the drivers or the app not see it? Do the processors power settings need to be altered? 4. The maximum number of PCI Express Lanes is 40 for that processor. Don't know if the board supports twice that or not? Also, while it might be PCIE3.0 for the first slot (possibly 2nd, 3rd and 4th also), I doubt that they are all PCIE3 and would expect it to drop to PCIE2. These 40 lanes are also shared with other devices so in reality you might only have 32 which means 4 lanes per 8 GPU cores. The point is, your GPU's might not operate if there is less than 4 lanes/GPU. FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help |
ColeslawSend message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
MJHSend message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level ![]() Scientific publications ![]()
|
It is the 9th gpus that pushes it over. 0-8 makes 9 Yes. Our app has a limit of 8 GPUs/host. I'll see about getting that fixed, but it won't happen for a wee while. Matt |
ColeslawSend message Joined: 24 Jul 08 Posts: 36 Credit: 363,857,679 RAC: 0 Level ![]() Scientific publications ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
|
©2025 Universitat Pompeu Fabra