Correct reporting of coprocessors (GPUs)

Author	Message
skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32988 - Posted: 16 Sep 2013, 10:01:36 UTC Last modified: 16 Sep 2013, 10:03:25 UTC When you have more than one GPU in a system but of mixed version (from same company ATI/NVidia) the computer information details are inaccurate. Only one GPU is listed and the second or third cards are only indicated by a number [2]. For example: Coprocessors [2] NVIDIA GeForce GTX 650 Ti BOOST (133937151MB) In the above case the system has a GTX670 and a GTX650TiBoost, but you cannot tell that there is a 670 in the system. In another example a system displays, Coprocessors [2] NVIDIA GeForce GTX 660 (2048MB) driver: 326.41, AMD ATI Radeon HD 5800 series (Cypress) (1024MB) driver: 1.4.1848 It actually has a GTX660Ti and a GTX660, not two GTX660's. Also, there is no mention of the iHD4000 (and its driver which would indicate if its usable). FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 32988 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 32990 - Posted: 16 Sep 2013, 10:37:37 UTC - in response to Message 32988. When you have more than one GPU in a system but of mixed version (from same company ATI/NVidia) the computer information details are inaccurate. Only one GPU is listed and the second or third cards are only indicated by a number [2]. Yeah, that's pretty annoying. I've not yet turned my attention to website side of things. It's on the list though.. Matt ID: 32990 · Rating: 0 · rate: / Reply Quote

TJ Send message Joined: 26 Jun 09 Posts: 815 Credit: 1,470,385,294 RAC: 0 Level Scientific publications	Message 32992 - Posted: 16 Sep 2013, 11:01:16 UTC This "issue" however is also present at other projects, so it seems to me more a BOINC problem than on project level. Greetings from TJ ID: 32992 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 32996 - Posted: 16 Sep 2013, 13:15:56 UTC - in response to Message 32992. Last modified: 16 Sep 2013, 13:22:04 UTC Yeah, ideally this would be done by Berkeley, but Matt's GPUGrid app reads the GPU's, knows which GPU a WU is using and now reports this correctly in the stderr output file (on Windows), so in theory it could GPU info to the Computer Information Details under Coprocessor (rather than using what Boinc reports). As long as we have the information we might as well use it to correct what isn't read/reported accurately. Maybe some of the methods and code could be passed to Boinc central for Boinc incorporation, at some stage? FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 32996 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 32997 - Posted: 16 Sep 2013, 13:50:00 UTC - in response to Message 32996. Matt's GPUGrid app reads the GPU's, knows which GPU a WU is using and now reports this correctly in the stderr output file Then hopefully we can get a fix for sending WUs to GPUs with too little memory to handle them, currently Noelia WUs which need a GPU with 1GB of memory whhile Santi and Nathan WUs run great on 768mb GPUs. It gets old aborting stuck Noelias every day (which also slow other [non-nvidia] processes to a crawl). ID: 32997 · Rating: 0 · rate: / Reply Quote

MJH Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 33005 - Posted: 16 Sep 2013, 16:45:19 UTC - in response to Message 32992. Yes - if you can persuade DA to fix it, that'd be great. MJH ID: 33005 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 3 Level Scientific publications	Message 33006 - Posted: 16 Sep 2013, 17:59:25 UTC - in response to Message 33005. Errors, omissions and bugs excepted, BOINC already reports considerable details of each GPU in a system - have a look at the internals of a sched_request.xml file sometime (the most recent one for each attached project is kept in the root of the BOINC Data directory until overwritten). If the 6GB of the Titan is still not being detected properly, that might be a 32-bit glitch in the CUDA runtime support in the consumer-grade drivers. We had problems in some cases with Keplers too: BOINC can't afford to purchase samples of every high-end card, and Rom Walton solved that one by remoting in to my GTX 670 (a problem with the high word of a 64-bit variable passed 'by reference' not being initialised in the way Rom expected, IIRC). If you have test Titans in the lab that could be made available to Rom via remote access, that might help fix that bug. The issue of multiple GPUs in a single host being shown as multiple instances of the best GPU is cosmetic: although full details are passed back, the BOINC database hasn't been adapted to store, and more importantly display, such fine-grained detail. But I do think there's a fundamental design problem which is going to cause problems here. When a task is allocated to a volunteer's computer, the capability checking is done on the server: questions like "does the host have a >1 GB CUDA card", "does the host have a double-precision ATI card" can be addressed. But - under the present design - those task requirements are not passed back to the host and re-tested locally when the task is scheduled. Thus, if a particular host has both a 1 GB card and a 768 MB card, there's no way of telling it (after allocation) not to run a 1 GB task on the 768 MB card. That whole problem of heterogeneous resources within a single host is going to require a major re-write, I think. ID: 33006 · Rating: 0 · rate: / Reply Quote

Beyond Send message Joined: 23 Nov 08 Posts: 1112 Credit: 6,162,416,256 RAC: 0 Level Scientific publications	Message 33013 - Posted: 16 Sep 2013, 20:37:14 UTC - in response to Message 33006. When a task is allocated to a volunteer's computer, the capability checking is done on the server: questions like "does the host have a >1 GB CUDA card", "does the host have a double-precision ATI card" can be addressed. But - under the present design - those task requirements are not passed back to the host and re-tested locally when the task is scheduled. Thus, if a particular host has both a 1 GB card and a 768 MB card, there's no way of telling it (after allocation) not to run a 1 GB task on the 768 MB card. However for hosts with one NVidia GPU or multiple GPUs with similar amounts of memory it is currently possible to allocate appropriate sized tasks. Is that correct? ID: 33013 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 3 Level Scientific publications	Message 33015 - Posted: 16 Sep 2013, 20:50:50 UTC - in response to Message 33013. When a task is allocated to a volunteer's computer, the capability checking is done on the server: questions like "does the host have a >1 GB CUDA card", "does the host have a double-precision ATI card" can be addressed. But - under the present design - those task requirements are not passed back to the host and re-tested locally when the task is scheduled. Thus, if a particular host has both a 1 GB card and a 768 MB card, there's no way of telling it (after allocation) not to run a 1 GB task on the 768 MB card. However for hosts with one NVidia GPU or multiple GPUs with similar amounts of memory it is currently possible to allocate appropriate sized tasks. Is that correct? I'm pretty confident it is. If every NVidia card in a system meets, for example, a cuda55_himem plan class requiring >= 1 GB, server and client should work in harmony. But with mixed card memory sizes, the server would still allocate tasks, and there's nothing the project can do to make them pick the right card. Individual users can set up detailled cc_config options (they'd have to set <use _all_gpus> anyway, so that might be possible). But I fear that some users might not, or might get it wrong. It's not a 'set and forget' scenario, as far as the project is concerned. ID: 33015 · Rating: 0 · rate: / Reply Quote

skgiven Volunteer moderator Volunteer tester Send message Joined: 23 Apr 09 Posts: 3968 Credit: 1,995,359,260 RAC: 0 Level Scientific publications	Message 33063 - Posted: 18 Sep 2013, 17:34:49 UTC - in response to Message 33006. Last modified: 18 Sep 2013, 17:59:41 UTC The issue of multiple GPUs in a single host being shown as multiple instances of the best GPU is cosmetic Actually, it doesn't report multiple instances of the best GPU ↲ FAQ's HOW TO: - Opt out of Beta Tests - Ask for Help ID: 33063 · Rating: 0 · rate: / Reply Quote