What is happening and what will happen at GPUGRID, update for 2021

Author	Message
Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 57300 - Posted: 14 Sep 2021, 18:22:03 UTC Great news! Now to just have some tasks ready to send. ID: 57300 · Rating: 0 · rate: / Reply Quote

Philip C Swift [Gridcoin] Send message Joined: 23 Dec 18 Posts: 12 Credit: 50,868,500 RAC: 0 Level Scientific publications	Message 57304 - Posted: 17 Sep 2021, 9:57:51 UTC - in response to Message 57240. Good news! :-) ID: 57304 · Rating: 0 · rate: / Reply Quote

stiwi Send message Joined: 18 Jun 12 Posts: 2 Credit: 100,396,087 RAC: 0 Level Scientific publications	Message 57305 - Posted: 17 Sep 2021, 15:57:22 UTC Last modified: 17 Sep 2021, 16:00:31 UTC Got 1 WU. Boinc shows a runtime of 6 days on my 2080ti with 60% powertarged. Hoppefully its not the real time. Edit: 1% after 10 Minutes so probably something about 16h ID: 57305 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 57308 - Posted: 17 Sep 2021, 18:17:05 UTC Picked up a task apiece on two hosts. Some 800 tasks in progress now. Hope this means the project is getting back to releasing steady work. ID: 57308 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 57360 - Posted: 23 Sep 2021, 15:20:57 UTC Last modified: 23 Sep 2021, 15:23:55 UTC it seems that even though the cuda1121 app is available and works fine on Ampere cards, there's nothing preventing the cuda101 app from being sent to an Ampere host. these will always fail. example: https://gpugrid.net/result.php?resultid=32643471 The project-side scheduler needs to be adjusted to not allow the cuda101 app from being sent to Ampere hosts. this can be achieved by checking the compute capability reported from the host. In addition to the cuda version checks, the cuda101 app should be limited to hosts with compute capability less than 8.0. hosts with 8.0 or greater should only get the cuda1121 app. or simply remove the cuda101 app and require all users to update their video drivers to use the 1121 app. ID: 57360 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 57363 - Posted: 24 Sep 2021, 0:00:36 UTC - in response to Message 57360. Last modified: 24 Sep 2021, 0:00:55 UTC This is correct. I had the same exact failure with the CUDA101 app sent to my Ampere RTX 3080. Failure was the inability to compile the CUDA kernel because it was expecting a different architecture. https://www.gpugrid.net/result.php?resultid=32642922 ID: 57363 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 57365 - Posted: 24 Sep 2021, 13:57:01 UTC - in response to Message 57363. Technical question: has anybody idea if the CC is available in the scheduler? ID: 57365 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57366 - Posted: 24 Sep 2021, 14:19:16 UTC - in response to Message 57365. Last modified: 24 Sep 2021, 14:30:08 UTC Technical question: has anybody idea if the CC is available in the scheduler? I'm sure it is. I'll start digging out some references, if you want. Sorry, you caught me in the middle of a driver update. Try https://boinc.berkeley.edu/trac/wiki/AppPlanSpec#NVIDIAGPUapps: <min_nvidia_compcap>MMmm</min_nvidia_compcap> minimum compute capability <max_nvidia_compcap>MMmm</max_nvidia_compcap> maximum compute capability ID: 57366 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 57367 - Posted: 24 Sep 2021, 14:28:27 UTC - in response to Message 57366. To expand on what Richard wrote, I’m sure it’s available. Einstein@home uses this metric in their scheduler to gatekeep some of their apps to certain generations of Nvidia GPUs. So it’s definitely information that’s provided from the host to the project via BOINC. ID: 57367 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57368 - Posted: 24 Sep 2021, 14:33:31 UTC - in response to Message 57367. Last modified: 24 Sep 2021, 14:34:00 UTC So it’s definitely information that’s provided from the host to the project via BOINC. From the most recent sched_request file sent from this computer to your server: <coproc_cuda> ... <major>7</major> <minor>5</minor> ID: 57368 · Rating: 0 · rate: / Reply Quote

Toni Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 9 Dec 08 Posts: 1006 Credit: 5,068,599 RAC: 0 Level Scientific publications	Message 57369 - Posted: 24 Sep 2021, 14:44:30 UTC - in response to Message 57368. Uhm... yes but I was wondering how to retrieve it in the C++ code. ID: 57369 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57370 - Posted: 24 Sep 2021, 14:56:53 UTC - in response to Message 57369. Uhm... yes but I was wondering how to retrieve it in the C++ code. Same principle. Start with Specifying plan classes in C++, third example. ... if (!strcmp(plan_class, "cuda23")) { if (!cuda_check(c, hu, 100, // minimum compute capability (1.0) 200, // max compute capability (2.0) 2030, // min CUDA version (2.3) 19500, // min display driver version (195.00) 384*MEGA, // min video RAM 1., // # of GPUs used (may be fractional, or an integer > 1) .01, // fraction of FLOPS done by the CPU .21 // estimated GPU efficiency (actual/peak FLOPS) )) { return false; } } ID: 57370 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 57377 - Posted: 25 Sep 2021, 21:52:50 UTC Last modified: 25 Sep 2021, 21:53:52 UTC Uhm... yes but I was wondering how to retrieve it in the C++ code. I guess that there might be an easier workaround, with no need to touch the current code. It would consist of unfolding in Project Preferences page the ACEMD3 app into ACEMD3 (cuda 101) and ACEMD3 (cuda 1121) This way, Ampere users would be able to untick ACEMD3 (cuda 101) app, thus manually preventing to receive tasks that will fail for sure. ID: 57377 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 57378 - Posted: 26 Sep 2021, 2:49:42 UTC - in response to Message 57377. Won't work for multi-gpu users that have both Turing and Ampere cards installed in a host. I have a 2080 and 3080 together in a host. ID: 57378 · Rating: 0 · rate: / Reply Quote

ServicEnginIC Send message Joined: 24 Sep 10 Posts: 595 Credit: 12,249,686,510 RAC: 1,140,567 Level Scientific publications	Message 57379 - Posted: 26 Sep 2021, 7:58:44 UTC - in response to Message 57378. Won't work for multi-gpu users that have both Turing and Ampere cards installed in a host. I have a 2080 and 3080 together in a host. It should work as a manual selection in Project Preferences for receiving ACEMD3 (cuda 1121) tasks only. Your RTX 3080 (Ampere - device 0) can't process ACEMD3 (cuda 101), as seen in your failed task e1s627_I757-ADRIA_AdB_KIXCMYB_HIP-1-2-RND0972_0, but it can process ACEMD3 (cuda 1121), as seen in your succeeded task e1s385_I477-ADRIA_AdB_KIXCMYB_HIP-0-2-RND6281_2 And your RTX 2080 (Turing - device 1) on the same host can also process ACEMD3 (cuda 1121) tasks, as seen in your succeeded task e1s667_I831-ADRIA_AdB_KIXCMYB_HIP-1-2-RND8282_1 Therefore, restricting preferences in a particular venue for your host # 462662] to only receiving ACEMD3 (cuda 1121) tasks would work for both cards. The exception is the general limitation for ACEMD3 app and for all kind of mixed multi-GPU systems when restarting tasks in a different device. It was described by Toni at his Message #52865, dated on Oct 17 2019. Paragraph: Can I use it on multi-GPU systems? Can I use it on multi-GPU systems? In general yes, with one caveat: if you have DIFFERENT types of NVIDIA GPUs in the same PC, suspending a job in one and restarting it in the other will NOT be possible (errors on restart). Consider restricting the client to one GPU type only ("exclude_gpu", see here). ID: 57379 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57380 - Posted: 26 Sep 2021, 9:59:18 UTC There are two different aspects to this debate: 1) What will a project server send to a mixed-GPU client? 2) Which card will the client choose to allocate a task to? The project server will allocate work solely on the basis of Keith's 3080. BOINC has been developed, effectively, to hide his 2080 from the server. Keith has some degree of control over the behaviour of his client. He can exclude certain applications from particular cards (using cc_config.xml), but he can't exclude particular versions of the same application - the control structure is too coarse. He can also control certain behaviours of applications at the plan_class level (using app_config.xml), but that control structure is too fine - it doesn't contain any device-level controls. Other projects have been able to develop general-purpose GPU applications which are at least compatible with mixed-device hosts - tasks assigned to the 'wrong' or 'changed' device at least run, even if efficiency is downgraded. If this project is unable to follow that design criterion (and I don't know why it is unable at this moment), then I think the only available solution at this time is to divide the versions into separate applications - analogous to the old short/long tasks - so that the limited range of available client options can be leveraged. ID: 57380 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 8,067 Level Scientific publications	Message 57381 - Posted: 26 Sep 2021, 12:15:28 UTC - in response to Message 57380. I haven’t seen enough compelling evidence to justify keeping the cuda101 app. The cuda1121 app works on all hosts and is basically the same speed. Removing the cuda101 app would solve all problems. ID: 57381 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1424 Credit: 9,189,946,190 RAC: 34,713 Level Scientific publications	Message 57383 - Posted: 26 Sep 2021, 16:51:41 UTC - in response to Message 57381. I agree. Simply remove the CUDA101 app and restrict sending tasks to any host that hasn't updated the drivers to the CUDA11.2 level. ID: 57383 · Rating: 0 · rate: / Reply Quote

Richard Haselgrove Send message Joined: 11 Jul 09 Posts: 1639 Credit: 10,159,968,649 RAC: 0 Level Scientific publications	Message 57391 - Posted: 29 Sep 2021, 13:22:37 UTC What's with the new ACEMD beta version 9.17, introduced today? What are we testing? I got a couple of really short tasks on Linux host 508381. That risks really messing up DCF. ID: 57391 · Rating: 0 · rate: / Reply Quote

SolidAir79 Send message Joined: 22 Aug 19 Posts: 7 Credit: 168,393,363 RAC: 0 Level Scientific publications	Message 57392 - Posted: 29 Sep 2021, 14:50:26 UTC I received 15 of the test wu's no problems on Ampere all crunched without issue just want more :) ID: 57392 · Rating: 0 · rate: / Reply Quote