Ampere 10496 & 8704 & 5888 fp32 cores!

Author	Message
bozz4science Send message Joined: 22 May 20 Posts: 110 Credit: 115,525,136 RAC: 0 Level Scientific publications	Message 55692 - Posted: 5 Nov 2020, 22:51:43 UTC - in response to Message 55691. Kudos to you! From what I understand so far, I have to support Zoltan's opinion. I really must stress that I am very keen on efficiency as that is sth that everyone should factor in their hardware decisions. Anyway, it still seems to offer a valid value proposition for me. Especially at this price point. And it is yet very early and future support of RTX 30xx cards could definitely offer some potential for optimisation. ID: 55692 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 55693 - Posted: 5 Nov 2020, 23:04:32 UTC - in response to Message 55691. Last modified: 5 Nov 2020, 23:21:34 UTC I got my EVGA 3070 black today and did some testing ... Testing results with CUDA 11.1 special app: about 75-85% speed vs. my 2080ti on my collection of WUs Thank you for all the effort you have put in this benchmark! Regrettably your benchmarks confirmed my expectations. Performance wise it's a bit better than I've expected (67.6%+10%~74.4% of the 2080 Ti). Power consumption wise it seems as of yet that it's not worth to invest in upgrading from the RTX 2*** series for crunching. take it with a grain of salt so far, which is exactly why I made the disclaimer. as you yourself mentioned, the types of calculations are different, and if SETI is performing a large number of INT calculations than this result wouldn't be unexpected. Ampere should see the most benefit from a pure FP32 load, and according to your previous comments, GPUGRID should be mostly FP32. It could also be that source code changes might be necessary to take full advantage of the new architecture. the new SETI app has ZERO source code changes from the older 10.2 app, I simply compiled it with the 11.1 CUDA library instead of 10.2. that's why I was attempting to build a FAHbench version with CUDA 11.1, but I hit a snag there and will have to wait. I don't do FAH, but since users here have said that GPUGRID is similar in work performed and software used, the 3070 should perform on par with the 2080ti. check this page for a comparison: https://folding.lar.systems/folding_data/gpu_ppd_overall showing the F@h PPD of a 3070 just behind the 2080ti. at $500 and 220W, that makes sense. not as power efficient as I'd like, but pushing it beyond the 2080ti nonetheless ID: 55693 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 55694 - Posted: 6 Nov 2020, 0:53:58 UTC - in response to Message 55690. Last modified: 6 Nov 2020, 0:59:46 UTC From the issue raised on the OpenMM github repo, it seems they let their SSL certificate expire back in September. And no one has done anything about it. https://github.com/openmm/openmm-org/issues/38 https://github.com/openmm/openmm This OpenMM forum? https://simtk.org/plugins/phpBB/indexPhpbb.php?group_id=161&pluginname=phpBB ID: 55694 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 55695 - Posted: 6 Nov 2020, 1:11:47 UTC The compiler optimizations for Zen 3 haven't made it into any linux kernel yet either. GCC11 and CLANG12 are supposed to get znver3 targets in the upcoming 5.10 kernel next April for the 21.04 distro release. https://www.phoronix.com/scan.php?page=news_item&px=AMD-Zen-3-Linux-Expectations ID: 55695 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 55720 - Posted: 10 Nov 2020, 22:15:32 UTC Last modified: 10 Nov 2020, 23:05:23 UTC got the 3070 up and running on Einstein@home for some more testing. As far as Einstein performance: the 2080ti does the current batch GW tasks in about 4 minutes, using 225W (360 t/day) the 3070 does the current batch GW tasks in about 5 minutes, using 170W (288 t/day) the 2080ti does the FGRP tasks in about 6 minutes, using 225W (240 t/day) the 3070 does the FGRP tasks in about 7:20 minutes, using 150W (196.4 t/day) so for Einstein GW tasks, the 3070 is about 5% more efficient and for the Einstein GR tasks, the 3070 is about 19% more efficient now this isnt to say you should buy Ampere (or even nvidia cards) for Einstein, since some of the newer AMD cards perform much better there (faster and overall more efficient). But this will serve as an additional data point with a different processing type. you can see the efficiency improvements here, especially on the GR tasks. ID: 55720 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 55731 - Posted: 12 Nov 2020, 23:54:31 UTC - in response to Message 55720. Last modified: 12 Nov 2020, 23:56:42 UTC I wonder if you slowed down the 2080Ti (reducing GPU voltage accordingly as well) by 20% to match the speed of the 3070, would it be about the same effective? (or even better in the case of GW tasks) ID: 55731 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 55732 - Posted: 13 Nov 2020, 1:57:11 UTC - in response to Message 55731. the 2080ti was already power limited to 225W. its not possible (under Linux) to reduce the voltage. there are just no tools for it. reducing the power limit will have the indirect effect of reducing voltage, but I don't have control of the voltage directly. I could power limit further, but it will only slow the card further. you start to lose too much clock speed below 215-225W in my experience (across 6 different 2080tis). the 2080ti was also watercooled with temps never exceeding 40C so it had as much of an advantage as it could have had. the 3070 was run at a 200W power limit, but it never even came close to that in these Einstein loads, with speed probably only limited by temp/clock boost bins. But that's really par for the course on mid-level Nvidia cards running Einstein tasks, the GR tasks are just light weight and dont pull a lot of power, and the GW tasks are more CPU bound than anything else so you dont get full GPU utilization. further efficiency gains could probably be made here on the 3070 with more power limiting and overclocking, but I didn't bother for this test. ID: 55732 · Rating: 0 · rate: / Reply Quote

SolidAir79 Send message Joined: 22 Aug 19 Posts: 7 Credit: 168,393,363 RAC: 0 Level Scientific publications	Message 56525 - Posted: 15 Feb 2021, 21:18:14 UTC Any apps i can use yet on my 30s cards? ID: 56525 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 56527 - Posted: 15 Feb 2021, 22:19:59 UTC - in response to Message 56525. Any apps i can use yet on my 30s cards? nope. ID: 56527 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 56590 - Posted: 17 Feb 2021, 0:25:32 UTC https://www.gpugrid.net/workunit.php?wuid=27026028 with no Ampere compatible app, and no mechanism to prevent work from being sent to incompatible systems, situations like this will only become more common as more and more users upgrade to these new cards. 3 out of the 6 systems that have handled this WU were using Ampere cards and failed because of it. If we had an Ampere compatible CUDA 11.1 app, this would have been completed by the first system. ID: 56590 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 56601 - Posted: 17 Feb 2021, 12:29:28 UTC - in response to Message 56590. Last modified: 17 Feb 2021, 12:29:40 UTC I've "saved" a workunit earlier (my host was the 7th crunching it - the 1st successful one). It had been sent to two ampere cards before. ID: 56601 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 56604 - Posted: 17 Feb 2021, 15:03:36 UTC - in response to Message 56601. i have a couple like that that I similarly saved. even some _7s (8th) this one, 50% of the users had RTX 30-series https://www.gpugrid.net/workunit.php?wuid=27025291 ID: 56604 · Rating: 0 · rate: / Reply Quote

Asghan Send message Joined: 30 Oct 19 Posts: 7 Credit: 405,900 RAC: 0 Level Scientific publications	Message 57019 - Posted: 25 Jun 2021, 6:51:57 UTC Is there any update regarding nVidia Ampere Workunits?? My Ampere cards are getting bored -.- ID: 57019 · Rating: 0 · rate: / Reply Quote

Ian&Steve C. Send message Joined: 21 Feb 20 Posts: 1116 Credit: 40,876,970,595 RAC: 423,674 Level Scientific publications	Message 57355 - Posted: 21 Sep 2021, 12:31:24 UTC Last modified: 21 Sep 2021, 12:32:54 UTC my 3080Ti (limited to 300W mind you) did this ADRIA task in under 9.5hrs https://www.gpugrid.net/result.php?resultid=32642251 anyone with a high power 3090 or 3080Ti run faster? my model 3080Ti will only go up to 366W, but I know some 3080Tis and 3090s can reach into the 400-500W range. ID: 57355 · Rating: 0 · rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level Scientific publications	Message 57980 - Posted: 1 Dec 2021, 1:42:27 UTC - in response to Message 57355. I am not running 3090s or 3080ti cards but I do have some times/comparisons for high-end Turing and Ampere GPUs for Adria tasks. Nvidia Quatro RTX6000- 8.4 hours Nvidia RTX A6000- 6.7 hours ID: 57980 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 57995 - Posted: 1 Dec 2021, 16:59:57 UTC - in response to Message 57980. I am not running 3090s or 3080ti cards but I do have some times/comparisons for high-end Turing and Ampere GPUs for Adria tasks. Nvidia Quatro RTX6000- 8.4 hours Nvidia RTX A6000- 6.7 hours The NVidia Quadro RTX 6000 is a "full chip" version of the RTX 2080Ti (4608 vs 4352 CUDA cores) while the NVidia RTX A6000 is the "full chip" version of the RTX 3090 (10752 vs 10496 CUDA cores). The rumoured RTX 3090Ti will have the "full chip" also. The RTX 3080 Ti has 10240 CUDA cores. (The real world GPUGrid performance of the Ampere architecture cards scales with the half of the advertised number of CUDA cores). ID: 57995 · Rating: 0 · rate: / Reply Quote

Boca Raton Community HS Send message Joined: 27 Aug 21 Posts: 38 Credit: 7,254,068,306 RAC: 0 Level Scientific publications	Message 57996 - Posted: 1 Dec 2021, 17:18:51 UTC - in response to Message 57995. (The real world GPUGrid performance of the Ampere architecture cards scales with the half of the advertised number of CUDA cores). Is that true for all NVidia GPUs or just Ampere? Just out of curiosity, why is it this way? ID: 57996 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 58002 - Posted: 1 Dec 2021, 19:37:41 UTC - in response to Message 57996. Just guessing here but since every new generation of Nvidia cards has basically doubled or at least increased the CUDA core count and since the GPUGrid app as well as a very few other project apps are really well coded for parallelization of computation, you can state that the crunch time scales with more cores. You can tell how well optimized an application is by how much sustained utilization it produces and how close to the max TDP of the card the app runs. The GPUGrid apps and the Minecraft apps are the only two apps that I know of that will run at 97-100 utilization through the entire computation at the full power capability of the card. Kudos to the app developers of these projects. Job well done! ID: 58002 · Rating: 0 · rate: / Reply Quote

Retvari Zoltan Send message Joined: 20 Jan 09 Posts: 2380 Credit: 16,897,957,044 RAC: 0 Level Scientific publications	Message 58003 - Posted: 1 Dec 2021, 19:43:18 UTC - in response to Message 57996. Last modified: 1 Dec 2021, 20:29:26 UTC (The real world GPUGrid performance of the Ampere architecture cards scales with the half of the advertised number of CUDA cores). Is that true for all NVidia GPUs or just Ampere? Just out of curiosity, why is it this way? It's true only for the Ampere architecture. As you can see on the picture above, the number of FP32 units have been doubled in the Ampere architecture (the INT32 units have been "upgraded"), but it resides within the (almost) same streaming multiprocessor (SM), so it could not feed much better that many cores within the SM. From a cruncher's point of view the number of SMs should have been doubled as well (by making "smaller" SMs). The other limiting factor is the power consumption, as the RTX 3080Ti (RTX3090 etc) easily hits the 350W power limit with this architecture. https://www.reddit.com/r/hardware/comments/ikok1b/explaining_amperes_cuda_core_count/ https://www.tomshardware.com/features/nvidia-ampere-architecture-deep-dive https://support.passware.com/hc/en-us/articles/1500000516221-The-new-NVIDIA-RTX-3080-has-double-the-number-of-CUDA-cores-but-is-there-a-2x-performance-gain- ID: 58003 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 13 Dec 17 Posts: 1423 Credit: 9,187,696,190 RAC: 1,276,885 Level Scientific publications	Message 58005 - Posted: 1 Dec 2021, 21:42:05 UTC As long as you can keep an INT32 operation out of the warp scheduler, then Ampere series can do two FP32 operations on the same clock. ID: 58005 · Rating: 0 · rate: / Reply Quote