Message about suboptimal build

Message boards : Multicore CPUs : Message about suboptimal build

Author	Message
Sebastian M. Bobrecki Send message Joined: 4 Oct 09 Posts: 6 Credit: 110,801,812 RAC: 0 Level Scientific publications	Message 39115 - Posted: 11 Dec 2014 \| 8:45:14 UTC
	I got this message: Compiled SIMD instructions: AVX_256 (Gromacs could use AVX_128_FMA on this machine, which is better) and The current CPU can measure timings more accurately than the code in mdrun_mtavx.901 was configured to use. This might affect your simulation speed as accurate timings are needed for load-balancing. Please consider rebuilding mdrun_mtavx.901 with the GMX_USE_RDTSCP=OFF CMake option.
	ID: 39115 \| Rating: 0 \| rate: / Reply Quote

MJH Project administrator Project developer Project scientist Send message Joined: 12 Nov 07 Posts: 696 Credit: 27,266,655 RAC: 0 Level Scientific publications	Message 39126 - Posted: 11 Dec 2014 \| 16:38:10 UTC - in response to Message 39115.
	Nothing important. MJH
	ID: 39126 \| Rating: 0 \| rate: / Reply Quote

Sebastian M. Bobrecki Send message Joined: 4 Oct 09 Posts: 6 Credit: 110,801,812 RAC: 0 Level Scientific publications	Message 39129 - Posted: 11 Dec 2014 \| 22:36:16 UTC - in response to Message 39126. Last modified: 11 Dec 2014 \| 22:43:37 UTC
	Nothing important. MJH However, I think it is important because performance is very low. My notebook which has 8 threads clocked 1.86GHz has the performance of 7.603ns/day. A system that has 64 threads (32 used by application) clocked at 2.5GHz is faster only about 2 times reaching only 15.829ns/day. According to the study by Professor Agner Fog from the Technical University of Denmark, processors with Bulldozer and Piledriver architecture, quote: "- The throughput of 256-bit store instructions is less than half the throughput of 128-bit store instructions on Bulldozer and Piledriver. It is particularly bad on the Piledriver, which has a throughput of one 256-bit store per 17 - 20 clock cycles. - 128-bit register-to-register moves have zero latency, while 256-bit register-to-register moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a different domain (see below) on Bulldozer and Piledriver." and: "Therefore, there is no advantage in using 256-bit instructions on Bulldozer and Piledriver when the bottleneck is execution unit throughput or instruction decoding. The poor throughput of 256-bit stores makes it a disadvantage to use 256-bit registers on the Piledriver." This is probably reason why the developers of GROMACS application sacrificed time to develop an appropriate optimization. Quote from the GROMACS project site: "Currently the supported acceleration options are: none, SSE2, SSE4.1, AVX-128-FMA (AMD Bulldozer + Piledriver) and AVX-256 (Intel Sandy+Ivy Bridge)." and: "On x86, the performance difference between SSE2 and SSE4.1 is minor. All other, higher acceleration differences are significant." Therefore, I think it would be good to also have application version built with such optimizations. Certainly I'd be delighted.
	ID: 39129 \| Rating: 0 \| rate: / Reply Quote

Message boards : Multicore CPUs : Message about suboptimal build

//