Advanced search

Message boards : Multicore CPUs : Message about suboptimal build

Author Message
Sebastian M. Bobrecki
Send message
Joined: 4 Oct 09
Posts: 5
Credit: 110,798,797
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 39115 - Posted: 11 Dec 2014 | 8:45:14 UTC

I got this message:

Compiled SIMD instructions: AVX_256 (Gromacs could use AVX_128_FMA on this machine, which is better)
and
The current CPU can measure timings more accurately than the code in
mdrun_mtavx.901 was configured to use. This might affect your simulation
speed as accurate timings are needed for load-balancing.
Please consider rebuilding mdrun_mtavx.901 with the GMX_USE_RDTSCP=OFF CMake option.

Profile MJH
Project administrator
Project developer
Project scientist
Send message
Joined: 12 Nov 07
Posts: 696
Credit: 27,266,655
RAC: 0
Level
Val
Scientific publications
watwat
Message 39126 - Posted: 11 Dec 2014 | 16:38:10 UTC - in response to Message 39115.

Nothing important.

MJH

Sebastian M. Bobrecki
Send message
Joined: 4 Oct 09
Posts: 5
Credit: 110,798,797
RAC: 0
Level
Cys
Scientific publications
watwatwatwatwatwatwat
Message 39129 - Posted: 11 Dec 2014 | 22:36:16 UTC - in response to Message 39126.
Last modified: 11 Dec 2014 | 22:43:37 UTC

Nothing important.

MJH
However, I think it is important because performance is very low. My notebook which has 8 threads clocked 1.86GHz has the performance of 7.603ns/day. A system that has 64 threads (32 used by application) clocked at 2.5GHz is faster only about 2 times reaching only 15.829ns/day.

According to the study by Professor Agner Fog from the Technical University of Denmark, processors with Bulldozer and Piledriver architecture, quote:
"- The throughput of 256-bit store instructions is less than half the throughput of 128-bit store instructions on Bulldozer and Piledriver. It is particularly bad on the Piledriver, which has a throughput of one 256-bit store per 17 - 20 clock cycles.
- 128-bit register-to-register moves have zero latency, while 256-bit register-to-register moves have a latency of 2 clocks plus a penalty of 2-3 clocks for using a different domain (see below) on Bulldozer and Piledriver."
and:
"Therefore, there is no advantage in using 256-bit instructions on Bulldozer and Piledriver when the bottleneck is execution unit throughput or instruction decoding. The poor throughput of 256-bit stores makes it a disadvantage to use 256-bit registers on the Piledriver."

This is probably reason why the developers of GROMACS application sacrificed time to develop an appropriate optimization. Quote from the GROMACS project site:
"Currently the supported acceleration options are: none, SSE2, SSE4.1, AVX-128-FMA (AMD Bulldozer + Piledriver) and AVX-256 (Intel Sandy+Ivy Bridge)."
and:
"On x86, the performance difference between SSE2 and SSE4.1 is minor. All other, higher acceleration differences are significant."

Therefore, I think it would be good to also have application version built with such optimizations. Certainly I'd be delighted.

Post to thread

Message boards : Multicore CPUs : Message about suboptimal build