While the previous two parts where more focused on theory and concepts, we are now going to actually get our hands dirty and write some code to see how different approaches to SIMD processing compare in practice, both from a performance an an ease-of-implementation point of view.
Prelude
In Part II I used F# (pseudo-)code to illustrate the difference between the AoS (array of structures) and SoA (structure of array) approaches. That's why I thought using F# for implementing the benchmarks might be a good idea. So I installed Visual Studio 2015, created a new F# console application project and installed System.Numerics.Vectors in its most recent 4.1.0 incarnation via NuGet. Yet, when I tried to use System.Numerics.Vector<T> IntelliSense wanted to convince me there was no such thing:Maybe just a problem with the F# language service? I tried to run this little hello world sample
but that didn't work either, because it references the wrong version of System.Numerics.Vectors:
I didn't have luck with manually replacing the "C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Numerics.Vectors.dll" reference with the one delivered by the System.Numerics.Vectors NuGet package either.
For now, I thus resorted to using C# as an implementation language instead.
Managed Implementations
For the upcoming benchmarks we will use the same problem we stated in the previous post: Compute the squared L2 norm of a set of 3-vectors. According to the previous post, we want to compare five possible approaches:- Scalar AoS
- AoS vectorized across each 3-vector
- On-the-fly vectorization (converting AoS to SoA)
- Scalar SoA
- Vectorized SoA
For further details, you can consult the full benchmark code here. In case you want to run it yourself and change some parameters, make sure that vectorLen is a multiple of laneWidth, as the remaing code assumes.
All of the following benchmark code was compiled and run on an Intel Core i5-4570 (Haswell) with 16 GB auf DD3-SDRAM on Windows 10 Pro. The managed implementation was developed using Visual Studio 2015 on .NET 4.6 and targeted x64 (release build, "RyuJIT").
Scalar AoS
This version is probably the easiest to understand and, as long as you don't intend to vectorize, a pretty reasonable one: We have an array of Vector3 structures (here we simply use the one provided by System.Numerics.Vectors) and compute the resulting dot products vector by vector:The outer loop over j is there to ensure the total number of computed dot products is 32 billion. For the inner loop, the JIT compiler generates the following machine code:
As expected, it uses scalar AVX instructions to compute the dot products (VMULSS, VADDSS).
AoS vectorized across each 3-vector
In this version, we still compute a single dot product per iteration, but we compute the squares of the components at once and then determine the (horizontal) sum of those squared components:When using Vector3.Dot, the compiler emits code that uses the DPPS (SSE 4.1) or VDPPS (AVX) instructions:
For some reason, it's loading the data twice, first into XMM0 and then into XMM1 (VMOVSS, VMOVSD, VSHUFPS).
On-the-fly vectorization
Because AoS isn't a data layout well suited for vectorization, last time we came up with the idea of reordering the data on the fly. Vector gather instructions would help with that, but System.Numerics.Vector<T> only supports loading consecutive elements for now. The only managed solution I could come with is to first manually gather the required data into temporary arrays and then creating the vector instances from these temporary data structures:That works, in principle, meaning that the compiler can now emit VMULPS and VADDPS instructions to compute 8 dot products at once. Yet, because the JIT compiler doesn't employ VGATHERDPS all this gathering becomes quite cumbersome:
As you can probably imagine, this code isn't exactly a candidate for the world's fastest dot3 product implementation...
Scalar SoA
Let's move on to a proper SoA data layout. The scalar version is similar to the scalar AoS version, only that we now index the components instead of the array of vectors:The resulting machine code is likewise similar to scalar AoS:
Vectorized SoA
The SoA layout makes it easy to use vector arithmetic to compute 8 dot products at once:This is what the compiler makes of it:
Nice, but sprinkled with range (?) checks. I also wonder, why it emits VMOVUPD instead of VMOVUPS instructions.
Unmanaged Implementations
After implementing and running the above variants in C#, I figured it would be useful to have something to compare the results to. Thus, I ported the benchmark code to C++ to see what the Visual C++ optimizer, its auto-vectorizer and SIMD intrinsics can do and how close we can get to the theoretical peak performance of the Haswell CPU. For the "native" implementation I used the Visual C++ 2015 compiler with the following flags:/GS- /GL /W3 /Gy /Zi /Gm- /Ox /Ob2 /Zc:inline /fp:fast /WX- /Zc:forScope /arch:AVX2 /Gd /Oy /Oi /MD /Ot
Scalar AoS
Again, the code for this version is pretty straightforward:In case you wonder about the inner loop construction: while (i--) turned out to result in slightly faster code than a more traditional for loop.
No surprises regarding the machine code, either:
AoS vectorized across each 3-vector
Let's use the DPPS instruction via intrinsics:Notice the little trick of directly loading four consecutive floats instead of scalar loads and shuffling. Strictly speaking, this might go wrong for the last element of the vector, if you try to access unallocated memory... In reality you'd handle that special case separately (or simply allocate a few more bytes). The corresponding machine code is really compact:
On-the-fly vectorization
In contrast to the managed version, we can now employ AVX's vector gather instructions to load eight of each 3ed component value into YMM registers:_mm256_fmadd_ps results in FMA3 (fused multiply add) instructions, combining multiplication and addition/accumulation in one instruction:
Scalar SoA
Auto-vectorized SoA
In order for the auto-vectorizer to kick in, we need to use a for-loop for the inner iteration:Now the compiler even generates FMA instructions:
Vectorized SoA
Of course we can also vectorize manually by using intrinsics:Vectorized SoA using FMA
This is the same as above, but it additionaly makes use of FMA:Results
The following figure displays the performance in GFLOP/s of the different versions. The dashed line is at 51.2 GFLOP/s, the theoretical peak performance of a single Haswell core (single precision): First of all, both, all the AoS variants and the scalar SoA version, don't even come close to the vectorized SoA versions. Second, any attempts at accelerating the original AoS version failed (C#) or only provide insignificant performance gains (C++). Even vector gather can't save the day and in fact further impairs performance. In any event, the gains don't justify the more complicated code.If you really need the performance SIMD can provide, you have to switch to a SoA layout: While Visual C++'s auto-vectorizer may relieve you of writing SIMD intrinsics directly, it still requires SIMD-friendly—that is: SoA—code. As long as it works, it provides the most accessible way of writing high-performance code. The second-best way, from a usability stand point, is probably C# and System.Numerics.Vectors, which enables (explicit) SIMD programming via a comparably easy-to-use interface.
Yet, the plot above also shows that non of the managed solutions is really able to keep up with any of the vectorized C++ versions. One reason for that is the inferior code generation of the JIT compiler compared to the C++ optimizer. Others are more intrinsic to the managed programming model (null-pointer checks, range checks). But also System.Numerics.Vectors is far from being complete: For instance, there is no support for FMA or scatter/gather operations. A "Vector8" type could help treating a float[] as AVX-sized chunks.