Friday, August 28, 2015

SIMD Fundamentals. Part III: Implementation & Benchmarks

This is the third part of my series on SIMD Fundamentals, targeted at .NET developers. You may read the other two parts here:
While the previous two parts where more focused on theory and concepts, we are now going to actually get our hands dirty and write some code to see how different approaches to SIMD processing compare in practice, both from a performance an an ease-of-implementation point of view.

Prelude

In Part II I used F# (pseudo-)code to illustrate the difference between the AoS (array of structures) and SoA (structure of array) approaches. That's why I thought using F# for implementing the benchmarks might be a good idea. So I installed Visual Studio 2015, created a new F# console application project and installed System.Numerics.Vectors in its most recent 4.1.0 incarnation via NuGet. Yet, when I tried to use System.Numerics.Vector<T> IntelliSense wanted to convince me there was no such thing:
Maybe just a problem with the F# language service? I tried to run this little hello world sample

open System.Numerics
[<EntryPoint>]
let main argv =
let laneWidth = System.Numerics.Vector<float32>.Count
printfn "%i" laneWidth
0
view raw fsvectest.fs hosted with ❤ by GitHub
but that didn't work either, because it references the wrong version of System.Numerics.Vectors:

------ Build started: Project: ConsoleApplication1, Configuration: Release Any CPU ------
C:\Program Files (x86)\Microsoft SDKs\F#\4.0\Framework\v4.0\fsc.exe -o:obj\Release\ConsoleApplication1.exe --debug:pdbonly --noframework --define:TRACE --doc:bin\Release\ConsoleApplication1.XML --optimize+ --platform:x64 -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\FSharp\.NETFramework\v4.0\4.4.0.0\FSharp.Core.dll" -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\mscorlib.dll" -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Core.dll" -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.dll" -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Numerics.dll" -r:"C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Numerics.Vectors.dll" --target:exe --warn:3 --warnaserror:76 --vserrors --validate-type-providers --LCID:1033 --utf8output --fullpaths --flaterrors --subsystemversion:6.00 --highentropyva+ --sqmsessionguid:dbfb1201-d8ea-4f7f-af58-f587b9d5e3cb "C:\Users\Frank\AppData\Local\Temp\.NETFramework,Version=v4.6.AssemblyAttributes.fs" AssemblyInfo.fs Program.fs
AosVsSoa\ConsoleApplication1\ConsoleApplication1\Program.fs(5,37): error FS0039: The value, constructor, namespace or type 'Vector' is not defined
Done building project "ConsoleApplication1.fsproj" -- FAILED.
view raw fsbuildfail.txt hosted with ❤ by GitHub
I didn't have luck with manually replacing the "C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Numerics.Vectors.dll" reference with the one delivered by the System.Numerics.Vectors NuGet package either.

For now, I thus resorted to using C# as an implementation language instead.

Managed Implementations

For the upcoming benchmarks we will use the same problem we stated in the previous post: Compute the squared L2 norm of a set of 3-vectors. According to the previous post, we want to compare five possible approaches:
  • Scalar AoS
  • AoS vectorized across each 3-vector
  • On-the-fly vectorization (converting AoS to SoA)
  • Scalar SoA
  • Vectorized SoA
To avoid any memory bottlenecks and keep everything cached, we will use a comparatively small set of vectors. I found that an array of 1024 to 2048 Vec3fs yielded peak performance. Note that this has the nice side effect of being a multiple of the AVX vector lane width (32 byte or 8 single precision floating point values). To get a reliable estimate of the average performance, we repeat the procedure a large number of times: For the benchmarks, all procedures had to compute a total of 32 · 2^30 (approximately 32 billion) dot products.

For further details, you can consult the full benchmark code here. In case you want to run it yourself and change some parameters, make sure that vectorLen is a multiple of laneWidth, as the remaing code assumes.

All of the following benchmark code was compiled and run on an Intel Core i5-4570 (Haswell) with 16 GB auf DD3-SDRAM on Windows 10 Pro. The managed implementation was developed using Visual Studio 2015 on .NET 4.6 and targeted x64 (release build, "RyuJIT").

Scalar AoS

This version is probably the easiest to understand and, as long as you don't intend to vectorize, a pretty reasonable one: We have an array of Vector3 structures (here we simply use the one provided by System.Numerics.Vectors) and compute the resulting dot products vector by vector:

static void Dot3AosScalar(Vector3[] vs, float[] dp) {
for (var j = 0; j < reps; ++j) {
for (var i = 0; i < dp.Length; ++i) {
dp[i] = vs[i].X * vs[i].X + vs[i].Y * vs[i].Y + vs[i].Z * vs[i].Z;
}
}
}
view raw csaosscalar.cs hosted with ❤ by GitHub
The outer loop over j is there to ensure the total number of computed dot products is 32 billion. For the inner loop, the JIT compiler generates the following machine code:

cmp r9d,r10d
jae 00007FF98D096F68
movsxd r11,r9d
imul r11,r11,3
vmovss xmm0,dword ptr [rcx+r11*4+10h]
vmulss xmm0,xmm0,xmm0
lea r11,[rcx+r11*4+10h]
vmovss xmm1,dword ptr [r11+4]
vmulss xmm1,xmm1,xmm1
vaddss xmm0,xmm0,xmm1
vmovss xmm1,dword ptr [r11+8]
vmulss xmm1,xmm1,xmm1
vaddss xmm0,xmm0,xmm1
movsxd r11,r9d
vmovss dword ptr [rdx+r11*4+10h],xmm0
inc r9d
cmp r8d,r9d
jg 00007FF98D096F06
view raw csaosscalar.asm hosted with ❤ by GitHub
As expected, it uses scalar AVX instructions to compute the dot products (VMULSS, VADDSS).

AoS vectorized across each 3-vector

In this version, we still compute a single dot product per iteration, but we compute the squares of the components at once and then determine the (horizontal) sum of those squared components:

static void Dot3AosVectorDp(Vector3[] vs, float[] dp) {
for (var j = 0; j < reps; ++j) {
for (var i = 0; i < dp.Length; ++i) {
dp[i] = Vector3.Dot(vs[i], vs[i]);
}
}
}
view raw csaosvecdp.cs hosted with ❤ by GitHub
When using Vector3.Dot, the compiler emits code that uses the DPPS (SSE 4.1) or VDPPS (AVX) instructions:

cmp r9d,r10d
jae 00007FF98D077A54
movsxd r11,r9d
imul r11,r11,3
lea r11,[rcx+r11*4+10h]
vmovss xmm1,dword ptr [r11+8]
vmovsd xmm0,qword ptr [r11]
vshufps xmm0,xmm0,xmm1,44h
vmovss xmm2,dword ptr [r11+8]
vmovsd xmm1,qword ptr [r11]
vshufps xmm1,xmm1,xmm2,44h
vdpps xmm0,xmm0,xmm1,0F1h
movsxd r11,r9d
vmovss dword ptr [rdx+r11*4+10h],xmm0
inc r9d
cmp r8d,r9d
jg 00007FF98D0779F6
view raw csaosvecdp.asm hosted with ❤ by GitHub
For some reason, it's loading the data twice, first into XMM0 and then into XMM1 (VMOVSS, VMOVSD, VSHUFPS).

On-the-fly vectorization

Because AoS isn't a data layout well suited for vectorization, last time we came up with the idea of reordering the data on the fly. Vector gather instructions would help with that, but System.Numerics.Vector<T> only supports loading consecutive elements for now. The only managed solution I could come with is to first manually gather the required data into temporary arrays and then creating the vector instances from these temporary data structures:

static void Dot3AosGather(Vector3[] vs, float[] dp) {
var xtmp = new float[laneWidth];
var ytmp = new float[laneWidth];
var ztmp = new float[laneWidth];
for (var j = 0; j < reps; ++j) {
for (var i = 0; i < dp.Length; i += laneWidth) {
for (var k = 0; k < laneWidth; ++k) {
xtmp[k] = vs[i + k].X;
ytmp[k] = vs[i + k].Y;
ztmp[k] = vs[i + k].Z;
}
var x = new Vector<float>(xtmp);
var y = new Vector<float>(ytmp);
var z = new Vector<float>(ztmp);
var dpv = x * x + y * y + z * z;
dpv.CopyTo(dp, i);
}
}
}
view raw csaosgather.cs hosted with ❤ by GitHub
That works, in principle, meaning that the compiler can now emit VMULPS and VADDPS instructions to compute 8 dot products at once. Yet, because the JIT compiler doesn't employ VGATHERDPS all this gathering becomes quite cumbersome:

xor r9d,r9d
mov r10d,dword ptr [rbx+8]
movsxd r10,r10d
cmp r10,8
setge r10b
movzx r10d,r10b
mov r11d,dword ptr [rbp+8]
movsxd r11,r11d
cmp r11,8
setge r11b
movzx r11d,r11b
and r10d,r11d
mov r11d,dword ptr [rax+8]
movsxd r11,r11d
cmp r11,8
setge r11b
movzx r11d,r11b
and r10d,r11d
test r10d,r10d
je 00007FF98D078059
mov r10d,dword ptr [rsi+8]
lea r11d,[r8+r9]
cmp r11d,r10d
jae 00007FF98D078144
movsxd r11,r11d
imul r14,r11,3
vmovss xmm0,dword ptr [rsi+r14*4+10h]
movsxd r14,r9d
vmovss dword ptr [rbx+r14*4+10h],xmm0
imul r11,r11,3
lea r11,[rsi+r11*4+10h]
vmovss xmm1,dword ptr [r11+4]
movsxd r14,r9d
vmovss dword ptr [rbp+r14*4+10h],xmm1
vmovss xmm2,dword ptr [r11+8]
movsxd r11,r9d
vmovss dword ptr [rax+r11*4+10h],xmm2
inc r9d
cmp r9d,8
jl 00007FF98D077FFD
jmp 00007FF98D0780DF
mov r10d,dword ptr [rsi+8]
lea r11d,[r8+r9]
cmp r11d,r10d
jae 00007FF98D078144
movsxd r10,r11d
imul r11,r10,3
vmovss xmm0,dword ptr [rsi+r11*4+10h]
mov r11d,dword ptr [rbx+8]
cmp r9d,r11d
jae 00007FF98D078144
movsxd r11,r9d
vmovss dword ptr [rbx+r11*4+10h],xmm0
imul r10,r10,3
lea r10,[rsi+r10*4+10h]
vmovss xmm1,dword ptr [r10+4]
mov r11d,dword ptr [rbp+8]
cmp r9d,r11d
jae 00007FF98D078144
movsxd r11,r9d
vmovss dword ptr [rbp+r11*4+10h],xmm1
vmovss xmm2,dword ptr [r10+8]
mov r10d,dword ptr [rax+8]
cmp r9d,r10d
jae 00007FF98D078144
movsxd r10,r9d
vmovss dword ptr [rax+r10*4+10h],xmm2
inc r9d
cmp r9d,8
jl 00007FF98D078059
vmovupd ymm0,ymmword ptr [rbx+10h]
vmovupd ymm1,ymmword ptr [rbp+10h]
vmovupd ymm2,ymmword ptr [rax+10h]
vmulps ymm0,ymm0,ymm0
vmulps ymm1,ymm1,ymm1
vaddps ymm0,ymm0,ymm1
vmulps ymm1,ymm2,ymm2
vaddps ymm0,ymm0,ymm1
lea r9d,[r8+7]
cmp r9d,ecx
jae 00007FF98D078144
vmovupd ymmword ptr [rdi+r8*4+10h],ymm0
add r8d,8
cmp ecx,r8d
jg 00007FF98D077FB2
view raw csaosgather.asm hosted with ❤ by GitHub
As you can probably imagine, this code isn't exactly a candidate for the world's fastest dot3 product implementation...

Scalar SoA

Let's move on to a proper SoA data layout. The scalar version is similar to the scalar AoS version, only that we now index the components instead of the array of vectors:

static void Dot3SoaScalar(float[] xs, float[] ys, float[] zs, float[] dp) {
for (var j = 0; j < reps; ++j) {
for (var i = 0; i < dp.Length; ++i) {
dp[i] = xs[i] * xs[i] + ys[i] * ys[i] + zs[i] * zs[i];
}
}
}
view raw cssoascalar.cs hosted with ❤ by GitHub
The resulting machine code is likewise similar to scalar AoS:

movsxd rsi,r11d
vmovss xmm0,dword ptr [rcx+rsi*4+10h]
vmulss xmm0,xmm0,xmm0
movsxd rsi,r11d
vmovss xmm1,dword ptr [rdx+rsi*4+10h]
vmulss xmm1,xmm1,xmm1
vaddss xmm0,xmm0,xmm1
movsxd rsi,r11d
vmovss xmm1,dword ptr [r8+rsi*4+10h]
vmulss xmm1,xmm1,xmm1
vaddss xmm0,xmm0,xmm1
movsxd rsi,r11d
vmovss dword ptr [r9+rsi*4+10h],xmm0
inc r11d
cmp r10d,r11d
jg 00007FF98D076BDF
view raw cssoascalar.asm hosted with ❤ by GitHub

Vectorized SoA

The SoA layout makes it easy to use vector arithmetic to compute 8 dot products at once:

static void Dot3SoaVectorized(float[] xs, float[] ys, float[] zs, float[] dp) {
for (var j = 0; j < reps; ++j) {
for (var i = 0; i < dp.Length; i += laneWidth) {
var x = new Vector<float>(xs, i);
var y = new Vector<float>(ys, i);
var z = new Vector<float>(zs, i);
var d = x * x + y * y + z * z;
d.CopyTo(dp, i);
}
}
}
This is what the compiler makes of it:

lea ebp,[r11+7]
cmp ebp,esi
jae 00007FF98D06828A
vmovupd ymm0,ymmword ptr [rcx+r11*4+10h]
cmp ebp,edi
jae 00007FF98D06828A
vmovupd ymm1,ymmword ptr [rdx+r11*4+10h]
cmp ebp,ebx
jae 00007FF98D06828A
vmovupd ymm2,ymmword ptr [r8+r11*4+10h]
vmulps ymm0,ymm0,ymm0
vmulps ymm1,ymm1,ymm1
vaddps ymm0,ymm0,ymm1
vmulps ymm1,ymm2,ymm2
vaddps ymm0,ymm0,ymm1
cmp ebp,r10d
jae 00007FF98D06828A
vmovupd ymmword ptr [r9+r11*4+10h],ymm0
add r11d,8
cmp r10d,r11d
jg 00007FF98D068220
Nice, but sprinkled with range (?) checks. I also wonder, why it emits VMOVUPD instead of VMOVUPS instructions.

Unmanaged Implementations

After implementing and running the above variants in C#, I figured it would be useful to have something to compare the results to. Thus, I ported the benchmark code to C++ to see what the Visual C++ optimizer, its auto-vectorizer and SIMD intrinsics can do and how close we can get to the theoretical peak performance of the Haswell CPU. For the "native" implementation I used the Visual C++ 2015 compiler with the following flags:
/GS- /GL /W3 /Gy /Zi /Gm- /Ox /Ob2 /Zc:inline /fp:fast /WX- /Zc:forScope /arch:AVX2 /Gd /Oy /Oi /MD /Ot

Scalar AoS

Again, the code for this version is pretty straightforward:

void dot3_aos_scalar(const vector<Vec3f>& vs, vector<float>& dp) {
for (auto j = 0; j < reps; ++j) {
auto i = vector_len;
while (i--) {
dp[i] = vs[i].x * vs[i].x + vs[i].y * vs[i].y + vs[i].z * vs[i].z;
}
}
}
In case you wonder about the inner loop construction: while (i--) turned out to result in slightly faster code than a more traditional for loop.

No surprises regarding the machine code, either:

lea rax,[rax-0Ch]
lea rdx,[rdx-4]
vmovss xmm0,dword ptr [rax-8]
vmovss xmm2,dword ptr [rax-4]
vmovss xmm3,dword ptr [rax]
vmulss xmm1,xmm0,xmm0
vmulss xmm0,xmm2,xmm2
vaddss xmm2,xmm1,xmm0
vmulss xmm1,xmm3,xmm3
vaddss xmm2,xmm2,xmm1
vmovss dword ptr [rdx],xmm2
sub ecx,1
jne benchmark<<lambda_918319110c26e9fabd8b05fc8f2cd5cd>,<lambda_c092a5680821d9f5b8bc5a7043f59100> >+0B3h (07FF656A82803h)

AoS vectorized across each 3-vector

Let's use the DPPS instruction via intrinsics:

void dot3_aos_vector_dp(const vector<Vec3f>& vs, vector<float>& dp) {
// 0000 0000 0111 0001: mul lower three components, store sum in lowest component
static const auto mask = 0x71;
for (auto j = 0; j < reps; ++j) {
const auto pvs = (float*)vs.data();
auto pdp = (float*)dp.data();
auto i = vector_len;
while (i--) {
// load 16 bytes (xyz|x)
const auto xyzx = _mm_loadu_ps(pvs + i * 3);
// compute d = x*x + y*y + z*z + 0*0 -> 000d
const auto xxyyzz00 = _mm_dp_ps(xyzx, xyzx, mask);
// store d (lower 4 bytes of dpv)
_mm_store_ss(pdp + i, xxyyzz00);
}
}
}
view raw cppaosvecdp.cpp hosted with ❤ by GitHub
Notice the little trick of directly loading four consecutive floats instead of scalar loads and shuffling. Strictly speaking, this might go wrong for the last element of the vector, if you try to access unallocated memory... In reality you'd handle that special case separately (or simply allocate a few more bytes). The corresponding machine code is really compact:

lea rcx,[rcx-0Ch]
lea rax,[rax-4]
vmovups xmm0,xmmword ptr [rcx]
vdpps xmm0,xmm0,xmm0,71h
vmovss dword ptr [rax],xmm0
sub edx,1
jne benchmark<<lambda_d1ac89d5e59a169233af7a419374e043>,<lambda_c092a5680821d9f5b8bc5a7043f59100> >+0C0h (07FF7B83F2100h)
view raw cppaosvecdp.asm hosted with ❤ by GitHub

On-the-fly vectorization

In contrast to the managed version, we can now employ AVX's vector gather instructions to load eight of each 3ed component value into YMM registers:

void dot3_aos_vector_gather(const vector<Vec3f>& vs, vector<float>& dp) {
static const auto epi32_one = _mm256_set1_epi32(1);
static const auto x_offsets = _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21);
static const auto y_offsets = _mm256_add_epi32(x_offsets, epi32_one);
static const auto z_offsets = _mm256_add_epi32(y_offsets, epi32_one);
for (auto j = 0; j < reps; ++j) {
const auto pvs = (float*)vs.data();
auto pdp = (__m256*)dp.data();
auto i = vector_len / lane_width;
while(i--) {
// xyz|xyz|xyz|xyz|xyz|xyz|xyz|xyz -> load 8 * 3 = 24 scattered floats
const auto base = pvs + i * 3 * lane_width;
const auto xs = _mm256_i32gather_ps(base, x_offsets, sizeof(float));
const auto ys = _mm256_i32gather_ps(base, y_offsets, sizeof(float));
const auto zs = _mm256_i32gather_ps(base, z_offsets, sizeof(float));
const auto xx = _mm256_mul_ps(xs, xs);
const auto yyxx = _mm256_fmadd_ps(ys, ys, xx);
const auto zzyyxx = _mm256_fmadd_ps(zs, zs, yyxx);
pdp[i] = zzyyxx;
}
}
}
_mm256_fmadd_ps results in FMA3 (fused multiply add) instructions, combining multiplication and addition/accumulation in one instruction:

lea rax,[rax-60h]
lea r8,[r8-20h]
vpcmpeqb ymm2,ymm2,ymm2
vmovups ymm5,ymm0
vgatherdps ymm5,dword ptr [rax+ymm6*4],ymm2
vpcmpeqb ymm2,ymm2,ymm2
vmovups ymm4,ymm0
vgatherdps ymm4,dword ptr [rax+ymm7*4],ymm2
vpcmpeqb ymm2,ymm2,ymm2
vmovups ymm1,ymm0
vgatherdps ymm1,dword ptr [rax+ymm8*4],ymm2
vmulps ymm2,ymm5,ymm5
vfmadd231ps ymm2,ymm4,ymm4
vmovups ymm0,ymm2
vfmadd231ps ymm0,ymm1,ymm1
vmovups ymmword ptr [r8],ymm0
sub edx,1
jne dot3_aos_vector_gather+190h (07FF624C81470h)

Scalar SoA

void dot3_soa_scalar(const vector<float>& xs, const vector<float>& ys, const vector<float>& zs, vector<float>& dp) {
for (auto j = 0; j < reps; ++j) {
auto i = vector_len;
while(i--) {
dp[i] = xs[i] * xs[i] + ys[i] * ys[i] + zs[i] * zs[i];
}
}
}
lea rax,[rax-4]
vmovss xmm0,dword ptr [rax]
vmovss xmm2,dword ptr [rdx+rax]
vmovss xmm3,dword ptr [r8+rax]
vmulss xmm1,xmm0,xmm0
vmulss xmm0,xmm2,xmm2
vaddss xmm2,xmm1,xmm0
vmulss xmm1,xmm3,xmm3
vaddss xmm2,xmm2,xmm1
vmovss dword ptr [r9+rax],xmm2
sub ecx,1
jne dot3_soa_scalar+50h (07FF6F2F31330h)

Auto-vectorized SoA

In order for the auto-vectorizer to kick in, we need to use a for-loop for the inner iteration:

void dot3_soa_autovec(const vector<float>& xs, const vector<float>& ys, const vector<float>& zs, vector<float>& dp) {
for (auto j = 0; j < reps; ++j) {
for (auto i = 0; i < vector_len; ++i) {
dp[i] = xs[i] * xs[i] + ys[i] * ys[i] + zs[i] * zs[i];
}
}
}
Now the compiler even generates FMA instructions:

lea rax,[rcx+rbx]
vmovups ymm2,ymmword ptr [r10+rcx]
vmovups ymm3,ymmword ptr [rcx]
vmovups ymm0,ymmword ptr [r11+rcx]
lea rcx,[rcx+20h]
vmulps ymm1,ymm2,ymm2
vfmadd231ps ymm1,ymm3,ymm3
vfmadd231ps ymm1,ymm0,ymm0
vmovups ymmword ptr [rax+r8],ymm1
sub r9,1
jne dot3_soa_autovec+190h (07FF7B8661510h)

Vectorized SoA

Of course we can also vectorize manually by using intrinsics:

void dot3_soa_vectorized(const vector<float>& xs, const vector<float>& ys, const vector<float>& zs, vector<float>& dp) {
for (auto j = 0; j < reps; ++j) {
const auto px = (__m256*)xs.data();
const auto py = (__m256*)ys.data();
const auto pz = (__m256*)zs.data();
auto pd = (__m256*)dp.data();
auto i = vector_len / lane_width;
while (i--) {
pd[i] = _mm256_add_ps(_mm256_add_ps(_mm256_mul_ps(px[i], px[i]), _mm256_mul_ps(py[i], py[i])), _mm256_mul_ps(pz[i], pz[i]));
}
}
}
lea rax,[rax-20h]
vmovups ymm0,ymmword ptr [rbx+rax]
vmovups ymm2,ymmword ptr [rax]
vmovups ymm3,ymmword ptr [rdi+rax]
vmulps ymm1,ymm0,ymm0
vmulps ymm0,ymm2,ymm2
vaddps ymm2,ymm1,ymm0
vmulps ymm1,ymm3,ymm3
vaddps ymm2,ymm2,ymm1
vmovups ymmword ptr [rsi+rax],ymm2
sub r11d,1
jne dot3_soa_vectorized+52h (07FF7183817D2h)

Vectorized SoA using FMA

This is the same as above, but it additionaly makes use of FMA:

void dot3_soa_vectorized_fma(const vector<float>& xs, const vector<float>& ys, const vector<float>& zs, vector<float>& dp) {
for (auto j = 0; j < reps; ++j) {
const auto px = (__m256*)xs.data();
const auto py = (__m256*)ys.data();
const auto pz = (__m256*)zs.data();
auto pd = (__m256*)dp.data();
auto i = vector_len / lane_width;
while (i--) {
pd[i] = _mm256_fmadd_ps(pz[i], pz[i], _mm256_fmadd_ps(py[i], py[i], _mm256_mul_ps(px[i], px[i])));
}
}
}
view raw cppsoafma.cpp hosted with ❤ by GitHub
lea rax,[rax-20h]
vmovups ymm0,ymmword ptr [rbx+rax]
vmovups ymm2,ymmword ptr [rax]
vmovups ymm3,ymmword ptr [rdi+rax]
vmulps ymm0,ymm0,ymm0
vfmadd231ps ymm0,ymm2,ymm2
vfmadd231ps ymm0,ymm3,ymm3
vmovups ymmword ptr [rsi+rax],ymm0
sub r11d,1
jne dot3_soa_vectorized+52h (07FF7DF9913D2h)
view raw cppsoafma.asm hosted with ❤ by GitHub

Results

The following figure displays the performance in GFLOP/s of the different versions. The dashed line is at 51.2 GFLOP/s, the theoretical peak performance of a single Haswell core (single precision):
First of all, both, all the AoS variants and the scalar SoA version, don't even come close to the vectorized SoA versions. Second, any attempts at accelerating the original AoS version failed (C#) or only provide insignificant performance gains (C++). Even vector gather can't save the day and in fact further impairs performance. In any event, the gains don't justify the more complicated code.

If you really need the performance SIMD can provide, you have to switch to a SoA layout: While Visual C++'s auto-vectorizer may relieve you of writing SIMD intrinsics directly, it still requires SIMD-friendly—that is: SoA—code. As long as it works, it provides the most accessible way of writing high-performance code. The second-best way, from a usability stand point, is probably C# and System.Numerics.Vectors, which enables (explicit) SIMD programming via a comparably easy-to-use interface.

Yet, the plot above also shows that non of the managed solutions is really able to keep up with any of the vectorized C++ versions. One reason for that is the inferior code generation of the JIT compiler compared to the C++ optimizer. Others are more intrinsic to the managed programming model (null-pointer checks, range checks). But also System.Numerics.Vectors is far from being complete: For instance, there is no support for FMA or scatter/gather operations. A "Vector8" type could help treating a float[] as AVX-sized chunks.

Conclusions

Want speed? Use a vectorized SoA approach. Want more speed? Use C++. That's what I learned from this little investigation. That and—until we have an AI driven optimizing compiler, that's smart enough to understand, what we are trying to achieve and then automatically transforms our code to the most efficient form—that writing high-performance code will remain both an art and hard work for the time being.

Saturday, June 13, 2015

SIMD Fundamentals. Part II: AoS, SoA, Gather/Scatter - Oh my!

Last time we looked at a very basic example of a data parallel problem: adding two arrays. Unfortunately, data parallelism is not always so easy to extract from existing code bases and often requires considerable effort. Today we will therefore address a slightly more complicated problem and move step by step from a pure scalar to a fully vectorized version.

The problem and its SISD solution

Imagine we wanted to compute the squared ℓ²-norm of a set of m 3-vectors u in R³, what is really just the dot product of each vector x with itself: ‖x‖² = x² + y² + z²

In good OO fashion, you'd probably first define a Vec3f data type, in its simplest form like so (F# syntax)

type Vec3f = struct
val public x: float32
val public y: float32
val public z: float32
new(x: float32, y: float32, z: float32) = { x = x; y = y; z = z }
end
view raw gistfile1.fs hosted with ❤ by GitHub
Our vector field u can then simply be represented by an array of these structs and the code that performs the computation of the dot products is

for i = 0 to m - 1 do
dp.[i] <- u.[i].x * u.[i].x + u.[i].y * u.[i].y + u.[i].z * u.[i].z
view raw gistfile1.fs hosted with ❤ by GitHub
Load the three components, square them, add up the squares and store the result in dp:
 
For SSE2, that's a total of three MULSS and two ADDSS instructions, so the one dot product we get out is worth 5 FLOPs. Now we would of course like to make use of our shiny SIMD hardware to accelerate the computation. There are a few options of how we can approach this:

Pseudo-SIMD: Bolting SIMD on top of AoS

Our current data layout is an "array of structures" (AoS). As we shall see further below, AoS isn't a particular good choice for vectorization and SIMDifying such code won't yield peak performance. Yet it can be done and depending on the used SIMD instruction set it may even provide some speed-up.

Here's the idea: Instead of loading the three scalar components into separate scalar registers, we store all of them in a single SIMD register. In case of SSE2, n = 4 for single precision floats, so one of the elements in the register is a dummy value that we can ignore.  Now we can multiply the SIMD register with itself to get the squared components (plus the ignored fourth) one. Then we need to horizontally add up the three squares that we are interested in; in pseudo-F#:


for i = 0 to m - 1 do
// init a SIMD register (type float4) to 4x 0.0f
let mutable xyzw = float4 0.0f
// pack x, y and z components into SIMD reg.
xyzw.[0] <- u.[i].x
xyzw.[1] <- u.[i].y
xyzw.[2] <- u.[i].z
// component-wise multiplication
let xxyyzzww = xyzw * xyzw
// extract x², y² and z², store the sum
dp.[i] <- xxyyzzww.[0] + xxyyzzww.[1] + xxyyzzww.[2]
view raw gistfile1.fs hosted with ❤ by GitHub
Graphically:

Although we were able to replace three MULSS with a single MULPS and therefore express 5 FLOPs through only 3 arithmetic instructions, there are multiple issues with this idea: 
  1. We need three scalar loads plus shuffling operations to place the three components in a single SIMD register.
  2. We effectively only use 3 of the 4 vector lanes, as we operate on Vec3f, not Vec4f structs. This problem becomes even more severe with wider vector paths.
  3. We use vector operations only for multiplication, but not for addition.
  4. Horizontal operations like adding the components are comparatively slow, require shuffling the components around and extracting a single scalar (the result).
  5. We still only compute one dot product per iteration.
Another example for this kind of pseudo-vectorization can be found in Microsoft's SIMD sample pack (ray tracer sample). Yet, for all the reasons mentioned above, you should avoid this approach whenever possible.

On-the-fly vectorization

If you recall our "Hello, world!" example from Part I, then you may also remember that vectorizing our loop meant that we computed n results per iteration instead of one. And that is what we have to achieve for our current dot-product example as well: We need to compute n dot products in each iteration. How can we accomplish this?

Imagine our data (array of Vec3f structs) to form a 3×m matrix (three rows, m columns). Each column represents a Vec3f, each row the x, y or z components of all Vec3fs. In the previous section, we tried to vectorize along the columns by parallelizing the computation of a single dot product—and largely failed.

The reason is that the data parallelism in this example can really be found along the rows of our imagined matrix, as each dot product can be computed independently and thus in parallel from each other. Furthermore m, the number of Vec3fs, is typically much larger then n and so we don't face problems in utilizing the full SIMD width. The wider the SIMD vector is, the more dot products we can compute per iteration.

As with our example from last time, the vectorized version of the algorithm is actually very similar to the plain scalar one. That's why vectorizing a data-parallel problem isn't really hard once you get the idea. The only difference is that we don't handle individual floats, but chunks of n floats:
  1. Load n x values and square them
  2. Load n y values and square them
  3. Load n z values and square them
  4. Add the n x, y and z values
  5. Store the n resulting dot products
In pseudo-F# the algorithm is now

for j = 0 to (m / 4) - 1 do // assume m % 4 = 0
let i = 4 * j
// load four x components [x0, x1, x2, x3] into a SIMD register
let xs = u.[i .. i+3].x
let ys = u.[i .. i+3].y
let zs = u.[i .. i+3].z
// component-wise multiplication and addition -> 4 DPs
dp.[i .. i+3] <- xs * xs + ys * ys + zs * zs
view raw gistfile1.fs hosted with ❤ by GitHub
Graphically:

That good part about this version is that it uses vector arithmetic instructions throughout, performing n times the work of its scalar counterparts, performing a total of 20 FLOPs each iteration.

The bad part is the one labeled "scalar loads & shuffling" in the picture above: Before we can compute the n results, we have to gather n x, y and z values, but our data is still laid out in memory as an AoS, i.e. ordered [xyzxyzxyzxyzxyz...]. Loading logically successive [x0 x1 x2 x3 ...] values thus requires indexed/indirect load instructions (vector gather). SIMD instruction sets without gather/scatter support, like SSE2, have to load the data using n conventional scalar loads and appropriately place them in the vector registers, hurting performance considerably.

Approaching nirvana: Structure of arrays (SoA)

To avoid this drawback, maybe we should just store the data in a SIMD-friendly way, as a Structure of Arrays (SoA): Instead of modeling the vector field u as a number of Vec3f structs, we consider it to be an object of three float arrays:

type Vec3Field = {x: float32[];
y: float32[];
z: float32[]}
view raw gistfile1.fs hosted with ❤ by GitHub
This is indeed very similar to switching from a column-major to row-major matrix layout, because it changes our data layout from [xyzxyzxyz...]. to [xxxxx...yyyyy...zzzzz...].

Observe how this implementation differs from the previous one only in that we index the components instead of the vectors:

for j = 0 to (m / 4) - 1 do // assume m % 4 = 0
let i = 4 * j
// load four x components [x0, x1, x2, x3] into a SIMD register
let xs = u.x.[i .. i+3]
let ys = u.y.[i .. i+3]
let zs = u.z.[i .. i+3]
// component-wise multiplication and addition -> 4 DPs
dp.[i .. i+3] <- xs * xs + ys * ys + zs * zs
view raw gistfile1.fs hosted with ❤ by GitHub
And yet it saves us a lot of loads and shuffling, resulting in pure, 100% vectorized SIMD code:
As this version really only replaces the scalar instructions with vector equivalents but executes 20 FLOPs in each iteration, we should now indeed get about 4x the performance of the scalar version and a much more favourable arithmetic-to-branch/load/store ratio.

In fact, once the hard part of the work is done (switching from AoS to an SoA data layout), many compilers can even automatically vectorize loops operating on that kind of data.

Conclusions

Vectorizing OO code needs some getting used to, the SoA view of the world may be a bit alien to non-APL programmers. If you can foresee the need for vectorizing your code at some point in the future, it may be wise to use an SoA layout from the get-go instead of having to rewrite half of your code later on. Experience certainly helps in identifying data-parallel problems; but as a rule of thumb good candidates for extracting data parallelism are those problems where thinking in terms of operations that apply to a whole array/field of items comes naturally.

In part III we are going to discuss concrete implementations of the above concepts using F#, RyuJIT and System.Numerics.Vectors and compare the performance of the different versions—once they sorted out this issue.

Sunday, June 7, 2015

SIMD Fundamentals. Part I: From SISD to SIMD

For a long time, .NET developers didn't have to care about SIMD programming, as it was simply not available to them (except maybe for Mono's Mono.Simd). Only recently Microsoft introduced a new JIT compiler RyuJIT for .NET that, in conjunction with a special library System.Numerics.Vectors, offers access to the SIMD hardware of modern CPUs. The goal of this three-part series of articles is therefore to introduce the fundamental ideas of SIMD to .NET developers who want to make use of all the power today's CPUs provide.

Parallelism in modern CPUs

In recent years CPUs gained performance mainly by increasing parallelism on all levels of execution. The advent of x86-CPUs with multiple cores established true thread level parallelism in the PC world. Solving the "multi-core problem", the question of how to distribute workloads between different threads in order to use all that parallel hardware goodness—ideally (semi-)automatically, suddenly became and continues to be one of the predominant goals of current hard- and software related research.

Years before the whole multi-core issue started, CPUs already utilized another kind of parallelism to increase performance: Instruction level parallelism (ILP) was exploited via pipelining, super-scalar and out-of-order execution and other advanced techniques. The nice thing about this kind of parallelism is that your average Joe Developer doesn't has to think about it, it's all handled by smart compilers and smart processors.

But then in 1997 Intel introduced the P55C and with it a new 64-bit-wide SIMD instruction set called MMX ("multimedia extension"; after all, "multimedia" was "the cloud" of the 90s). MMX made available a level of parallelism new to most PC developers: data level parallelism (DLP). Contrary to ILP however, DLP requires specifically designed code to be of any good. This was true for MMX and it remains to be true for its successors like Intel's 256-bit-wide AVX2. Just as with multi threading, programmers need to understand how to use those capabilities properly in order to exploit the tremendous amounts of floating point horse power.

From SISD to SIMD

Whenever I learn new concepts, I like to first think of examples and generalize afterwards (inductive reasoning). In my experience that's true for most people, so let us therefore start with the "Hello, world!" of array programming: Suppose you have two floating point arrays, say xs and ys of length m and, for some reason, you want to add those arrays component-wise and store the result in an array zs. The conventional "scalar" or SISD (Single Instruction Single Data, see Flynn's taxonomy) way of doing this looks like this (C# syntax)

for (var i = 0; i < m; ++i) {
    zs[i] = xs[i] + ys[i];
}

Fetch two floats, add them up and store the result in zs[i]. Easy:
For this specific example, adding SIMD (Single Instruction Multiple Data) is almost trivial: For simplicity, let us further assume that m is evenly divisible by n, the vector lane width. Now instead of performing one addition per iteration, we add up xs and ys in chunks of n items, in pseudo-C# with an imagined array range expression syntax

for (var i = 0; i < m / n; i += n) {
    zs[i:(i+n-1)] = xs[i:(i+n-1)] + ys[i:(i+n-1)];
}

We fetch n items from xs and n items from ys using vector load instructions add them up using a single vector add and store the n results in zs. Compared to the scalar version, we need to decode fewer instructions and perform n times the work in each iteration. Think of each SIMD register as a window, n values wide into an arbitrarily long stream (array) of scalar values. Each SIMD instructions modifies n scalar values of that stream at once (that's where the speed-up comes from) and moves on to the next chunk of n values:
So SIMD really is just a form of array processing, where you think in terms of applying one operation (single instruction) to a lot of different elements (multiple data), loop-unrolling on steroids. And that's already really all you need to know about SIMD, basically.

Yet, this example is perhaps the most obvious data parallel case one could think of. It gets a whole lot more interesting once you add OOP and the data layout this paradigm propagates to the mix. We will look into the issue of vectorizing typical OOP code in the next part of this series.

Thursday, June 4, 2015

NativeInterop 2.4.0

NativeInterop v2.4.0 is ready and awaiting your download from NuGet. This version brings a revised version of Buffer.Copy tuned for .NET 4.6/RyuJIT: By using a 16 byte block copy, the JITter can generate movq (x86) or movdqu (x64) instructions for further improved performance. Enjoy!


Now I just have to figure out, where the larger performance drop for large data sizes comes from compared to the other methods. memcpy somehow reaches approx. 20 GB/s beyond the L3 cache.

Btw., I have a further post on SIMD programming with .NET almost ready, but I can't publish it yet due to problems with the current version of System.Numerics.Vectors in combination with .NET 4.6/VS 2015 RC. Hope that'll get fixed soon!