Showing posts with label native. Show all posts

Friday, August 28, 2015

SIMD Fundamentals. Part III: Implementation & Benchmarks

This is the third part of my series on SIMD Fundamentals, targeted at .NET developers. You may read the other two parts here:

While the previous two parts where more focused on theory and concepts, we are now going to actually get our hands dirty and write some code to see how different approaches to SIMD processing compare in practice, both from a performance an an ease-of-implementation point of view.

Prelude

In Part II I used F# (pseudo-)code to illustrate the difference between the AoS (array of structures) and SoA (structure of array) approaches. That's why I thought using F# for implementing the benchmarks might be a good idea. So I installed Visual Studio 2015, created a new F# console application project and installed System.Numerics.Vectors in its most recent 4.1.0 incarnation via NuGet. Yet, when I tried to use System.Numerics.Vector<T> IntelliSense wanted to convince me there was no such thing:

Maybe just a problem with the F# language service? I tried to run this little hello world sample

but that didn't work either, because it references the wrong version of System.Numerics.Vectors:

I didn't have luck with manually replacing the "C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.6\System.Numerics.Vectors.dll" reference with the one delivered by the System.Numerics.Vectors NuGet package either.

For now, I thus resorted to using C# as an implementation language instead.

Managed Implementations

For the upcoming benchmarks we will use the same problem we stated in the previous post: Compute the squared L2 norm of a set of 3-vectors. According to the previous post, we want to compare five possible approaches:

Scalar AoS
AoS vectorized across each 3-vector
On-the-fly vectorization (converting AoS to SoA)
Scalar SoA
Vectorized SoA

To avoid any memory bottlenecks and keep everything cached, we will use a comparatively small set of vectors. I found that an array of 1024 to 2048 Vec3fs yielded peak performance. Note that this has the nice side effect of being a multiple of the AVX vector lane width (32 byte or 8 single precision floating point values). To get a reliable estimate of the average performance, we repeat the procedure a large number of times: For the benchmarks, all procedures had to compute a total of 32 · 2^30 (approximately 32 billion) dot products.

For further details, you can consult the full benchmark code here. In case you want to run it yourself and change some parameters, make sure that vectorLen is a multiple of laneWidth, as the remaing code assumes.

All of the following benchmark code was compiled and run on an Intel Core i5-4570 (Haswell) with 16 GB auf DD3-SDRAM on Windows 10 Pro. The managed implementation was developed using Visual Studio 2015 on .NET 4.6 and targeted x64 (release build, "RyuJIT").

Scalar AoS

This version is probably the easiest to understand and, as long as you don't intend to vectorize, a pretty reasonable one: We have an array of Vector3 structures (here we simply use the one provided by System.Numerics.Vectors) and compute the resulting dot products vector by vector:

The outer loop over j is there to ensure the total number of computed dot products is 32 billion. For the inner loop, the JIT compiler generates the following machine code:

As expected, it uses scalar AVX instructions to compute the dot products (VMULSS, VADDSS).

AoS vectorized across each 3-vector

In this version, we still compute a single dot product per iteration, but we compute the squares of the components at once and then determine the (horizontal) sum of those squared components:

When using Vector3.Dot, the compiler emits code that uses the DPPS (SSE 4.1) or VDPPS (AVX) instructions:

For some reason, it's loading the data twice, first into XMM0 and then into XMM1 (VMOVSS, VMOVSD, VSHUFPS).

On-the-fly vectorization

Because AoS isn't a data layout well suited for vectorization, last time we came up with the idea of reordering the data on the fly. Vector gather instructions would help with that, but System.Numerics.Vector<T> only supports loading consecutive elements for now. The only managed solution I could come with is to first manually gather the required data into temporary arrays and then creating the vector instances from these temporary data structures:

That works, in principle, meaning that the compiler can now emit VMULPS and VADDPS instructions to compute 8 dot products at once. Yet, because the JIT compiler doesn't employ VGATHERDPS all this gathering becomes quite cumbersome:

As you can probably imagine, this code isn't exactly a candidate for the world's fastest dot3 product implementation...

Scalar SoA

Let's move on to a proper SoA data layout. The scalar version is similar to the scalar AoS version, only that we now index the components instead of the array of vectors:

The resulting machine code is likewise similar to scalar AoS:

Vectorized SoA

The SoA layout makes it easy to use vector arithmetic to compute 8 dot products at once:

This is what the compiler makes of it:

Nice, but sprinkled with range (?) checks. I also wonder, why it emits VMOVUPD instead of VMOVUPS instructions.

Unmanaged Implementations

After implementing and running the above variants in C#, I figured it would be useful to have something to compare the results to. Thus, I ported the benchmark code to C++ to see what the Visual C++ optimizer, its auto-vectorizer and SIMD intrinsics can do and how close we can get to the theoretical peak performance of the Haswell CPU. For the "native" implementation I used the Visual C++ 2015 compiler with the following flags:

/GS- /GL /W3 /Gy /Zi /Gm- /Ox /Ob2 /Zc:inline /fp:fast /WX- /Zc:forScope /arch:AVX2 /Gd /Oy /Oi /MD /Ot

Scalar AoS

Again, the code for this version is pretty straightforward:

In case you wonder about the inner loop construction: while (i--) turned out to result in slightly faster code than a more traditional for loop.

No surprises regarding the machine code, either:

AoS vectorized across each 3-vector

Let's use the DPPS instruction via intrinsics:

Notice the little trick of directly loading four consecutive floats instead of scalar loads and shuffling. Strictly speaking, this might go wrong for the last element of the vector, if you try to access unallocated memory... In reality you'd handle that special case separately (or simply allocate a few more bytes). The corresponding machine code is really compact:

On-the-fly vectorization

In contrast to the managed version, we can now employ AVX's vector gather instructions to load eight of each 3ed component value into YMM registers:

_mm256_fmadd_ps results in FMA3 (fused multiply add) instructions, combining multiplication and addition/accumulation in one instruction:

Scalar SoA

Auto-vectorized SoA

In order for the auto-vectorizer to kick in, we need to use a for-loop for the inner iteration:

Now the compiler even generates FMA instructions:

Vectorized SoA

Of course we can also vectorize manually by using intrinsics:

Vectorized SoA using FMA

This is the same as above, but it additionaly makes use of FMA:

Results

The following figure displays the performance in GFLOP/s of the different versions. The dashed line is at 51.2 GFLOP/s, the theoretical peak performance of a single Haswell core (single precision):

First of all, both, all the AoS variants and the scalar SoA version, don't even come close to the vectorized SoA versions. Second, any attempts at accelerating the original AoS version failed (C#) or only provide insignificant performance gains (C++). Even vector gather can't save the day and in fact further impairs performance. In any event, the gains don't justify the more complicated code.

If you really need the performance SIMD can provide, you have to switch to a SoA layout: While Visual C++'s auto-vectorizer may relieve you of writing SIMD intrinsics directly, it still requires SIMD-friendly—that is: SoA—code. As long as it works, it provides the most accessible way of writing high-performance code. The second-best way, from a usability stand point, is probably C# and System.Numerics.Vectors, which enables (explicit) SIMD programming via a comparably easy-to-use interface.

Yet, the plot above also shows that non of the managed solutions is really able to keep up with any of the vectorized C++ versions. One reason for that is the inferior code generation of the JIT compiler compared to the C++ optimizer. Others are more intrinsic to the managed programming model (null-pointer checks, range checks). But also System.Numerics.Vectors is far from being complete: For instance, there is no support for FMA or scatter/gather operations. A "Vector8" type could help treating a float[] as AVX-sized chunks.

Conclusions

Want speed? Use a vectorized SoA approach. Want more speed? Use C++. That's what I learned from this little investigation. That and—until we have an AI driven optimizing compiler, that's smart enough to understand, what we are trying to achieve and then automatically transforms our code to the most efficient form—that writing high-performance code will remain both an art and hard work for the time being.

Thursday, June 4, 2015

NativeInterop 2.4.0

NativeInterop v2.4.0 is ready and awaiting your download from NuGet. This version brings a revised version of Buffer.Copy tuned for .NET 4.6/RyuJIT: By using a 16 byte block copy, the JITter can generate movq (x86) or movdqu (x64) instructions for further improved performance. Enjoy!

Now I just have to figure out, where the larger performance drop for large data sizes comes from compared to the other methods. memcpy somehow reaches approx. 20 GB/s beyond the L3 cache.

Btw., I have a further post on SIMD programming with .NET almost ready, but I can't publish it yet due to problems with the current version of System.Numerics.Vectors in combination with .NET 4.6/VS 2015 RC. Hope that'll get fixed soon!

Monday, May 19, 2014

Binary STL I/O using NativeInterop.Stream

There are essentially three common ways for reading/writing structured binary data of some user defined type T (read: "files of records") from/to files in C# and .NET in general:

Use System.IO.BinaryReader/BinaryWriter and its' ReadXXX/Write methods to read/write individual fields of the data type T in question.
Use one of the System.Runtime.InteropServices.Marshal.PtrToStructure/StructureToPtr methods to covert between (pinned) byte arrays (which can be written to and read from System.IO.Stream) and non-generic value types or "formatted class types" via .NET's COM marshalling infrastructure.
Use a little bit of unsafe code and cast a byte* pointing at the first entry of a pinned byte buffer to a T* so structs of that type T can be written/read by simply dereferencing a pointer.

Generally, the whole task becomes a lot easier, when T is an unmanaged type (required for option #3 to work properly). Unmanaged types are either primitive types or value types (struct in C#) composed of only unmanaged types. See the recursion in the definition? What that essentially boils down to is that an unmanaged type may be composed of arbitrarily nested structs, as long as no reference type is involved at any level.

Sadly, C# cannot constrain type parameters to unmanaged types and, as a consequence, does not support dereferencing generic pointer types. (In C# there is only the struct constraint, but that is not sufficient as a struct might contain fields of reference type.) What that means is that option #3 from above only works for concrete types.

Because F# can constrain a type parameter to be an unmanaged type, I used F# to build the NativeInterop library that—at its core—exposes the possiblity to (unsafely!) dereference generic pointers in C#. This in turn enabled the implementation of some generic extension methods (NativeInterop ≥ v1.4.0) for System.IO.Stream that provide both easy* to use and efficient** methods for reading and writing structured binary files.

A word of warning: The F# unmanaged constrain does not surface in the NativeInterop API if used from C# (and probably VB.NET). Thus the user of NativeInterop has to make sure, that his data types are truly unmanaged! If that is not the case NativeInterop may produce arbitrary garbage!

Example: Writing and Reading Binary STL files

Let's look at a simple example for how to use the aformentioned library methods to handle a simple structured binary file format: STL files essentially contain a description of a triangle mesh (triangulated surface). The exact format of the binary version (there is also an ASCII variant) is described on Wikipedia:

An 80 byte ASCII header, which may contain some descriptive string
The number of triangles in the mesh as an unsigned 32 bit integer
For each triangle there is one record of the following format:

A normal vector made up of three single precions floating point numbers
Three vertices, each made up of three single precions floating point numbers
A field called "Attribute byte count" (16 bit unsigned integer); this should usually be zero, but some software may interpret this as color information.

Modeling STL Records in C#

In this example, we will only explicitly model the triangle records. The header information is so simple that it's easier to just directly write out a 80 byte ASCII string.

To represent the triangle information, we use the following user-defined unmanaged type(s):

[StructLayout(LayoutKind.Sequential, Pack = 1)]
struct STLVector
{
    public readonly float X;
    public readonly float Y;
    public readonly float Z;
 
    public STLVector(float x, float y, float z) {
        this.X = x;
        this.Y = y;
        this.Z = z;
    }
}
 
[StructLayout(LayoutKind.Sequential, Pack = 1)]    
struct STLTriangle
{
    // 4 * 3 * 4 byte + 2 byte = 50 byte
    public readonly STLVector Normal;
    public readonly STLVector A;
    public readonly STLVector B;
    public readonly STLVector C;
    public readonly ushort AttributeByteCount;
 
    public STLTriangle(
        STLVector normalVec, 
        STLVector vertex1, 
        STLVector vertex2, 
        STLVector vertex3, 
        ushort attr = 0) 
    {
        Normal = normalVec;
        A = vertex1;
        B = vertex2;
        C = vertex3;
        AttributeByteCount = attr;            
    }
}

Defining a Test Geometry

For testing purposes, we can now define a super-simple triangle mesh, a single tetrahedron:

// tetrahedron, vertex order: right hand rule
var mesh = new STLTriangle[] {
    new STLTriangle(new STLVector(0, 0, 0),
                    new STLVector(0, 0, 0),
                    new STLVector(0, 1, 0),
                    new STLVector(1, 0, 0)),
    new STLTriangle(new STLVector(0, 0, 0),
                    new STLVector(0, 0, 0),
                    new STLVector(0, 0, 1),
                    new STLVector(0, 1, 0)),
    new STLTriangle(new STLVector(0, 0, 0),
                    new STLVector(0, 0, 0),
                    new STLVector(0, 0, 1),
                    new STLVector(1, 0, 0)),
    new STLTriangle(new STLVector(0, 0, 0),
                    new STLVector(0, 1, 0),
                    new STLVector(0, 0, 1),
                    new STLVector(1, 0, 0)),
};

We leave the normals at zero: Most software will derive the surface normals automatically correctly, if the order in which the vertices of each face are specified follow the right hand rule (i.e. vertices enumerated counter-clockwise when looking at the face from the outside).

Writing STL

Now it's really straightforward to generate a valid STL file from the available data:

using (var bw = new BinaryWriter(File.OpenWrite(filename), Encoding.ASCII)) {
    // Encode the header string as ASCII and put it in a 80 bytes buffer
    var headerString = "Tetrahedron";
    var header = new byte[80];
    Encoding.ASCII.GetBytes(headerString, 0, headerString.Length, header, 0);
    bw.Write(header);
    // write #triangles
    bw.Write(mesh.Length);
    // use extension method from NativeInterop.Stream to write out the mesh data
    bw.BaseStream.WriteUnmanagedStructRange(mesh);
}

And here we have it, our, ahem, "beautiful" tetrahedron rendered in MeshLab:

Note how all the surface normals are sticking outward.

Reading STL

By using the ReadStructRange<T> extension method, reading binary STL data is just as simple:

string header;
STLTriangle[] mesh;
 
using (var br = new BinaryReader(File.OpenRead(filename), Encoding.ASCII)) {
    header = Encoding.ASCII.GetString(br.ReadBytes(80));
    var triCount = br.ReadUInt32();
    mesh = br.BaseStream.ReadUnmanagedStructRange<STLTriangle>((int)triCount);
}

Conclusion

Reading and writing simple structured binary data is easy using NativeInterop.Stream. Get it now from NuGet! For reporting bugs or suggesting new features, please use the BitBucket issue tracker.

*Easy, because the user of the library doesn't have to fiddle with unsafe code.
**Efficient, because under the hood it boils down to a generic version of option #3 with zero marshalling overhead.

Friday, April 11, 2014

NativeInterop is live on NuGet

Today I released a first version of my NativeInterop package on NuGet. You can find a description of the purpose and usage of the package as well as the source code on BitBucket.

The motivation to build and release this package really evolved from two practical issues I encountered when building performance critical C# code:

Creating a native, high-performance generic array data structure in C# seems to be impossible (see A minimalistic native 64 bit array ...).
Reading structured binary data from a byte[] requires some ugly hacks to get both decent performance and genericity (see e.g. Reading Unmanaged Data Into Structures).

The reason for both of these issues is the fact, that C# lacks a "unmanaged" type constraint and thus you cannot express something like

static unsafe T Read<T>(T* p) {
    return *p;
}

But in F#, you can; you'd simply encode this as

open NativeInterop
let pVal = NativePtr.read ptr

where ptr is of type nativeptr<'T> and 'T is constrained to unmanaged types.

The performance offered by the NativeInterop package should be on par with non-generic unsafe C# code. The NativeInterop package also contains an implementation of NativeArray64, but this time without using inline IL. It turned out that in newer versions of the .NET framework, the AGUs are utilized correctly for the address (offset) computation (instead of emitting IMUL instructions): Calling NativePtr.set<'T>/get<'T>/set64<'T>/get64<'T> (or NativePtr.Get/Set or IntPtr.Get/Set or NativePtr.get_Item/set_Item, respectively) should all result in the generation of a single mov instruction.

Sunday, April 25, 2010

A minimalistic native 64 bit array implementation for .NET (updated code)

If you ever felt the need to process huge amounts of data via a algorithm implemented using .NET/the CLR, you’ve surely ran into the 2^31-items-limit of the CLR’s current array implementation that only supports Int32 array indices (this also affects other collections like List<T> as those use standard arrays for storage internally).
You can try to circumvent this limitation by implementing your own array-like data-structure, either by emulating continuous storage via a collection of standard .NET-arrays (partition your data in chunks with 2^31 items each), or you can use native APIs and some evil pointer arithmetic to get maximum performance.
A while ago I tried to implement the latter approach in C#, which isn’t a big deal, only a matter of some unsafe-blocks for pointer arithmetic and a call to Marshal.AllocHGlobal for allocating memory on the unmanged heap. However, when I tried to make that custom collection into a generic one, I ran into an unsolvable problem:

public unsafe T this[long index]

get

        return *((T*)pBase + index);

set

        *((T*)pBase + index) = value;

This code does not compile. The reason for that is, that there is no way to tell the C# compiler that T shall be constrained to unmanaged types.
Interestingly, F# 2.0 does feature such a constraint! This is how a minimalistic F# implementation of such an native 64 bit array could look like:

namespace NativeTools

#nowarn "9"

#nowarn "42"

open System

open System.Runtime

open Microsoft.FSharp.NativeInterop

module internal PointerArithmetic =

    [<CompiledName("AddIntPtrToIntPtr")>]

    [<Unverifiable>]

    let inline addNativeInt (x: nativeptr<'T>) (n: nativeint) : nativeptr<'T> =

        (NativePtr.toNativeInt x) + n * (# "sizeof !0" type('T) : nativeint #) |> NativePtr.ofNativeInt

    // "reinterpret_cast<IntPtr>(x)"... EVIL!

    [<CompiledName("Int64ToIntPtr")>]

    [<Unverifiable>]

    let inline int64ToNativeint (x: int64) = (# "" x : nativeint #)

    [<CompiledName("AddInt64ToIntPtr")>]

    [<Unverifiable>]

    let inline addInt64 (x: nativeptr<'a>) (o: int64) : nativeptr<'a> = addNativeInt x (int64ToNativeint o)

[<Sealed>]

type NativeArray64<'T when 'T: unmanaged>(length: int64) =

    let itemSize: int64 = (int64)(InteropServices.Marshal.SizeOf(typeof<'T>))

    let mutable isDisposed = false

    let allocatedBytes = length * itemSize

    let blob = InteropServices.Marshal.AllocHGlobal(nativeint allocatedBytes)

    let pBlobBase: nativeptr<'T> = NativePtr.ofNativeInt blob

    let disposeLock = new Object()

    member this.Length = length

    member this.BaseAddress = pBlobBase

    member this.ItemSize = itemSize

    member this.IsDisposed = isDisposed

    member this.AllocatedBytes = allocatedBytes

    member private this.Free () =

        lock disposeLock (fun () ->

            if isDisposed

                then ()

                else InteropServices.Marshal.FreeHGlobal blob

                     isDisposed <- true

    member this.Item

        with get (idx: int64) =

                        NativePtr.read (PointerArithmetic.addInt64 pBlobBase idx)

        and  set (idx: int64) (value: 'T) =

                        NativePtr.write (PointerArithmetic.addInt64 pBlobBase idx) value

    member private this.Items = seq {

            for i in 0L .. length - 1L do

                yield this.[i]

    override this.Finalize () = this.Free()

    interface IDisposable with

        member this.Dispose () =

            GC.SuppressFinalize this

            this.Free()

    interface Collections.Generic.IEnumerable<'T> with

        member this.GetEnumerator () : Collections.Generic.IEnumerator<'T> =

            this.Items.GetEnumerator()

        member this.GetEnumerator () : Collections.IEnumerator =

            this.Items.GetEnumerator() :> Collections.IEnumerator

UPDATE 2010-04-25: Removed a few bugs.

You can use this data structure in your C# code like a normal array:

var length = 8L * 1024L * 1024L * 1024L;

// allocate a byte-array of 8 GiB

using(arr = new NativeTools.NativeArray64<byte>(length))

    arr[0] = 123;

    arr[length-1] = 222;

    Console.WriteLine("Allocated " + arr.AllocatedBytes);

// auto-disposed ...