Frank’s Scratchpad: July 2014

In my previous post I demonstrated how one can use the NativeIntrop library to easily implement file I/O for structured binary data formats, STL in this particular case. In this example, we used NativeInterop.Stream.ReadUnmanagedStructRange to read a sequence of bytes from a file as an array of STL triangle structs.

When implementing ReadUnmanagedStructRange I had these design goals in mind:

Ease-of-use (no user-side tinkering with unsafe code)
Portability (after all, NativeInterop is a portable class library)
Genericity (should work with any unmanaged data type)
Efficiency

Today's post concerns how I tried to fulfill the first three bullet points without sacrificing too much of the last one ("efficiency"/"performance").

Choices

Implementing ReadUnmanagedStructRange (and its dual, WriteUnmanagedStructRange) essentially requires a method to convert a stream of bytes to a (managed) array of the target type (which is supposed to be unmanaged/blittable). In context of reading binary STL, we would therefore need to somehow convert a byte[] to an STLTriangle[].

In a language like C that guarantees neither type nor memory safety we would simply treat the raw byte array (represented as a byte*) as a STLTriangle* and be done. And while that approach is possible in C# as well, it requires unsafe code at every point we want to access the STL data. Put differently: As there is no way to change the "runtime type" of a .NET array (certainly for good reasons!) the solution of our problem boils down to either...

"Manually" construct objects of the target type from the byte[] data or store them in the result array, or...
Unsafely copy the contents of the raw byte[] to an STLTrianlge[] ("memcpy"), without explicitly or implicitly creating intermediate/temporary objects

Both options can be implemented in .NET in multiple ways; for option (1) that is

BinaryReader.ReadXXX: read individual fields directly from the stream
BitConverter.ToXXX: convert small chunks of the byte[] to individual fields of the target data type
Buffer.BlockCopy: read multiple fields of homogenous type at once
Marshal.PtrToStructure: interpret larger chunks of the byte[] as complete STLTriangle structs

For option (2) ("memcpy" approach) we will consider

Simple, unsafe C# code (*dst++ = *src++)
Marshal.Copy
memcpy (via P/Invoke)
cpblk CIL instruction
NativeInterop.Buffer.Copy using a custom blocked memcpy implementation (ReadUnmanagedStrucRange uses this function under the hood)

For our sample scenario, we assume that the STL data is already available in-memory as a byte[]; otherwise we would only be measuring I/O transfer speed. We also don't want to measure GC performance; therefore the following implementations will use a pre-allocated result array ("tris") to store the re-interpreted STLTriangle data.

BinaryReader.ReadXXX

Probably the most straight-forward approach for parsing a binary file is using a BinaryReader to get the data for individual fields of the target data structure and then create the data structure from that data. An implementation for reading an STL file might look like the following bit of code:

public static void BinaryReaderRead(byte[] triBytes, STLTriangle[] tris) {
    using (var br = new BinaryReader(new MemoryStream(triBytes))) {
        for (int i = 0; i < tris.Length; ++i) {
            var normX = br.ReadSingle();
            var normY = br.ReadSingle();
            var normZ = br.ReadSingle();
            var aX = br.ReadSingle();
            var aY = br.ReadSingle();
            var aZ = br.ReadSingle();
            var bX = br.ReadSingle();
            var bY = br.ReadSingle();
            var bZ = br.ReadSingle();
            var cX = br.ReadSingle();
            var cY = br.ReadSingle();
            var cZ = br.ReadSingle();
            var abc = br.ReadUInt16();
            tris[i] = new STLTriangle(
                new STLVector(normX, normY, normZ),
                new STLVector(aX, aY, aZ),
                new STLVector(bX, bY, bZ),
                new STLVector(cX, cY, cZ),
                abc);
        }
    }
}

Note that I'm using a MemoryStream here to simulate reading from a Stream object while avoiding disk I/O.

BitConverter.ToXXX

I guess revealing upfront the BinaryReader approach as not being exactly the most efficient approach will be hardly surprising for my dear readers. The next optimization step could therefore be to read the whole file into a flat byte[] at once and extract the data using BitConverter:

public static void BitConverterTo(byte[] triBytes, STLTriangle[] tris) {
    for (int i = 0; i < tris.Length; ++i) {
        var offset = i * STLTriangle.Size;
        var normX = BitConverter.ToSingle(triBytes, offset);
        var normY = BitConverter.ToSingle(triBytes, offset + 4);
        var normZ = BitConverter.ToSingle(triBytes, offset + 8);
        var aX = BitConverter.ToSingle(triBytes, offset + 12);
        var aY = BitConverter.ToSingle(triBytes, offset + 16);
        var aZ = BitConverter.ToSingle(triBytes, offset + 20);
        var bX = BitConverter.ToSingle(triBytes, offset + 24);
        var bY = BitConverter.ToSingle(triBytes, offset + 28);
        var bZ = BitConverter.ToSingle(triBytes, offset + 32);
        var cX = BitConverter.ToSingle(triBytes, offset + 36);
        var cY = BitConverter.ToSingle(triBytes, offset + 40);
        var cZ = BitConverter.ToSingle(triBytes, offset + 44);
        var abc = BitConverter.ToUInt16(triBytes, offset + 48);
        tris[i] = new STLTriangle(
            new STLVector(normX, normY, normZ),
            new STLVector(aX, aY, aZ),
            new STLVector(bX, bY, bZ),
            new STLVector(cX, cY, cZ),
            abc);
    }
}

Here we convert chunks of four bytes to the single precision floating point fields of an STLTriangle, representing the components of the vertices and the normal vector. The last component is the 2-byte attribute count field.

Buffer.BlockCopy

We can further refine the previous approach by extracting not single floats from the byte[], but all vertex/normal data for each single triangle at once using Buffer.BlockCopy:

public static void BufferBlockCopy(byte[] triBytes, STLTriangle[] tris) {
    var coords = new float[12];
    for (int i = 0; i < tris.Length; ++i) {
        var offset = i * STLTriangle.Size;
        System.Buffer.BlockCopy(triBytes, offset, coords, 0, 48);
        var abc = BitConverter.ToUInt16(triBytes, offset + 48);
        tris[i] = new STLTriangle(
            new STLVector(coords[0], coords[1], coords[2]),
            new STLVector(coords[3], coords[4], coords[5]),
            new STLVector(coords[6], coords[7], coords[8]),
            new STLVector(coords[9], coords[10], coords[11]),
            abc);
    }
}

Unfortunately Buffer.BlockCopy can only handle arrays whose elements are of primitive type (in this case, we copy from a byte[] to a float[]). We can therfore not use this approach to copy all bytes at once from the byte[] to a STLTriangle[].

Marshal.PtrToStructure

The Marshal class provides a method PtrToStructure (or alternatively in generic form), which essentially reads an (unmanaged) struct from an arbitrary memory address:

public static unsafe void MarshalPtrToStructureGeneric(byte[] triBytes, STLTriangle[] tris) {
    var triType = typeof(STLTriangle);
    fixed (byte* pBytes = &triBytes[0])
    {
        for (int i = 0; i < tris.Length; ++i) {
            var offset = i * STLTriangle.Size;                    
            tris[i] = Marshal.PtrToStructure<STLTriangle>(new IntPtr(pBytes + offset));
        }
    }
}

dst++ = src++

Essentially the same as with Marshal.PtrToStructure can be achieved with a little bit of unsafe C#:

public static unsafe void UnsafePointerArith(byte[] triBytes, STLTriangle[] tris) {
    fixed (byte* pbuffer = triBytes)
    fixed (STLTriangle* ptris = tris)
    {
        var pSrc = (STLTriangle*)pbuffer;
        var pDst = ptris;
        for (int i = 0; i < tris.Length; ++i) {
            *pDst++ = *pSrc++;
        }
    }
}

Here we treat the (fixed) byte input array as an STLTriangle* which we can then dereference to store each triangle data in the result array of type STLTriangle[]. Note that you cannot implement this kind of operation in a generic way in C#, as C# does not allow to dereference pointers to objects of arbitrary type.

Marshal.Copy

The aformentioned unsafe C# approach already is a form of memcpy, albeit an unoptimized one (we copy blocks of 50 bytes at each iteration); furthermore it's specialized for copying from byte[] to STLTriangle[]. Marshal.Copy on the other hand is a little more generic as it can copy any array of primitive elements to an arbitrary memory location (and vice versa):

public static unsafe void ConvertBufferMarshalCopy(byte[] triBytes, STLTriangle[] tris) {
    fixed (STLTriangle* pTris = &tris[0]) {
        Marshal.Copy(triBytes, 0, new IntPtr(pTris), triBytes.Length);
    }
}

memcpy (P/Invoke)

As we already concluded that we essentially need an efficient way to copy one block of memory to some other location in memory, why not simply use the system-provided, highly optimized, memcpy implementation?

[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false)]
static extern IntPtr memcpy(IntPtr dest, IntPtr src, UIntPtr count);
 
public static unsafe void ConvertBufferMemcpy(byte[] triBytes, STLTriangle[] tris) {
    fixed (byte* src = triBytes)
    fixed (STLTriangle* dst = tris)
    {
        memcpy(new IntPtr(dst), new IntPtr(src), new UIntPtr((uint)triBytes.Length));
    }
}

Easy. Of course, this introduces a dependency on Microsoft's C runtime library msvcrt.dll and is thus not platform independent.

cpblk

Interestingly, the CIL provides a specialized instruction just for copying blocks of memory, namely cpblk. A few years back, Alexander Mutel used cpblk in his nice performance comparison of different memcpy methods for .NET. To my knowledge, this instruction is currently not directly accessible from either C# or F# (not sure about C++/CLI, though). Yet, we can generate the neccessary instructions using a bit of runtime code generation; here I'm using Kevin Montrose's excellent Sigil library:

// parameters: src, dst, length (bytes)
static Action<IntPtr, IntPtr, uint> CpBlk = GenerateCpBlk();
 
static Action<IntPtr, IntPtr, uint> GenerateCpBlk() {
    var emitter = Sigil.Emit<Action<IntPtr, IntPtr, uint>>.NewDynamicMethod("CopyBlockIL");
    // emit IL
    emitter.LoadArgument(1); // dest
    emitter.LoadArgument(0); // src
    emitter.LoadArgument(2); // len
    emitter.CopyBlock();
    emitter.Return();
    // compile to delegate
    return emitter.CreateDelegate();
}
 
public static unsafe void ConvertBufferCpBlk(byte[] triBytes, STLTriangle[] tris) {
    fixed (byte* src = triBytes)
    fixed (STLTriangle* dst = tris)
    {
        CpBlk(new IntPtr(src), new IntPtr(dst), (uint)triBytes.Length);
    }
}

While it's a bit clunky that code generation is required to get access to cpblk, the nice thing about it is that it's platform-independent: any CLI-compliant implementation must provide it. There is even hope that the next version of F# will come with an updated version of it's pointer intrinsics library that will also feature access to cpblk (cf. Additional intrinsics for the NativePtr module and the corresponding pull request).

NativeInterop.Buffer.Copy

My own little NativeInterop helper library also has tools for moving around stuff in memory. For instance, the Buffer module contains a method Copy that copies an input array of type T to an array of type U. Both T and U must be unmanaged types (Note: Neither the C# compiler nor the CLR can enforce this constraint!). Under the covers, Buffer.Copy(T[], U[]) creates an empty result array and then copies the input bytes using a custom block copy implementation to the result array. Using Buffer.Copy is simple:

public static void ConvertBufferBufferConvert(byte[] triBytes, STLTriangle[] tris) {
    NativeInterop.Buffer.Copy(triBytes, tris);
}

Unfortunately, I couldn't use any of the other mem-copy methods to implement Buffer.Copy as they are either not generic enough (Marshal.Copy, Buffer.BlockCopy), introduce dependencies on some native libraries (memcpy via P/Invoke) or require runtime-code generation or unsafe C# code, both of which isn't available for Portable Class Libraries like NativeInterop (cpblk, *dst++=*src++).

I therefore experimented with different memcpy algorithms (different block sizes, different degrees of loop-unrolling, with or without considering alignment...) and—for the time being—setteled with a comparatively simple, aligned qword copying mechanism that offers decent performance on x64 with the current x64 JIT.

Multi-Threading

For some of the methods we will also look at a multi-threaded variant. Essentially, all of those versions look similar to this (here: memcpy):

public static unsafe void ConvertBufferMemcpyMultiThreaded(byte[] triBytes, STLTriangle[] tris) {
    var threadcount = Environment.ProcessorCount;
    var chunksize = triBytes.Length / threadcount;
    var remainder = triBytes.Length % threadcount;
 
    fixed (byte* pBytes = &triBytes[0])
    fixed (STLTriangle* pTris = &tris[0])
    {
        var tasks = new Action[threadcount];
        for (var i = 0; i < threadcount - 1; ++i) {
            var offset = i * chunksize;
            var newSrc = new IntPtr(pBytes + offset);
            var newDst = new IntPtr((byte*)pTris + offset);
            tasks[i] = () => memcpy(newSrc, newDst, new UIntPtr((uint)chunksize));                    
        }
 
        var finalOffset = (threadcount - 1) * chunksize;
        var finalSrc = new IntPtr(pBytes + finalOffset);
        var finalDst = new IntPtr((byte*)pTris + finalOffset);
        tasks[threadcount - 1] = () => memcpy(finalSrc, finalDst, new UIntPtr((uint)(chunksize + remainder)));
 
        Parallel.Invoke(tasks);
    }
}

Given that copying (large) chunks of memory should be a bandwidth-limited problem, multi-threading shouldn't help much, at least for efficient implementations. Methods with a lot of overhead might profit a little more.

Benchmark

As a benchmark problem, I chose to create fake STL data from 2^6 up to 2^20 triangles. As each triangle struct is 50 bytes, the total amount of data transferred (read + write) varies between approx. 6 kB up to approx. 100 MB and should thus stress all cache levels.

Each method described above ran up to 10 000 times with a time-out of 5 s for each of the 15 problem sizes. To minimize interference from other processes, I set the process priority to "high" and a garbage collection is kicked off before each round.

System.Diagnostics.Stopwatch, which I used to measure transfer times, has a resolution of 300 ns on my system. For very small working sets and the fastest methods, that's already too coarse. Of course I could have measured the total time for, say, 10 000 runs and just compute the average. Yet, it turns out that even with sufficient warm-up, there are still enough outliers in the measured timings to skew the result. After some experimenting, I decided to use the mean of the values below the 10th percentile instead. That seems to be more stable than relying on the timings of individual runs and also excludes extreme outliers. Still, I wouldn't trust the measurements for < 2^8 (25 kB) too much.

I ran the benchmarks on an Intel Core i7-2600K @ 3.4 - 4.2 GHz and 32 GiB of DDR3-1600 RAM (PC3-12800, dual channel; peak theoretical transfer rate: 25 GiB/s) under Windows 8.1 Pro. Both the CLR's current x64 JIT compiler "JIT64" as well as RyuJIT (CTP 4) received the oportunity to show of their performance in the benchmarks.

Results & Analysis

First let's have a look at the results from JIT64 (click for full-size):

Ignoring the multithreaded versions for now (dashed lines), it is clear immediately that memcpy (P/Invoke) offers great overall performance for both the smallest and the largest data sets. Marshal.Copy and cpblk come as a close second. Unsafe C# (*dst++ = *src++) offers stable, but comparatively poor performance.

NativeInterop's Buffer.Copy doesn't even come close to the fastest methods for small data sets, but offers comparable performance for larger sets. Something in its implementation is generating way more overhead than neccessary... That "something" turns out to be the allocation of GCHandles for fixing the managed arrays: F# 3.1 doesn't support a "fixed" statement as C# does. To check whether that truly could be the source of the overhead, I implemented a version of the cpblk method that uses GCHandle.Alloc instead of fixed and—lo and behold—this slowed down cpblk to approximately the same speed as Buffer.Copy (light-blue).

Unsurprisingly, the multithreaded versions come with quite some overhead, so they can't compete for< 500 kB. Only for problem sizes that fit into Sandy Bridge's L3 cache we see some nice scaling. Problem sizes beyond the L3 cache size are again limited by the system's DRAM bandwidth and don't profit from multi-threading.

The performance of all other (non-memcpy-like) methods is abysmal at best. We best just skip them...

How does this picture change once we switch to an (early!) preview of the upcoming RyuJIT compiler? As it turns out, not by much; yet there is still happening something interesting:

Suddenly, "*dst++ = *src++" has become one of the fastest implementations. What's going on here? Let's have a look at the generated assembly for *dst++ = *src++; first let's see what JIT64 produces:

lea rdx,[r9+r8]
lea rax,[r9+32h]
mov rcx, r9
mov r9, rax
mov rax, qword ptr [rcx]
mov qword ptr [rsp+20h], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rsp+28h], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rsp+30h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rsp+38h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rsp+40h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rsp+48h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rsp+50h], ax
lea rcx, [rsp+20h]
mov rax, qword ptr [rcx]
mov qword ptr [rdx], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rdx+8], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rdx+10h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rdx+18h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rdx+20h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rdx+28h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rdx+30h], ax
inc r11d
cmp r11d, r10d
jl 00007FFE57865410

Hm, so it first moves data from [rcx] (offset into the original byte[]) to some temporary at [rsp+20h] and then copies that to the destination [rdx] (offset into the result STLTriangle[]). That's obviously one step that's not neccessary and where RyuJIT can improve upon (there are 28 mov instructions, but only 14 qword movs + 2 word movs are required to move 50 bytes from memory to another memory location). So, what does RyuJIT really do?

lea r9, [rcx+32h]
lea r10, [rax+32h]
movdqu xmm0, xmmword ptr [rax]
movdqu xmmword ptr [rcx], xmm0
movdqu xmm0, xmmword ptr [rax+10h]
movdqu xmmword ptr [rcx+10h], xmm0
movdqu xmm0, xmmword ptr [rax+20h]
movdqu xmmword ptr [rcx+20h], xmm0
mov r11w, word ptr [rax+30h]
mov word ptr [rcx+30h], r11w
inc r8d
cmp edx, r8d
mov rax, r10
mov rcx, r9
jg 00007FFE57884CE8

We see that RyuJIT is able to not only remove the unrequired intermediate load/store ops, but in addition it also issues (unaligned) SIMD mov instructions to copy 16 bytes at a time. This optimization also works for NativeInterop.Buffer.Copy: Once I increased the blocksize to at least 16 bytes (or larger) performance became identical to that of *dst++=*src++ in the RyuJIT case (aside from the overhead for small problem sizes due to GCHandle allocations).

Conclusion

When your only concern is reading from/writing to a file on a comparatively slow disk, all of the above shouldn't bother you much. Most of the methods will be simply "fast enough". As soon as you start to move around data that's already in memory (e.g. cached) or maybe it resides on a very fast SSD, choosing the right method however becomes critical.

While NativeInterop.Buffer.Copy isn't the fastest of the bunch (yet), it's performance is competitive for medium to large problem sizes and in any case orders of magnitudes faster then the usual BinaryReader-approach. If you want convenience, portability and genericity while maintaining reasonable performance, NativeInterop provides a good all-round solution, I believe. If you want raw speed, though, use memcpy (or whatever may be available on your platform).

Method	Convenience	Portability	Safety	Efficiency	Genericity
BinaryReader.ReadXXX	+	+	+	-	-
BitConverter.ToXXX	-	+	o	-	-
Buffer.BlockCopy	-	+	o	-	-
Marshal.PtrToStructure	-	+	o	-	+
dst++ = src++	o	+	-	o	o
Marshal.Copy	+	+	-	+	o
memcpy	o	-	-	+	+
cpblk	-	+	-	+	+
Buffer.Convert/Copy, Stream.ReadUnmanagedStructRange	+	+	-	o/+	+

Frank’s Scratchpad

Wednesday, July 16, 2014

Methods for Reading Structured Binary Data: Benchmarks & Comparisons