Frank’s Scratchpad: F#

Showing posts with label F#. Show all posts

Wednesday, July 16, 2014

Methods for Reading Structured Binary Data: Benchmarks & Comparisons

In my previous post I demonstrated how one can use the NativeIntrop library to easily implement file I/O for structured binary data formats, STL in this particular case. In this example, we used NativeInterop.Stream.ReadUnmanagedStructRange to read a sequence of bytes from a file as an array of STL triangle structs.

When implementing ReadUnmanagedStructRange I had these design goals in mind:

Ease-of-use (no user-side tinkering with unsafe code)
Portability (after all, NativeInterop is a portable class library)
Genericity (should work with any unmanaged data type)
Efficiency

Today's post concerns how I tried to fulfill the first three bullet points without sacrificing too much of the last one ("efficiency"/"performance").

Choices

Implementing ReadUnmanagedStructRange (and its dual, WriteUnmanagedStructRange) essentially requires a method to convert a stream of bytes to a (managed) array of the target type (which is supposed to be unmanaged/blittable). In context of reading binary STL, we would therefore need to somehow convert a byte[] to an STLTriangle[].

In a language like C that guarantees neither type nor memory safety we would simply treat the raw byte array (represented as a byte*) as a STLTriangle* and be done. And while that approach is possible in C# as well, it requires unsafe code at every point we want to access the STL data. Put differently: As there is no way to change the "runtime type" of a .NET array (certainly for good reasons!) the solution of our problem boils down to either...

"Manually" construct objects of the target type from the byte[] data or store them in the result array, or...
Unsafely copy the contents of the raw byte[] to an STLTrianlge[] ("memcpy"), without explicitly or implicitly creating intermediate/temporary objects

Both options can be implemented in .NET in multiple ways; for option (1) that is

BinaryReader.ReadXXX: read individual fields directly from the stream
BitConverter.ToXXX: convert small chunks of the byte[] to individual fields of the target data type
Buffer.BlockCopy: read multiple fields of homogenous type at once
Marshal.PtrToStructure: interpret larger chunks of the byte[] as complete STLTriangle structs

For option (2) ("memcpy" approach) we will consider

Simple, unsafe C# code (*dst++ = *src++)
Marshal.Copy
memcpy (via P/Invoke)
cpblk CIL instruction
NativeInterop.Buffer.Copy using a custom blocked memcpy implementation (ReadUnmanagedStrucRange uses this function under the hood)

For our sample scenario, we assume that the STL data is already available in-memory as a byte[]; otherwise we would only be measuring I/O transfer speed. We also don't want to measure GC performance; therefore the following implementations will use a pre-allocated result array ("tris") to store the re-interpreted STLTriangle data.

BinaryReader.ReadXXX

Probably the most straight-forward approach for parsing a binary file is using a BinaryReader to get the data for individual fields of the target data structure and then create the data structure from that data. An implementation for reading an STL file might look like the following bit of code:

public static void BinaryReaderRead(byte[] triBytes, STLTriangle[] tris) {
    using (var br = new BinaryReader(new MemoryStream(triBytes))) {
        for (int i = 0; i < tris.Length; ++i) {
            var normX = br.ReadSingle();
            var normY = br.ReadSingle();
            var normZ = br.ReadSingle();
            var aX = br.ReadSingle();
            var aY = br.ReadSingle();
            var aZ = br.ReadSingle();
            var bX = br.ReadSingle();
            var bY = br.ReadSingle();
            var bZ = br.ReadSingle();
            var cX = br.ReadSingle();
            var cY = br.ReadSingle();
            var cZ = br.ReadSingle();
            var abc = br.ReadUInt16();
            tris[i] = new STLTriangle(
                new STLVector(normX, normY, normZ),
                new STLVector(aX, aY, aZ),
                new STLVector(bX, bY, bZ),
                new STLVector(cX, cY, cZ),
                abc);
        }
    }
}

Note that I'm using a MemoryStream here to simulate reading from a Stream object while avoiding disk I/O.

BitConverter.ToXXX

I guess revealing upfront the BinaryReader approach as not being exactly the most efficient approach will be hardly surprising for my dear readers. The next optimization step could therefore be to read the whole file into a flat byte[] at once and extract the data using BitConverter:

public static void BitConverterTo(byte[] triBytes, STLTriangle[] tris) {
    for (int i = 0; i < tris.Length; ++i) {
        var offset = i * STLTriangle.Size;
        var normX = BitConverter.ToSingle(triBytes, offset);
        var normY = BitConverter.ToSingle(triBytes, offset + 4);
        var normZ = BitConverter.ToSingle(triBytes, offset + 8);
        var aX = BitConverter.ToSingle(triBytes, offset + 12);
        var aY = BitConverter.ToSingle(triBytes, offset + 16);
        var aZ = BitConverter.ToSingle(triBytes, offset + 20);
        var bX = BitConverter.ToSingle(triBytes, offset + 24);
        var bY = BitConverter.ToSingle(triBytes, offset + 28);
        var bZ = BitConverter.ToSingle(triBytes, offset + 32);
        var cX = BitConverter.ToSingle(triBytes, offset + 36);
        var cY = BitConverter.ToSingle(triBytes, offset + 40);
        var cZ = BitConverter.ToSingle(triBytes, offset + 44);
        var abc = BitConverter.ToUInt16(triBytes, offset + 48);
        tris[i] = new STLTriangle(
            new STLVector(normX, normY, normZ),
            new STLVector(aX, aY, aZ),
            new STLVector(bX, bY, bZ),
            new STLVector(cX, cY, cZ),
            abc);
    }
}

Here we convert chunks of four bytes to the single precision floating point fields of an STLTriangle, representing the components of the vertices and the normal vector. The last component is the 2-byte attribute count field.

Buffer.BlockCopy

We can further refine the previous approach by extracting not single floats from the byte[], but all vertex/normal data for each single triangle at once using Buffer.BlockCopy:

public static void BufferBlockCopy(byte[] triBytes, STLTriangle[] tris) {
    var coords = new float[12];
    for (int i = 0; i < tris.Length; ++i) {
        var offset = i * STLTriangle.Size;
        System.Buffer.BlockCopy(triBytes, offset, coords, 0, 48);
        var abc = BitConverter.ToUInt16(triBytes, offset + 48);
        tris[i] = new STLTriangle(
            new STLVector(coords[0], coords[1], coords[2]),
            new STLVector(coords[3], coords[4], coords[5]),
            new STLVector(coords[6], coords[7], coords[8]),
            new STLVector(coords[9], coords[10], coords[11]),
            abc);
    }
}

Unfortunately Buffer.BlockCopy can only handle arrays whose elements are of primitive type (in this case, we copy from a byte[] to a float[]). We can therfore not use this approach to copy all bytes at once from the byte[] to a STLTriangle[].

Marshal.PtrToStructure

The Marshal class provides a method PtrToStructure (or alternatively in generic form), which essentially reads an (unmanaged) struct from an arbitrary memory address:

public static unsafe void MarshalPtrToStructureGeneric(byte[] triBytes, STLTriangle[] tris) {
    var triType = typeof(STLTriangle);
    fixed (byte* pBytes = &triBytes[0])
    {
        for (int i = 0; i < tris.Length; ++i) {
            var offset = i * STLTriangle.Size;                    
            tris[i] = Marshal.PtrToStructure<STLTriangle>(new IntPtr(pBytes + offset));
        }
    }
}

dst++ = src++

Essentially the same as with Marshal.PtrToStructure can be achieved with a little bit of unsafe C#:

public static unsafe void UnsafePointerArith(byte[] triBytes, STLTriangle[] tris) {
    fixed (byte* pbuffer = triBytes)
    fixed (STLTriangle* ptris = tris)
    {
        var pSrc = (STLTriangle*)pbuffer;
        var pDst = ptris;
        for (int i = 0; i < tris.Length; ++i) {
            *pDst++ = *pSrc++;
        }
    }
}

Here we treat the (fixed) byte input array as an STLTriangle* which we can then dereference to store each triangle data in the result array of type STLTriangle[]. Note that you cannot implement this kind of operation in a generic way in C#, as C# does not allow to dereference pointers to objects of arbitrary type.

Marshal.Copy

The aformentioned unsafe C# approach already is a form of memcpy, albeit an unoptimized one (we copy blocks of 50 bytes at each iteration); furthermore it's specialized for copying from byte[] to STLTriangle[]. Marshal.Copy on the other hand is a little more generic as it can copy any array of primitive elements to an arbitrary memory location (and vice versa):

public static unsafe void ConvertBufferMarshalCopy(byte[] triBytes, STLTriangle[] tris) {
    fixed (STLTriangle* pTris = &tris[0]) {
        Marshal.Copy(triBytes, 0, new IntPtr(pTris), triBytes.Length);
    }
}

memcpy (P/Invoke)

As we already concluded that we essentially need an efficient way to copy one block of memory to some other location in memory, why not simply use the system-provided, highly optimized, memcpy implementation?

[DllImport("msvcrt.dll", EntryPoint = "memcpy", CallingConvention = CallingConvention.Cdecl, SetLastError = false)]
static extern IntPtr memcpy(IntPtr dest, IntPtr src, UIntPtr count);
 
public static unsafe void ConvertBufferMemcpy(byte[] triBytes, STLTriangle[] tris) {
    fixed (byte* src = triBytes)
    fixed (STLTriangle* dst = tris)
    {
        memcpy(new IntPtr(dst), new IntPtr(src), new UIntPtr((uint)triBytes.Length));
    }
}

Easy. Of course, this introduces a dependency on Microsoft's C runtime library msvcrt.dll and is thus not platform independent.

cpblk

Interestingly, the CIL provides a specialized instruction just for copying blocks of memory, namely cpblk. A few years back, Alexander Mutel used cpblk in his nice performance comparison of different memcpy methods for .NET. To my knowledge, this instruction is currently not directly accessible from either C# or F# (not sure about C++/CLI, though). Yet, we can generate the neccessary instructions using a bit of runtime code generation; here I'm using Kevin Montrose's excellent Sigil library:

// parameters: src, dst, length (bytes)
static Action<IntPtr, IntPtr, uint> CpBlk = GenerateCpBlk();
 
static Action<IntPtr, IntPtr, uint> GenerateCpBlk() {
    var emitter = Sigil.Emit<Action<IntPtr, IntPtr, uint>>.NewDynamicMethod("CopyBlockIL");
    // emit IL
    emitter.LoadArgument(1); // dest
    emitter.LoadArgument(0); // src
    emitter.LoadArgument(2); // len
    emitter.CopyBlock();
    emitter.Return();
    // compile to delegate
    return emitter.CreateDelegate();
}
 
public static unsafe void ConvertBufferCpBlk(byte[] triBytes, STLTriangle[] tris) {
    fixed (byte* src = triBytes)
    fixed (STLTriangle* dst = tris)
    {
        CpBlk(new IntPtr(src), new IntPtr(dst), (uint)triBytes.Length);
    }
}

While it's a bit clunky that code generation is required to get access to cpblk, the nice thing about it is that it's platform-independent: any CLI-compliant implementation must provide it. There is even hope that the next version of F# will come with an updated version of it's pointer intrinsics library that will also feature access to cpblk (cf. Additional intrinsics for the NativePtr module and the corresponding pull request).

NativeInterop.Buffer.Copy

My own little NativeInterop helper library also has tools for moving around stuff in memory. For instance, the Buffer module contains a method Copy that copies an input array of type T to an array of type U. Both T and U must be unmanaged types (Note: Neither the C# compiler nor the CLR can enforce this constraint!). Under the covers, Buffer.Copy(T[], U[]) creates an empty result array and then copies the input bytes using a custom block copy implementation to the result array. Using Buffer.Copy is simple:

public static void ConvertBufferBufferConvert(byte[] triBytes, STLTriangle[] tris) {
    NativeInterop.Buffer.Copy(triBytes, tris);
}

Unfortunately, I couldn't use any of the other mem-copy methods to implement Buffer.Copy as they are either not generic enough (Marshal.Copy, Buffer.BlockCopy), introduce dependencies on some native libraries (memcpy via P/Invoke) or require runtime-code generation or unsafe C# code, both of which isn't available for Portable Class Libraries like NativeInterop (cpblk, *dst++=*src++).

I therefore experimented with different memcpy algorithms (different block sizes, different degrees of loop-unrolling, with or without considering alignment...) and—for the time being—setteled with a comparatively simple, aligned qword copying mechanism that offers decent performance on x64 with the current x64 JIT.

Multi-Threading

For some of the methods we will also look at a multi-threaded variant. Essentially, all of those versions look similar to this (here: memcpy):

public static unsafe void ConvertBufferMemcpyMultiThreaded(byte[] triBytes, STLTriangle[] tris) {
    var threadcount = Environment.ProcessorCount;
    var chunksize = triBytes.Length / threadcount;
    var remainder = triBytes.Length % threadcount;
 
    fixed (byte* pBytes = &triBytes[0])
    fixed (STLTriangle* pTris = &tris[0])
    {
        var tasks = new Action[threadcount];
        for (var i = 0; i < threadcount - 1; ++i) {
            var offset = i * chunksize;
            var newSrc = new IntPtr(pBytes + offset);
            var newDst = new IntPtr((byte*)pTris + offset);
            tasks[i] = () => memcpy(newSrc, newDst, new UIntPtr((uint)chunksize));                    
        }
 
        var finalOffset = (threadcount - 1) * chunksize;
        var finalSrc = new IntPtr(pBytes + finalOffset);
        var finalDst = new IntPtr((byte*)pTris + finalOffset);
        tasks[threadcount - 1] = () => memcpy(finalSrc, finalDst, new UIntPtr((uint)(chunksize + remainder)));
 
        Parallel.Invoke(tasks);
    }
}

Given that copying (large) chunks of memory should be a bandwidth-limited problem, multi-threading shouldn't help much, at least for efficient implementations. Methods with a lot of overhead might profit a little more.

Benchmark

As a benchmark problem, I chose to create fake STL data from 2^6 up to 2^20 triangles. As each triangle struct is 50 bytes, the total amount of data transferred (read + write) varies between approx. 6 kB up to approx. 100 MB and should thus stress all cache levels.

Each method described above ran up to 10 000 times with a time-out of 5 s for each of the 15 problem sizes. To minimize interference from other processes, I set the process priority to "high" and a garbage collection is kicked off before each round.

System.Diagnostics.Stopwatch, which I used to measure transfer times, has a resolution of 300 ns on my system. For very small working sets and the fastest methods, that's already too coarse. Of course I could have measured the total time for, say, 10 000 runs and just compute the average. Yet, it turns out that even with sufficient warm-up, there are still enough outliers in the measured timings to skew the result. After some experimenting, I decided to use the mean of the values below the 10th percentile instead. That seems to be more stable than relying on the timings of individual runs and also excludes extreme outliers. Still, I wouldn't trust the measurements for < 2^8 (25 kB) too much.

I ran the benchmarks on an Intel Core i7-2600K @ 3.4 - 4.2 GHz and 32 GiB of DDR3-1600 RAM (PC3-12800, dual channel; peak theoretical transfer rate: 25 GiB/s) under Windows 8.1 Pro. Both the CLR's current x64 JIT compiler "JIT64" as well as RyuJIT (CTP 4) received the oportunity to show of their performance in the benchmarks.

Results & Analysis

First let's have a look at the results from JIT64 (click for full-size):

Ignoring the multithreaded versions for now (dashed lines), it is clear immediately that memcpy (P/Invoke) offers great overall performance for both the smallest and the largest data sets. Marshal.Copy and cpblk come as a close second. Unsafe C# (*dst++ = *src++) offers stable, but comparatively poor performance.

NativeInterop's Buffer.Copy doesn't even come close to the fastest methods for small data sets, but offers comparable performance for larger sets. Something in its implementation is generating way more overhead than neccessary... That "something" turns out to be the allocation of GCHandles for fixing the managed arrays: F# 3.1 doesn't support a "fixed" statement as C# does. To check whether that truly could be the source of the overhead, I implemented a version of the cpblk method that uses GCHandle.Alloc instead of fixed and—lo and behold—this slowed down cpblk to approximately the same speed as Buffer.Copy (light-blue).

Unsurprisingly, the multithreaded versions come with quite some overhead, so they can't compete for< 500 kB. Only for problem sizes that fit into Sandy Bridge's L3 cache we see some nice scaling. Problem sizes beyond the L3 cache size are again limited by the system's DRAM bandwidth and don't profit from multi-threading.

The performance of all other (non-memcpy-like) methods is abysmal at best. We best just skip them...

How does this picture change once we switch to an (early!) preview of the upcoming RyuJIT compiler? As it turns out, not by much; yet there is still happening something interesting:

Suddenly, "*dst++ = *src++" has become one of the fastest implementations. What's going on here? Let's have a look at the generated assembly for *dst++ = *src++; first let's see what JIT64 produces:

lea rdx,[r9+r8]
lea rax,[r9+32h]
mov rcx, r9
mov r9, rax
mov rax, qword ptr [rcx]
mov qword ptr [rsp+20h], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rsp+28h], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rsp+30h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rsp+38h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rsp+40h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rsp+48h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rsp+50h], ax
lea rcx, [rsp+20h]
mov rax, qword ptr [rcx]
mov qword ptr [rdx], rax
mov rax, qword ptr [rcx+8]
mov qword ptr [rdx+8], rax
mov rax, qword ptr [rcx+10h]
mov qword ptr [rdx+10h], rax
mov rax, qword ptr [rcx+18h]
mov qword ptr [rdx+18h], rax
mov rax, qword ptr [rcx+20h]
mov qword ptr [rdx+20h], rax
mov rax, qword ptr [rcx+28h]
mov qword ptr [rdx+28h], rax
mov ax, word ptr [rcx+30h]
mov word ptr [rdx+30h], ax
inc r11d
cmp r11d, r10d
jl 00007FFE57865410

Hm, so it first moves data from [rcx] (offset into the original byte[]) to some temporary at [rsp+20h] and then copies that to the destination [rdx] (offset into the result STLTriangle[]). That's obviously one step that's not neccessary and where RyuJIT can improve upon (there are 28 mov instructions, but only 14 qword movs + 2 word movs are required to move 50 bytes from memory to another memory location). So, what does RyuJIT really do?

lea r9, [rcx+32h]
lea r10, [rax+32h]
movdqu xmm0, xmmword ptr [rax]
movdqu xmmword ptr [rcx], xmm0
movdqu xmm0, xmmword ptr [rax+10h]
movdqu xmmword ptr [rcx+10h], xmm0
movdqu xmm0, xmmword ptr [rax+20h]
movdqu xmmword ptr [rcx+20h], xmm0
mov r11w, word ptr [rax+30h]
mov word ptr [rcx+30h], r11w
inc r8d
cmp edx, r8d
mov rax, r10
mov rcx, r9
jg 00007FFE57884CE8

We see that RyuJIT is able to not only remove the unrequired intermediate load/store ops, but in addition it also issues (unaligned) SIMD mov instructions to copy 16 bytes at a time. This optimization also works for NativeInterop.Buffer.Copy: Once I increased the blocksize to at least 16 bytes (or larger) performance became identical to that of *dst++=*src++ in the RyuJIT case (aside from the overhead for small problem sizes due to GCHandle allocations).

Conclusion

When your only concern is reading from/writing to a file on a comparatively slow disk, all of the above shouldn't bother you much. Most of the methods will be simply "fast enough". As soon as you start to move around data that's already in memory (e.g. cached) or maybe it resides on a very fast SSD, choosing the right method however becomes critical.

While NativeInterop.Buffer.Copy isn't the fastest of the bunch (yet), it's performance is competitive for medium to large problem sizes and in any case orders of magnitudes faster then the usual BinaryReader-approach. If you want convenience, portability and genericity while maintaining reasonable performance, NativeInterop provides a good all-round solution, I believe. If you want raw speed, though, use memcpy (or whatever may be available on your platform).

Method	Convenience	Portability	Safety	Efficiency	Genericity
BinaryReader.ReadXXX	+	+	+	-	-
BitConverter.ToXXX	-	+	o	-	-
Buffer.BlockCopy	-	+	o	-	-
Marshal.PtrToStructure	-	+	o	-	+
dst++ = src++	o	+	-	o	o
Marshal.Copy	+	+	-	+	o
memcpy	o	-	-	+	+
cpblk	-	+	-	+	+
Buffer.Convert/Copy, Stream.ReadUnmanagedStructRange	+	+	-	o/+	+

Sunday, April 25, 2010

A minimalistic native 64 bit array implementation for .NET (updated code)

If you ever felt the need to process huge amounts of data via a algorithm implemented using .NET/the CLR, you’ve surely ran into the 2^31-items-limit of the CLR’s current array implementation that only supports Int32 array indices (this also affects other collections like List<T> as those use standard arrays for storage internally).
You can try to circumvent this limitation by implementing your own array-like data-structure, either by emulating continuous storage via a collection of standard .NET-arrays (partition your data in chunks with 2^31 items each), or you can use native APIs and some evil pointer arithmetic to get maximum performance.
A while ago I tried to implement the latter approach in C#, which isn’t a big deal, only a matter of some unsafe-blocks for pointer arithmetic and a call to Marshal.AllocHGlobal for allocating memory on the unmanged heap. However, when I tried to make that custom collection into a generic one, I ran into an unsolvable problem:

public unsafe T this[long index]

get

        return *((T*)pBase + index);

set

        *((T*)pBase + index) = value;

This code does not compile. The reason for that is, that there is no way to tell the C# compiler that T shall be constrained to unmanaged types.
Interestingly, F# 2.0 does feature such a constraint! This is how a minimalistic F# implementation of such an native 64 bit array could look like:

namespace NativeTools

#nowarn "9"

#nowarn "42"

open System

open System.Runtime

open Microsoft.FSharp.NativeInterop

module internal PointerArithmetic =

    [<CompiledName("AddIntPtrToIntPtr")>]

    [<Unverifiable>]

    let inline addNativeInt (x: nativeptr<'T>) (n: nativeint) : nativeptr<'T> =

        (NativePtr.toNativeInt x) + n * (# "sizeof !0" type('T) : nativeint #) |> NativePtr.ofNativeInt

    // "reinterpret_cast<IntPtr>(x)"... EVIL!

    [<CompiledName("Int64ToIntPtr")>]

    [<Unverifiable>]

    let inline int64ToNativeint (x: int64) = (# "" x : nativeint #)

    [<CompiledName("AddInt64ToIntPtr")>]

    [<Unverifiable>]

    let inline addInt64 (x: nativeptr<'a>) (o: int64) : nativeptr<'a> = addNativeInt x (int64ToNativeint o)

[<Sealed>]

type NativeArray64<'T when 'T: unmanaged>(length: int64) =

    let itemSize: int64 = (int64)(InteropServices.Marshal.SizeOf(typeof<'T>))

    let mutable isDisposed = false

    let allocatedBytes = length * itemSize

    let blob = InteropServices.Marshal.AllocHGlobal(nativeint allocatedBytes)

    let pBlobBase: nativeptr<'T> = NativePtr.ofNativeInt blob

    let disposeLock = new Object()

    member this.Length = length

    member this.BaseAddress = pBlobBase

    member this.ItemSize = itemSize

    member this.IsDisposed = isDisposed

    member this.AllocatedBytes = allocatedBytes

    member private this.Free () =

        lock disposeLock (fun () ->

            if isDisposed

                then ()

                else InteropServices.Marshal.FreeHGlobal blob

                     isDisposed <- true

    member this.Item

        with get (idx: int64) =

                        NativePtr.read (PointerArithmetic.addInt64 pBlobBase idx)

        and  set (idx: int64) (value: 'T) =

                        NativePtr.write (PointerArithmetic.addInt64 pBlobBase idx) value

    member private this.Items = seq {

            for i in 0L .. length - 1L do

                yield this.[i]

    override this.Finalize () = this.Free()

    interface IDisposable with

        member this.Dispose () =

            GC.SuppressFinalize this

            this.Free()

    interface Collections.Generic.IEnumerable<'T> with

        member this.GetEnumerator () : Collections.Generic.IEnumerator<'T> =

            this.Items.GetEnumerator()

        member this.GetEnumerator () : Collections.IEnumerator =

            this.Items.GetEnumerator() :> Collections.IEnumerator

UPDATE 2010-04-25: Removed a few bugs.

You can use this data structure in your C# code like a normal array:

var length = 8L * 1024L * 1024L * 1024L;

// allocate a byte-array of 8 GiB

using(arr = new NativeTools.NativeArray64<byte>(length))

    arr[0] = 123;

    arr[length-1] = 222;

    Console.WriteLine("Allocated " + arr.AllocatedBytes);

// auto-disposed ...

Friday, April 10, 2009

Generic Arithmetic for .NET

To me, one of the major design flaws and sources of frustration in .NET’s generics system is the absence of suppport for some sort of common arithmetic interface for numeric types (float, double, int, etc.).

But why should anyone care about such things? Well, the problem becomes obvious as soon as you try to implement—say—a generic vector class:

(Note: I'll only show the relavant parts of the code here to avoid clutter and confusion)

   1:  // a vector in R^n

   2:  class Vector<T>

   3:  {

   4:      T[] components;

5:

   6:      // ctors, properties etc. ...

7:

   8:      // operator defintion, here: vector addition

   9:      public staticVector<T> operator +(Vector<T> lhs,

  10:                                        Vector<T> rhs)

  11:      {

  12:          Vector<T> res = new Vector<T>(lhs.Length);

13:

  14:          for (int i = 0; i < res.Length; ++i)

  15:          {

  16:                       // compiler error!

  17:              res[i] = lhs[i] + rhs[i];

  18:          }

19:

  20:          return res;

  21:      }

  22:  }

The reason for the compiler error is, that there’s simply not enough information available to the compiler to check, whether T supports op_Addition. Now, if the CLI designers had included some interface definition like

   1:  interface INumeric<T>

   2:  {

   3:      T Add(T rhs);

   4:      T Sub(T rhs);

   5:      T Mul(T rhs);

   6:      T Div(T rhs);

   7:      // etc.

... (or similar) and if they had implemented this interface for every numeric primitive type, we could simply use an interface constraint to solve our problem:

   1:  // a vector in R^n

   2:  class Vector<T> where T: INumeric<T>

   3:  {

   4:      T[] components;

5:

   6:      // ctors, properties ...

7:

   8:      public static Vector<T> operator +(Vector<T> lhs, Vector<T> rhs)

   9:      {

  10:          Vector<T> res = new Vector<T>(lhs.Length);

11:

  12:          for (int i = 0; i < res.Length; ++i)

  13:          {

  14:              res[i] = lhs[i].Add(rhs[i]);       // OK, T implements INumeric<T>

  15:          }

16:

  17:          return res;

  18:      }

  19:  }

(Note: Operators are always statically defined in .NET, hence we can only use normal prefix method calls as static methods can never be virtual, thus neither abstract nor part of an interface defintion.)

Unfortunately, such an interface does not exist. Several people already tried to circumvent this limitation. One of the first approaches uses special numerical interfaces that implement arithmetic operations for concrete types. This technique works reasonably well, besides having the disadvantage of forcing the user to explicitly provide an implementation of such an interfaces (or at least a type name of a type implementing the according interface).

More recently, a new solution came up, which uses runtime code generation by building and compiling expression trees on the fly (or using LCG for older .NET version). This indeed is a very elegant solution to the problem. Using an implementation of this concept like the one found in the MiscUtil library you can simply write something like

   1:  T sum = Operator<T>.Add(lhs, rhs);

... inside your generic math algorithms.

The appropriate "Add" method for type T is generated on-the-fly upon the first call and cached for later reuse. The drawback of this method is the need for delegate invocations, which tend to be a lot more expensive than interface calls (at least my first benchmarks indicate that, especially on x64; curiously, the effect is much less severe on x86).

Interestingly, the latest F# CTP also provides generic vector and matrix classes (namespace Microsoft.FSharp.Math in assembly FSharp.PowerPack.dll). You can instantiate e.g. matrices like that:

   1:  Vector<float> v = VectorModule.Generic.create<float>(3, 3, 0f);

This returns a 3×3 matrix of floats initialized to 0.0f. So how did they do that? As far as I understand, their approach is very similar to the first method I described (using interfaces to define "calculator" types). To avoid having the user explicitly specify the operators’ implementation they use some kind of “type registry”, mapping each known numeric type to an appropriate implementation of an interface that defines basic arithmetic operations.

I’ve tried to do a quick n’ dirty implementation of this idea in C# which resulted in something like this (I haven’t yet checked if this actually runs...):

   1:  namespace GenericMath

   2:  {

   3:      public interface INumeric<T>

   4:      {

   5:          T Add(T lhs, T rhs);

   6:          T Sub(T lhs, T rhs);

   7:          T Mul(T lhs, T rhs);

   8:          T Div(T lhs, T rhs);

   9:          T Abs(T x);

  10:          T Zero { get; }

  11:          T One { get; }

  12:      }

13:

  14:      class FloatNumerics: INumeric<float>

  15:      {

  16:          public float Add(float lhs, float rhs)

  17:          {

  18:              return lhs + rhs;

  19:          }

20:

  21:          public float Sub(float lhs, float rhs)

  22:          {

  23:              return lhs - rhs;

  24:          }

25:

  26:          public float Mul(float lhs, float rhs)

  27:          {

  28:              return lhs * rhs;

  29:          }

30:

  31:          public float Div(float lhs, float rhs)

  32:          {

  33:              return lhs / rhs;

  34:          }

35:

  36:          public float Abs(float x)

  37:          {

  38:              return Math.Abs(x);

  39:          }

40:

  41:          public float Zero

  42:          {

  43:              get { return 0f; }

  44:          }

45:

  46:          public float One

  47:          {

  48:              get { return 1f; }

  49:          }

  50:      }

51:

  52:      // implementations for other types ...

53:

  54:      public static class ArithmeticAssociations

  55:      {

  56:          static readonly IDictionary<Type, object> operationsDict = new Dictionary<Type, object>

  57:          {

  58:              { typeof(float), new FloatNumerics() },

  59:              { typeof(double), new DoubleNumerics() },

  60:              { typeof(int), new Int32Numerics() }

  61:          };

62:

  63:          static string numericIfaceName = typeof(INumeric<>).FullName;

  64:          static string numericIfaceNameShort = typeof(INumeric<>).Name;

65:

  66:          public static bool UnregisterType<T>()

  67:          {

  68:              return operationsDict.Remove(typeof(T));

  69:          }

70:

  71:          public static void RegisterType<T>(INumeric<T> arithmetics)

  72:          {

  73:              operationsDict[typeof(T)] = arithmetics;

  74:          }

75:

  76:          public static INumeric<T> TryGetNumericOps<T>()

  77:          {

  78:              Type t = typeof(T);

79:

  80:              if (!operationsDict.ContainsKey(t))

  81:              {

  82:                  throw new ArgumentException("No implementation of "

  83:                      + numericIfaceNameShort + "<" + t.Name + "> registered.");

  84:              }

85:

  86:              Type opsType = operationsDict[t].GetType();

87:

  88:              if (opsType.GetInterface(numericIfaceName) == null)

  89:              {

  90:                  throw new ArgumentException("Arithmetic object associated with "

  91:                      + t.Name + " does not implement " + numericIfaceNameShort);

  92:              }

93:

  94:              return (INumeric<T>)operationsDict[t];

  95:          }

  96:      }

97:

  98:      // a sample class that uses INumeric + ArithmeticAssociations

  99:      public sealed class Vector<T>

 100:      {

 101:          static INumeric<T> ops = ArithmeticAssociations.TryGetNumericOps<T>();

 102:

 103:          T[] components;

 104:

 105:          public T this[int i]

 106:          {

 107:              get { return components[i]; }

 108:              set { components[i] = value; }

 109:          }

 110:

 111:          public int Length { get { return components.Length; } }

 112:

 113:          public Vector(int size)

 114:          {

 115:              components = new T[size];

 116:          }

 117:

 118:          public static T operator *(Vector<T> lhs, Vector<T> rhs)

 119:          {

 120:              T res = ops.Zero;

 121:

 122:              for (int i = 0; i < lhs.Length; ++i)

 123:              {

 124:                  res = ops.Add(res, ops.Mul(lhs[i], rhs[i]));

 125:              }

 126:

 127:              return res;

 128:          }

 129:

 130:          public static Vector<T> operator *(T lambda, Vector<T> vec)

 131:          {

 132:              Vector<T> result = new Vector<T>(vec.Length);

 133:

 134:              for (int i = 0; i < result.Length; ++i)

 135:              {

 136:                  result[i] = ops.Mul(vec[i], lambda);

 137:              }

 138:

 139:              return result;

 140:          }

 141:

 142:          // other properties, methods and operators ...

 143:      }

 144:

 145:      // ...

 146:  }

Objects of the type Vector can now be created by this simple line:

   1:  Vector<float> v = new Vector<float>(3);

As you can see, this is totally transparent to the user (no need to specifiy a "Calculator" type parameter) and doesn’t have the potential performance problems of the delegate/Func method.

Maybe someone finds this useful and can even improve the idea. I also plan to do a performance comparision of the different methods soon.

Update 2009-04-11:
I found another nice blog entry that covers the same topic and also presents a solution similar to the one showed above.