C Sharp

Performance

Performance as it's related to the problem of finalization is an important issue. The .NET team believes that some kind of tracing collector must exist to handle cycles, which necessitates a big part of the cost of a tracing collector. It's also believed that code-execution performance can be substantially affected by the cost of reference counting. The good news is that in the context of all objects allocated in a running program, the number of those objects that really need deterministic finalization is small. However, it's usually difficult to isolate the performance cost to just those objects. Let's look at some pseudocode for a simple reference assignment when using a tracing collector and then at the pseudocode when using reference counting: -

// Tracing.
a = b;

That's it. The compiler turns that into a single move instruction and might even optimize the whole thing away in some circumstances.

// Ref counting.
if (a != null)
    if (InterlockedDecrement(ref a.m_ref) == 0)
        a.FinalRelease();
if (b != null)
    InterlockedIncrement(ref b.m_ref);
a = b;

The bloat in this code is very high-the working set is bigger and the execution performance is obscenely higher, especially given the two interlocked instructions. You can limit code bloat by putting this stuff in a "helper" method and further increasing the length of the code path. In addition, code generation will ultimately suffer when you put in all the necessary try blocks because the optimizer has its hands somewhat tied in the presence of exception handling code-this is true even in unmanaged C++. It's also worth noting that every object is four bytes bigger because of the extra ref count field, again increasing memory usage and the size of the working set.

Two example applications follow that demonstrate the cost of this-they are provided courtesy of the .NET team. This particular benchmark loops, allocating objects, doing two assignments, and exiting the scope of one reference. As with any benchmark, it's open to some subjective interpretation. One might even argue that in the context of this routine, most of the reference counting can be optimized away. That's probably true; however, here it's intended to demonstrate the effect. In a real program, those kinds of optimizations are difficult, if not impossible, to perform. In fact, in C++ programmers make those optimizations manually, which leads to reference-counting bugs. Therefore, note that in a real program the ratio of assignment to allocation is much higher than here.

Here's the first example, ref_gc.cs-this version relies on the tracing GC: -

using System;
public class Foo
{
    private static Foo m_f;
    private int m_member;
    public static void Main(String[] args)
    {
        int ticks = Environment.TickCount;
        Foo f2 = null;
        for (int i=0; i < 10000000; ++i)
        {
            Foo f = new Foo();
            // Assign static to f2.
            f2 = m_f;
            // Assign f to the static.
            m_f = f;
            // f goes out of scope.
        }
        // Assign f2 to the static.
        m_f = f2;
        // f2 goes out of scope.
        ticks = Environment.TickCount - ticks;
        Console.WriteLine("Ticks = {0}", ticks);
    }
    public Foo()
    {
    }
}

And here's the second, ref_rm.cs-a ref counted version that uses interlocked operations for thread safety: -

using System;
using System.Threading;
public class Foo
{
    private static Foo m_f;
    private int m_member;
    private int m_ref;
    public static void Main(String[] args)
    {
        int ticks = Environment.TickCount;
        Foo f2 = null;
        for (int i=0; i < 10000000; ++i)
        {
            Foo f = new Foo();
            // Assign static to f2.
            if (f2 != null)
            {
                if (Interlocked.Decrement(ref f2.m_ref) == 0)
                    f2.Dispose();
            }
            if (m_f != null)
                Interlocked.Increment(ref m_f.m_ref);
            f2 = m_f;
            // Assign f to the static.
            if (m_f != null)
            {
                if (Interlocked.Decrement(ref m_f.m_ref) == 0)
                    m_f.Dispose();
            }
            if (f != null)
                Interlocked.Increment(ref f.m_ref);
            m_f = f;
            // f goes out of scope.
            if (Interlocked.Decrement(ref f.m_ref) == 0)
                f.Dispose();
        }
        // Assign f2 to the static.
        if (m_f != null)
        {
            if (Interlocked.Decrement(ref m_f.m_ref) == 0)
                m_f.Dispose();
        }
        if (f2 != null)
            Interlocked.Increment(ref f2.m_ref);
        m_f = f2;
        // f2 goes out of scope.
        if (Interlocked.Decrement(ref f2.m_ref) == 0)
            f2.Dispose();
        ticks = Environment.TickCount - ticks;
        Console.WriteLine("Ticks = {0}", ticks);
    }
    public Foo()
    {
        m_ref = 1;
    }
    public virtual void Dispose()
    {
    }
}

Note there's only one thread and therefore no bus contention, making this the "ideal" case. You could probably fine-tune this a bit but not much. It's also worth noting that Visual Basic has not historically had to worry about using interlocked operations for its reference counting (although Visual C++ has). In prior releases of Visual Basic, a component ran in a single-threaded apartment and was guaranteed that only a single thread at a time would be executing in it. One of the goals for Visual Basic.NET is multithreaded programming, and another is getting rid of the complexity of COM's multiple threading models. However, for those of you who want to see a version that doesn't use lock prefixes, another example follows, again courtesy of the .NET team: ref_rs.cs, a ref counted version that assumes it's running in a single-threaded environment. This program is not nearly as slow as the multithreaded version, but it's still quite a bit slower than the GC version.

using System;
public class Foo
{
    private static Foo m_f;
    private int m_member;
    private int m_ref;
    public static void Main(String[] args)
    {
        int ticks = Environment.TickCount;
        Foo f2 = null;
        for (int i=0; i < 10000000; ++i)
        {
            Foo f = new Foo();
            // Assign static to f2.
            if (f2 != null)
            {
                if (--f2.m_ref == 0)
                    f2.Dispose();
            }
            if (m_f != null)
                ++m_f.m_ref;
            f2 = m_f;
            // Assign f to the static.
            if (m_f != null)
            {
                if (--m_f.m_ref == 0)
                    m_f.Dispose();
            }
            if (f != null)
                ++f.m_ref;
            m_f = f;
            // f goes out of scope.
            if (--f.m_ref == 0)
                f.Dispose();
        }
        // Assign f2 to the static.
        if (m_f != null)
        {
            if (--m_f.m_ref == 0)
                m_f.Dispose();
        }
        if (f2 != null)
            ++f2.m_ref;
        m_f = f2;
        // f2 goes out of scope.
        if (--f2.m_ref == 0)
            f2.Dispose();
        ticks = Environment.TickCount - ticks;
        Console.WriteLine("Ticks = {0}", ticks);
    }
    public Foo()
    {
        m_ref = 1;
    }
    public virtual void Dispose()
    {
    }
}

Obviously, there are many variables here. Still, running all three of these applications results in the GC version running almost twice as fast as the single-threaded reference-counting example and four times faster than the reference-counting example with locking prefixes. My personal experience was as follows-note that these numbers represent the mean average on an IBM ThinkPad 570 running the .NET Framework SDK Beta 1 compiler: -

GC Version (ref_gc)                       1162ms
Ref Counting (ref_rm)(multi-threaded)     4757ms
Ref Counting (ref_rs)(single-threaded)    1913ms