What began as a small peek at JDK source code quickly showed me a whole new side of Java.

java.util.concurrency.atomic.LongAdder is a new class that will be introduced with Java 8. It provides an atomic way to add up a large number of values. According to the documentation, LongAdder is faster than AtomicLong. I was curious how that could be and had a look at the implementation.

LongadderQuad

I found that yes, LongAdder is faster (see graph on the right). And I also learned a bit about how atomics are implemented in Java. It turned out to be more complex and more interesting than I expected.

Performance

First, how well does LongAdder perform? I downloaded the latest preview of JDK 8 (1.8.0-ea-b84 64bit, under Windows 7) and ran a performance test. I compared the performance of LongAdder, AtomicLong, the synchronized keyword, and an empty statement (as a control to see how much overhead there is).

You can see my performance test code here. In short, for each of the four test cases, I ran a number of threads (1 to 20) at a time. The thread to test LongAdder was equivalent to this simplified version:

static class LongAdderThread implements Runnable{
    public StressTest stressTest;

    @Override
    public void run() {
        stressTest.cyclicBarrier.await();
        for (int i = 0; i < 1000000; i++) {
            stressTest.longAdder.increment();
        }
        stressTest.cyclicBarrier.await();
    }
}

All threads share the same StressTest instance, so longAdder is the shared atomic variable that is tested. CyclicBarrier is used to control the threads and to eliminate the variation in thread creation and join overhead from the result.

This is what the result looks like for a quad core machine (8 cores with hyper threading):

LongadderQuad

The horizontal axis is the number of concurrent threads that were started for the test, and the vertical axis is the duration in milliseconds normalized by the number of threads. So if LongAdder has a value of 3 ms for 10 threads plotted in the graph, then that means hat running those 10 threads took 30ms to run.

All four versions appear to eventually plot a horizontal line in this graph. This means that the system is saturated. Adding another thread at this stage is equivalent to adding the thread to a waiting queue from which it gets pulled once all other threads have been run. This means that the graph really is only interesting for a low number of threads.

Synchronized and AtomicLong both follow a similar pattern. There is a bit of an adjustment going on, but from 2 threads on, their performance is virtually constant. This is to be expected. These test threads do nothing but access a locked variable, so adding a new thread will simply make that thread wait in a queue to access the locked variable. AtomicLong does a much better job at this than synchronized, as expected.

However, LongAdder shows a completely different behaviour. It gets more efficient with more threads. And it only starts to show constant performance from about 4 threads on, the number of physical cores. Even then, performance seems to improve slightly with more threads added, evening out a constant overhead. This gives it similar performance characteristics to the empty loop thread, meaning it performs as if there was no locked variable that serializes the threads.

How is this possible? Simply by not having a locked variable. Well, that’s not quite true, but the idea of LongAdder is to avoid contention by making each thread work on a different variable.

Implementation

Here is an excerpt of the LongAdder implementation with a few key elements:

public class LongAdder extends Striped64 implements Serializable {

    public void add(long x) {
        Cell[] as; long b, v; int m; Cell a;
        if ((as = cells) != null || !casBase(b = base, b + x)) {
            boolean uncontended = true;
            if (as == null || (m = as.length - 1) < 0 ||
                (a = as[getProbe() & m]) == null ||
                !(uncontended = a.cas(v = a.value, v + x)))
                longAccumulate(x, null, uncontended);
        }
    }

    public long sum() {
        Cell[] as = cells; Cell a;
        long sum = base;
        if (as != null) {
            for (int i = 0; i < as.length; ++i) {
                if ((a = as[i]) != null)
                    sum += a.value;
            }
        }
        return sum;
    }
}

LongAdder was the first source file for a Java library that I have seen, and it was not what I expected. Assignments inside if clauses? Really? One could argue that for some of these cases, you do get better performance because the number of if statements is reduced and assignments only occur if needed. I suppose it’s the first time I have looked at speed optimized Java code, and my focus so far has been on making code reliable. However, it makes me wonder whether this kind of optimization should be left to the compiler.

Programming style aside, two JDK 8 classes are used: Striped64 and Cell.

Striped64 is a new class in JDK 8 that is the base for a whole family of new classes in java.util.concurrency.atomic that use the same basic functionality as LongAdder. There is for example a more general class LongAccumulator which allows developers to implement variants such as LongMultiplier. The classes are duplicated to work with doubles, so there is a DoubleAdder and a DoubleAccumulator.

The basic concept of a Striped64 is that it holds a hash table of Cells (think of each Cell as an AtomicLong). When two threads try to add something to a LongAdder – which is a Striped64 – then there is a good chance that the threads will try to add their value to different Cells in that hash table. This reduces the contention to a near minimum.

Cells are the building blocks of a Striped64. A Cell is a “Padded variant of AtomicLong supporting only raw accesses plus CAS” according to the source code. This is what Cell looks like:

static final class Cell {
    volatile long p0, p1, p2, p3, p4, p5, p6;
    volatile long value;
    volatile long q0, q1, q2, q3, q4, q5, q6;
    Cell(long x) { value = x; }

    final boolean cas(long cmp, long val) {
        return UNSAFE.compareAndSwapLong(this, valueOffset, cmp, val);
    }

    // Unsafe mechanics
    private static final sun.misc.Unsafe UNSAFE;
    private static final long valueOffset;
    static {
        try {
            UNSAFE = sun.misc.Unsafe.getUnsafe();
            Class<?> ak = Cell.class;
            valueOffset = UNSAFE.objectFieldOffset
                (ak.getDeclaredField("value"));
        } catch (Exception e) {
            throw new Error(e);
        }
    }
}

The only attribute of interest is “value”. The long attributes around it are for padding only. Padding is a strategy to reduce CPU cache contention by trying to make sure that different cells do not land in the same cache line. It has been found to bring significant performance improvements.

You can also see that the mysterious CAS means “compare and swap”, an atomic operation provided by Unsafe. Unsafe provides an interface to the underworld that is the JVM: memory can be allocated, overwritten, and cast. And a list of atomic operations supported by the JVM can be accessed. This blog post has some pretty cool examples of what Unsafe can do.

Unsafe.compareAndSwap operations are atomic. They take a pointer to a chunk of memory (in this case comprised of this and valueOffset which together point to value), a compare value and a swap value. If the JVM finds that the value of the addressed memory is equal to the compare value, then it stores the swap value in the addressed memory and returns true. This means that CAS operations are a fast and thread safe way to update the value of a variable and get feedback on whether the operation was successful or whether there was contention.

With this short and sudden trip all the way down to atomic operations on actual memory, we can now attempt to step through the LongAdder.add method. Here it is again:

public void add(long x) {
    Cell[] as; long b, v; int m; Cell a;
    if ((as = cells) != null || !casBase(b = base, b + x)) {
        boolean uncontended = true;
        if (as == null || (m = as.length - 1) < 0 ||
            (a = as[getProbe() & m]) == null ||
            !(uncontended = a.cas(v = a.value, v + x)))
            longAccumulate(x, null, uncontended);
    }
}

The basic strategy of Striped64 is to use a single base value until the first case of contention. From then on, Striped64 maintains a hash table of cells to avoid contention.

  1. (line 3) If we already have a hash table of cells (there has been contention before) then skip to step 3.
  2. (line 3) Try to CAS the base (add our value to the base atomically). If successful, we’re done. Otherwise continue.
  3. (line 5) If we have no hash table (this means we have our first case of contention) then skip to step 6.
  4. (line 6) If we do not have an entry in the hash table for the current thread (getProbe() returns a hash value for the current thread) then skip to step 6.
  5. (line 7) Try to CAS the hash table entry. If successful, we’re done. Otherwise there was contention and we continue.
  6. (line 8) The method Striped64.longAccumulate will do all the heavy lifting of maintaining the hash table. It will create it if necessary. It will double the size of the hash table whenever there was contention, so that over time the probability of contention is minimized. Finally, it will add the new value to one of the cells. The longAccumulate method is too complex to discuss it here in detail.

Now, how do we get the grand total out of LongAdder once we’re done? Here is the sum method of LongAdder:

public long sum() {
    Cell[] as = cells; Cell a;
    long sum = base;
    if (as != null) {
        for (int i = 0; i < as.length; ++i) {
            if ((a = as[i]) != null)
                sum += a.value;
        }
    }
    return sum;
}

The sum method simply adds up the base and all entries in the cells array. This is clearly not atomic and that is what the documentation says as well.

So this is what LongAdder is: an elegant idea efficiently implemented, leading to excellent performance. Personally, I find it a bit scary to see how unmaintainable the source code looks. However, I learned a few tricks along the way that I might want to consider in the future. Variable padding to avoid cache contention, for example. This little peek at Java library implementation has made me curious: what can I learn from looking at the implementation of the Collection classes? I loved reading “Effective Java” by Josh Bloch who designed the collection framework which makes me think it’s well worth taking a look at it.

Advertisements