What began as a small peek at JDK source code quickly showed me a whole new side of Java.
java.util.concurrency.atomic.LongAdder is a new class that will be introduced with Java 8. It provides an atomic way to add up a large number of values. According to the documentation, LongAdder is faster than AtomicLong. I was curious how that could be and had a look at the implementation.
I found that yes, LongAdder is faster (see graph on the right). And I also learned a bit about how atomics are implemented in Java. It turned out to be more complex and more interesting than I expected.
Performance
First, how well does LongAdder perform? I downloaded the latest preview of JDK 8 (1.8.0-ea-b84 64bit, under Windows 7) and ran a performance test. I compared the performance of LongAdder, AtomicLong, the synchronized keyword, and an empty statement (as a control to see how much overhead there is).
You can see my performance test code here. In short, for each of the four test cases, I ran a number of threads (1 to 20) at a time. The thread to test LongAdder was equivalent to this simplified version:
static class LongAdderThread implements Runnable{ public StressTest stressTest; @Override public void run() { stressTest.cyclicBarrier.await(); for (int i = 0; i < 1000000; i++) { stressTest.longAdder.increment(); } stressTest.cyclicBarrier.await(); } }
All threads share the same StressTest instance, so longAdder is the shared atomic variable that is tested. CyclicBarrier is used to control the threads and to eliminate the variation in thread creation and join overhead from the result.
This is what the result looks like for a quad core machine (8 cores with hyper threading):
The horizontal axis is the number of concurrent threads that were started for the test, and the vertical axis is the duration in milliseconds normalized by the number of threads. So if LongAdder has a value of 3 ms for 10 threads plotted in the graph, then that means hat running those 10 threads took 30ms to run.
All four versions appear to eventually plot a horizontal line in this graph. This means that the system is saturated. Adding another thread at this stage is equivalent to adding the thread to a waiting queue from which it gets pulled once all other threads have been run. This means that the graph really is only interesting for a low number of threads.
Synchronized and AtomicLong both follow a similar pattern. There is a bit of an adjustment going on, but from 2 threads on, their performance is virtually constant. This is to be expected. These test threads do nothing but access a locked variable, so adding a new thread will simply make that thread wait in a queue to access the locked variable. AtomicLong does a much better job at this than synchronized, as expected.
However, LongAdder shows a completely different behaviour. It gets more efficient with more threads. And it only starts to show constant performance from about 4 threads on, the number of physical cores. Even then, performance seems to improve slightly with more threads added, evening out a constant overhead. This gives it similar performance characteristics to the empty loop thread, meaning it performs as if there was no locked variable that serializes the threads.
How is this possible? Simply by not having a locked variable. Well, that’s not quite true, but the idea of LongAdder is to avoid contention by making each thread work on a different variable.
Implementation
Here is an excerpt of the LongAdder implementation with a few key elements:
public class LongAdder extends Striped64 implements Serializable { public void add(long x) { Cell[] as; long b, v; int m; Cell a; if ((as = cells) != null || !casBase(b = base, b + x)) { boolean uncontended = true; if (as == null || (m = as.length - 1) < 0 || (a = as[getProbe() & m]) == null || !(uncontended = a.cas(v = a.value, v + x))) longAccumulate(x, null, uncontended); } } public long sum() { Cell[] as = cells; Cell a; long sum = base; if (as != null) { for (int i = 0; i < as.length; ++i) { if ((a = as[i]) != null) sum += a.value; } } return sum; } }
LongAdder was the first source file for a Java library that I have seen, and it was not what I expected. Assignments inside if clauses? Really? One could argue that for some of these cases, you do get better performance because the number of if statements is reduced and assignments only occur if needed. I suppose it’s the first time I have looked at speed optimized Java code, and my focus so far has been on making code reliable. However, it makes me wonder whether this kind of optimization should be left to the compiler.
Programming style aside, two JDK 8 classes are used: Striped64 and Cell.
Striped64 is a new class in JDK 8 that is the base for a whole family of new classes in java.util.concurrency.atomic that use the same basic functionality as LongAdder. There is for example a more general class LongAccumulator which allows developers to implement variants such as LongMultiplier. The classes are duplicated to work with doubles, so there is a DoubleAdder and a DoubleAccumulator.
The basic concept of a Striped64 is that it holds a hash table of Cells (think of each Cell as an AtomicLong). When two threads try to add something to a LongAdder – which is a Striped64 – then there is a good chance that the threads will try to add their value to different Cells in that hash table. This reduces the contention to a near minimum.
Cells are the building blocks of a Striped64. A Cell is a “Padded variant of AtomicLong supporting only raw accesses plus CAS” according to the source code. This is what Cell looks like:
static final class Cell { volatile long p0, p1, p2, p3, p4, p5, p6; volatile long value; volatile long q0, q1, q2, q3, q4, q5, q6; Cell(long x) { value = x; } final boolean cas(long cmp, long val) { return UNSAFE.compareAndSwapLong(this, valueOffset, cmp, val); } // Unsafe mechanics private static final sun.misc.Unsafe UNSAFE; private static final long valueOffset; static { try { UNSAFE = sun.misc.Unsafe.getUnsafe(); Class<?> ak = Cell.class; valueOffset = UNSAFE.objectFieldOffset (ak.getDeclaredField("value")); } catch (Exception e) { throw new Error(e); } } }
The only attribute of interest is “value”. The long attributes around it are for padding only. Padding is a strategy to reduce CPU cache contention by trying to make sure that different cells do not land in the same cache line. It has been found to bring significant performance improvements.
You can also see that the mysterious CAS means “compare and swap”, an atomic operation provided by Unsafe. Unsafe provides an interface to the underworld that is the JVM: memory can be allocated, overwritten, and cast. And a list of atomic operations supported by the JVM can be accessed. This blog post has some pretty cool examples of what Unsafe can do.
Unsafe.compareAndSwap operations are atomic. They take a pointer to a chunk of memory (in this case comprised of this and valueOffset which together point to value), a compare value and a swap value. If the JVM finds that the value of the addressed memory is equal to the compare value, then it stores the swap value in the addressed memory and returns true. This means that CAS operations are a fast and thread safe way to update the value of a variable and get feedback on whether the operation was successful or whether there was contention.
With this short and sudden trip all the way down to atomic operations on actual memory, we can now attempt to step through the LongAdder.add method. Here it is again:
public void add(long x) { Cell[] as; long b, v; int m; Cell a; if ((as = cells) != null || !casBase(b = base, b + x)) { boolean uncontended = true; if (as == null || (m = as.length - 1) < 0 || (a = as[getProbe() & m]) == null || !(uncontended = a.cas(v = a.value, v + x))) longAccumulate(x, null, uncontended); } }
The basic strategy of Striped64 is to use a single base value until the first case of contention. From then on, Striped64 maintains a hash table of cells to avoid contention.
- (line 3) If we already have a hash table of cells (there has been contention before) then skip to step 3.
- (line 3) Try to CAS the base (add our value to the base atomically). If successful, we’re done. Otherwise continue.
- (line 5) If we have no hash table (this means we have our first case of contention) then skip to step 6.
- (line 6) If we do not have an entry in the hash table for the current thread (getProbe() returns a hash value for the current thread) then skip to step 6.
- (line 7) Try to CAS the hash table entry. If successful, we’re done. Otherwise there was contention and we continue.
- (line 8) The method Striped64.longAccumulate will do all the heavy lifting of maintaining the hash table. It will create it if necessary. It will double the size of the hash table whenever there was contention, so that over time the probability of contention is minimized. Finally, it will add the new value to one of the cells. The longAccumulate method is too complex to discuss it here in detail.
Now, how do we get the grand total out of LongAdder once we’re done? Here is the sum method of LongAdder:
public long sum() { Cell[] as = cells; Cell a; long sum = base; if (as != null) { for (int i = 0; i < as.length; ++i) { if ((a = as[i]) != null) sum += a.value; } } return sum; }
The sum method simply adds up the base and all entries in the cells array. This is clearly not atomic and that is what the documentation says as well.
So this is what LongAdder is: an elegant idea efficiently implemented, leading to excellent performance. Personally, I find it a bit scary to see how unmaintainable the source code looks. However, I learned a few tricks along the way that I might want to consider in the future. Variable padding to avoid cache contention, for example. This little peek at Java library implementation has made me curious: what can I learn from looking at the implementation of the Collection classes? I loved reading “Effective Java” by Josh Bloch who designed the collection framework which makes me think it’s well worth taking a look at it.
Nice article, busy writing about similar.
One note on assignment in if statements, they are meaningful as the fields read are volatile. What the code is aiming for is to minimize the number of volatile reads (potential for cache misses) and local caching of values read (for consistency). Like anything lock-free, it’s a rather intricate relationship between the different variables and their visibility as it can be deduced from happens-before guarantees.
The end result is indeed very C looking, but have no doubt that Doug knows exactly what he is doing ;-)
As for leaving it to the compiler, this is not magic that can be done by the JIT compiler at the moment… the JIT compiler would have to guess at all those intricate relationships above and err on the side of safety, I’m not sure we’ll ever have a compiler smart enough to produce equally efficient code.
You are benchmarking bareback, have you tried JMH?
Also note the JDK is aiming for massive scalability these days, they are keen to scale to the 128 CPU machines and the above approach needs to be considered in that context.
Thanks :-)
Thanks for the feedback! Please feel free to post a link once you’re done writing, I’m keen to give it a read.
I agree that the fields had to be assigned to local variables. Not only for efficiency but also because you don’t want someone else to change the value that you’re working on. The point I was trying to make is that code is much more readable and maintainable if assignments only happen outside if statements. I have no doubt that Doug knows what he’s doing! But code like this hides bugs easier. And he’s not the only one to ever maintain the code. For example, the Java repository shows that LongAdder was reviewed by three developers before it was added. And who knows how many changes will be made to it in the following years?
As for compiler optimization, it’s true that I got carried away there. It’s wishful thinking that compilers can do all the work. But have a look at this cool talk about JIT optimizations. It’s quite fascinating. It also made me wonder what I was actually testing. Is it possible that the compiler optimized away some of my test code?
I’ve never benchmarked Java code before, so my first instinct was to do it by hand. I’ll have a look at JMH. Thanks for the tip!
I posted it yesterday: http://psy-lob-saw.blogspot.com/2013/06/java-concurrent-counters-by-numbers.html
It presents some alternatives to LongAdder and discusses the design and motivation behind it. I sent a link to Doug and got good feedback from him, so I think it’s fairly sound.
The volatile reads also serve as happens-before edges, not just optimisations. Reading volatile fields in a particular order is meaningful.
The way the guys who write this code avoid bugs is by doing some insanely intense testing (see here: https://github.com/russel/java-concurrency-torture), and having a very careful review process. Plus, the code is open for the public to review which is great and one of the things that has always been great about Java. Assuming the implementation is not unnecessarily complex (you can find a simpler way to achieve the same) I’m not sure a more explicit assignment style would improve the overall readability/maintainability of the code.
Thanks for the link, I’ll go check it out. It’s awesome that you got feedback from Doug Lea!
Yes, you’d have to break up the if statements to make sure you’re still reading the fields in the same order if you wanted to make assignments outside if clauses. I’ve been tempted for a whole to re-write the code to see if it would make it harder or easier to read. Just haven’t found the time yet. I’m reading “Clean Code” by Robert Martin at the moment and it makes a few very good points on the importance of keeping your code clean and easily readable. So now I’m spotting unclean code everywhere. Most and foremost in my own code of course…
[…] https://minddotout.wordpress.com/2013/05/11/java-8-concurrency-longadder/ […]
[…] The java.util.concurrent.atomic package has been expanded to include four new classes that allow for concurrent, scalable, and updatable variables. LongAccumulator and LongAdder (and the complementary DoubleAccumulator and DoubleAdder) can be used almost anywhere you need to sum a value across multiple threads, and is generally preferable to the AtomicLong class since it’s considerably faster. These new classes essentially store each update without affecting the original variable. This makes them thread-safe by avoiding contention rather than locking variables on write. Instead, the variable is only locked and updated whenever it’s read, meaning the performance enhancements provided by these new classes are negligible when reads are frequent. An interesting and thorough overview of the algorithm and performance enhancements can be found here. […]
[…] 深入了解LongAdder(英文) […]
[…] https://minddotout.wordpress.com/2013/05/11/java-8-concurrency-longadder/ […]
Hey, in your github code, I only see the “write” part and don’t see the “read”. It would be interesting to add reading measurement and see.
Hey, that’s true, I did neglect the “read” side of things. That might yield interesting results as well.
[…] of the program. Compared to AtomicLong, LongAdder offers much better performance according to this article, therefore using the class LongAdder instead of AtomicLong is […]