Several times on this blog, I've dealt with issues where some kind of benchmarking was required. The method, implementation, environment all play a role and has subsequently been the object of much discussion. In this post let's see if we can agree on some way of benchmarking.
In my oppinion there are 4 dimensions to benchmarking which we need to deal with separately. In industry benchmarking can be crual to architecting your solution correctly and/or dealing with performance issues. Industry benchmarking needs to come as close to the real scenario as possible. If you're working on a massive scale, benchmarking on a laptop is a no go, if your setup is heavy on I/O you need to clone your production environment in order to produce reliable results, etc etc.
The second type of benchmarking is what is typical for bloggers and I'm no exception: Microbenchmarking. Microbenchmarking is where we take certain routines out and benchmark them individually, my recent Fibonnaci posts are examples of this. Like the commentators pointed out, there needs to be an emphasis on equality where possible. A Ruby solution won't look like a Clojure solution, but differences like returning or printing the result should be eliminated and defining the specific area to benchmark is very important: Are we timing the function or the inner-loop?
This second form of benchmarking is in divided into 2 distinct categories as well (bringing us up to the 4D benchmarking), because we have many languages which put the JVM to good effect with all the added benefits, but there are also languages that dont. Benchmarking on the JVM requires a good knowledge of both the JVM and HotSpot, whereas benchmarking outside of the JVM requires a good knowledge of whichever compiler you are working with. In that regard I couldn't do Haskell any favors, for not having a clue about how the compiled code runs, start up times etc?
Being able to benchmark on the JVM is important and requires some background. I recommend reading this post in particular.
The unix time function is ruled out for the obvious reasons that 1) We dont want to be affected by the startup time of the JVM and 2) We dont need it. Especially when we're looking at single algorithm/loop etc, what we really want to know is how well that given body of code performs. Its interaction with the rest of our system can be disregarded for the sake of comparison and since we're often dealing with very small numbers, subtle differences quickly become not so subtle. Before we can get into the actual benchmarking we need to boot the JVM.
For the sake of blogging/microbenchmarking we're typically not doing any major optimizations, because usually idioms are being compared. There are exceptions, but we can deal with them once we get there. The JVM is a fantastic eco-system which allows for rigorious introspection - all of which we can disgard for our simple exercises.
Things we can disregard:
JVM Parameters:
-XX:+PrintCompilation - This gives us a heads-up whenever a method is compiled.
-verbose:GC - Lets us keep tabs on when the GC is running
-XX:-PrintGC - Print messages at GC
-XX:-PrintGCDetails - More verbose GC messages
-XX:-TraceClassLoading - Trace the loading of classes
-agentlib:hprof[=options] - Heap/CPU - read this: options
All of these (and dim sum) can be very helpful in tracing performance bottlenecks, but for microbenchmarking they seem overkill. Depending on how much allocation we're doing we might trigger some heavy GC, but in that case we should account be stripping high/low values from our timings.
The things we cannot disregard:
JVM Paramters:
-Xms128M - The minimum amount of memory the JVM allocation
-Xmx512M - The maximum amount
-server/-client - Depending on which we choose for a given task, you actually get different compilers
We should strive to keep the values identical because depending on the task they can make all the difference. The minimum amount is allocation at the JVM boot, meaning no time will be spent growing the memory space while the program is running if you don't exceed this. Thought I haven't checked Jarkko Oranen suggested that the preformance degradation which comes by approaching the Xmx value is due to the GC working double time to free up memory, therefore its important to set the max to an appropriate value - though it varies from test to test we should strive to keep these values identical.
I don't think there's a set standard for how we benchmark when doing blog comparisons, so allow me to introduce and outline for how we can handle comparisons done on this blog:
Do multiple passes: The code being benched should be executed repeatedly a given number of times, at default for smaller algorithms could be 20 passes.
Because we eventually hit GCs or other burps of the system, multiple passes ensure that we get the overall picture.
Garbage collect: Dont heap it up between runs, but clean after every pass.
This is the simplest way in which you can tame the GC and help get uniform results.
Filter highest and lowest values: Take the highest and lowest value and remove them from the timings.
When timing on the JVM some of the bumps are very significant - You can have a series of runs going at 5ms and then suddenly a pass that takes 40ms - Since its more a reflection of the disturbance than the actual algorithm, it can safely be stripped.
Warm up the JVM: To trigger all the optimizations etc, the main loop should be repeated a number of times
Since we're blogging, I recommend just going for either 1 pass of the algorithm, or repeated passes of the algorithm until we've have been crunching for 1 minute. So if you algorithm takes 30 seconds for 1 pass, it will run twice, if it takes 2 minutes per pass, it will run once.
To make this fair I'll provide a benchmark macro in Clojure and if you are also the happy user of some other JVM language which you want included in future benchmarks on this site, please provide me with a similar function/macro, which I will then both post here and use in future comparisons.
(defmacro microbench " Evaluates the expression n number of times, returning the average time spent in computation, removing highest and lowest values. If the body of expr returns nil, only the timing is returned otherwise the result is printed - does not affect timing. Before timings begin, a warmup is performed lasting either 1 minute or 1 full computational cycle, depending on which comes first." [n expr] {:pre [(> n 2)]} `(let [warm-up# (let [start# (System/currentTimeMillis)] (println "Warming up!") (while (< (System/currentTimeMillis) (+ start# (* 60 1000))) (with-out-str ~expr) (System/gc)) (println "Benchmarking...")) timings# (doall (for [pass# (range ~n)] (let [start# (System/nanoTime) retr# ~expr timing# (/ (double (- (System/nanoTime) start#)) 1000000.0)] (when retr# (println retr#)) (System/gc) timing#))) runtime# (reduce + timings#) highest# (apply max timings#) lowest# (apply min timings#)] (println "Total runtime: " runtime#) (println "Highest time : " highest#) (println "Lowest time : " lowest#) (println "Average : " (/ (- runtime# (+ highest# lowest#)) (- (count timings#) 2))) timings#))
The code is effectively divded into 3 sections:
Warm-up: A while loop runs for at least 1 minute, trapping all output so we dont see any printing from whatever we're benchmarking.
Timings: The actually expression is repeatedly run and the milliseconds spent in each pass is returned
Stats: Highest/Lowest values are filtered, results are printed
Using this to test our Fibonacci code from my previous post, then becomes:
user> (microbench 20 (let [limit (.pow (BigInteger/TEN) 999)] (loop [a 0 b 1 i 1] (if (< b limit) (recur b (+ a b) (inc i)) nil)))) Warming up! Benchmarking... Total runtime: 110.08787899999999 Highest time : 7.98691 Lowest time : 5.135988 Average : 5.386943388888889
Notice the final 'nil' in the code. If I had returned 'i' instead it would have printed the result of each pass before printing the final stats - It would not affect the timing. Or if we used the more idiomatic (?) version:
user> (microbench 20 (let [limit (.pow (BigInteger/TEN) 999)] (count (take-while #(< % limit) fib-seq)))) Warming up! Benchmarking... 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 4782 Total runtime: 74.73764200000001 Highest time : 5.031017 Lowest time : 3.168419 Average : 3.696567
Of course in that case its cheating, because the fib-seq is calculated once and then kept in memory.
Benchmarks can be fun, but I think its important to 1) Agree on some methodology and 2) Not go nuts over a few milliseconds here and there. Certain algorithms perform better than others, and certain algorithms can be more or less idiomatically expressed in various languages. Ultimately when looking at the results we have to apply some common sense as we're not always doing 1:1 comparisons. Using these benchmarks to definitely positively declare one language superior to the other should neither be our goal nor is it possible.
If you've taken a look at Alex Osbornes attempt at the WideFinder 2 challenge, you saw how he plowed through 45 Gigabytes of text in a blazing 8m 4s blowing both Scala and Java out of the water. Does that then conclusively state that Clojure is faster than both those languages? Of course not, Clojure compiles to bytecode, just like Scala and Java so the exact same speed could be obtained by both of them. The point is, that the concise and elegant Clojure code can be as powerful than other languages without the same level of incidental complexity being put on the user, so it makes sense to focus on the quality of the code while keeping half an eye on the performance. If all that mattered was speed, we'd all still be writing ASM.
This is my proposal for a uniform way to benchmark in the future - let me know what you think, I'm only happy to accept changes, improvements, implementations in other languages etc. And JVM outsiders shouldn't feel left out.