Getting benchmarking right

2010-02-28 16:30:56

Several times on this blog, I've dealt with issues where some kind of benchmarking was required. The method, implementation, environment all play a role and has subsequently been the object of much discussion. In this post let's see if we can agree on some way of benchmarking.



Benchmarking is 4D

In my oppinion there are 4 dimensions to benchmarking which we need to deal with separately. In industry benchmarking can be crual to architecting your solution correctly and/or dealing with performance issues. Industry benchmarking needs to come as close to the real scenario as possible. If you're working on a massive scale, benchmarking on a laptop is a no go, if your setup is heavy on I/O you need to clone your production environment in order to produce reliable results, etc etc.

The second type of benchmarking is what is typical for bloggers and I'm no exception: Microbenchmarking. Microbenchmarking is where we take certain routines out and benchmark them individually, my recent Fibonnaci posts are examples of this. Like the commentators pointed out, there needs to be an emphasis on equality where possible. A Ruby solution won't look like a Clojure solution, but differences like returning or printing the result should be eliminated and defining the specific area to benchmark is very important: Are we timing the function or the inner-loop?

This second form of benchmarking is in divided into 2 distinct categories as well (bringing us up to the 4D benchmarking), because we have many languages which put the JVM to good effect with all the added benefits, but there are also languages that dont. Benchmarking on the JVM requires a good knowledge of both the JVM and HotSpot, whereas benchmarking outside of the JVM requires a good knowledge of whichever compiler you are working with. In that regard I couldn't do Haskell any favors, for not having a clue about how the compiled code runs, start up times etc?


JVM Benchmarking

Being able to benchmark on the JVM is important and requires some background. I recommend reading this post in particular.

The unix time function is ruled out for the obvious reasons that 1) We dont want to be affected by the startup time of the JVM and 2) We dont need it. Especially when we're looking at single algorithm/loop etc, what we really want to know is how well that given body of code performs. Its interaction with the rest of our system can be disregarded for the sake of comparison and since we're often dealing with very small numbers, subtle differences quickly become not so subtle. Before we can get into the actual benchmarking we need to boot the JVM.

What we dont care about

For the sake of blogging/microbenchmarking we're typically not doing any major optimizations, because usually idioms are being compared. There are exceptions, but we can deal with them once we get there. The JVM is a fantastic eco-system which allows for rigorious introspection - all of which we can disgard for our simple exercises.

Things we can disregard:

JVM Parameters:

-XX:+PrintCompilation - This gives us a heads-up whenever a method is compiled.

-verbose:GC - Lets us keep tabs on when the GC is running

-XX:-PrintGC - Print messages at GC

-XX:-PrintGCDetails - More verbose GC messages

-XX:-TraceClassLoading - Trace the loading of classes

-agentlib:hprof[=options] - Heap/CPU - read this: options

All of these (and dim sum) can be very helpful in tracing performance bottlenecks, but for microbenchmarking they seem overkill. Depending on how much allocation we're doing we might trigger some heavy GC, but in that case we should account be stripping high/low values from our timings.

The things we cannot disregard:

JVM Paramters:

-Xms128M - The minimum amount of memory the JVM allocation

-Xmx512M - The maximum amount

-server/-client - Depending on which we choose for a given task, you actually get different compilers

We should strive to keep the values identical because depending on the task they can make all the difference. The minimum amount is allocation at the JVM boot, meaning no time will be spent growing the memory space while the program is running if you don't exceed this. Thought I haven't checked Jarkko Oranen suggested that the preformance degradation which comes by approaching the Xmx value is due to the GC working double time to free up memory, therefore its important to set the max to an appropriate value - though it varies from test to test we should strive to keep these values identical.


Methodology

I don't think there's a set standard for how we benchmark when doing blog comparisons, so allow me to introduce and outline for how we can handle comparisons done on this blog:

Do multiple passes: The code being benched should be executed repeatedly a given number of times, at default for smaller algorithms could be 20 passes.

Because we eventually hit GCs or other burps of the system, multiple passes ensure that we get the overall picture.

Garbage collect: Dont heap it up between runs, but clean after every pass.

This is the simplest way in which you can tame the GC and help get uniform results.

Filter highest and lowest values: Take the highest and lowest value and remove them from the timings.

When timing on the JVM some of the bumps are very significant - You can have a series of runs going at 5ms and then suddenly a pass that takes 40ms - Since its more a reflection of the disturbance than the actual algorithm, it can safely be stripped.

Warm up the JVM: To trigger all the optimizations etc, the main loop should be repeated a number of times

Since we're blogging, I recommend just going for either 1 pass of the algorithm, or repeated passes of the algorithm until we've have been crunching for 1 minute. So if you algorithm takes 30 seconds for 1 pass, it will run twice, if it takes 2 minutes per pass, it will run once.


Implementation

To make this fair I'll provide a benchmark macro in Clojure and if you are also the happy user of some other JVM language which you want included in future benchmarks on this site, please provide me with a similar function/macro, which I will then both post here and use in future comparisons.

(defmacro microbench
  " Evaluates the expression n number of times, returning the average
    time spent in computation, removing highest and lowest values.

    If the body of expr returns nil, only the timing is returned otherwise
    the result is printed - does not affect timing.

    Before timings begin, a warmup is performed lasting either 1 minute or
    1 full computational cycle, depending on which comes first."
  [n expr] {:pre [(> n 2)]}
  `(let [warm-up#  (let [start# (System/currentTimeMillis)]
                     (println "Warming up!")
                     (while (< (System/currentTimeMillis) (+ start# (* 60 1000)))
                            (with-out-str ~expr)
                            (System/gc))
                     (println "Benchmarking..."))
         timings#  (doall
                    (for [pass# (range ~n)]
                      (let [start#    (System/nanoTime)
                            retr#     ~expr
                            timing#   (/ (double (- (System/nanoTime) start#))
                                         1000000.0)]
                        (when retr# (println retr#))
                        (System/gc)
                        timing#)))
         runtime#  (reduce + timings#)
         highest#  (apply max timings#)
         lowest#   (apply min timings#)]
     (println "Total runtime: " runtime#)
     (println "Highest time : " highest#)
     (println "Lowest time  : " lowest#)
     (println "Average      : " (/ (- runtime# (+ highest# lowest#))
                                   (- (count timings#) 2)))
     timings#))

The code is effectively divded into 3 sections:

Warm-up: A while loop runs for at least 1 minute, trapping all output so we dont see any printing from whatever we're benchmarking.

Timings: The actually expression is repeatedly run and the milliseconds spent in each pass is returned

Stats: Highest/Lowest values are filtered, results are printed

Using this to test our Fibonacci code from my previous post, then becomes:

user> (microbench 20
              (let [limit (.pow (BigInteger/TEN) 999)]
                   (loop [a 0 b 1 i 1]
                      (if (< b limit)
                          (recur b (+ a b) (inc i))
                          nil))))
Warming up!
Benchmarking...
Total runtime:  110.08787899999999
Highest time :  7.98691
Lowest time  :  5.135988
Average      :  5.386943388888889

Notice the final 'nil' in the code. If I had returned 'i' instead it would have printed the result of each pass before printing the final stats - It would not affect the timing. Or if we used the more idiomatic (?) version:

user> (microbench 20
                  (let [limit (.pow (BigInteger/TEN) 999)]
                       (count (take-while #(< % limit) fib-seq))))
Warming up!
Benchmarking...
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
4782
Total runtime:  74.73764200000001
Highest time :  5.031017
Lowest time  :  3.168419
Average      :  3.696567

Of course in that case its cheating, because the fib-seq is calculated once and then kept in memory.


Conclusion

Benchmarks can be fun, but I think its important to 1) Agree on some methodology and 2) Not go nuts over a few milliseconds here and there. Certain algorithms perform better than others, and certain algorithms can be more or less idiomatically expressed in various languages. Ultimately when looking at the results we have to apply some common sense as we're not always doing 1:1 comparisons. Using these benchmarks to definitely positively declare one language superior to the other should neither be our goal nor is it possible.

If you've taken a look at Alex Osbornes attempt at the WideFinder 2 challenge, you saw how he plowed through 45 Gigabytes of text in a blazing 8m 4s blowing both Scala and Java out of the water. Does that then conclusively state that Clojure is faster than both those languages? Of course not, Clojure compiles to bytecode, just like Scala and Java so the exact same speed could be obtained by both of them. The point is, that the concise and elegant Clojure code can be as powerful than other languages without the same level of incidental complexity being put on the user, so it makes sense to focus on the quality of the code while keeping half an eye on the performance. If all that mattered was speed, we'd all still be writing ASM.

This is my proposal for a uniform way to benchmark in the future - let me know what you think, I'm only happy to accept changes, improvements, implementations in other languages etc. And JVM outsiders shouldn't feel left out.

danlei
2010-02-28 17:41:54
Hi Lau!

Using your macro, Clojure gets a lot closer to SBCL (~9ms):


Total runtime:  13223.448962
Highest time :  89.741439
Lowest time  :  8.055909
Average      :  13.1519555250501


I did exclude some peaks in the results of my last post, but using this macro is way more easy than doing it by hand (and also more fair).

Something like this could be handy to have in contrib. If you have signed a CA, maybe you should consider putting it in.
Lau
2010-02-28 17:43:53
Danlei,

I was juuuust about to send that CA :)

You have some enormous peaks - You wouldn't be running Windows by any chance?
danlei
2010-02-28 17:52:01
"You have some enormous peaks — You wouldn’t be running Windows by any chance?"

Guilty as charged.
Lau
2010-02-28 17:54:38
Ah - Then your results make more sense!
Isaac Gouy
2010-02-28 20:19:23
&gt;&gt; Being able to benchmark on the JVM is important and requires some background &lt;&lt;

Yes.

http://www.elis.ugent.be/JavaStats

&quot;JavaStats is a tool, written in Python that allows to run Java benchmarks automatically until a sufficiently narrow confidence interval is obtained for a given performance metric.&quot;
Isaac Gouy
2010-02-28 21:54:24
» Being able to benchmark on the JVM is important and requires some background «

There are links to the "Statistically Rigorous Java Performance Evaluation" pdf paper and slides here -

http://shootout.alioth.debian.org/flawed-benchmarks.php#bogus
Meikel
2010-03-01 00:05:30
Lau, what will I tell you about this macro? Exactly: http://paste.pocoo.org/show/184094.
danlei
2010-03-01 00:15:48
I think one should provide microbench-fn and a thin macro wrapper microbench around it. (After all, time is a macro, too)
Mike
2010-03-01 01:01:58
I don't like asking the user to specify the number of times to run. Better to think about how long it takes to run. I.e. - you run the loop N times, but keep increasing N until it takes more than some amount of time to make N passes. You then report the same data.
Dhananjay Nene
2010-03-01 05:23:05
Some great points here in terms of what should be covered in a detailed benchmarking exercise.  +1

I would differ in one statement. It is not import to "agree" upon one methodology as it is to apply one consistently and allow others the means necessary to apply another consistently at their end should they wish to change it.

When blogging benchmarking results, two more aspects could be covered :

a. Published code should contain benchmarking related code (even when it complicates indicating LOC for the core logic even as one attempts to emphasise brevity of one language)
b. Blog post should contain specifics of hardware / os / compiler / other relevant environment etc.

These are important so that the reader could choose to further extend the exercise at his end should he wish to, even as he retains the confidence he is able to continue to extend at the point where you left off, (given that the benchmarking code is in the post) and  he can make at least some reasonable mental adjustments for differences in environments if any.
Dhananjay Nene
2010-03-01 05:26:34
re: Last Comment : Read "it is not import" as "it is not important"
Meikel
2010-03-01 07:29:55
@danlei:

The usual approach is (defn microbench* ...) + (defmacro microbench [n &amp; body] `(microbench* ~n (fn [] ~@body))).
Patrick
2010-03-01 11:20:57
I would recommend looking at Japex (https://japex.dev.java.net/) and Caliper (http://code.google.com/p/caliper/). Caliper is newer and under active development. Both of them can capture and graph benchmark results, track historical results to some extent, compare different algorithms, etc. Plus they handle warmup. No reason to re-invent the wheel, IMO.
danlei
2010-03-01 13:34:53
@Meikel

Yes, that's what I meant. Nitpick: I'd stick with naming it microbench-fn (if it's exposed), since the asterisk means either something like "helper" or "variation" (for example clojure.core/list*), and doesn't give a clue about what the function does.
Meikel
2010-03-01 20:04:31
@danlei

the * naming is wide-spread eg. in Scheme (in particular scsh) and also Clojure. This is the first time I hear about -fn convention.
tx
2010-03-01 21:44:34
To minimize the effects of the JVM expanding heap space you may want to also set -Xmx and -Xms to the same value.
danlei
2010-03-02 00:09:15
@meikel

Yes, * is used widely, but for different things. I'm not sure if -fn is used as often, but one sees it here and there, and I think it's more informative than *.
Isaac Gouy
2010-03-02 01:13:28
&gt; Microbenchmarking is where we take certain routines out and benchmark them individually, my recent Fibonnaci posts are examples of this.

Unless the micro-benchmarking is done a lot better than the benchmarks game does it; unless what's being benchmarked is a more interesting than the benchmarks game tasks - what's the point?


Back in 2000 Doug Bagley was showing timings for Fib programs written in 27 languages -

http://web.archive.org/web/20010606110920/www.bagley.org/~doug/shootout/bench/fibo/


Brent Fulgham revived that  in 2004 showing timings for Fib programs written in even more languages -

http://shootout.alioth.debian.org/gp4/benchmark.php?test=recursive&amp;lang=all


The challenge is to find new, relevant, interesting problems and provide a more interesting, more accurate measurement and analysis than the benchmarks game provides.
Lau
2010-03-02 09:02:24
@Isaac: The point is still to have a minimalistic uniform methodology for doing these small comparisons. The only reason I mentioned how Enterprise benchmarking is done, was to avoid rolling out the big guns with graphing, randomized input etc etc. Usually all we care about is seeing how well a certain compiler takes the code down to the metal.

@danlei, @meikel: I'm not sure that the asterisk suffix is used for many different things, but I know that since december 2 years ago, ClojureQL has consistently used that suffix to denote driver functions, and indeed my benchmark would benefit from one such. If we can't agree on a convention, I side with danlei in saying that -fn is more descriptive. 

@tx: Interesting tip, I'll have to give it a whirl

@Patrick: I like new wheels, especially for keeping things as minimalistic as possible

@Dahananjay: Agreed.

To the rest of you reading, if you have similar snippets in Scala, Python, JRuby etc, feel free to post them and I'll be happy to apply them.
Isaac Gouy
2010-03-02 19:54:58
@Lau: "The point is still to have a minimalistic uniform methodology for doing these small comparisons."

My mistake - I guess you don't actually mean "Getting benchmarking right".

For JVM benchmarking JavaStats gives everyone a way to avoid being fooled by bad methodology. (It doesn't provide graphing or randomized input.)

And the question was - What's the point of "doing these small comparisons" which just repeat what others have done over the last 10 years ago. 

Where's the challenge in that? What's interesting about that?
gene t
2010-03-17 19:54:58
re: "Clojure compiles to bytecode, just like Scala and Java so the exact same speed could be obtained "

I've seen discussion of scala (and maybe clojure?) generating bytecode sequences that are impossible from javac, but this is the only example i could google

http://clojure-log.n01se.net/date/2010-01-14.html#i19