Haskell,Ruby,Scala,Clojure - Tweaked!

2010-02-24 21:17:11

Last night I did very small superficial comparison, of 3 ways of getting the first fibonnaci number consisting of 1000 digits. This attracted a lot of attention from Rubists, Haskaloonies and Clojurians alike, here I'll share their contributions.



Preface

Microbenchmarking sucks, especially on the JVM. HotSpot is doing such a great job of optimizing code, that sometimes you end up getting an incorrect idea of how your code performs. Nevertheless I got, many tips for improving performance, so it seems the Internet Performance Rule of Thumb is: Always benchmark the Fibs!


Haskell

Team Haskell weren't pleased completing the job in 7 msecs, so they suggested that I recompile the program using -O2. I used GHC v. 6.10:

limit = 10^999
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
main = print . length . takeWhile (< limit) $ fibs

(thanks to: "laulau")

Real time, average of 20: 7 msecs.
Nothing gained, but 7 msecs is blazing.


Ruby

Ruby had big room for improvement as the very clever Matrix approach didn't yield very satisfying performance. Here's one commentators approach

limit = 10**999
def fib2(a,b)
 return b,b+a
end
a,b = fib2(0,1)
while b < limit
 a,b = fib2(a,b)
end
puts b

(thanks to: Aaron)

Average on 20 runs: 35 msecs.

Huge improvement and honor is restored to the Ruby camp!


Scala

Camp Scala was good enough to contribute a solution as well:

object Main {
  def main(args: Array[String]) {
    def f(a: BigInt, b: BigInt) = (b, b + a)
    val max = BigInt(10).pow(999)
    var (a, b) = f(0, 1)
    while(b < max) {
      val temp = f(a, b)
      a = temp._1
      b = temp._2
    }
    println("\nb = " + b)
  }
}

(thanks to: Rahul G ,*updated*)

Average: 18 ms.


Clojure

Clojures loops are recursive and side-effect free, but even still it looks a little clunky:

(time (let [limit (.pow (BigInteger/TEN) 999)]
        (loop [a 0 b 1 i 1]
          (if (< b limit)
            (recur b (+ a b) (inc i))
            i))))

(thanks to: Duc)

Nevertheless, it averages at: 5 msecs. Hooray! It's a new record!


Conclusion

If nothing else I think its fun to see these challenges approached from different languages, which is also what I like about Project Euler. And I'm certain that this will not be the last time I bring Haskell along for the ride. Last nights post were meant as small reflections on 3 very different languages, so I was a little surprised of the attention that it got - But its positive, I think its great when we all can learn a little from one another.

I'm open for ideas on another arena for comparisons. We could do something crazy like solve Global Warming.. oh wait, I already did that. How about Euler 200 then?

Don Stewart
2010-02-24 22:29:05
Thanks for clarifying which tools and compiler flags you're using, however, there are some very serious flaws remaining.

Just to double check: you are still measuring different algorithms in each language, and also measuring different aspects of the total runtime.

To repeat: the programs you present aren't implementing the same functions. In particular, the Clojure version is using a loop, while the Haskell version is using memoization. 

Here, for example, is the iterative version you use in the Clojure example, translated to Haskell:

limit = 10^999

loop a b i
     | b &lt; limit = loop b (a+b) (i+1)
     | otherwise = i

main = print (loop 0 1 1) 

This is a different algorithm, so comparing it against say, a memozing version *is meaningless*.

Secondly, what are you trying to measure? Total runtime? Runtime of just the inner loop excluding startup time?

The Clojure version doesn&#039;t seem to be doing IO, and you&#039;re not including its startup time, while, e.g. the Haskell version includes startup time *and* IO!

Unless they&#039;re doing *the same algorithm* and the *same total runtime*  -- the comparisons on this post will continue to be meaningless!

Could you please clarify what exactly you&#039;re trying to measure? I would suggest you state precisely:

 * iterative or memozing fibonacci (specify the required algorithm)
 * total runtime, as measured by one external tool
Joel Mueller
2010-02-24 22:38:00
I haven't run the Scala code, but the page where that code snippet is posted seems to be saying the answer to Euler 25 is 3321, which is incorrect.

http://ideone.com/Tt4Yw0Ys

I presume you ignored the F# one-liner in your follow-up because you don't have Mono installed on your Linux box, so you couldn't compare performance on the same machine?
Don Stewart
2010-02-24 22:39:33
Just to follow up, using a statistically robust mechanism based on kernel density functions (see the Criterion package for robust benchmarking), the iterative Haskell solution measures out at at mean execution time of 0.819 ms, when we exclude all startup costs (as you do in the Clojure example), with a standard deviation over 20 runs of 39.29665 us.

This just illustrates how silly it is to time different algorithms, in different ways.
Rahul G
2010-02-24 22:46:53
Hi Lau,

I am extremely sorry for the incorrect code (The Scala version). Here's a corrected version (translation of Aaron's Ruby version actually) : http://ideone.com/pvNdSwq3

Please update your blog post accordingly. Thanks.
Ochronus
2010-02-24 23:12:06
Have you considered a solution like this in ruby?

limit = 10**999
i = 0
n, m = 0, 1
while n &lt; limit do
    n, m = m, n + m
    i += 1
end
puts i
domor
2010-02-25 00:51:26
Dons: I get similar results to you for the looping version, but unlike what I would have thought the list one is much faster. Any idea why? I think I used Criterion correctly...

http://codepad.org/vO0sCnLF

(Sorry for the tangent - I don't think fibs is really worth benchmarking, but I'm interested in the work GHC does on the code.)
Don Stewart
2010-02-25 00:57:36
@domor: criterion is likely caching that fibs list, since it's a CAF. so you're not reevaluating it each time.
Peter
2010-02-25 04:07:32
Here's a Scala version based on the first algorithm:

val limit = BigInt(10) pow 999
lazy val fib: Stream[BigInt] = 0 #:: 1 #:: fib.zip(fib.tail).map { case(x,y) =&gt; x + y }
fib takeWhile (_ &lt; limit) length
Peter
2010-02-25 04:11:19
Regarding the scala version I just posted. Stream doesn't have zipWith at the moment so I used zip and map. If zipWith was added fib would be like:

lazy val fib: Stream[BigInt] = 0 #:: 1 #:: fib.zipWith(fib.tail) { _ + _ }
Peter
2010-02-25 04:43:17
I certainly agree with what Don has said above about comparing the same algorithm and the same runtime. As Scala and Clojure are both running in the JVM I just tried a comparison between both algorithms. I've written what should be basically the same method for recording time. Averaging over 10,000 iterations, and including a warm up of 10,000 as well, as suggested by Dhananjay in the previous post.

The code is here:
http://gist.github.com/314194
http://gist.github.com/314192

On this machine I get:

Clojure lazy-cat:  5.1545
Clojure loop: 1.7102

Scala memoized stream: 2.523
Scala while loop: 1.4752
Peter
2010-02-25 05:07:54
And just in case... the above Scala is Scala 2.8. If anyone needs 2.7 then Stream.cons needs to be used rather than #:: in which case fib is:

lazy val fib: Stream[BigInt] = Stream.cons(0, Stream.cons(1, fib.zip(fib.tail).map { case(x,y) =&gt; x + y }))

Cheers
Lau
2010-02-25 09:05:32
Hi all,

Time has been a little short lately, so I apologize for the slowness of replies both past and future :)

@Rahul: No need to apologize, in fact its me who is guilty for not double checking your code, something I usually do before posting on this blog. I'll pay closer attention next time.

@Don: You point out serious flaws and I agree. This benchmark, as well as every other microbenchmark I've seen to date, can give you a little indicator of the performance of a language and the algorithm chosen, but no more than that. For all JVM based languages HotSpot needs to be controlled in order to see whats actually going on with the code and to ensure that takes a lot more work than a fun little exercise like this justifies. The numbers were supposed to be shadows, nothing more nothing less. For this post, different algorithms are used in both posts, for the reason of showing off various styles and approaches to the problem. Those who have supplied code with a built in timer for the actual work, will see those timings. JVM based languages aren't penalized for the JVM startup/shutdown time. Its true that the Haskell timing also includes I/O, but when we're talking about unscientific timings which are below 10 milliseconds, it would be (for me at least) a waste of time to improve on the purity of the results. 

@Rest:: Real benchmarking is hard work and takes some thorough preparations - To do so in a several different languages is not trivial and therefor beyond the scope of this post. Take what you see as indicators of performance as well as a demonstration of how these 4 different languages can be used to approach a single problem from different angles.
Pollux
2010-02-25 19:16:19
-Ruby

This is more or less the same, although it seems to be a bit faster than the current implementation, might be worth checking out.

limit = 10**999
tmp = [0,1]
while tmp.last &lt; limit 
 tmp &lt;&lt; tmp.shift+tmp.first
end
p tmp.last
danlei
2010-02-25 22:54:54
Here is an (almost 1:1) CL adaption of your Clojure code:

(defun euler25 ()
  (labels ((rec (a b i)
                (if (&lt; b #.(expt 10 999))
                    (rec b (+ a b) (1+ i))
                  i)))
    (rec 0 1 1)))

DHL&gt; (time (euler25))
(EULER25) took 16 milliseconds (0.016 seconds) to run 
                    with 2 available CPU cores.
During that period, 31 milliseconds (0.031 seconds) were spent in user mode
                    0 milliseconds (0.000 seconds) were spent in system mode
 1,048,848 bytes of memory allocated.

On my machine this took 29ms in average on CCL/win32, the clojure version 23.65ms.

Then, I gave SBCL a try, which blew both out of the water:

  seconds  |     consed    | calls |  sec/call  |  name  
-----------------------------------------------------------
     8.836 | 1,048,826,608 | 1,000 |   0.008836 | EULER25
-----------------------------------------------------------
     8.836 | 1,048,826,608 | 1,000 |            | Total

Sorry, but my last post got messed up by an angle bracket.
Lau
2010-02-26 08:48:08
Danlei - Does it change your results of you move the calculation of limit, #.(expt 10 999) outside of the labels block?
danlei
2010-02-26 13:21:13
It doesn't make much of a difference:


(defun euler25 ()
  (let ((limit (expt 10 999)))
    (labels ((rec (a b i)
                  (if (&lt; b limit)
                      (rec b (+ a b) (1+ i))
                    i)))
      (rec 0 1 1))))


Is a wee bit slower:


  seconds  |     consed    | calls |  sec/call  |  name  
-----------------------------------------------------------
     9.228 | 1,048,835,080 | 1,000 |   0.009228 | EULER25
-----------------------------------------------------------
     9.228 | 1,048,835,080 | 1,000 |            | Total

estimated total profiling overhead: 0.01 seconds
overhead estimation parameters:
  2.18e-7s/call, 1.022e-5s total profiling, 5.8200003e-6s internal profiling


I would have used the aequivalent of #. in the Clojure versiont, too (namely #=), but it wouldn't work with .pow.
danlei
2010-02-26 13:25:56
The interesting thing about the Clojure version is: It DOES get some really good timings (downto about 8ms), but in between, there often are calls around 50ms, sometimes even up to ~400ms.

Maybe that is due to HotSpot's dynamic optimization, but I really don't know anything about that.
Yahya
2010-02-28 00:52:05
Even tho "Microbenchmarking sucks", I'd like to see how Python performs at this task, using, e.g. one of the algorithms at:
    http://en.literateprograms.org/Fibonacci_numbers_%28Python%29

Anyone?
Meikel
2010-02-28 16:59:14
We are already at Euler 215. ;) http://clojure-euler.wikispaces.com/Problem+215
Mork
2010-03-11 14:22:59
&gt; This benchmark, as well as every other microbenchmark I’ve seen to date, can give you a little indicator of the performance of a language and the algorithm chosen, but no more than that. 

Not even that: If you vary algorithms and languages at the same time, you cannot deduce much meaningful information from the run times. As dons pointed out, the measurements are even less useful since you also vary the amount of processing you measure (input/output, startup time) at the same time as well, and in this case (microbenchmarks) input/output and startup time can be significant parts of the benchmark run time.

The microbenchmark might be a useful performance indicator if you used the same code with the same compiler and varied one flag at a time; perhaps even if you used the same code and different compilers, though care needs to be taken to keep compiler settings similar in some meaningful way.

The microbenchmark might be also a useful performance indicator if you compared two implementations using the same compiler and swapped only the underlying data structure.

But by varying just about everything from program to program, it becomes pointless to even tell us the run times; as presented here, they merely spread misinformation.
After a few posts like this one starts to feel that these numbers have only one purpose: To give the article's arbitrarily chosen ranking of languages by performance an air of authority.

For example: In both this post and the previous post the Clojure implementations just so happen to be the ones using an efficient approach to solving the problem, while the other languages typically implement Fibonacci in clearly slower ways (sometimes absurdly so).

I'm confident that you could have come up with a Clojure program that is slower than the three programs presented here -- but you chose not to. Instead, you end the article with the statement:

&gt; 5 msecs. Hooray! It’s a new record!

Yes, it is a record: of all the unrelated runtimes you gave, this is the lowest.