Clojure vs Ruby & Scala - Transient Newsgroups

2009-12-13 15:30:27

Recently I had the good pleasure of reading a blogpost which demonstrated a fun exercise in both Ruby and Scala, namely scraping newsgroups. I had a look at both solutions and decided to roll one in Clojure as well, examining the differences between the famous Ruby, the Juggernaut Scala and the elegant Clojure.


Preface

If you have not read the blogpost I linked above, I recommend that you do so to get the full story. The short version goes like this: Mr. Cox was asked to do a newsgroup scraper for a job interview in any language and he rolled two very similar versions in both Ruby and Scala. Both programs do the same thing: Walk through all the files in a directory which contains 20 downloaded newsgroups, stripping out the individual words using a regex and count the number of occurrences. Finally the result is dumped into 2 files, the first alphabetically sorted the second sorted according to the number of occurrences of each word.


Ruby

For me it was interesting to see Ruby contrasted so directly with Scala, because there are more similarities than I thought. Ruby has often been lauded and applauded because it is said to enable the developer to produce software very quickly - I must admit from this example I fail to see how, but that could be my fault.

Ruby solution

#Change rootDir to the location of your newsgroup files
rootDir = "/home/zcox/dev/20_newsgroups"
raise rootDir + " does not exist" unless File.directory? rootDir

#Iterates over all files under rootDir, opens each one and passes it to the block
def files(rootDir)
  Dir.foreach(rootDir) do |dir|
    if dir != "." && dir != ".."
      puts "Processing " + dir
      Dir.foreach(rootDir + "/" + dir) do |file|
        if file != "." && file != ".."
          open(rootDir + "/" + dir + "/" + file) do |f|
            yield(f)
          end
        end
      end
    end
  end
end

t1 = Time.now
counts = Hash.new(0) #0 will be the default value for non-existent keys
files(rootDir) do |file|
  file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }
end

puts "Writing counts in decreasing order"
open("counts-descreasing-ruby", "w") do |out|
  counts.sort { |a, b| b[1] <_> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end

puts "Writing counts in alphabetical order"
open("counts-alphabetical-ruby", "w") do |out|
  counts.sort { |a, b| a[0] <_> b[0] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end

t2 = Time.now
puts "Finished in " + (t2 - t1).to_s + " seconds"

There are quite a few things which catch my interest when reading that example. Firstly it looks a lot like everything else you see these days in terms of structure and syntax, but it has an interesting mix of both the long vertical code like most imperative languages, but also in a few places the deeper horizontal code which really packs a punch. I'm thinking primarily of the lines which output the result files, there's a lot of functionality stuffed into a very short space.

Also when reading it, I kinda regret my earlier criticisms of Pythons indentation rules, because even though they aren't optimal they beat the explicit block constraints which Ruby have. I can't help but wonder how you feel as a developer when you're writing out lines 14 - 19 in the above Ruby code. I think I'd feel a little tied up.

One of the interesting things about the code is this:

counts.sort { |a, b| b[1] <_> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }

counts is a hash of the already accumulated statistics, which is sorted  on the fly according to the integer value and output as pairs. It might have deserved a line or two more, but its pretty cool. Reminds me a little of Groovy.

Lastly, the actual assignment of scores to each word is done here

  file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }

Again there's the clever use of code-blocks where you iterate over some data, giving it a name |word| and working with it. Compared to a solution in say Java, there's alot of ceremony being avoided here.

All in all, I hope some commentators will help enlighten me as to why there is so much hype concerning Ruby - to me it looks a little like your regular imperative language, with a couple of interesting features tossed in the mix.

Scala

I've previously compared Scala with Clojure, especially its Concurrency capabilities and unfortunately Scala came out looking pretty bad due a bug in their code-base, but this time Scala compares very nicely to Ruby.

Firstly I wouldn't call this typical Scala because there are no classes and methods being defined, its as stripped of Boilerplate as is possible with Scala. Normally Scala comes across as very verbose and this isn't the exception as its almost 40% longer than the Ruby version - Which I guess is a trade off, because it does yield significantly better performance.

Scala solution:

//Change rootDir to the location of your newsgroup files
import java.io._
val rootDir = new File("/home/lau/Desktop/20_newsgroups")
if (!rootDir.exists) throw new IllegalArgumentException(rootDir + " does not exist")

/** Iterates over all files under rootDir, opens each one and passes it to the function */
def files(rootDir: File)(process: File => Unit) {
  for (dir <_- _-="_-" filesrootdir="filesrootdir" printlnprocessing="printlnprocessing" counts="Map.empty[String," dir.listfiles="dir.listfiles" _="_" dir="dir" t1="System.currentTimeMillis" file.isfile="file.isfile" processfile="processfile" rootdir.listfiles="rootdir.listfiles" if="if" val="val" for="for" dir.isdirectory="dir.isdirectory" file="" int.withdefaultvalue0="int.withdefaultvalue0" var="var">
  file.split("""\W+""").foreach { word => counts = counts(word.toLowerCase) += 1 }
}

println("Writing counts in decreasing order")
write(counts, "counts-descreasing-scala") {_._2 > _._2}

println("Writing counts in alphabetical order")
write(counts, "counts-alphabetical-scala") {_._1 < _._1}

val t2 = System.currentTimeMillis
println("Finished in " + ((t2 - t1)/1000.0) + " seconds");

/** Writes the specified map to the specified file in tab-delimited format, sorted accordingly. */
def write[K, V](map: Map[K, V], file: String)(sort: (Tuple2[K, V], Tuple2[K, V]) => Boolean) {
  using (new PrintWriter(new FileWriter(file))) { out =>
    map.toList.sort(sort).foreach { pair => out.println(pair._1 + "\t" + pair._2) }
  }
}

/** Converts a File to a String. */
implicit def file2String(file: File): String = {
  val builder = new StringBuilder
  using (new BufferedReader(new FileReader(file))) { reader =>
    var line = reader.readLine
    while (line != null) {
      builder.append(line).append('\n')
      line = reader.readLine
    }
  }
  builder.toString
}

/** Performs some operation on the specified closeable object and then ensures it gets closed. */
def using[Closeable <_:_ closeable="" closeablegetb:_="closeablegetb:_" bcloseable:_="bcloseable:_" unit="unit" close:_="close:_" def="def"> B): B =
  try {
    getB(closeable)
  } finally {
    closeable.close()
  }

Now this does the same thing as the Ruby version, so why the added code? Primarily its the file2String and 'using' which aren't (?) in Scalas core that are taking up space, so its not a big deal. Scala actually mimics Ruby quite closely in that sorting and assignment of scores are done in almost exactly the same way. As a Scala developer you spent much of your time writing out variables types which must be tedious, but the pay-off is quite a bit of speed. Its only in the static typing that I can find an explanation for why Scala outperforms Ruby fiercely.

But of course, the most interesting solution is saved for now :)

Clojure

Solving this problem in Clojure was extremely simple, so I decided to tweak performance a little bit. Firstly we need a function to which I can pass a list of words and in return get their distribution, or frequency of occurrence.

In its simplest form, you could do this

(reduce #(assoc %1 %2 (inc (get %1 %2 0))) {}  ["you" "are" "you" "and" "I" "am" "me"])
{"me" 1, "am" 1, "I" 1, "and" 1, "are" 1, "you" 2}

That's really simple, the first argument is a hashmap, to which I assign the string as the key and the value as either (inc 0) (default if get fails) or (inc old-value) if there is a value. As you can see it returns 1 for all words except 'you'. But there's a problem, its case sensitive. So we can formally define the case-sensitive version as so:

(defn distribution [items]
  (reduce #(let [lowcase (.toLowerCase %2)]
	       (assoc %1 lowcase (inc (get %1 lowcase 0))))) {} items)))

And that's really doing the exact same thing, but as my good friend Mr. Grand noticed, assoc always returns a new hash-map instead of mutating, as Clojure defaults to immutability. Because the computation is contained locally in this function, I don't have a problem using transients (mutable data) because we're still functional, no side-effects:

(defn distribution [items]
  (persistent!
   (reduce #(let [lowcase (.toLowerCase #^String %2)]
	      (assoc! %1 lowcase (inc (%1 lowcase 0)))) (transient {}) items)))))

In this case, transients buy us something between 10% - 15% performance, which is nice and the code still looks the same.

So now we have the core functionality implemented, all we need to do is produce the distribution hash-map for every file from the news group:

(pmap #(distribution (re-seq #"\w+" (slurp* %)))
			  (filter #(.isFile #^File %)
				  (file-seq (java.io.File. dir))))

Reading that from the tail-end, I call file-seq on a directory giving me a list containing the entire contents, directories and files. I then filter out those items which return true for the ".isFile" test. The result which is a list of java.io.File objects, is passed to slurp* which emits the content as a string and that result is run through the regex '\w+', which gives me a list of all the words. Since the computation of distribution is the heavy-lifter, I've passed that to a pmap, meaning that I activate both cores on my system, letting me work on multiple files at once. The result of this entire operation is as many hash-maps as there are files, all of them structured like

{"you" 5 "the" 22 "one" 11}

...etc. The next step, is then to join all of these hash-maps together into a single map, containing the accumulated values for all words:

(reduce #(merge-with + %1 %2) ....)

Where '...' is the pmap just above. So that's a parallelized newsgroup parser which returns a giant hashmap. All thats left to do is sort it emit it to the files, so I'll paste the entire thing:

(ns user
  (:use [clojure.contrib duck-streams str-utils])
  (:import [java.io File]))

(defn distribution [items]
  (persistent!
   (reduce #(let [lowcase (.toLowerCase #^String %2)]
	      (assoc! %1 lowcase (inc (%1 lowcase 0)))) (transient {}) items)))

(defn process-newsgroups [dir]
  (let [format #(str-join \newline (map (fn [_] (str-join \tab _)) %))
		data	(reduce #(merge-with + %1 %2)
			(pmap #(distribution (re-seq #"\w+" (slurp* %)))
				(filter #(.isFile #^File %)
					(file-seq (java.io.File. dir)))))]
    (spit "/home/lau/Desktop/alphabetical" (format (sort data)))
    (spit "/home/lau/Desktop/decending"    (format (sort-by val > data)))))

(time (process-newsgroups "/home/lau/Desktop/20_newsgroups/"))

As you can see sorting is a breeze. Alphabetically you just call sort as it works on the keys. To work on the values I use sort-by and explicitly ask for the value and pass > as the comparator. The result of sorting is a list of vectors (["one" 1] ["two" 2]) which is passed to format. Format is also in Clojure-Core but I've shadowed it with an anonymous function that interposes a tab between items and a newline between vectors. The result is then passed to spit, which writes the actual file.

(note: spit and slurp* are from contrib.duck-streams and str-join are from str-utils)

Conclusion

So whats the fun of comparing if we dont round it all up:


Ruby Scala Clojure
LOC 36 54 19
Runtime (secs) 88 39 25
LOC 36 54 19


So it seems that Ruby allows you to write somewhat compact code, but you have to suffer a great performance hit. Scala gives you a somewhat bloated code-size, but in return you get about than 50% more performance out of your system. With Clojure you get a minimal code-size, parallelized computations and blazing performance that's 72% faster than Ruby and 36% faster than Scala.

Now as always, remember that the benchmark gives you an indication of the truth, results will vary a little. I ran all 3 scripts from command-line with no other programs being used at the same time, but still.

Also I don't want anybody to get the wrong impression: This is not your typical x VS y post of which I have done several. Mr. Cox didn't write out those examples to demonstrate the awesome power of either language so this can not be considering anything other than a superficial look at the code - He solved a fun problem and I was happy to play along. Nothing I've said should be taken as a critism of his code or of his blogpost which I quite enjoyed :)

Update: Danny Lagrouw of Scala fame wanted to represent the Scala colors in a few more flavors, so he has taken the above code and improved upon it. He made 2 versions, one using foldLeft and the other using foreach. Both of these versions are for the Scala compiler in versions 2.7 and 2.8. Both versions using foldLeft failed to complete on my system so the benchmark is for the mutable version (which is faster). Please see his contributions: here. The results are:

Update(2): More and more contributions are coming in - I'll benchmark and add them as I get them.


LOC Runtime (secs)
Ruby 36 88
Scala (orig) 54 39
Clojure 19 25
Scala (2.7) 38 45
Scala (2.8) 38 54
Python 32 13
Java 61 21


Big thanks to all for contributing.


Justin Kramer
2009-12-13 21:11:14
Nice post. FYI, clojure.contrib.seq-utils has a frequencies function which does the same thing as your distribution, minus lowercasing. This would be equivalent:

    (frequencies (map #(.toLowerCase %) items))

It doesn't use transients, however (maybe someone should submit a patch).
ajay
2009-12-13 22:16:11
For one thing, I dont think it is a right comparison to count the LOC of Clojure code without putting the closing parenthesis ")" in different lines like in Scala / Ruby. Then only would the LOC comparison be valid. Nevertheless, even after doing so, Clojure code is smaller by 15 odd lines i guess, so it's still good.

More … the Scala code does som checking to see if the file exists or not. Your Clojure code does not do that. That will bump up the LOC by 4 – 5 lines i guess
Lau
2009-12-13 22:39:33
@ajay: Thanks for stopping by.

Regarding the LOC then the only fair way to go, is present the idiomatic approach of both languages, which will be the easiest read for the users. For Lisp thats always having all closing parens after the last expression. For Ruby it's different.

In regards to checking if the directory exists, its a matter of adding (when-let (dir (file-seq ...)) :)

But don't read this as the authoritative comparison between Ruby and Clojure, it's just a fun little exercise.
Michael
2009-12-14 00:08:54
It's been a while, but doesn't Ruby have some equivalent to Perl's File::Find module? That should take care of that ugly files routine and be more predictable in case symlinks are in there.
Not using some standard library in an interview question usually doesn't bode well.
Brenton Ashworth
2009-12-14 22:56:22
Great post!

I did a Clojure version before reading this. Yours is better.

There is one thing I would like to point out. I have run each version on my machine multiple times and the Scala code is always the fastest. Ruby takes 1m50s, your Clojure version takes 19s, my Clojure version takes 25s and the Scala version always takes between 11s and 15s.

I think it took longer the first time I ran it (compilation).

After my testing, I would definitely not conclude that Clojure is faster than Scala.

I prefer Clojure to Scala but I was impressed by Scala's performance here. Also, the code is not so bad if you remove lines 37 to 57 which should be in a library anyway.
Nathan Youngman
2009-12-15 04:39:31
@Michael, Ruby does have a Dir.glob() function which could really clean up that example.

@Lau, I think you answered your own question about the Ruby "hype". While being a regular imperative languages may not be the best, it does make it approachable for the regular imperative programmer. And there are a "couple of interesting features".

If you asked Matz (language designer), Ruby's killer feature is internal DSLs ... macros that read like English. It's also nice to hear that the core team is thinking of functional features, even if it's just in terms of lazy evaluation and not a radical shift to immutable data structures:
http://rubyconf2009.confreaks.com/19-nov-2009-09-00-keynote-yukuhiro-matz-matsumoto.html
Lau
2009-12-15 10:36:43
@Brenton: Glad you liked it.

Regarding your benchmark, I'll make the guess that you're on a single core machine. See my Clojure version does almost the exact same thing as the Scala version, but with one major difference: It parallelizes work on all available cores. So on a single core system, Scala will be faster. For every core you add, Clojure should scale and become faster and faster. This is consistent with my observations when benchmarking, Scala only uses about 80% of 1 core.

Generally I wouldn't make the case that Clojure is faster than Scala, because if you just look at the compiler alone I think Scala outperforms Clojure. 

@Nathan: I'll have a look at that video - Macros are great, but I don't know if I'd suffer a ~450% performance hit to get them :)
Daniel Sobral
2009-12-15 13:47:31
Lines 39 to 49 in the Scala example could be replaced with:

implicit def file2String(file: File): String = io.Source.fromFile(file).mkString

Unfortunately, that wouldn't close the resource. Instead, 

implicit def file2String(file: File): String = using (io.Source.fromFile(file)) { source =&gt; source.mkString }

ought work, but only on Scala 2.8. 

Also, 

def using[Closeable  B): B =

could be written

def using[Closeable  B): B =

But that wouldn't work with the simplification above.

As for that being unlike Scala, it is unlike large Scala programs, but it is exactly like smallish Scala scripts.

I found it very curious that the foldLeft solution wouldn't finish. I can only assume there's some trashing going on with the map structure. Not something I'd have expected.

Regarding parallelism and cores, the 80% figure you gave is the most strange I have seen. I'm used to see all cores fully busy -- or one core, if whatever I'm running is not parallelized.

Personally, for this problem, I'd have mapped the files with Futures, and then merged them, pretty much as you did in the Clojure version, except that instead of using pmap in place of map, one uses map and wraps the result in Futures.future. Sadly, there's no map-merge in Scala.

I posted such a version to http://paste.pocoo.org/show/157157/.  This is as functional as a program doing I/O can be. Out of curiosity, I made a non-functional one as well: http://paste.pocoo.org/show/157160/. Runs in about the same time, but I'm guessing my sample is too small.
Daniel Sobral
2009-12-15 13:50:11
MMmmm. The last post was mangled, because of "&lt;:". Anyway, I meant to say that { def close(): Unit } can be replaced by java.io.Closeable.
Sam Aaron
2009-12-15 17:53:28
The Ruby version you cite is pretty poor. There's a lot of accidental complexity boiler plated in (which looks like it has some php heritage).

A nicer version is here: http://gist.github.com/257079

This has been optimised for readability and to fit in with Ruby's normal conventions.

I really don't like the LOC measure as a good indication of the lack of accidental complexity as compressed code can be complicated to decompress in your brain.

However, if you're going for LOC, then let me suggest this Ioke version. It's shorter and (in my opinion) far more easily digestible than the Clojure version whilst also including it's own implementation of (spit) (as of the time of writing, Ioke doesn't have an equivalent).

counts = method(
  files_to_read = FileSystem["**"] reject(name, FileSystem directory?(name))
  combined_text = files_to_read inject("", text, filename, text += FileSystem readFully(filename))
  words_filter  = method(text, #/\w+/ allMatches(combined_text) map(lower))

  words_filter(combined_text) inject({} withDefault(0), counts, word, counts[word] += 1. counts)
)

spit = method(outfile, enumerable,
  FileSystem withOpenFile(outfile, fn(file, enumerable each(x, file println(x)))))

spit("alphabetical_output_ioke.txt", counts sort)
spit("count_output_ioke.txt", counts sortBy(value))

(http://gist.github.com/257080)
Brenton Ashworth
2009-12-15 18:31:57
@Lau: I have two cores. Here are the numbers

 -- time clj wordcounts_lau.clj
real	0m19.758s
user	0m32.160s
sys	0m1.835s

if I change pmap to map in your code I get
real	0m20.157s
user	0m24.256s
sys	0m1.333s

-- time scala wordcounts.scala
real	0m15.593s
user	0m12.615s
sys	0m1.026s

I am using Scala 2.7.3 on a Mac. 

It is also interesting that the parallel version uses 32% more CPU to get a 2% speed up.
Shot
2009-12-15 19:33:46
I’d seriously argue the following is more rubyish – with the idiomatic.rb version being the more, well, idiomatic one, while the efficient.rb version being more efficient by not having to flatten the per-file word arrays: http://gist.github.com/257164
Jon Harrop
2009-12-22 05:19:21
You might like my F# solution:

http://fsharpnews.blogspot.com/2009/12/zach-cox-word-count-challenge.html
Lau
2009-12-22 08:56:16
@John: Thanks for contributing.
anshul
2009-12-28 13:23:46
That is a very non-rubyish ruby code.   The way I would write it in real life would be something like the efficient.rb of @Shot above...
Michael
2009-12-28 22:45:05
Here are my performance results (1 Core, AMD Sempron, Ubuntu 8.04 LTS, Sun JSDK 1.6.0-12):

Clojure (your version, best performance with -server):  49.8s
Clojure (your version, pmap -&gt; map, -server): 47.0s
Scala (recent 2.8 nightly build, -server): 32.9s
JRuby 1.4 (-server): 37.3s
Ruby 1.8.6: 78s
Lau
2009-12-28 22:56:04
@Michael: Thanks for sharing.

Its not surprising that the Clojure version runs poorly on a 1 Core system, but I'm positively surprised about the JRuby result.
Tim
2009-12-29 07:38:52
Profiling on 1 core doesn't really contribute except for a nostalgic look back when Moore's Law still applied to CPU frequency ;)

Assuming Clojure's wining performance comes largely from parallelization, it would be nice to see how it fares with more cores and more CPUs. That is the primary reason for the resurgence of interest in FP languages by more than FP adherents.
Michael
2009-12-29 11:54:15
@Tim: Well, there are several people in the Clojure community that are touting that Clojure can be as fast as Java, e. g. http://www.pragprog.com/titles/shcloj/programming-clojure

"Clojure is fast. Wherever you need it, you can get the exact same performance that you could get from hand-written Java code."

But as more than one evidence shows, they are simply not telling the truth. Evidence might suggest that Clojure is slower than both Scala and maybe even JRuby(!).

Benchmarking on 1 Core is the only way to really compare code performance and not parallelization features (Clojure's are really cool). Parallelization of calculations might be interesting for number crunching big data, but not for web applications (which is what most of us are doing). But then if you have extremely high performance requirements (e. g. processing Terabytes of Large Hadron Collider data) and parallelization requirements than Clojure might also not be suited because of not optimal performance...
Lau
2009-12-29 11:57:13
Hi Michael,

Just to correct a few misunderstandings. Functional Programming has never been faster than imperative programming. Most Scala code is somewhat imperative. Functional programming does however give you substantially shorter programs, programs with fewer bugs and finally programs which parallelize easily. So comparing a functional program to an imperative one on a single core system is redundant. So the question is, how fast can you make Clojure go? Well if you check out WideFinder2 you'll see that it goes both faster than Java, Scala and single-threaded C.

/Lau
Morgan Conrad
2009-12-30 00:25:53
I posted a Java version at the Artima site, 

http://www.artima.com/forums/flat.jsp?forum=270&amp;thread=277968

Fill in a proper file directory name and have at it.  Be interesting to see how it compares for speed.  It's around 50 LOC depending how you count...
Lau
2009-12-30 09:00:07
Hi Morgan,

Thanks for contributing, I'd be happy to benchmark your solution and put it on the site, but for the sake of comparison could you line up your solution with the others?

<ol>
	<li>1. Output 2 files, one in descending order (tab separated), the other in alphabetical</li>

	<li>2. Record System/nanoSeconds before you start walking the directory, and output how many seconds lapsed<strong> after both files are written to disk</strong></li>
</ol>





Thanks
Sakuraba
2009-12-30 10:50:55
Great post. Now I know why neither of these languages has surpassed Java yet. Ruby is too slow, Scala is too hard and Clojure just makes the average developer say wtf when looking at it for the first time.

I want the instant-eval-REPL-Engine from Scala, the syntax from Groovy and the performance from Java. That would be the next big thing.
Lau
2009-12-30 10:57:21
@Sakuraba: You might be right, but this blog isn't about the 'next big thing' as we then have to take in account the lowest common denominator. Its about the code which first looks mystifying, but holds tremendous power. Btw regarding performance, did you notice Clojure now out-performs Java on WideFinder2? :)
Morgan Conrad
2009-12-30 20:46:55
Hi Lau,

I have changed the Java as you requested.  You can find the source code at

http://flyingspaniel.wikidot.com/local--files/misc/WordCounter.java
(or through links on it's parent page, http://flyingspaniel.wikidot.com/misc)

I blogged briefly about it at
http://flyingspaniel.blogspot.com/2009/12/answering-clojure-vs-ruby-scala.html
Warning - that code won't compile because I removed a bunch of the Java  things cause they confuse my HTML so much.

Thanks
Morgan Conrad
2009-12-30 20:48:11
That should say "java angle bracket" things.  The &lt;Type&gt; stuff...
John
2009-12-30 20:56:23
I implemented both a Python version and a Factor version, for comparison.

http://re-factor.blogspot.com/2009/12/counting-word-frequencies.html
Morgan Conrad
2009-12-30 21:51:37
I have updated the Java code to match the directory structure of the newsgroup files.  Basically, it looks two directories deep.  Code (now complete with all the angle bracket generic stuff) is posted at 

http://flyingspaniel.wikidot.com/local – files/misc/WordCounter.java
(or through links on it’s parent page, http://flyingspaniel.wikidot.com/misc)

and
http://flyingspaniel.blogspot.com/2009/12/answering-clojure-vs-ruby-scala.html

It seems to me that, for this problem, the speed of the language's I/O libraries is more important than the intrinsic speed of the languages themselves.
Lau
2009-12-30 23:03:09
@John: I've benchmarked your Python solution and congratz, you're now the fastest entry!
Tommy
2010-01-01 12:30:07
Does anyone have a readable Clojure version? I've programmed quite a lot of Common Lisp but no Clojure and the code posted is impenetrable for me. 

Looking at the code gives almost no hints on what it is supposed to do.

Please insert a few more bindings to introduce some names for the intermediate data.
Lau
2010-01-01 12:41:58
Hi Tommy,

I will try to help your understanding along. Distribution takes a hash-map and for every value it meets, it associates a value to the key in the hash-map which represents that value. If it doesn't find a value, it defaults to 0 and increments it. If it finds a value it increments it. So it works its way through a seq like this
(distrubution ("one" "two" "one" "three"))
1: (:one 1) ; seeing one
2: (:one 1 :two 1)  ;seeing two
3: (:one 2 :two 1)  ; seeing one again, 3.rd item
4: (:one 2 :two 1 : three 1)  ;seeing three
So you understand how distribution counts the number of occurrences. Then in the main loop, at file-seq is generated, which contains all directories and files under the argument. That is passed to a filter which strips out the directories so the following functions are only working on files. The pmap then works as a map, just parallelized, and it slurps the file-name argument to get the content, which is then run through a regex separating words, which is then passed to distribution, doing what you saw above. Once pmap is done, the data is a series of hashmaps containing the count of occurrences in each file, to merge that into our master set, I call reduce which accumulates a result, calling 'merge-with +' on the first hashmap, taking the result and calling it on the 2.nd, taking the result calling it on the 3.rd etc, ending up with a huge hashmap of all the words and their occurances accumulated. This is then stored in 'data'.

Anything I skipped, which I shouldn't have?
Paul King
2010-01-01 14:20:38
Groovy equivalent of your Ruby example:

import static java.lang.System.currentTimeMillis as now
rootDir = new File("/home/zcox/dev/20_newsgroups")
assert rootDir.exists() &amp;&amp; rootDir.isDirectory()

t1 = now()
counts = [:]
rootDir.eachFileRecurse{ f -&gt;
  if (!f.isDirectory()) {
    f.text.eachMatch(/\w+/) { word -&gt;
      def w = word.toLowerCase()
      def c = counts[w] ?: 0
      counts[w] = c + 1
    }
  }
}

println "Writing counts in decreasing order"
new File("counts-descreasing-groovy").withWriter { out -&gt;
  counts.sort { a, b -&gt; b.value  a.value }.each { k, v -&gt; out &lt;
  counts.sort { it.key }.each { k, v -&gt; out &lt;&lt; &quot;$k\t$v\n&quot; }
}

println &quot;Finished in ${now() - t1} millis&quot;
Paul King
2010-01-01 14:25:40
import static java.lang.System.currentTimeMillis as now
rootDir = new File("/Temp/rootDir")
assert rootDir.exists() &amp;&amp; rootDir.isDirectory()

t1 = now()
counts = [:]
rootDir.eachFileRecurse{ f -&gt;
&nbsp;&nbsp;if (!f.isDirectory()) {
&nbsp;&nbsp;&nbsp;&nbsp;f.text.eachMatch(/\w+/) { word -&gt;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;def w = word.toLowerCase()
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;def c = counts[w] ?: 0
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;counts[w] = c + 1
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;}
}

println "Writing counts in decreasing order"
new File(rootDir, "counts-descreasing-groovy").withWriter { out -&gt;
&nbsp;&nbsp;counts.sort { a, b -&gt; b.value  a.value }.each { k, v -&gt; out &lt;&lt; "$k\t$v\n" }
}

println "Writing counts in alphabetical order"
new File(rootDir, "counts-alphabetical-groovy").withWriter { out -&gt;
&nbsp;&nbsp;counts.sort { it.key }.each { k, v -&gt; out &lt;&lt; "$k\t$v\n" }
}

println "Finished in ${now() - t1} millis"
Paul King
2010-01-01 14:30:18
(Hopefully) nicer formatted version of Groovy equivalent to Ruby example:

import static java.lang.System.currentTimeMillis as now
rootDir = new File("/Temp/rootDir")
assert rootDir.exists() &amp;&amp; rootDir.isDirectory()

t1 = now()
counts = [:]
rootDir.eachFileRecurse{ f -&gt;
&nbsp;&nbsp;if (!f.isDirectory()) {
&nbsp;&nbsp;&nbsp;&nbsp;f.text.eachMatch(/\w+/) { word -&gt;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;def w = word.toLowerCase()
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;def c = counts[w] ?: 0
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;counts[w] = c + 1
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;}
}

println "Writing counts in decreasing order"
new File(rootDir, "counts-descreasing-groovy").withWriter { out -&gt;
&nbsp;&nbsp;counts.sort { a, b -&gt; b.value &lt;=&gt; a.value }.each { k, v -&gt; out &lt;&lt; "$k\t$v\n" }
}

println "Writing counts in alphabetical order"
new File(rootDir, "counts-alphabetical-groovy").withWriter { out -&gt;
&nbsp;&nbsp;counts.sort { it.key }.each { k, v -&gt; out &lt;&lt; "$k\t$v\n" }
}

println "Finished in ${now() - t1} millis"
Eric Bowman
2010-01-02 03:14:19
I struggled to get Daniel Sobral's Scala code to work with Scala 2.8 RC5, so I modified it to scale nicely across processors using an executor.  This works on 2.7.7; I didn't test it on 2.8.0.  I have to confess, using actors in this way makes me nervous.  Perhaps Akka would be a good choice here.  Anyhow creating a fixed executor isn't that big a deal.  Thanks to Daniel for the start.

Code is here:  http://paste.pocoo.org/show/161199/

On an 8 core machine, I setup scala with a 4GB heap, and compared the running time against the Clojure version, running on 1.1.0.

Scala version, 10 times:
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 5.588 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.76 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.083 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.245 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.843 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.782 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.026 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.263 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 9.326 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 5.191 seconds

It seems to fairly predictably hiccup when it needs to garbage collect, and this version is definitely fairly memory intensive.

On the same machine, the clojure version (running with no special command line options) clocks in at:

user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 10177.707843 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7819.693648 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 8400.974754 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7718.277628 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7799.463482 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7547.845221 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7596.377432 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7712.731791 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7644.399926 msecs"
nil
user=&gt; (time (process-newsgroups "/home/ebowman/newsgroups/full/20_newsgroups"))
"Elapsed time: 7860.660474 msecs"
nil

On the whole, quite impressive how little code is required.  But I'm happy to see that a statically-typed language which uses cores carefully does go a bit faster.
Eric Bowman
2010-01-02 03:34:12
Just for fun, I also wrote a version using a mutable map, and predictable the performance is slightly better.

Code is here: http://paste.pocoo.org/show/161204/

Results on my 8-core machine:

Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 5.732 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.422 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.004 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.199 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 5.327 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.943 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.921 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.898 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 5.26 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.929 seconds
Eric Bowman
2010-01-02 03:37:05
As one last little exercise before calling it a night, I was curious to see what is the performance of the mutable map version, using a thread pool of 1 thread instead of 8.  Surprisingly not that much worse than 8 threads!  So probably there is still quite a bit of room for more parallelism here:

Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 8.985 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.272 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.184 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.09 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 8.664 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.053 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.096 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.073 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 8.502 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 7.028 seconds
Daniel Sobral
2010-01-02 20:12:48
Eric, I wonder what problem you had running my code. You can't run it on REPL, because of initialization issues if you encapsulate it on an object, and you can't run it within an object extending Application, because of the same problem. It runs fine as a script here, however. You might want to catch&print exceptions, particularly on file2string, as these will definitely cause problems -- the actor shutdowns, and the join never succeeds.
Daniel Sobral
2010-01-02 20:17:05
Ah, on a side note, the mutable version I did for comparison is broken. This line doesn't work:

word =&gt; counts(word) = counts.getOrElseUpdate(word, 0) + 1

See this blog for explanations why. :-)
Eric Bowman
2010-01-02 22:55:24
Hi Daniel,

I upgraded to RC6, and the imperative version finishes now, but as you mentioned there is a bug.

The functional version hangs; see:

http://paste.pocoo.org/show/161458/

Seems like a Scala bug?
danlei
2010-01-03 08:30:20
I have done an implementation in Common Lisp, it's at http://gist.github.com/267404

My approach was not to get the job done in the least possible LOC, but to make it more or less comparable to the clojure version, reuseable and readable (I really don't like designators like x v l &c for example). The solution itself is lines 1-28, the rest implements stuff like slurp and merge-into, which I want to put in my utilities package once I have the time to do so. (slurp will be done differently though, since as I have done it here it won't work correctly with wide characters)

Performance on my machine and implementation was worse than Clojure's, but a little profiling showed that it spends about 40% with IO, so other CL implementations may be faster. Also, I experimented a bit with writing the files directly (without using slurp) and without using format, which gave a little speedup, but I'll leave it like it is and do performance tweaking when I put the merge-with and maybe slurp/spit stuff in my utilities package.

If someone wants to time it, make sure to declaim optimize accordingly and set *print-pretty* to nil, since this uses format for outputting, as I said above, then (time (output-directory-distribution "home:newsgroups;**;*.*" "home:output;").
poker joe
2010-01-03 08:56:12
the scala version, provided by eric (which is using java.util.concurrent) http://paste.pocoo.org/show/161199/

does seem to be faster than the clojure version, so the original blog post should be updated
Lau
2010-01-03 09:23:03
@danlei: Very nicely implemented.
Eric Bowman
2010-01-03 09:46:35
@Lau — you need to set a bigger heap size than the default. Try: JAVA_OPTS=”-Xmx2000m –Xms2000m” scala [script].

(Also, 2.7.6 is fatally broken, you should use 2.7.7...)
Eric Bowman
2010-01-03 09:57:34
This version is a little faster still:

http://paste.pocoo.org/show/161535/

...and also doesn't fall foul of the "Source doesn't close the file" bug in 2.7.x.

Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 5.479 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.909 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.972 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.749 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 5.206 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.742 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.959 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.661 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 4.402 seconds
Writing counts in decreasing order
Writing counts in alphabetical order
Finished in 3.748 seconds
Lau
2010-01-03 10:14:20
@Eric: I'll commend you on the source-code, its very nicely structured and readable. Regarding the timings it varied quite a bit for the first 6 runs, probably while HotSpot was considering what to do with it. Worst run was 66s, best was 48s, then it stabilized just below 40s.
Eric Bowman
2010-01-03 12:11:57
I'm wondering about the machine you are running on; my performance characteristics are quite different.  I ran the python version, and its performance is comparable to clojure, but slower than scala:

$ for ((i=0;i&lt;10;i++)) do python python.py; done
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 8.87481188774 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 9.00109100342 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 8.84871888161 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 8.87842607498 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 8.85654592514 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 9.07612895966 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 8.83537912369 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 8.91312813759 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 8.86974906921 seconds
Writing counts in decreasing order
Writing counts in decreasing order
Finished in 9.06246709824 seconds

CPU: Dual Intel(R) Xeon(R) CPU E5520  @ 2.27GHz
MemTotal:       24663692 kB
Linux snappy 2.6.30-gentoo-r4 #1 SMP Sat Sep 5 15:30:30 IST 2009 x86_64 Intel(R) Xeon(R) CPU E5520 @ 2.27GHz GenuineIntel GNU/Linux
java version &quot;1.6.0_17&quot;
Java(TM) SE Runtime Environment (build 1.6.0_17-b04)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01, mixed mode)
Scala code runner version 2.7.7.r0-b20100101230322 -- Copyright 2002-2009, LAMP/EPFL
Python 2.6.4
Lau
2010-01-03 12:14:31
@Eric: I think I'll need to devise a benchmark which relies less on I/O speed and more on computations. I'm on 
Ubuntu 9.10, 
Java 1.6.0_13, 
HotSpot b02, 
Python 2.6.2
1 GB RAM, 250GB Seagate HD
Paul King
2010-01-03 15:05:13
I ran some benchmarks with some Groovy variations:

12-LOC Groovy version 95+ sec
60-LOC Java 29 sec
13-LOC Static Groovy 22 sec
29-LOC Parallel Static Groovy with 2 processors 12 sec
Gabriel C.
2010-01-03 23:23:07
Lau, your Scala code is far from idiomatic: seems a  translated the Ruby version... no wonder it looks verbose!
And where is the Clojure version of this lines?
# if (!rootDir.exists) throw new IllegalArgumentException(rootDir + " does not exist")
# println("Processing" + dir)
# println("Writing counts in decreasing order")
# println("Writing counts in alphabetical order")
Lau
2010-01-03 23:30:20
@ALL: *ATTENTION* The comments starting from #38 - #50 all have my Gravatar next to it, even when I'm not the author, so check the name.

@Gabriel: I think I answered that when this debate begun :)
Zach Cox
2010-01-04 17:52:20
Wow I never thought my original post would generate so much interest!  Just posted a comment on it with some additional background and revised Ruby &amp; Scala versions: http://blogs.sourceallies.com/2009/12/word-counts-example-in-ruby-and-scala/comment-page-1/#comment-280

@Lau: I am going to learn Clojure in 2010 so that I can understand your code!  Thanks for posting it.

To everyone else: thanks for continuing the discussion!  It's great to see all of the different solutions in various languages.  I have learned a lot from all of you.
Daniel Spiewak
2010-01-05 06:34:56
For the record, I think this whole post ridiculous, slanted and without any positive merit.  Micro-benchmarks seldom serve any purpose, and *never* when it's a cross-language comparison.  And as if that wasn't bad enough, you're considering lines of code at the same time?!

With that said, I couldn't help myself.  :-)  http://paste.pocoo.org/show/162160/  I golfed your original implementation down a bit, made it substantially more idiomatic and leveraged some of those handy functions which come built into the standard library.  I didn't have time to test anything, but I'm reasonably sure that it's correct.  My code is substantially more functional than your Scala version (no more mutable state), quite a bit more readable than your Clojure version, and still manages to accomplish the task in only 27 lines.  If I were actually trying to solve this problem for myself, I probably would have written something similar to this.

Oh, and if you let me throw readability to the wind, I could probably shave eight or nine more lines off of that total, putting it basically on-par with the Clojure implementation.
Mogens Heller Grabe
2010-01-10 17:21:50
I agree with Daniel Spiewak completely, and I seldomly find it interesting to solve this kind of micro-scale problems. This time, however, I find it intriguing that the languages are so different - only thing missing is a C# version!

"But isn't that like Java?" someone might ask... if I had solved it with idiomatic C# 2, then yes it would probably be hard to distinguish a C# version from a Java version. But C# 3 changed that!

Check out a C# 3 version here: http://mookid.dk/oncode/archives/950
Lau
2010-01-10 17:25:04
@Mogens: Thanks for checking in and leaving a link to the code. I still haven't gotten around to benchmarking it, but hopefully I'll find a quiet moment soon.
virasak
2010-01-13 09:35:12
My best Scala version: http://gist.github.com/265899#file_word_freq.scala
Sal
2010-02-06 23:02:56
How about a Clojure vs Haskell post?