Clojure vs Ruby & Scala — Transient Newsgroups
Recently I had the good pleasure of reading a blogpost which demonstrated a fun exercise in both Ruby and Scala, namely scraping newsgroups. I had a look at both solutions and decided to roll one in Clojure as well, examining the differences between the famous Ruby, the Juggernaut Scala and the elegant Clojure.
Preface
If you have not read the blogpost I linked above, I recommend that you do so to get the full story. The short version goes like this: Mr. Cox was asked to do a newsgroup scraper for a job interview in any language and he rolled two very similar versions in both Ruby and Scala. Both programs do the same thing: Walk through all the files in a directory which contains 20 downloaded newsgroups, stripping out the individual words using a regex and count the number of occurrences. Finally the result is dumped into 2 files, the first alphabetically sorted the second sorted according to the number of occurrences of each word.
Ruby
For me it was interesting to see Ruby contrasted so directly with Scala, because there are more similarities than I thought. Ruby has often been lauded and applauded because it is said to enable the developer to produce software very quickly — I must admit from this example I fail to see how, but that could be my fault.
Ruby solution
#Change rootDir to the location of your newsgroup files
rootDir = "/home/zcox/dev/20_newsgroups"
raise rootDir + " does not exist" unless File.directory? rootDir
#Iterates over all files under rootDir, opens each one and passes it to the block
def files(rootDir)
Dir.foreach(rootDir) do |dir|
if dir != "." && dir != ".."
puts "Processing " + dir
Dir.foreach(rootDir + "/" + dir) do |file|
if file != "." && file != ".."
open(rootDir + "/" + dir + "/" + file) do |f|
yield(f)
end
end
end
end
end
end
t1 = Time.now
counts = Hash.new(0) #0 will be the default value for non-existent keys
files(rootDir) do |file|
file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }
end
puts "Writing counts in decreasing order"
open("counts-descreasing-ruby", "w") do |out|
counts.sort { |a, b| b[1] <=> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end
puts "Writing counts in alphabetical order"
open("counts-alphabetical-ruby", "w") do |out|
counts.sort { |a, b| a[0] <=> b[0] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end
t2 = Time.now
puts "Finished in " + (t2 - t1).to_s + " seconds"
There are quite a few things which catch my interest when reading that example. Firstly it looks a lot like everything else you see these days in terms of structure and syntax, but it has an interesting mix of both the long vertical code like most imperative languages, but also in a few places the deeper horizontal code which really packs a punch. I’m thinking primarily of the lines which output the result files, there’s a lot of functionality stuffed into a very short space.
Also when reading it, I kinda regret my earlier criticisms of Pythons indentation rules, because even though they aren’t optimal they beat the explicit block constraints which Ruby have. I can’t help but wonder how you feel as a developer when you’re writing out lines 14 — 19 in the above Ruby code. I think I’d feel a little tied up.
One of the interesting things about the code is this:
counts.sort { |a, b| b[1] <=> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
counts is a hash of the already accumulated statistics, which is sorted on the fly according to the integer value and output as pairs. It might have deserved a line or two more, but its pretty cool. Reminds me a little of Groovy.
Lastly, the actual assignment of scores to each word is done here
file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }
Again there’s the clever use of code-blocks where you iterate over some data, giving it a name |word| and working with it. Compared to a solution in say Java, there’s alot of ceremony being avoided here.
All in all, I hope some commentators will help enlighten me as to why there is so much hype concerning Ruby — to me it looks a little like your regular imperative language, with a couple of interesting features tossed in the mix.
Scala
I’ve previously compared Scala with Clojure, especially its Concurrency capabilities and unfortunately Scala came out looking pretty bad due a bug in their code-base, but this time Scala compares very nicely to Ruby.
Firstly I wouldn’t call this typical Scala because there are no classes and methods being defined, its as stripped of Boilerplate as is possible with Scala. Normally Scala comes across as very verbose and this isn’t the exception as its almost 40% longer than the Ruby version — Which I guess is a trade off, because it does yield significantly better performance.
Scala solution:
//Change rootDir to the location of your newsgroup files
import java.io._
val rootDir = new File("/home/lau/Desktop/20_newsgroups")
if (!rootDir.exists) throw new IllegalArgumentException(rootDir + " does not exist")
/** Iterates over all files under rootDir, opens each one and passes it to the function */
def files(rootDir: File)(process: File => Unit) {
for (dir <- rootDir.listFiles; if dir.isDirectory) {
println("Processing" + dir)
for (file <- dir.listFiles; if file.isFile) {
process(file)
}
}
}
val t1 = System.currentTimeMillis
var counts = Map.empty[String, Int].withDefaultValue(0)
files(rootDir) { file =>
file.split("""\W+""").foreach { word => counts = counts(word.toLowerCase) += 1 }
}
println("Writing counts in decreasing order")
write(counts, "counts-descreasing-scala") {_._2 > _._2}
println("Writing counts in alphabetical order")
write(counts, "counts-alphabetical-scala") {_._1 < _._1}
val t2 = System.currentTimeMillis
println("Finished in " + ((t2 - t1)/1000.0) + " seconds");
/** Writes the specified map to the specified file in tab-delimited format, sorted accordingly. */
def write[K, V](map: Map[K, V], file: String)(sort: (Tuple2[K, V], Tuple2[K, V]) => Boolean) {
using (new PrintWriter(new FileWriter(file))) { out =>
map.toList.sort(sort).foreach { pair => out.println(pair._1 + "\t" + pair._2) }
}
}
/** Converts a File to a String. */
implicit def file2String(file: File): String = {
val builder = new StringBuilder
using (new BufferedReader(new FileReader(file))) { reader =>
var line = reader.readLine
while (line != null) {
builder.append(line).append('\n')
line = reader.readLine
}
}
builder.toString
}
/** Performs some operation on the specified closeable object and then ensures it gets closed. */
def using[Closeable <: {def close(): Unit}, B](closeable: Closeable)(getB: Closeable => B): B =
try {
getB(closeable)
} finally {
closeable.close()
}
Now this does the same thing as the Ruby version, so why the added code? Primarily its the file2String and ‘using’ which aren’t (?) in Scalas core that are taking up space, so its not a big deal. Scala actually mimics Ruby quite closely in that sorting and assignment of scores are done in almost exactly the same way. As a Scala developer you spent much of your time writing out variables types which must be tedious, but the pay-off is quite a bit of speed. Its only in the static typing that I can find an explanation for why Scala outperforms Ruby fiercely.
But of course, the most interesting solution is saved for now :)
Clojure
Solving this problem in Clojure was extremely simple, so I decided to tweak performance a little bit. Firstly we need a function to which I can pass a list of words and in return get their distribution, or frequency of occurrence.
In its simplest form, you could do this
(reduce #(assoc %1 %2 (inc (get %1 %2 0))) {} ["you" "are" "you" "and" "I" "am" "me"])
{"me" 1, "am" 1, "I" 1, "and" 1, "are" 1, "you" 2}
That’s really simple, the first argument is a hashmap, to which I assign the string as the key and the value as either (inc 0) (default if get fails) or (inc old-value) if there is a value. As you can see it returns 1 for all words except ‘you’. But there’s a problem, its case sensitive. So we can formally define the case-sensitive version as so:
(defn distribution [items]
(reduce #(let [lowcase (.toLowerCase %2)]
(assoc %1 lowcase (inc (get %1 lowcase 0))))) {} items)))
And that’s really doing the exact same thing, but as my good friend Mr. Grand noticed, assoc always returns a new hash-map instead of mutating, as Clojure defaults to immutability. Because the computation is contained locally in this function, I don’t have a problem using transients (mutable data) because we’re still functional, no side-effects:
(defn distribution [items]
(persistent!
(reduce #(let [lowcase (.toLowerCase #^String %2)]
(assoc! %1 lowcase (inc (%1 lowcase 0)))) (transient {}) items)))))
In this case, transients buy us something between 10% — 15% performance, which is nice and the code still looks the same.
So now we have the core functionality implemented, all we need to do is produce the distribution hash-map for every file from the news group:
(pmap #(distribution (re-seq #"\w+" (slurp* %))) (filter #(.isFile #^File %) (file-seq (java.io.File. dir))))
Reading that from the tail-end, I call file-seq on a directory giving me a list containing the entire contents, directories and files. I then filter out those items which return true for the “.isFile” test. The result which is a list of java.io.File objects, is passed to slurp* which emits the content as a string and that result is run through the regex ‘\w+’, which gives me a list of all the words. Since the computation of distribution is the heavy-lifter, I’ve passed that to a pmap, meaning that I activate both cores on my system, letting me work on multiple files at once. The result of this entire operation is as many hash-maps as there are files, all of them structured like
{“you” 5 “the” 22 “one” 11}
…etc. The next step, is then to join all of these hash-maps together into a single map, containing the accumulated values for all words:
(reduce #(merge-with + %1 %2) ....)
Where ‘…’ is the pmap just above. So that’s a parallelized newsgroup parser which returns a giant hashmap. All thats left to do is sort it emit it to the files, so I’ll paste the entire thing:
(ns user
(:use [clojure.contrib duck-streams str-utils])
(:import [java.io File]))
(defn distribution [items]
(persistent!
(reduce #(let [lowcase (.toLowerCase #^String %2)]
(assoc! %1 lowcase (inc (%1 lowcase 0)))) (transient {}) items)))
(defn process-newsgroups [dir]
(let [format #(str-join \newline (map (fn [_] (str-join \tab _)) %))
data (reduce #(merge-with + %1 %2)
(pmap #(distribution (re-seq #"\w+" (slurp* %)))
(filter #(.isFile #^File %)
(file-seq (java.io.File. dir)))))]
(spit "/home/lau/Desktop/alphabetical" (format (sort data)))
(spit "/home/lau/Desktop/decending" (format (sort-by val > data)))))
(time (process-newsgroups "/home/lau/Desktop/20_newsgroups/"))
As you can see sorting is a breeze. Alphabetically you just call sort as it works on the keys. To work on the values I use sort-by and explicitly ask for the value and pass > as the comparator. The result of sorting is a list of vectors ([“one” 1] [“two” 2]) which is passed to format. Format is also in Clojure-Core but I’ve shadowed it with an anonymous function that interposes a tab between items and a newline between vectors. The result is then passed to spit, which writes the actual file.
(note: spit and slurp* are from contrib.duck-streams and str-join are from str-utils)
Conclusion
So whats the fun of comparing if we dont round it all up:
| Ruby | Scala | Clojure | |
| LOC | 36 | 54 | 19 |
| Runtime (secs) | 88 | 39 | 25 |
| LOC | 36 | 54 | 19 |
So it seems that Ruby allows you to write somewhat compact code, but you have to suffer a great performance hit. Scala gives you a somewhat bloated code-size, but in return you get about than 50% more performance out of your system. With Clojure you get a minimal code-size, parallelized computations and blazing performance that’s 72% faster than Ruby and 36% faster than Scala.
Now as always, remember that the benchmark gives you an indication of the truth, results will vary a little. I ran all 3 scripts from command-line with no other programs being used at the same time, but still.
Also I don’t want anybody to get the wrong impression: This is not your typical x VS y post of which I have done several. Mr. Cox didn’t write out those examples to demonstrate the awesome power of either language so this can not be considering anything other than a superficial look at the code — He solved a fun problem and I was happy to play along. Nothing I’ve said should be taken as a critism of his code or of his blogpost which I quite enjoyed :)
Update: Danny Lagrouw of Scala fame wanted to represent the Scala colors in a few more flavors, so he has taken the above code and improved upon it. He made 2 versions, one using foldLeft and the other using foreach. Both of these versions are for the Scala compiler in versions 2.7 and 2.8. Both versions using foldLeft failed to complete on my system so the benchmark is for the mutable version (which is faster). Please see his contributions: here. The results are:
Update(2): More and more contributions are coming in — I’ll benchmark and add them as I get them.
| LOC | Runtime (secs) | |
| Ruby | 36 | 88 |
| Scala (orig) | 54 | 39 |
| Clojure | 19 | 25 |
| Scala (2.7) | 38 | 45 |
| Scala (2.8) | 38 | 54 |
| Python | 32 | 13 |
| Java | 61 | 21 |
Big thanks to all for contributing.
| | |
Comments are closed.

about 2 months ago
@ALL: *ATTENTION* The comments starting from #38 — #50 all have my Gravatar next to it, even when I’m not the author, so check the name.
@Gabriel: I think I answered that when this debate begun :)
about 2 months ago
Wow I never thought my original post would generate so much interest! Just posted a comment on it with some additional background and revised Ruby & Scala versions: http://blogs.sourceallies.com/2009/12/word-counts-example-in-ruby-and-scala/comment-page-1/#comment-280
@Lau: I am going to learn Clojure in 2010 so that I can understand your code! Thanks for posting it.
To everyone else: thanks for continuing the discussion! It’s great to see all of the different solutions in various languages. I have learned a lot from all of you.
about 2 months ago
For the record, I think this whole post ridiculous, slanted and without any positive merit. Micro-benchmarks seldom serve any purpose, and *never* when it’s a cross-language comparison. And as if that wasn’t bad enough, you’re considering lines of code at the same time?!
With that said, I couldn’t help myself. :-) http://paste.pocoo.org/show/162160/ I golfed your original implementation down a bit, made it substantially more idiomatic and leveraged some of those handy functions which come built into the standard library. I didn’t have time to test anything, but I’m reasonably sure that it’s correct. My code is substantially more functional than your Scala version (no more mutable state), quite a bit more readable than your Clojure version, and still manages to accomplish the task in only 27 lines. If I were actually trying to solve this problem for myself, I probably would have written something similar to this.
Oh, and if you let me throw readability to the wind, I could probably shave eight or nine more lines off of that total, putting it basically on-par with the Clojure implementation.
about 1 month ago
I agree with Daniel Spiewak completely, and I seldomly find it interesting to solve this kind of micro-scale problems. This time, however, I find it intriguing that the languages are so different — only thing missing is a C# version!
“But isn’t that like Java?” someone might ask… if I had solved it with idiomatic C# 2, then yes it would probably be hard to distinguish a C# version from a Java version. But C# 3 changed that!
Check out a C# 3 version here: http://mookid.dk/oncode/archives/950
about 1 month ago
@Mogens: Thanks for checking in and leaving a link to the code. I still haven’t gotten around to benchmarking it, but hopefully I’ll find a quiet moment soon.
about 1 month ago
My best Scala version: http://gist.github.com/265899#file_word_freq.scala
about 1 month ago
How about a Clojure vs Haskell post?