Clojure vs Ruby & Scala – Transient Newsgroups

Recently I had the good pleasure of reading a blogpost which demonstrated a fun exercise in both Ruby and Scala, namely scraping newsgroups. I had a look at both solutions and decided to roll one in Clojure as well, examining the differences between the famous Ruby, the Juggernaut Scala and the elegant Clojure. Clojure stands out by being purely function, yet still allowing local mutation using transients.

Preface

If you have not read the blogpost I linked above, I recommend that you do so to get the full story. The short version goes like this: Mr. Cox was asked to do a newsgroup scraper for a job interview in any language and he rolled two very similar versions in both Ruby and Scala. Both programs do the same thing: Walk through all the files in a directory which contains 20 downloaded newsgroups, stripping out the individual words using a regex and count the number of occurrences. Finally the result is dumped into 2 files, the first alphabetically sorted the second sorted according to the number of occurrences of each word.

Ruby

For me it was interesting to see Ruby contrasted so directly with Scala, because there are more similarities than I thought. Ruby has often been lauded and applauded because it is said to enable the developer to produce software very quickly – I must admit from this example I fail to see how, but that could be my fault.

Ruby solution

#Change rootDir to the location of your newsgroup files
rootDir = "/home/zcox/dev/20_newsgroups"
raise rootDir + " does not exist" unless File.directory? rootDir

#Iterates over all files under rootDir, opens each one and passes it to the block
def files(rootDir)
  Dir.foreach(rootDir) do |dir|
    if dir != "." && dir != ".."
      puts "Processing " + dir
      Dir.foreach(rootDir + "/" + dir) do |file|
        if file != "." && file != ".."
          open(rootDir + "/" + dir + "/" + file) do |f|
            yield(f)
          end
        end
      end
    end
  end
end

t1 = Time.now
counts = Hash.new(0) #0 will be the default value for non-existent keys
files(rootDir) do |file|
  file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }
end

puts "Writing counts in decreasing order"
open("counts-descreasing-ruby", "w") do |out|
  counts.sort { |a, b| b[1] <_> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end

puts "Writing counts in alphabetical order"
open("counts-alphabetical-ruby", "w") do |out|
  counts.sort { |a, b| a[0] <_> b[0] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end

t2 = Time.now
puts "Finished in " + (t2 - t1).to_s + " seconds"

There are quite a few things which catch my interest when reading that example. Firstly it looks a lot like everything else you see these days in terms of structure and syntax, but it has an interesting mix of both the long vertical code like most imperative languages, but also in a few places the deeper horizontal code which really packs a punch. I’m thinking primarily of the lines which output the result files, there’s a lot of functionality stuffed into a very short space.

Also when reading it, I kinda regret my earlier criticisms of Pythons indentation rules, because even though they aren’t optimal they beat the explicit block constraints which Ruby have. I can’t help but wonder how you feel as a developer when you’re writing out lines 14 – 19 in the above Ruby code. I think I’d feel a little tied up.

One of the interesting things about the code is this:

counts.sort { |a, b| b[1] <_> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }

counts is a hash of the already accumulated statistics, which is sorted  on the fly according to the integer value and output as pairs. It might have deserved a line or two more, but its pretty cool. Reminds me a little of Groovy.

Lastly, the actual assignment of scores to each word is done here

  file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }

Again there’s the clever use of code-blocks where you iterate over some data, giving it a name |word| and working with it. Compared to a solution in say Java, there’s alot of ceremony being avoided here.

All in all, I hope some commentators will help enlighten me as to why there is so much hype concerning Ruby – to me it looks a little like your regular imperative language, with a couple of interesting features tossed in the mix.

Scala

I’ve previously compared Scala with Clojure, especially its Concurrency capabilities and unfortunately Scala came out looking pretty bad due a bug in their code-base, but this time Scala compares very nicely to Ruby.

Firstly I wouldn’t call this typical Scala because there are no classes and methods being defined, its as stripped of Boilerplate as is possible with Scala. Normally Scala comes across as very verbose and this isn’t the exception as its almost 40% longer than the Ruby version – Which I guess is a trade off, because it does yield significantly better performance.

Scala solution:

//Change rootDir to the location of your newsgroup files
import java.io._
val rootDir = new File("/home/lau/Desktop/20_newsgroups")
if (!rootDir.exists) throw new IllegalArgumentException(rootDir + " does not exist")

/** Iterates over all files under rootDir, opens each one and passes it to the function */
def files(rootDir: File)(process: File =>; Unit) {
  for (dir <_- _-="_-" filesrootdir="filesrootdir" printlnprocessing="printlnprocessing" counts="Map.empty[String," dir.listfiles="dir.listfiles" _="_" dir="dir" t1="System.currentTimeMillis" file.isfile="file.isfile" processfile="processfile" rootdir.listfiles="rootdir.listfiles" if="if" val="val" for="for" dir.isdirectory="dir.isdirectory" file="" int.withdefaultvalue0="int.withdefaultvalue0" var="var">
  file.split("""\W+""").foreach { word => counts = counts(word.toLowerCase) += 1 }
}

println("Writing counts in decreasing order")
write(counts, "counts-descreasing-scala") {_._2 > _._2}

println("Writing counts in alphabetical order")
write(counts, "counts-alphabetical-scala") {_._1 > _._1}

val t2 = System.currentTimeMillis
println("Finished in " + ((t2 - t1)/1000.0) + " seconds");

/** Writes the specified map to the specified file in tab-delimited format, sorted accordingly. */
def write[K, V](map: Map[K, V], file: String)(sort: (Tuple2[K, V], Tuple2[K, V]) =>; Boolean) {
  using (new PrintWriter(new FileWriter(file))) { out =>
    map.toList.sort(sort).foreach { pair => out.println(pair._1 + "\t" + pair._2) }
  }
}

/** Converts a File to a String. */
implicit def file2String(file: File): String = {
  val builder = new StringBuilder
  using (new BufferedReader(new FileReader(file))) { reader =>
    var line = reader.readLine
    while (line != null) {
      builder.append(line).append('\n')
      line = reader.readLine
    }
  }
  builder.toString
}

/** Performs some operation on the specified closeable object and then ensures it gets closed. */
def using[Closeable &lt;_:_ closeable="" closeablegetb:_="closeablegetb:_" bcloseable:_="bcloseable:_" unit="unit" close:_="close:_" def="def"> B): B =
  try {
    getB(closeable)
  } finally {
    closeable.close()
  }

Now this does the same thing as the Ruby version, so why the added code? Primarily its the file2String and ‘using’ which aren’t (?) in Scalas core that are taking up space, so its not a big deal. Scala actually mimics Ruby quite closely in that sorting and assignment of scores are done in almost exactly the same way. As a Scala developer you spent much of your time writing out variables types which must be tedious, but the pay-off is quite a bit of speed. Its only in the static typing that I can find an explanation for why Scala outperforms Ruby fiercely.

But of course, the most interesting solution is saved for now :)

Clojure Transients

Solving this problem in Clojure was extremely simple, so I decided to tweak performance a little bit. Firstly we need a function to which I can pass a list of words and in return get their distribution, or frequency of occurrence.

In its simplest form, you could do this

(reduce #(assoc %1 %2 (inc (get %1 %2 0))) {}  ["you" "are" "you" "and" "I" "am" "me"])
{"me" 1, "am" 1, "I" 1, "and" 1, "are" 1, "you" 2}

That’s really simple, the first argument is a hashmap, to which I assign the string as the key and the value as either (inc 0) (default if get fails) or (inc old-value) if there is a value. As you can see it returns 1 for all words except ‘you’. But there’s a problem, its case sensitive. So we can formally define the case-sensitive version as so:

(defn distribution [items]
  (reduce #(let [lowcase (.toLowerCase %2)]
	       (assoc %1 lowcase (inc (get %1 lowcase 0))))) {} items)))

And that’s really doing the exact same thing, but as my good friend Mr. Grand noticed, assoc always returns a new hash-map instead of mutating, as Clojure defaults to immutability. Because the computation is contained locally in this function, I don’t have a problem using transients (mutable data) because we’re still functional, no side-effects:

(defn distribution [items]
  (persistent!
   (reduce #(let [lowcase (.toLowerCase #^String %2)]
	      (assoc! %1 lowcase (inc (%1 lowcase 0)))) (transient {}) items)))))

In this case, transients buy us something between 10% – 15% performance, which is nice and the code still looks the same.

So now we have the core functionality implemented, all we need to do is produce the distribution hash-map for every file from the news group:

(pmap #(distribution (re-seq #"\w+" (slurp* %)))
			  (filter #(.isFile #^File %)
				  (file-seq (java.io.File. dir))))

Reading that from the tail-end, I call file-seq on a directory giving me a list containing the entire contents, directories and files. I then filter out those items which return true for the “.isFile” test. The result which is a list of java.io.File objects, is passed to slurp* which emits the content as a string and that result is run through the regex ‘\w+’, which gives me a list of all the words. Since the computation of distribution is the heavy-lifter, I’ve passed that to a pmap, meaning that I activate both cores on my system, letting me work on multiple files at once. The result of this entire operation is as many hash-maps as there are files, all of them structured like

{"you" 5 "the" 22 "one" 11}

…etc. The next step, is then to join all of these hash-maps together into a single map, containing the accumulated values for all words:

(reduce #(merge-with + %1 %2) ....)

Where ‘…’ is the pmap just above. So that’s a parallelized newsgroup parser which returns a giant hashmap. All thats left to do is sort it emit it to the files, so I’ll paste the entire thing:

(ns user
  (:use [clojure.contrib duck-streams str-utils])
  (:import [java.io File]))

(defn distribution [items]
  (persistent!
   (reduce #(let [lowcase (.toLowerCase #^String %2)]
	      (assoc! %1 lowcase (inc (%1 lowcase 0)))) (transient {}) items)))

(defn process-newsgroups [dir]
  (let [format #(str-join \newline (map (fn [_] (str-join \tab _)) %))
		data	(reduce #(merge-with + %1 %2)
			(pmap #(distribution (re-seq #"\w+" (slurp* %)))
				(filter #(.isFile #^File %)
					(file-seq (java.io.File. dir)))))]
    (spit "/home/lau/Desktop/alphabetical" (format (sort data)))
    (spit "/home/lau/Desktop/decending"    (format (sort-by val &gt; data)))))

(time (process-newsgroups "/home/lau/Desktop/20_newsgroups/"))

As you can see sorting is a breeze. Alphabetically you just call sort as it works on the keys. To work on the values I use sort-by and explicitly ask for the value and pass > as the comparator. The result of sorting is a list of vectors ([“one” 1] [“two” 2]) which is passed to format. Format is also in Clojure-Core but I’ve shadowed it with an anonymous function that interposes a tab between items and a newline between vectors. The result is then passed to spit, which writes the actual file.

(note: spit and slurp* are from contrib.duck-streams and str-join are from str-utils)

Conclusion

So whats the fun of comparing if we dont round it all up:

Ruby Scala Clojure
LOC 36 54 19
Runtime (secs) 88 39 25
LOC 36 54 19

So it seems that Ruby allows you to write somewhat compact code, but you have to suffer a great performance hit. Scala gives you a somewhat bloated code-size, but in return you get about than 50% more performance out of your system. With Clojure you get a minimal code-size, parallelized computations and blazing performance that’s 72% faster than Ruby and 36% faster than Scala.

Now as always, remember that the benchmark gives you an indication of the truth, results will vary a little. I ran all 3 scripts from command-line with no other programs being used at the same time, but still.

Also I don’t want anybody to get the wrong impression: This is not your typical x VS y post of which I have done several. Mr. Cox didn’t write out those examples to demonstrate the awesome power of either language so this can not be considering anything other than a superficial look at the code – He solved a fun problem and I was happy to play along. Nothing I’ve said should be taken as a critism of his code or of his blogpost which I quite enjoyed :)

Update: Danny Lagrouw of Scala fame wanted to represent the Scala colors in a few more flavors, so he has taken the above code and improved upon it. He made 2 versions, one using foldLeft and the other using foreach. Both of these versions are for the Scala compiler in versions 2.7 and 2.8. Both versions using foldLeft failed to complete on my system so the benchmark is for the mutable version (which is faster). Please see his contributions: here. The results are:

Update(2): More and more contributions are coming in – I’ll benchmark and add them as I get them.

LOC Runtime (secs)
Ruby 36 88
Scala (orig) 54 39
Clojure 19 25
Scala (2.7) 38 45
Scala (2.8) 38 54
Python 32 13
Java 61 21

Big thanks to all for contributing.
Lau

About the author

Lau Jensen is the owner of Best In Class, an avid Clojure Developer well experienced in project management and also the primary author of ClojureQL. You are very welcome to follow his posts on this blog or get in touch directly on lau.jensen@bestinclass.dk