Python vs Clojure - Reloaded

2009-10-18 14:09:51

In my last post I tried to outline some of the contrasts between the popular Python and Clojure. A repeated comment from the Python community was 'Why dont you let the code speak for itself, go to Rosetta'....So I did and this is what I found.

Preface

The Python community is very active and sizeable, which is good. Quite a few of them let me know that the examples which I had picked from the PyEuler project were not showing off Pythons best side and looking at their counter-examples I cannot deny that. On the encouragement of several Python users I've now visisted Rosetta code for the first time and here I'll share my experiences, solving 3 of their challenges.

Disclaimer

If you read the comments in the last post you'll see that tempers tend to rise when you're contrasting languages and that's not what I'm going for. I want this to be a fun apples-to-apples comparison, where hopefully both Python and Clojure users take away something useful. But please remember before you start ripping my comment-box in pieces: Inverting the example does not always invert the conclusion. If we determine that certain codepaths are error prone and 'dangerous' providing a safe/correct example won't necessarily change the principle, so let's try to keep the debate principle oriented. Makes sense?

Rosetta

Now I won't claim any expertise on the Rosetta project I visisted it yesterday for the first time. It's basically a place where you can go and show off solutions in every programming language you can think of, which is quite insightful. I'll try to comment on size, readability, boilerplate and correctness in the following examples. My method was simple: I saw a link to a task from the front page, solved that, clicked around until I found something that had a Python solution, solved that and so on.

#1: Lucas-Lehmer test

If the name rings a bell and you can't quite place it, perhaps it's because it relates to Mersenne primes which were mentioned on Slashdot earlier this week. The 48th number was found, winning some $100.000,- USD to a lucky (patient?) university. Mathematically speaking, the test is this:

Lucas-Lehmer Test: for p a prime, the Mersenne number 2p − 1 is prime if and only if 2p − 1 divides S(p − 1) where S(n + 1) = (S(n))2 − 2, and S(1) = 4.

So it's an iterative walk through S to determine if p is a Mersenne prime, its some Big Integer math which (until the next release) is a bit slow on the JVM. We'll start with the Python teams solution:

from sys import stdout
from math import sqrt, log

def is_prime ( p ):
  if p == 2: return True # Lucas-Lehmer test only works on odd primes
  elif p <_ in="in" range3="range3" no="no" print="print" _-="_-" range2="range2" intsqrtp1="intsqrtp1" mersenne="mersenne" else:_="else:_" _="1" true="true" or="or" log10="log10" finding="finding" count="count" upb_prime1:_="upb_prime1:_" _0="_0" p1:_="p1:_" upb_prime="int(" return="return" _2="_2" _1="_1" _2:_="_2:_" _0:_="_0:_" mprimes="mprimes" printmdp="printmdp" of="of" and="and" primes="primes" log2="log2" was="was" i="i" requested="requested" is_mersenne_prime="is_mersenne_prime" int="int" is_primep="is_primep" bits="bits" decimal="decimal" mp-1="mp-1" if="if" m_p="m_p" find="find" _:_="_:_" def="def" places="places" m2..d:upb_prime="m2..d:upb_prime" stdout.flush="stdout.flush" _45="_45" upb_count="45" maximum="maximum" for="for" number="number" false="false" unsigned="unsigned" long_bits_width="long_bits_width" is_mersenne_primep:_="is_mersenne_primep:_" enough="enough" p="p" given="given" s="=" precision="20000">= upb_count: break
print

Weighing in at 36 lines Team Python has produced a working solution. In the last post we we're specifically looking at Pythons functional programming (FP) capabilities, but since this is a comparison in a broader sense I won't deduct points for mutability and/or side effects :) Methodically speaking their prime tester is similar to mine, there's a couple of special cases but otherwise we look for divisors up until the squareroot of n. The biggest difference in both testers, is that Python code often relies on breaking out of nesting loops when certain conditions occur, but apart from that we're using the same approach. To new-comers their translation

2p - 1      to (1 << p) - 1

Might be a little confusing, but it's a great performance booster. << means bit-shift-left and

n << x        =     x*2^n

So, using the same methodology, allow me to present 11 lines of Clojure:

(defn prime? [i]
  (cond (< i 4)           (>= i 2)
        (zero? (rem i 2)) false
  :else (not-any? #(zero? (rem i %)) (range 3 (inc (Math/sqrt i))))))))

(defn mersenne? [p] (or (= p 2)
  (let [mp   (dec (bit-shift-left 1 p))]
    (loop [n 3 s 4]
      (if (> n p)
        (zero? s)
        (recur (inc n) (rem (- (* s s) 2) mp)))))))

(filter mersenne? (filter prime? (iterate inc 1)))

One thing you need to be mindful of, is that functions suffixed "?" in Clojure is often (by convention) used as predicates, meaning they will return either true of false. My prime? asks "are we below 4? If we are we're looking at numbers 1, 2 and 3, whereof only the first isn't a prime, so the question is really "are we looking at 2 or 3"? Shortened to (>= i 2). Mersenne? is exactly similar to Pythons and the filters should read like plain english. (iterate inc 1) iterates the function inc(rement) starting with one, 1.2.3.4.5.6.......n.

Conclusion

Python: Reads easy, well layed out - 36 lines is a bit much though

Clojure: Reads easy, 11 lines is excellent, no boilerplate.

If we're agreed, Clojure takes a tender lead of 1 - 0. If we're not agreed, I'm sure you'll let me know :)

It's worth noting though, that in the world outside Clojure and Python we see a C++ solution weighing about 45 lines, Java at about 40 lines and Ruby at 42 - So Clojure's 11 lines are impressive.

#2: Counting programming examples

This was a fun little challenge. Basically you get a URL which feeds you some XML giving you titles of all Programmings tasks on Rosetta, you then crawl these looking for examples, counting them as you go along. This is one of the times where I have to say Team Pythons solution really impressed me. It weighs in at a trimmed 12 lines:

import urllib, xml.dom.minidom

x = urllib.urlopen("http://www.rosettacode.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Programming_Tasks&cmlimit=500&format=xml")

tasks = []
for i in xml.dom.minidom.parseString(x.read()).getElementsByTagName("cm"):
    t = i.getAttribute('title').replace(" ", "_")
    y = urllib.urlopen("http://www.rosettacode.org/w/index.php?title=%s&action=raw" % t)
    tasks.append( y.read().lower().count("{{header|") )
    print t.replace("_", " ") + ": %d examples." % tasks[-1]

print "\nTotal: %d examples." % sum(tasks)

Quite nice dont you think? First they get the xml into x and then they define an array called tasks. Here they inject the number of examples they crawl in their for-loop. They took the advice of Rosetta and match "{{header|" in all subpages, which I've found actually doesn't give quite the correct result, because not everybody uses this tag like they should! Shame on them.

Now, this had me pressed up against the wall in terms of line-numbers, so I knew I had to come up with something good. May I present 11 long lines of Clojure:

(use net.cgrand.enlive-html)

(let [xml         "http://www.rosettacode.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Programming_Tasks&cmlimit=500&format=xml"
      title-url   "http://www.rosettacode.org/w/index.php?title="
      task-count  #(try
                             (dec (count (select (html-resource (java.net.URL. (str title-url %))) [[:h2]])))
                             (catch Exception 0))
      results    (pmap #(assoc % :tasks (task-count (.replace (:title %) \space \_)))
                       (map :attrs (select (html-resource (java.net.URL. xml-url)) [:cm])))]
  (doseq [r results] (println (:title r) (apply str (repeat (- 55 (count (:title r))) \space)) (:tasks r)))
  (println "Total: " (reduce + (map :tasks results))))

I know that I couldn't possible cram more code into that small space :) Allow me to explain:

I define task-count as being an anonymous function which counts the number of -h2- tags on a given URL and decrements 1 from this count. The reason for this is that when looking through the pages it's obvious that the '{{header' tag is being misused, but it was also obvious that the h2 tag is reserved for 2 things, 1x 'Content' and code-examples, so subtract the 1 and you got the number of examples.

Then comes the fun part - I run a parallel-map (that means multithreaded) on a function which builds a hash-map. assoc(iate) means to connect a key-value pair. Here the hash-map is whatever it's being fed, meaning I work directly on the one select returns, the key is :tasks and the value is (task-count) of pulling out the :title tag from parsing the attributes of :cm tags on the url. Cryptic? Let me show you a boiled down example:

user> (first (select (html-resource (java.net.URL. xml-url)) [:cm]))
{:tag :cm, :attrs {:title "100 doors", :ns "0", :pageid "2151"}, :content nil}
user> (:attrs (first (select (html-resource (java.net.URL. xml-url)) [:cm])))
{:title "100 doors", :ns "0", :pageid "2151"}

Line 1: The select methods returns as many as those maps are there are occurances of [:cm] in the XML. Then I pull out the attributes, giving me :title, :ns and :pageid. When that's fed into the main pmap I replace spaces with underscores making "100 doors" => "100_doors". That title is then fed to task-count, which returns the number of tasks on that page. That number is then tied to the hash-map with the key :tasks.

Then finally you have the printing which is straightforward, except I threw in that (repeat (- 55 (count (:title r))) \space) expression, which just aligns everything very nicely, injecting as many spaces as is needed to print 55 characters in. (it's a hack, but it's much prettier than the alternative). As an encouragement, know that once you're comfortable with Clojure, these routines take 2 - 5 minutes to write and half of that to read - but if you're coming from another language, I know that it's harder.

Conclusion

Python: Gets almost correct count, with very concise and readable code - zero boilerplate.

Runtime: 1 hour 53 minutes

Clojure: Gets almost correct count, with very concise and somewhat complex code, zero boilerplate

Runtime: 12 minutes !!

Behold the power of pmap! Deciding on a multi threaded strategy was key to such a huge performance gain, opening as many threads as makes sense on my duo-core system (pmap takes care of that automatically). Also note, I say 'almost correct' on with Clojure as well because user generated html is user generated, it's error prone. Although I tested dozens of pages manually finding Clojure to be right every time, that is no guarantee.

Sidenote: For this problem which both languages solved beautifully in ~10 lines, Java solves in 60+ lines, Perl about 50 and Ruby at about 35. So quite impressive on both counts.

Ok, last one.

#3: Top-rank per group

This one is more of a data juggling act than anything else. You get some data and is asked to represent that in a way that's idiomatic to your language, not mutating it at run-time (which I did first). The data consists of 4 columns and n rows. Team Python weighs in at 12 lines (not counting data definition):

from collections import defaultdict

data = [('Employee Name', 'Employee ID', 'Salary', 'Department'),
        ('Tyler Bennett', 'E10297', 32000, 'D101'),
        ('John Rappl', 'E21437', 47000, 'D050'),
        ('George Woltman', 'E00127', 53500, 'D101'),
        ('Adam Smith', 'E63535', 18000, 'D202'),
        ('Claire Buckman', 'E39876', 27800, 'D202'),
        ('David McClellan', 'E04242', 41500, 'D101'),
        ('Rich Holcomb', 'E01234', 49500, 'D202'),
        ('Nathan Adams', 'E41298', 21900, 'D050'),
        ('Richard Potter', 'E43128', 15900, 'D101'),
        ('David Motsinger', 'E27002', 19250, 'D202'),
        ('Tim Sampair', 'E03033', 27000, 'D101'),
        ('Kim Arlich', 'E10001', 57000, 'D190'),
        ('Timothy Grove', 'E16398', 29900, 'D190')]

departments = defaultdict(list)
for rec in data[1:]:
    departments[rec[-1]].append(rec)

N = 3
format = "%-15s " * len(data[0])
for department, recs in departments.iteritems():
    print "Department", department
    print " ", format % data[0]
    for rec in sorted(recs, key=lambda rec: -rec[-2])[:N]:
        print " ", format % rec
    print

Starting with the data definition, then in Clojure terms that would be a vector containing lists. I'm not sure what a Python user would call it. It's low ceremony, but not being used to all the commas they kinda stick out.

They sort the data using 'sorted' which takes a lambda as the comperator. I think that's what Graham would call a 'near Lisp experience' and it is quite cool. Despite coming in at 12 lines, it's a slim read because all the functionality is packed into that 1 line - very powerful. (albeit a little Perl like "rec: -rec[-2])[:N]:")

Weighing in at 6 lines, we have Clojure:

(def data
     [{:name "Tyler Bennett"   :id "E10297" :salary 32000 :department "D101"}
      {:name "John Rappl"      :id "E21437" :salary 47000 :department "D050"}
      {:name "George Woltman"  :id "E00127" :salary 53500 :department "D101"}
      {:name "Adam Smith"      :id "E63535" :salary 18000 :department "D202"}
      {:name "Claire Buckman"  :id "E39876" :salary 27800 :department "D202"}
      {:name "David McClellan" :id "E04242" :salary 41500 :department "D101"}
      {:name "Rich Holcomb"    :id "E01234" :salary 49500 :department "D202"}
      {:name "Nathan Adams"    :id "E41298" :salary 21900 :department "D050"}
      {:name "Richard Potter"  :id "E43128" :salary 15900 :department "D101"}
      {:name "David Motsinger" :id "E27002" :salary 19250 :department "D202"}
      {:name "Tim Sampair"     :id "E03033" :salary 27000 :department "D101"}
      {:name "Kim Arlich"      :id "E10001" :salary 57000 :department "D190"}
      {:name "Timothy Grove"   :id "E16398" :salary 29900 :department "D190"}])

(let [n           3
      departments (reduce #(assoc %1 (keyword (:department %2))
                                  (conj ((keyword (:department %2)) %1) %2)) {}
                                        (sort-by :salary data))]
  (doseq [d departments]
    (println (key d) (map #(str "[" (:name %) ": " (:salary %) "]") (take n (val d))))))

The data is a vector containing hash-maps. In the 'let' statement I bind n to 3, this is a requirement of the task and it states how many employees we extract (max). Then I do all my data-mangling in the binding to 'departments' and if you're comfortable with Clojure it should read like plain english...well almost :) Sort gets the raw data and compare the :salary values asking if the first is smaller than the next. This returns all the above data sorted by salary, highest first. Then that gets passed to reduce, which produces a hashmap which groups all the departments, like {:D101 [old-entries new-entries]}, growing it everytime a new employee from that department comes through.

And with that little functional beauty I'm actually done. The entries went in sorted by size, so when grouped by departments they're already lined up the way I want. All I have to do is call (take n (:D101 ..)) and I've got the result I wanted.

Conclusion

Python: Very concise, suffers a little on readability (-rec[:-2]) for instance, no boilerplate

Clojure: Very concise and readable, basically 2 expressions handle everything

So for a job such as this I'd say they are neck and neck. In the same amount of space I show off heavy use of reduce, assoc and sort, so I feel I should get an extra point for that :)

Where both languages packed all their power into 1 expression/statement, notice how C++ gets over 70 lines for this simple problem (again, not counting definitions), that's 700% more possibility for bugs, 69 lines wasted on ceremony, perhaps 2 hours straight into the waste basket. I only spent a few minutes on my solution and I assume it's the same for Team Python.

Reloaded

So I felt it was good to do a little more informal comparison after my first post earlier this week. Both Clojure and Python are tools and so I don't want to create an atmosphere we're people get all worked up about tools. There are some facts that we should consider and then there's an element of style and taste.

If I had to ascribe a 'coolness' factor to Python, I'd say it's pretty cool. Without hesitation I would pick it above C++, Perl, Ruby, TCL etc, just to name a few odd balls. It can come in very concise chunks and it's easy to dive into. Regarding boiletplate, clearness of syntax and ceremony it's virtually unmatched. When I reviewed Scala I was told that it had very thin boilerplate, but if that's true then Pythons is invisible. But like I also told one disgruntled Scala user, it's not always fun being compared to Clojure and I think a few Python users felt that they got more competition than they hoped for. Oh well, as long as the competition is friendly, its mutually beneficial.

Not forgetting, I was asked to let the code speak for itself, so I won't say another word...

pyclusion

I hope you enjoyed the read :)

Peter
2009-10-18 17:40:05
This is so biased, it hurts! Please try to make the clojure code more clear, and skip that silly line counting theme.
Jarkko Oranen
2009-10-18 17:58:17
This post wasn't quite as inflammatory as the previous one, but I still don't like the line-counting... Anyway, I thought the last Clojure example was needlessly noisy, so here's my version: http://gist.github.com/212716

Actually, clojure.contrib.seq-utils has a group-by function, and by using that the whole thing would become just 
(group-by :department (sort-by :salary > data)) and then the printing code... Functional languages are very good at this kind of data-massaging.
Lau
2009-10-18 19:45:22
@Peter: Biased? Well if you say so :) Regarding the line-counting I've decided to make that a topic for itself, there's more to it than meets the eye.
John
2009-10-18 19:58:22
Gotta say, I laughed out loud at that picture :)

Reading this was educational.  Thanks.
James Reeves
2009-10-19 00:14:53
Your Python code is not idiomatic. For instance, you write:

    for i in range(3, int(sqrt(p))+1, 2 ):
      if p % i == 0: return False
    return True

When in Python, a better way to write it would be:

    return not any(p % i == 0 for i in range(3, int(sqrt(p))+1, 2))

Which is about the same length as the equivalent Clojure line:

    (not-any? #(zero? (rem i %)) (range 3 (inc (Math/sqrt i)) 2))

I don't think you do Clojure any favours by comparing concisely written Clojure code with verbosely written Python code.
Michael Mol
2009-10-19 03:08:24
OK, hotshot, let's see you fill in some code.  :P Closure currently has more unimplemented tasks than Python.

http://rosettacode.org/wiki/Reports:Tasks_not_implemented_in_Python
http://rosettacode.org/wiki/Reports:Tasks_not_implemented_in_Clojure

Seriously, though, It'd be awesome to have more folks in the Clojure community filling in some tasks.  Rosetta Code is my baby, and one of its primary purposes is to aid people in learning a language by seeing how a familiar task is done in both familiar and unfamiliar languages.
kylesmith
2009-10-19 04:27:38
You can use zipmap to define data:

(def data (map #(zipmap [:name :id :salary :department] %)
     [["Tyler Bennett"   "E10297" 32000 "D101"]
      ["John Rappl"      "E21437" 47000 "D050"]
      ["George Woltman"  "E00127" 53500 "D101"]
      ["Adam Smith"      "E63535" 18000 "D202"]
      ["Claire Buckman"  "E39876" 27800 "D202"]
      ["David McClellan" "E04242" 41500 "D101"]
      ["Rich Holcomb"    "E01234" 49500 "D202"]
      ["Nathan Adams"    "E41298" 21900 "D050"]
      ["Richard Potter"  "E43128" 15900 "D101"]
      ["David Motsinger" "E27002" 19250 "D202"]
      ["Tim Sampair"     "E03033" 27000 "D101"]
      ["Kim Arlich"      "E10001" 57000 "D190"]
      ["Timothy Grove"   "E16398" 29900 "D190"]]))
Lau
2009-10-19 09:08:56
@James: Thanks - Of course it would be unfair for me to write verbose Python and compare it to Clojure, but you see I never wrote any Python code, it's all from public sides, in this case Rosetta. And second I think it's great that you then stop by and show off a more concise version, we should be fair at all times.
Lau
2009-10-19 09:09:33
@Kyle: Very good point, thanks for making it.
Jacques Mattheij
2009-10-19 23:06:47
thread at HN about this posting:

http://news.ycombinator.com/item?id=890697

The short version: the timing of the second sample is off by a factor of 24 in my repeat test, the first example can be shortened by a good 50% without trying really hard, probably a lot more by someone more proficient in python.
Jacques Mattheij
2009-10-19 23:09:00
Line counting is a silly measure. Try stripping out comments and then gzipping the source, after making sure both pieces of code implement comparable algorithms.

The resulting filesize says a lot more than linecount.
Andrew Dalke
2009-10-20 00:00:59
To follow up with @James, other parts of the code can easily be made more terse and idiomatic. Also, you aren't even comparing two things the same. For example, 4 of the lines you count in the prime code are just to set up the upper limit, while the Clojure code uses an unbounded loop. The Python code can also easily be unbounded, saving in total about 7 LOC.

Your Clojure code doesn't do the same output. The Python code lists the primes in the form "M2 M3 M5 M7 M13 M17 M19 ..." as they are found, which yours does not do. That's another few lines of code. You also don't show the same set of comments that they do.

In addition, LOC is a bad metric. It encourages Python style like "if p == 2: return True", which means there's no syntax helping spot control flow. You also include blank lines, which means you are doing pure LOC and not SLOC.

As an example, the URL fetching code LOC includes 3 blank lines in Python's 12 lines while only 1 of Clojure's 11 is blank. That's 9 SLOC vs. 10. If LOC is important, then what about character count? Consider Python's svelte 570 bytes vs. 749. Why, that's 30% "more possibilities for typos"!  :)

Also, regarding #3, the Python code could have used a dict(name="Tyler Bennett", ...) for each row instead of a list, in which case it would be closer to what you have and not need the magic [-1] you complained about. Or used a list comprehension and a zip to minimize all the duplicate "name", "id", etc. Also, I see here too you don't reproduce the same output as the Python code - where's the code which prints "Employee Name", "Employee ID" and so forth?

The outputs between the Python and Clojure codes are only somewhat the same, which doesn't seem right for something which states to be a apples-to-apples comparison.
Lau
2009-10-20 00:05:50
@Andrew: You make some good points and I'll accept anything you have to say about my code, but please remember. I picked the Python examples from Rosetta because I was told that they were 'exemplary', I didn't consider optimizing them.

And yes, LOC is a bad and widely misunderstood metric in itself, so don't take is as more than a humoristic score-board for these excercies. I'm preparing an entire blogpost on quality of code and how to judge it. Your point with character-count is of course excellent and for the same reason I made the ironic comment ' I cant cram more code into that small space' :)

Thanks for stopping by, Lau.
d0m
2009-10-20 08:28:01
I found your article quite interesting. A little caveat thought, I don't agree with you on the easily readable clojure code.  I can practically read the python code and understanding its meaning as if I was reading a roman.. however, really not the same thing for clojure where I need to examine every lines twice. I'm not saying clojure is bad code -- far from that. However, I think by trying to optimize line of code you minimize clarity which is a no-no for me. 

Last thing, it would be really interesting if you could write the time it took you to write those clojure snippets. Does writting a really compact-clever 10 lines of code takes less time than a verbose python program? I don't think so. 

But please, don't take it bad.. I quite enjoyed reading the article and I've learnt a bit of clojure in the same time. :)
Lau
2009-10-20 10:02:58
@d0m: Hey, thanks for stopping by. Optimizing line-count while killing clarity is a big nono - I was simply an attempt at humor.

I can't tell you how long it took to write every snippet, but we're talking minutes not hours. In example #2 for instance, it was harder to find the [h2] pattern in the html than to write the parallelized code which dug it out.

Thanks, Lau
Antti Rasinen
2009-10-20 17:59:48
The groupby argument also works for Python as well. Just as with Clojure, we can use dicts to store the data. See http://gist.github.com/214369 for an implementation. Six lines and arguably prettier output than with the Clojure version.

Conclusions? Counting lines is silly when most of the lines are spent in human-computer interaction.
Paddy3118
2009-10-24 20:14:07
Don't shorten the Python, purely for the sake of line count.
Don't 'optimize' the Python for speed, when speed requirements are not known to be broken.
Do present clear and concise, maintainable and idiomatic Python however.

And if in doubt,

import this

:-)
Paddy3118
2009-10-24 23:03:06
Just looking at the Rosetta Code site, you would find that J language has answered a large set of examples and the coders usually take some time to *expand* what could, in most cases be a very short number of characters on one line as answers to a lot of tasks. (Thanks J guys).  They know that lines of code needs to be tempered by other considerations.

I have started to read a bit more about the J language, not because of the use of less lines in an answer, but because of the sometimes novel ways that the J language users use to solve problems that, when I read some of their explanations in talk pages, make me want to learn about the language that shapes their thought processes.

- Paddy.
Eric Moritz
2009-12-01 09:51:34
That last example can translate the Clojure code pretty much verbatim:

from collections import defaultdict
result = reduce(lambda x,y: x[y[-1]].append(y) or x,
        	sorted(data[1:], lambda x,y: cmp(y[-2], x[-2])),
                defaultdict(list))
N = 3
for department, recs in result.iteritems():
    print "\n".join([department] + ["[%s:%s]" % (x[0],x[-2]) for x in recs[:N]])

Only one line extra line for the defaultdict import.
Ludovic Kuty
2009-12-08 09:35:54
In the #2 example, I did three things to make it run:
 - replace (use net.cgrand.enlive-html) by (use 'net.cgrand.enlive-html)
 - replace (catch Exception 0) by (catch Exception _ 0)
 - replace xml in the first let binding by xml-url

Just to let you know.

Thanks for all the great Clojure examples on your blog.