In this post I'll show you how to make a beautiful Swing application with all the Mac-trimmings, a functional webscraper which gracefully overlooks malformed html and finally how to have some exploratory fun with Clojure, REPL style!
Sometimes I get an idea for a blogpost and immediately I start a REPL and pick up the trail, but this time I really stepped in it. This time, we're pulling out the big guns. There's been a lot of talk about the 'death of the desktop app' for various valid reasons, but after this post, hopefully 'elegance & style' won't valid critique points, this is what we're doing:

Originally I wanted to do a post on how to beautify swing applications, which in my part of the world are notoriously ugly. Despite the fact, that we're not 'fully there' yet, that just didn't make for an interesting post in itself. So what was this Swing app supposed to do? My original idea, was to scrape Google or Bing and pick up links and highlight the ones which matched a certain domain. But seeing how 1) Google forbids that kind of use of their servers and 2) Bing is Microsofts flagship, meaning you get 9 results, then BSOD, then 9 results and so on. I turned my attention towards Social Networks.
Let's get to it.
Thinking I'd be very quick about it, I resorted to Regular Expressions for the extraction of data - That cost me a few hours! In #Clojure on Freenode you can observe the following:
<LauJensen> ~regex ? <clojurebot> Sometimes people have a problem, and decide to solve it with regular expressions. Now they have two problems.
Oh if only I would have listened. I spent about an hour trying to construct some regular expressions, which would be adaptable to the various search engines, without having to start from scratch every time. After having failed I turned to RSchulz of Scala fame for advice, he helped me get the results I wanted, but it had 3 problems which are inherent to Regular Expressions
I know what you're thinking: "Well all that's true of Perl, but people still use that" - I know I know, but it's not for me. For the sake of making a comparison, let me show you the difference between the 2 methods.
If you do a search on Bing.com and open the source code, you'll see that it contains tons of links. A part of these are inside a div-tag of the class "sb_tlst", so first off we would need to get the content of this div tag:
(let [search-url (str "http://www.bing.com/search?q=" phrase "&go=&form=QBLH&filt=all&first=0") search-results (get-page search-url) div-results (re-seq #"
If 'search-results' contains the raw HTML output from quering bing.com, then div-results would contain a sequence of sequences. The outer-seq is every match in the sb_tlst div and the inner-seq contains what was matched and the group-match. The latter is the (.*?) which is the interesting bit. So taking the 'last' of every item in div-results, get's the data, example:
CODE IS MIA
That's the first result, and the content of the div-tag. This gives us 2 challenges: Getting the title and the link. Thankfully Bing.com is very straight forward in that all headers (h3) contain both the link and title, so we can define the regex's like so:
(for [result div-results] {:title (last (re-find #">([^<]+)<" (last result))) :url (last (re-find #"
What do you think of that, is it acceptable? This is beyond the shadow of a doubt, completely unacceptable in a production system, or an OpenSource application for that matter! I'll share with you a story from a freelancer I've worked with, a very experienced guy who's been around the block quite a few times. He was asked to help out with a production system, while the maintainer and developer of that system was away on vacation. It was written in Perl and was modelled after some weird circular architecture, where things were calling things which in the end called the caller. After reviewing the system for several hours he reported back to the CEO something to this effect
This system is coded in such an ugly and unreadble style, that I've looked at it for several hours and still have no idea what the components do. I cannot help you with this.
What do you think the CEO replied? "Well, you don't have to tell me it's Perl twice!" ? No - He fired the lead-developer. Speaking about elegant, easy to read code, isn't just for style-points. Sometimes it's business critical.
So, not wanting to fall into that trap, let's roll out a big gun. Meet...
Enlive is the brainchild of CGrand whom I've mentioned a few times on this blog. It's based on Tagsoup which is an HTML parser designed specifically for the kind of HTML you see out there on the web: Crude, malformed and incoherent. The term Tagsoup refers to malformed HTML and is said to be 'the most severe problem in web authoring' today, but I have a feeling that's not going to trouble me much.
Enlive lets me define selectors (ala CSS) in my comfortable Lispy syntax and it can then both extract data, transform it based on templates and much more. But let's answer the big question of the day: How do we build a webscraper with it? Well, you'll need 2 things: Firefox and a REPL.
CGrand says that I can use CSS selectors to pick out my data, so let's see if he's right. I use the Webdeveloper plugin for Firefox to view styling, you can use whatever you want, even read the source if you're not in a hurry. We'll start with Reddit:
As you can see right below the address bar Reddit has structured their page like so html -> body -> div id="content" -> div class="siteTable" etc etc. So what I want now, is to pull out all the links in the div of the class "siteTable", let's try:
macswing> (def page (html-resource (java.net.URL. "http://www.reddit.com/search?q=clojure"))) #'macswing/page macswing> (select page [:div#siteTable [:a (attr? :href)]]) {:tag :a, :attrs {:href "http://blog.n01se.net/?p=41", :class "title "}, :content ["Clojure in Clojure"]} {:tag :a, :attrs {:href "http://www.reddit.com/domain/blog.n01se.net"}, :content ["blog.n01se.net"]} {:tag :a, :attrs {:class "author", :href "http://www.reddit.com/user/lispnik"}, :content ["lispnik"]} {:tag :a, :attrs {:class "subreddit hover", :href "http://www.reddit.com/r/programming/"}, :content ["programming"]} {:tag :a, :attrs {:target "_parent", :href "http://www.reddit.com/r/programming/comments/90c0x/clojure_in_clojure/", :class "comments"}, :content ["13 comments"]} ... .
The result from my selector [:div#siteTable [:a (attr? :href)]] was a list of all the links and their title, that's almost a little too easy. The first link is good but the rest is not, the difference being the class="title" attribute, I can filter that simply by selecting as follows
[:div#siteTable [:a.title (attr? :href)]]
It can't be THAT easy? Let's try another site, Mixx.com:
So it looks like it doesn't need to be any more complicated than Reddit, simply the the #mod_search_results div-tag and strip out the links. Let's try:
macswing> (def page (html-resource (java.net.URL. "http://www.mixx.com/search?query=clojure&x=0&y=0"))) #'macswing/page macswing> (select page [:div#mod_search_results [:a (attr? :href)]]) {:rel "bookmark", :href "http://www.mixx.com/stories/8182751/jruby_groovy_scala_clojure_job_trends_indeed_com"}, :content ["jruby, groovy, scala, clojure Job Trends | Indeed.com"]} {:target "_blank", :rel "nofollow", :href "http://www.indeed.com/jobtrends?q=+jruby%2C+groovy%2C+scala%2C+clojure"}, :content ["view story"]} {:title "Login to vote!", :class "login", :href "https://www.mixx.com/login?return=http%3A%2F%2Fwww.mixx.com%2Fstories%2F8182751%2Fjruby_groovy_scala_clojure_job_trends_indeed_com"}, :content [{:tag :span, :attrs {:class "alt"}, :content ["Vote"]}]} {:class "fn url", :href "http://www.mixx.com/users/twittersubmitter"}, :content ["TwitterSubmitter"]} {:target "_blank", :rel "nofollow", :href "http://www.indeed.com/jobtrends?q=+jruby%2C+groovy%2C+scala%2C+clojure"}, :content ["http://www.indeed.com/jobtrends?q..."]}
Ok, so that's not quite what we wanted. It seems Mixx is throwing in a bunch of links which are irrelevant for us. Actually the only link which looks like a hit is the first one with the attribute :rel set to "bookmark". What do you think will happend if I change my selector from
[:div#mod_search_results [:a (attr? :href)]])
to
[:div#mod_search_results [:a (attr= :rel "bookmark") (attr? :href)]]
Let's see:
macswing> (select page [:div#mod_search_results [:a (attr= :rel "bookmark") (attr? :href)]]) {:rel "bookmark", :href "http://www.mixx.com/stories/8182751/jruby_groovy_scala_clojure_job_trends_indeed_com"}, :content ["jruby, groovy, scala, clojure Job Trends | Indeed.com"]} {:rel "bookmark", :href "http://www.mixx.com/stories/6855654/clojure_is_a_dynamic_programming_language_that_targets_the_java_virtual_machine"}, :content ["Clojure is a dynamic programming language that targets the Java Virtual Machine."]} {:rel "bookmark", :href "http://www.mixx.com/stories/8107164/programming_clojure"}, :content ["Programming Clojure"]} {:rel "bookmark", :href "http://www.mixx.com/stories/8219960/chaos_theory_vs_clojure_best_in_class_the_blog"}, :content ["Chaos Theory vs Clojure « Best In Class ? The Blog"]} {:rel "bookmark", :href "http://www.mixx.com/stories/7994944/scala_vs_clojure_round_2_concurrency_best_in_class_the_blog"}, :content ["Scala vs Clojure ? Round 2: Concurrency! « Best In Class ? The Blog"]}
Bingo! Now we're getting the data we need. I'm sure most of you reading will now easily be able to adapt Enlive to whatever site needs scraping, but let me help you along:
Right, now since this is a demo app I'll stick with these 2. Let's look at how to integrate that into our Swing Application. I want to define a little function which takes a site to scrape, the phrase to use and a 'table row object' to report back to. Because we brought Enlive along all that separates one scrape from another is the phrase and the selector, so let's make a function with that in mind. I'll comment on it in 2 parts, #1:
(defn web-search [rows phrase engine] (let [phrase (.replaceAll phrase " " "+") site (condp = engine :Mixx {:url (str "http://www.mixx.com/search?query=" phrase) :selector (selector [:div#mod_search_results [:a (attr? :href) (attr= :rel "bookmark")]])} :Reddit {:url (str "http://www.reddit.com/search?q=" phrase) :selector (selector [:div#siteTable [:a.title (attr? :href)]])} (JOptionPane/showMessageDialog nil "Not yet implemented"))]
I get a normal search phrase, wherein I replace all spaces with +'s, so "clojure sql" => "clojure+sql". Then I ask what the parameter engine is equal to. Whatever hit I get I return a hashmap with the query-url and the Enlive selector. There's an implicit :else clause at the bottom, saying that if nothing matched I pop up a dialog saying "Not yet implemented". The last part, the logic is as follows:
(if-not site 0 (let [selection (select (html-resource (java.net.URL. (site :url))) (site :selector)) cnt (count selection)] (dorun (map #(.add rows (Vector. [(str (inc %1)) (apply str (map text (:content %2))) (:href (:attrs %2))])) (range cnt) selection)) (when (zero? cnt) (JOptionPane/showMessageDialog nil "Your search returned 0 results")) cnt))))
I start out by saying that if I don't have a site (ie. not implemented) then just return 0, we have no results.
If we do have a site, then I define my selection as being the Enlive selection of the html-feed. Then in the map I know I need to convert my results into a java Vector since it's going into a JTable, so in one blow I construct a Vector containing an index, the content of my selection (the text) and the value of the href attribute, the link. If I've done nothing because I had zero results I notify the user and in either case I return that count. Is that the simplest webscraper you've ever seen?
I've long been asking for an alternative to Swing, JavaFX and QtJambi for various reasons, one of them being style. Recently however, I discovered an initiative to bring some of the aesthetic qualities of OSX to Swing, check out the project here: MacWidgets.
Doing Java-Interop for an extended period of time is not encouraged, so I'll walk you through some of the part where Clojure does a one-up on Java and then I'll leave the remaining experimentation to you guys. You can see all the Widgets: here.
Java is for the most part instantiating classes and applying methods to them and in a way you have to cater to that when doing interop. In other cases, not so much. Ever so often you need to implement an 'onClick' event, something which fires whenever the component is clicked. Let's say I want to be able to do this
(on-click button (doseq [r results] (apply santizer r) (.update-ui ui)))
We know that to add an onClick event, we just have to override the existing method so it's pretty straight forward. The only thing here, which is tricky is the code which is supposed to fire in the event of a click, because when I pass that to my on-click function, it will evaluate immediately - not the effect we want. Clojure provides a way out of course, the infamous Lisp macro system where you control evaluation, an example could be:
(defmacro on-click [obj & body] `(.addActionListener ~obj (proxy [ActionListener] [] (actionPerformed [e#] ~@body))))
The backquote ` which I wrap my code-body in, means that I will be using reader-macros throughout the following code. Since I control evaluation anything that doesn't have a special symbol next to it, will be translated literally. so "obj" becomes "obj" and at runtime, it'll be whatever obj is at that time. But ~obj forces evaluation, setting the value to whatever argument was passed as 'obj'. Similarly the e# means (gensym e). It's a special precaution you need to take when writing macros. It makes e's name something like e_GENSYM124812948, guaranteeing that it won't conflict with anything in the user-provided code. But Macros are 1) A unique property of Lisps, setting them greatly apart and above all other languages, 2) A subject for a later blogpost. Paul Graham wrote 'On Lisp' which extensively deals with Macros throughout its 476 pages - it's a big topic.
I showed you that, simply to demonstrate how sometimes we can abstract our way out of Java-Interop, while at other times we have to suffer a little bit.
One of the macros in core which helps us a bit is 'doto' takes, like 'on-click', 2 arguments: x & body. It will evaluate x and walk through the body, evaluating all it's forms with x as the first argument. I've tried to contrast one way of using doto against traditional Java which you can see in this image:
I know I'm bordering on abuse of 'doto', but that's only a compliment in this regard. When you look at the Clojure statement you intuitively see what's going on. First I'm declaring a SourceList, then I'm adding stuff to its model, then setting the UI, then implementing the click-handler. On the Java side you're looking at a bunch of objects all referencing each other non-sequentially and to me it's just a harder read than the Clojure version - but maybe I'm biased?
This however is what the largest chunks of the UI code looks like - It's not pretty and it's not exciting to write, but that's Swing for you I guess. Finally a note on the interface:
Type whatever you like as the search string, then click "Mixx" if you want to search there or "Reddit" if you want that instead. Results will be appended to each other. Click the little trash-can in the lower right to clear the results. Looks kinda nice for a Swing app doesn't it ? But please remember, it's not written to be user-friendly, thread-safe, 'nice' or anything other than just this: To show off a Mac styled look.
That's it, we're done. Now people have been talking about the death of desktop apps, but I honestly don't think we'll see that anytime soon. Sure they'll be tightly integrated with online services and such, but they still have their place. There are several UI toolkits available, a few are crossplatform. They come with a set of multithreading problems of their own.
Secondly I demonstrated Enlive as a contrast to the more traditional way of scraping using text/pattern-matching, in this case regex. Although Enlive greatly outshines the competition, I didn't demonstrate perhaps the most fascinating aspect of it, templates. It would be possible to define a selection scheme for Digg.com and via a template have that outputted in the exact same format/layout as Reddit. That maybe quite pointless, but there are many uses where that type of template based transformation can be applied to good effect.
About the code:
As always I hope you enjoyed the read and will let me know if anything is unclear,
/Lau