MacSwing meets Enlive - Functional Social Webscraping!

2009-10-09 12:03:37

In this post I'll show you how to make a beautiful Swing application with all the Mac-trimmings, a functional webscraper which gracefully overlooks malformed html and finally how to have some exploratory fun with Clojure, REPL style!

Preface

Sometimes I get an idea for a blogpost and immediately I start a REPL and pick up the trail, but this time I really stepped in it. This time, we're pulling out the big guns. There's been a lot of talk about the 'death of the desktop app' for various valid reasons, but after this post, hopefully 'elegance & style' won't valid critique points, this is what we're doing:

swingapp

Originally I wanted to do a post on how to beautify swing applications, which in my part of the world are notoriously ugly. Despite the fact, that we're not 'fully there' yet, that just didn't make for an interesting post in itself. So what was this Swing app supposed to do? My original idea, was to scrape Google or Bing and pick up links and highlight the ones which matched a certain domain. But seeing how 1) Google forbids that kind of use of their servers and 2) Bing is Microsofts flagship, meaning you get 9 results, then BSOD, then 9 results and so on. I turned my attention towards Social Networks.

Let's get to it.

The Scraper

Thinking I'd be very quick about it, I resorted to Regular Expressions for the extraction of data - That cost me a few hours! In #Clojure on Freenode you can observe the following:

<LauJensen> ~regex ?
<clojurebot> Sometimes people have a problem, and decide to solve it with regular expressions. Now they have two problems.

Oh if only I would have listened. I spent about an hour trying to construct some regular expressions, which would be adaptable to the various search engines, without having to start from scratch every time. After having failed I turned to RSchulz of Scala fame for advice, he helped me get the results I wanted, but it had 3 problems which are inherent to Regular Expressions

  1. Unreadable - at least they will be when I come back in 2 weeks
  2. Unforgiving - You get a match or you don't, no fault tolerance, hard to debug
  3. Crude - I would have to give up my rule-of-thumb regarding elegant code

I know what you're thinking: "Well all that's true of Perl, but people still use that" - I know I know, but it's not for me. For the sake of making a comparison, let me show you the difference between the 2 methods.

Regular Expressions

If you do a search on Bing.com and open the source code, you'll see that it contains tons of links. A part of these are inside a div-tag of the class "sb_tlst", so first off we would need to get the content of this div tag:

  (let [search-url     (str "http://www.bing.com/search?q=" phrase
                            "&go=&form=QBLH&filt=all&first=0")
        search-results (get-page search-url)
        div-results    (re-seq #"
(.*?)
" search-results)

If 'search-results' contains the raw HTML output from quering bing.com, then div-results would contain a sequence of sequences. The outer-seq is every match in the sb_tlst div and the inner-seq contains what was matched and the group-match. The latter is the (.*?) which is the interesting bit. So taking the 'last' of every item in div-results, get's the data, example:

   CODE IS MIA

That's the first result, and the content of the div-tag. This gives us 2 challenges: Getting the title and the link. Thankfully Bing.com is very straight forward in that all headers (h3) contain both the link and title, so we can define the regex's like so:

(for [result div-results]
  {:title (last (re-find #">([^<]+)<"                      (last result)))
   :url   (last (re-find #"

[ (last result)))}) (range offset (+ count-results offset)) div-results))

What do you think of that, is it acceptable? This is beyond the shadow of a doubt, completely unacceptable in a production system, or an OpenSource application for that matter! I'll share with you a story from a freelancer I've worked with, a very experienced guy who's been around the block quite a few times. He was asked to help out with a production system, while the maintainer and developer of that system was away on vacation. It was written in Perl and was modelled after some weird circular architecture, where things were calling things which in the end called the caller. After reviewing the system for several hours he reported back to the CEO something to this effect

This system is coded in such an ugly and unreadble style, that I've looked at it for several hours and still have no idea what the components do. I cannot help you with this.

What do you think the CEO replied? "Well, you don't have to tell me it's Perl twice!" ? No - He fired the lead-developer. Speaking about elegant, easy to read code, isn't just for style-points. Sometimes it's business critical.

So, not wanting to fall into that trap, let's roll out a big gun. Meet...

Enlive

Enlive is the brainchild of CGrand whom I've mentioned a few times on this blog. It's based on Tagsoup which is an HTML parser designed specifically for the kind of HTML you see out there on the web: Crude, malformed and incoherent. The term Tagsoup refers to malformed HTML and is said to be 'the most severe problem in web authoring' today, but I have a feeling that's not going to trouble me much.

Enlive lets me define selectors (ala CSS) in my comfortable Lispy syntax and it can then both extract data, transform it based on templates and much more. But let's answer the big question of the day: How do we build a webscraper with it? Well, you'll need 2 things: Firefox and a REPL.

CGrand says that I can use CSS selectors to pick out my data, so let's see if he's right. I use the Webdeveloper plugin for Firefox to view styling, you can use whatever you want, even read the source if you're not in a hurry. We'll start with Reddit:

scrreddit

 

As you can see right below the address bar Reddit has structured their page like so html -> body -> div id="content" -> div class="siteTable" etc etc. So what I want now, is to pull out all the links in the div of the class "siteTable", let's try:

macswing> (def page (html-resource (java.net.URL. "http://www.reddit.com/search?q=clojure")))
#'macswing/page
macswing> (select page [:div#siteTable [:a (attr? :href)]])
{:tag :a, :attrs {:href "http://blog.n01se.net/?p=41", :class "title "}, :content ["Clojure in Clojure"]}
{:tag :a, :attrs {:href "http://www.reddit.com/domain/blog.n01se.net"}, :content ["blog.n01se.net"]}
{:tag :a, :attrs {:class "author", :href "http://www.reddit.com/user/lispnik"}, :content ["lispnik"]}
{:tag :a, :attrs {:class "subreddit hover", :href "http://www.reddit.com/r/programming/"}, :content ["programming"]}
{:tag :a, :attrs {:target "_parent", :href "http://www.reddit.com/r/programming/comments/90c0x/clojure_in_clojure/", :class "comments"}, :content ["13 comments"]}
...
.

The result from my selector [:div#siteTable [:a (attr? :href)]] was a list of all the links and their title, that's almost a little too easy. The first link is good but the rest is not, the difference being the class="title" attribute, I can filter that simply by selecting as follows

[:div#siteTable [:a.title (attr? :href)]]

It can't be THAT easy? Let's try another site, Mixx.com:

mixx

So it looks like it doesn't need to be any more complicated than Reddit, simply the the #mod_search_results div-tag and strip out the links. Let's try:

macswing> (def page (html-resource (java.net.URL. "http://www.mixx.com/search?query=clojure&x=0&y=0")))
#'macswing/page
macswing>  (select page [:div#mod_search_results [:a (attr? :href)]])
{:rel "bookmark", :href "http://www.mixx.com/stories/8182751/jruby_groovy_scala_clojure_job_trends_indeed_com"}, :content ["jruby, groovy, scala, clojure Job Trends | Indeed.com"]}
{:target "_blank", :rel "nofollow", :href "http://www.indeed.com/jobtrends?q=+jruby%2C+groovy%2C+scala%2C+clojure"}, :content ["view story"]}
{:title "Login to vote!", :class "login", :href "https://www.mixx.com/login?return=http%3A%2F%2Fwww.mixx.com%2Fstories%2F8182751%2Fjruby_groovy_scala_clojure_job_trends_indeed_com"}, :content [{:tag :span, :attrs {:class "alt"}, :content ["Vote"]}]}
{:class "fn url", :href "http://www.mixx.com/users/twittersubmitter"}, :content ["TwitterSubmitter"]}
{:target "_blank", :rel "nofollow", :href "http://www.indeed.com/jobtrends?q=+jruby%2C+groovy%2C+scala%2C+clojure"}, :content ["http://www.indeed.com/jobtrends?q..."]}

Ok, so that's not quite what we wanted. It seems Mixx is throwing in a bunch of links which are irrelevant for us. Actually the only link which looks like a hit is the first one with the attribute :rel set to "bookmark". What do you think will happend if I change my selector from

[:div#mod_search_results [:a (attr? :href)]])

to

[:div#mod_search_results [:a (attr= :rel "bookmark") (attr? :href)]]

Let's see:

macswing> (select page [:div#mod_search_results [:a (attr= :rel "bookmark") (attr? :href)]])
{:rel "bookmark", :href "http://www.mixx.com/stories/8182751/jruby_groovy_scala_clojure_job_trends_indeed_com"},
 :content ["jruby, groovy, scala, clojure Job Trends | Indeed.com"]}
{:rel "bookmark", :href "http://www.mixx.com/stories/6855654/clojure_is_a_dynamic_programming_language_that_targets_the_java_virtual_machine"},
 :content ["Clojure is a dynamic programming language that targets the Java Virtual Machine."]}
{:rel "bookmark", :href "http://www.mixx.com/stories/8107164/programming_clojure"},
 :content ["Programming Clojure"]}
{:rel "bookmark", :href "http://www.mixx.com/stories/8219960/chaos_theory_vs_clojure_best_in_class_the_blog"},
 :content ["Chaos Theory vs Clojure « Best In Class ? The Blog"]}
{:rel "bookmark", :href "http://www.mixx.com/stories/7994944/scala_vs_clojure_round_2_concurrency_best_in_class_the_blog"},
 :content ["Scala vs Clojure ? Round 2: Concurrency! « Best In Class ? The Blog"]}

Bingo! Now we're getting the data we need. I'm sure most of you reading will now easily be able to adapt Enlive to whatever site needs scraping, but let me help you along:

Right, now since this is a demo app I'll stick with these 2. Let's look at how to integrate that into our Swing Application. I want to define a little function which takes a site to scrape, the phrase to use and a 'table row object' to report back to. Because we brought Enlive along all that separates one scrape from another is the phrase and the selector, so let's make a function with that in mind. I'll comment on it in 2 parts, #1:

(defn web-search [rows phrase engine]
  (let [phrase  (.replaceAll phrase " " "+")
        site    (condp = engine
                  :Mixx
                    {:url      (str "http://www.mixx.com/search?query=" phrase)
                     :selector (selector [:div#mod_search_results [:a (attr? :href) (attr= :rel "bookmark")]])}
                  :Reddit
                    {:url      (str "http://www.reddit.com/search?q=" phrase)
                     :selector (selector [:div#siteTable [:a.title (attr? :href)]])}
                  (JOptionPane/showMessageDialog nil "Not yet implemented"))]

I get a normal search phrase, wherein I replace all spaces with +'s, so "clojure sql" => "clojure+sql". Then I ask what the parameter engine is equal to. Whatever hit I get I return a hashmap with the query-url and the Enlive selector. There's an implicit :else clause at the bottom, saying that if nothing matched I pop up a dialog saying "Not yet implemented". The last part, the logic is as follows:

    (if-not site
      0
      (let [selection    (select (html-resource (java.net.URL. (site :url))) (site :selector))
            cnt          (count selection)]
        (dorun
         (map #(.add rows (Vector. [(str (inc %1))
                                    (apply str (map text (:content %2)))
                                    (:href (:attrs %2))]))
              (range cnt) selection))
         (when (zero? cnt)
          (JOptionPane/showMessageDialog nil "Your search returned 0 results"))
        cnt))))

I start out by saying that if I don't have a site (ie. not implemented) then just return 0, we have no results.

If we do have a site, then I define my selection as being the Enlive selection of the html-feed. Then in the map I know I need to convert my results into a java Vector since it's going into a JTable, so in one blow I construct a Vector containing an index, the content of my selection (the text) and the value of the href attribute, the link. If I've done nothing because I had zero results I notify the user and in either case I return that count. Is that the simplest webscraper you've ever seen?

The Desktop App

I've long been asking for an alternative to Swing, JavaFX and QtJambi for various reasons, one of them being style. Recently however, I discovered an initiative to bring some of the aesthetic qualities of OSX to Swing, check out the project here: MacWidgets.

Doing Java-Interop for an extended period of time is not encouraged, so I'll walk you through some of the part where Clojure does a one-up on Java and then I'll leave the remaining experimentation to you guys. You can see all the Widgets: here.

Java is for the most part instantiating classes and applying methods to them and in a way you have to cater to that when doing interop. In other cases, not so much. Ever so often you need to implement an 'onClick' event, something which fires whenever the component is clicked. Let's say I want to be able to do this

(on-click button (doseq [r results]
                   (apply santizer r)
                   (.update-ui ui)))

We know that to add an onClick event, we just have to override the existing method so it's pretty straight forward. The only thing here, which is tricky is the code which is supposed to fire in the event of a click, because when I pass that to my on-click function, it will evaluate immediately - not the effect we want. Clojure provides a way out of course, the infamous Lisp macro system where you control evaluation, an example could be:

(defmacro on-click [obj & body]
  `(.addActionListener ~obj
         (proxy [ActionListener] []
           (actionPerformed [e#] ~@body))))

The backquote ` which I wrap my code-body in, means that I will be using reader-macros throughout the following code. Since I control evaluation anything that doesn't have a special symbol next to it, will be translated literally. so "obj" becomes "obj" and at runtime, it'll be whatever obj is at that time. But ~obj forces evaluation, setting the value to whatever argument was passed as 'obj'. Similarly the e# means (gensym e). It's a special precaution you need to take when writing macros. It makes e's name something like e_GENSYM124812948, guaranteeing that it won't conflict with anything in the user-provided code. But Macros are 1) A unique property of Lisps, setting them greatly apart and above all other languages, 2) A subject for a later blogpost. Paul Graham wrote 'On Lisp' which extensively deals with Macros throughout its 476 pages - it's a big topic.

I showed you that, simply to demonstrate how sometimes we can abstract our way out of Java-Interop, while at other times we have to suffer a little bit.

One of the macros in core which helps us a bit is 'doto' takes, like 'on-click', 2 arguments: x & body. It will evaluate x and walk through the body, evaluating all it's forms with x as the first argument. I've tried to contrast one way of using doto against traditional Java which you can see in this image:

doto-vs-java(click me)

I know I'm bordering on abuse of 'doto', but that's only a compliment in this regard. When you look at the Clojure statement you intuitively see what's going on. First I'm declaring a SourceList, then I'm adding stuff to its model, then setting the UI, then implementing the click-handler. On the Java side you're looking at a bunch of objects all referencing each other non-sequentially and to me it's just a harder read than the Clojure version - but maybe I'm biased?

This however is what the largest chunks of the UI code looks like - It's not pretty and it's not exciting to write, but that's Swing for you I guess. Finally a note on the interface:

swingapp

Type whatever you like as the search string, then click "Mixx" if you want to search there or "Reddit" if you want that instead. Results will be appended to each other. Click the little trash-can in the lower right to clear the results. Looks kinda nice for a Swing app doesn't it ? But please remember, it's not written to be user-friendly, thread-safe, 'nice' or anything other than just this: To show off a Mac styled look.

Conclusion

That's it, we're done. Now people have been talking about the death of desktop apps, but I honestly don't think we'll see that anytime soon. Sure they'll be tightly integrated with online services and such, but they still have their place. There are several UI toolkits available, a few are crossplatform. They come with a set of multithreading problems of their own.

Secondly I demonstrated Enlive as a contrast to the more traditional way of scraping using text/pattern-matching, in this case regex. Although Enlive greatly outshines the competition, I didn't demonstrate perhaps the most fascinating aspect of it, templates. It would be possible to define a selection scheme for Digg.com and via a template have that outputted in the exact same format/layout as Reddit. That maybe quite pointless, but there are many uses where that type of template based transformation can be applied to good effect.

About the code:

As always I hope you enjoyed the read and will let me know if anything is unclear,

/Lau

Walt
2009-10-15 00:28:27
Hi, 

In the defn for web-search, "text" is an undefined symbol on line 41. 

I'm enjoying your posts alot, BTW.
Lau
2009-10-15 09:31:54
@Walt: Text is a function from enlive so you should double check your imports.

/Lau
Meikel
2009-10-15 13:57:58
&gt; I discovered an initiative to bring some of the *aesthetic qualities of OSX* to Swing.

Ok. Who are you? You are not Lau! Where is the real Lau? Are you holding him hostage? ;)
Meikel
2009-10-15 14:35:13
Some more comments on the implementation:

- *iek* (dorun (map ...))
- beware of "this" in the on-click body
- future is not equivalent to SwingUtils/invokeLater
Walt
2009-10-15 14:52:20
Ah, my enlive was too old. All works now. Thanks again.
Lau
2009-10-15 23:00:55
Hey Meikel - 

Future isn't equivalent, but in this case it serves the same purpose. And generally, (dorun/doall (map)) should be (doseq) - agreed? :)

/Lau
Meikel
2009-10-15 23:18:11
(doall (map ..)) has its uses. But (dorun (map ...)) should most likely be a doseq. Yes.
Daniel Sobral
2009-11-04 16:35:19
I'll now try to write a program in Clojure to extract those tags without knowing Clojure. I'll then evaluate it by the time it took me, and how legible I think it is without experience with Clojure.

Which isn't really fair, as I have done some Lisp before.

If you can't be bothered to learn regexp, which isn't exactly turing-complete, but it's still quite powerful, then you really shouldn't try to do stuff with it. Or criticize it.

If you had learned regexp, you'll know that its usefulness is searching for patterns in unstructured text (or text whose structure is not relevant to the pattern), and would never have attempted to do it.

On the other hand, if you had to get data from a site whose HTML is so broken that no JVM parser can't make tail or heads of it, like I had on one occasion, you'd be wasting your time if you tried to use anything else.
Daniel Sobral
2009-11-04 16:40:25
Oh, btw, Ruby let you do stuff with its AST, which is as powerful as Lisp macros.