Since we've had so much fun with multiple cores running at once, how about upping the game to play with multiple servers? Hadoop is a framework for distributed computing, which lets us process jobs on multiple servers at once giving more power *grunt*. In this first post I'll run through how to set up your first Hadoop server running in a VirtualBox using Arch.
I'm doing these experiments on my tiny Macbook Pro laptop, so I want my Linux installation in the VBox to be as lean and clean as possible. Arch strikes a perfect balance between functionality and bloat and for something as simple as running a Hadoop server it's very easy to set up.
I think its a beautiful thing when a cleanly installed linux replies "No entries" to the "netstat -lnput" after installation. Arch lets you build your system from the ground up and although that takes a little longer than Ubuntu, it might just make for a better end result.
Clojure is an excellent language for writing data parsers et al, so what could be more fun than taking our regular code and process it on a multiserver network? In industry, many tasks are of such dimensions that its pointless to run it on a single server, so if you have something like Flightcaster in mind, you need to get comfortable with distributed computing. Secondly its Java based, meaning that to get my hands all the way from Clojure into the Engine Room is very doable.
Worth mentioning as well is the fact that there is already a couple of Clojure Interfaces out in the open. As most people know the crew behind Flightcaster released Crane and secondly Stuart Sierra released the creatively named clojure-hadoop library.
Thanks to the kind donations I was able to purchase Vimeo Plus, so that you can now follow the screencasts in HD, hopefully giving you a clearer rendering of the text! If you know all there is to know about installing Arch and getting Hadoop up an running in Pseudo Distributed Mode, then feel free to skip this entire post. It's a mandatory first stop for me, to ensure that everyone can follow future experiments using Hadoop.
Since this is HD 2x click for fullscreen or go to the Vimeo site.
(double click for full-screen - if you're not seeing it, try hitting F5 or using Firefox)
For your own set up, these are the things you need to change:
sshd: ALL: ALLOW
java: ALL: ALLOW
daemons=(... sshd rsyncd ...)
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
~/hadoop/conf/core-site.xml
Pseudo xml: property: name: fs.default.name value: hdfs://localhost:9000
~/hadoop/conf/hdfs-site.xml
Pseudo xml: property: name: dfs.replication value: 1
~/hadoop/conf/mapred-site.xml
Pseudo xml: property: name: mapred.job.tracker value: localhost:9001
All of the XML configuration files are 6 lines long - I hope everybody is cool with that :)
This was this obligatory step which we just have to get over with. The next step is making/using some kind of Clojure Interface with Hadoop in order to run jobs on it. Stay tuned for round #2.