Clojure Vs Global Warming

Nobody who’s connected to the rest of the world, either via TV or the Internet is unaware of Global Warming – This phenomenon which threatens to destroy us all if we don’t collectively assume responsibility for the globe. Here’s my contribution to a solution in 98 lines of heavy computational Clojure!

Preface

As a Danish Citizen I feel its mandatory to engage in this debate. Denmark recently hosted a major conference to facilitate solutions to the climate threats of this the 21.th century. The result as you all probably know what unfortunately a huge failure, so tons of CO2 was needlessly emitted by flying in all the foreign officials, security motorcades etc etc.

I plan on getting a better result. I’ve learned that the National Oceanic and Atmospheric Administration (NOAA) have published about 3 Gigabytes of tarballed weather data, going back as far as 1929. My mission is now to organize and parse that data, to see exactly what the effect of our recent boom in CO2 emission is doing to the environment. From the total effect I’ll be able to approximate Clojures contribution to the Global Warming.

Why should I read this post?

Well if you should read it, it’s because one of more of the following applies:

  • I care about Global Warming
  • I love heavy computational parallelized Clojure
  • I want to know how to stream Tarballs and GZips

Data

First we have to get the data from NOAA. For the sake of all of you who want to repeat these calculations to show the neighborhood kids that they shouldn’t spend their spare time lighting dumpers on fire and generally wasting energy, I’ll go through every step:

#1: Preparing URLS

Every dataset is found on their ftp in a subdirectory named according to the year the data is from and in that directory there’s a tar-ball called gsod_year.tar. I’m not a wget expert so we will tag-team with Clojure:

(spit "urls"
   (apply str
        (map #(format "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%d/gsod_%d.tar\n" % %)
              (range 1929 2010))))

--- OR ---

(->> (range 1929 2010)
	   (map #(format "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%d/gsod_%d.tar\n" % %))
	   (apply str)
	   (spit "urls"))

(spit is from clojure.contrib.duck-streams)

I show both variants because you’ll want to get comfortable with both -> (thread as the 2.nd item) and ->>(thread as the last item), they’re here to stay and are quite handy. Either of those snippets produce a file called ‘urls’ which contains links to all the tar-balls available (except 2010 which ofc isn’t complete yet), totalling about 3 Gigabytes.

To download all the data issue this command from the terminal

 $ mkdir dataset && cd dataset
 $ wget -i ../urls
--2010-01-20 17:18:53--  ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1929/gsod_1929.tar
           => `gsod_1929.tar'
Resolving ftp.ncdc.noaa.gov... 205.167.25.101
Connecting to ftp.ncdc.noaa.gov|205.167.25.101|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /pub/data/gsod/1929 ... done.
==> SIZE gsod_1929.tar ... 71680
==> PASV ... done.    ==> RETR gsod_1929.tar ... done.
Length: 71680 (70K)

100%[=====================================>] 71.680      94,5K/s   in 0,7s

The first file is just 70K as you can see, but the last is 90M – they grow as the number of weather stations increase. Now you’ve got all of your tars sitting in the same directory, so if you wanted to extract the data, do this:

 $ for i in *.tar; do echo $i && tar -xf $i && rm $i; done
(..wait about 20 minutes..)

That would give you about 450.000 gzips weighing in at a heavy 10 Gigabytes. If you unroll these files they’ll expand into about 50 Gigabytes of raw text! Expanding that is too ambitious on my little laptop so lets think this through. When we’ve processed all the weather data we need it sorted by year (and perhaps month). To do a running sort of the data we have to carry the head which is close to impossible in this case – even ‘ls’ struggles with the 450k files so Javas Heap will too. The best possible way to attack this problem would be to write tar-ball reader, which lets us peek at the GZips and then a GZip reader which lets us work with the text without any pre-unpacking. Javas IO is centered around Input/Output streams, so if we can expose the data via streams it will be presorted and I don’t have to overload my harddrive.

Processing

We’re looking down the barrel of several Gigabytes of raw text data, so we need to set up some kind of headless processing. When I speak about holding/losing the head, I’m referring to how we handle memory. If you keep the head of the sequence while processing, that entire sequence is accumulated in memory. Not holding the head means only keeps an item/chunk in memory at every given moment.

First we have to narrow the field as much as possible so lets look at how the professionals do:

Hockey Stick
Hockey Stick – (C) Al Gore

As we can clearly see the temperatures in the Northen Hemisphere explode near start -> middle of the 1900’s. So we’ll follow Mr. Gores lead and filter out the stations which record temperature in the Northen Hemisphere (NH) – hopefully by the end of this blogpost we can reproduce the Hockey Stick using Clojure.

To do that, download the history file from NOAA’s website, it contains the IDs of all weather stations as well as their position by longitude and latitude. The NH is defined by the longtitude coordinates that are positive. The file starts with 20 lines of information and then many lines like these:

010014 99999 SOERSTOKKEN                   NO NO    ENSO  +59783 +005350 +00490
010015 99999 BRINGELAND                    NO NO    ENBL  +61383 +005867 +03270
010016 99999 RORVIK/RYUM                   NO NO          +64850 +011233 +00140
010017 99999 FRIGG                         NO NO    ENFR  +59933 +002417 +00480
010020 99999 VERLEGENHUKEN                 NO SV          +80050 +016250 +00080
010030 99999 HORNSUND                      NO SV          +77000 +015500 +00120
010040 99999 NY-ALESUND II                 NO SV    ENAS  +78917 +011933 +00080
010050 99999 ISFJORD RADIO                 NO NO    ENIS  +78067 +013633 +00050

The predefined structure of the file is based on indices, meaning I know that the WBAN ID always starts at IDX 8 and ends on 12 (7-11 when zero-based). We cannot split this on words (\w+) because some elements are occasionally left out but the index structure remains. Thats means that we can filter out those stations that are in the Northern hemisphere like so:

(defn northern-stations [filename]
  (let [data   (->> (line-seq (reader filename)) (drop 20))
        north  (for [station data :when (= \+ (nth station 58))]
                 (vec (take 2 (.split station " "))))]
    (reduce #(assoc %1  :stn  (conj (:stn %1) (%2 0))
                        :wban (conj (:wban %1) (%2 1)))
            {} north)))

If you’re a regular here that should be very straight forward, but in case you’re not:

  1. Start a line-reader, skip 20 lines
  2. Walk the lines and :when the 58.th index is “+” take out the first 2 elements STN and WBAN and bundle them in a vector
  3. Reduce that series of vectors into a hashmap ala {:stn [id1 id2 id3] :wban [id1 id2 id3]}

The reason this doesn’t break is that neither STN or WBAN are left out anywhere in the data, NOAAconsistently uses 99999 to indicate that a field is blank. With all the valid stations extracted we can now set up a data-parser which only extracts data from these stations.

Gunzip

We can start by unpacking the 1929 tarball manually to inspect the data. Java.util.zip has a few classes for working with GZip archives, so lets try them out and see if we can’t integrate it nicely.

My personal preference would be to write some kind of wrapper, which would enable me to work on the data like so

(with-zipstream [stream "/path/to/gzip.gz"]
   (...work on data...))

Well as you know its easy to conform a Lisp to your liking, so here’s my personal Gunship:

(defmacro with-zipstream [bindings & body]
  `(with-open [~(bindings 0) (->> (FileInputStream. ~(bindings 1))
                                  GZIPInputStream. InputStreamReader.
                                  BufferedReader.)]
     (do ~@body)))

But although its tempting to abstract away using macros, this constitutes macro-abuse, since this is equally helpful:

(defn dump-stream [stream sz]
  (let [buffer    (make-array Byte/TYPE sz)]
    (.read stream buffer 0 sz)
    (ByteArrayInputStream. buffer)))

(defn line-stream
  [tarstream tarentry]
  (with-open [zipfile (->> (dump-stream tarstream (.getSize tarentry))
                           GZIPInputStream. InputStreamReader. BufferedReader.)]
    (doall (for [line (repeatedly #(.readLine zipfile)) :while line] line))))

This does very much the same as cores line-seq, in that it returns a non-lazy (doall) sequence of all the lines in the GZip.

Now you have access to the weather data which looks like so:

STN--- WBAN   YEARMODA    TEMP       DEWP      SLP        STP       VISIB
030050 99999  19291001    45.3  4    40.0  4  1001.6  4  9999.9  0   17.1
030050 99999  19291002    49.5  4    45.2  4   977.6  4  9999.9  0    9.3
030050 99999  19291003    49.0  4    41.7  4   975.7  4  9999.9  0   10.9
030050 99999  19291004    45.7  4    38.5  4   992.0  4  9999.9  0    6.2
030050 99999  19291005    46.5  4    41.5  4   997.8  4  9999.9  0    7.8
030050 99999  19291006    49.5  4    46.5  4   990.1  4  9999.9  0    7.8
030050 99999  19291007    48.2  4    44.8  4   979.1  4  9999.9  0    9.3
030050 99999  19291008    46.5  4    39.2  4   994.3  4  9999.9  0   12.4
030050 99999  19291009    44.7  4    40.0  4  1005.4  4  9999.9  0   10.9
030050 99999  19291010    48.7  4    47.0  4  1000.6  4  9999.9  0    8.4
030050 99999  19291011    48.7  4    39.2  4   995.5  4  9999.9  0   12.4

With direct access to the text compressed away in the GZips its easier to implement a tarball reader, because we can see the result of our experiments immediately – Just printing the binary values from each entry (file) in the tarball wouldn’t tell me much about my success of failure in uncompressing the data.

Tar

To work with tarballs I’ve downloaded this jar file. You can also visit the Javadocs: here. First things first: Lets see if we can peek at the data, starting with the smallest set 1929.

As always when we’re in Java-land we have to think carefully about how we want to abstract methods and workflows. The class TarInputStream lets me view a Tarball as series of TarEntries (compressed files).

With the ability to walk a zipstream as pure text already in place, we can now wrap the Tarball in a process-tarball function. Its handy to go through the data one tarball at a time, both for sorting as I mentioned before but also for getting both cores busy with the data. To give you an idea of what I’m thinking, here’s the top-level processing:

(defn process-tarball
  [filename stn-ids wban-ids headers]
  (println "Parsing: " (.getName filename))
  (flush)
  (let [tarstream    (->> filename FileInputStream. TarInputStream.)
        readings     (extract-readings tarstream stn-ids wban-ids)]
    {:year (re-find #"\d{4}" (.getName filename))
     :mean (if-let [cnt (count readings)]
             (when-not (zero? cnt)
               (/ (reduce + readings) cnt))))}))

If you imagine that extract-readings just does something like walk through all readings and return a map of the temperatures it picks up, then this will collect all readings from a jar and return a hash-map containing the year parsed and the mean temp. for that year- Quite neat for about 20 lines of Clojure.

With the ability to peek through the tar into the GZip and through the compression into the text, we can start writing extract-readings. The easiest thing to do, would be to have a for-loop run through every line of the GZips and merge those line into 1 huge sequence, giving is a line-seq of all the GZips. The problem however, is that Java spends 2 bytes per char in a String. If we apply the math to the set from 2002. It goes like this

77 Mb Tarball => 10.000 Gzips weighing 360 Mb => Strings worth at least 3.6 Gb

That means for a weak system as mine, I’m down from the count once we reach 1950. So that deprives me of the luxury of picking columns out at the highest level of the parser, meaning my extract-readings function needs to be more specific than just pulling out 1 line at a time:

(defn extract-readings
  [tarstream stn-ids wban-ids]
  (->> (Double. (nth (cols data) 3))
       (for [data (rest (line-stream tarstream file))])
       (for [file (repeatedly #(.getNextEntry tarstream))
             :while file
             :when (let [[_ stn wban] (re-find #"(\d+)-(\d+)" (.getName file))]
                     (and (not (.isDirectory file))
                          (or (stn-ids stn) (wban-ids wban))))])
       flatten))

If you’re new to ->> thats probably not easy to read. First I know that I want to work on every line of each file, so I pull out the 4.th column from calling ‘cols’ on that line. Cols just splits on spaces. And then I cast that to a Double. That expression is then fed to the first for loop which runs through all of the lines in the file where file is the result of the final for loop. The final for loops runs through all the entries in the tarball, picking out those entries which are not directories and are in the valid stations list, ie. stn-ids & wban-ids respectively.

Currently we’re running about 30 lines of Clojure and we’ve already got our main data extraction function set up. On the JVM we have several options when wanting to process data concurrently: Agents that carry state in asynchronized processes, futures that are just threads, promises and more. For this job there were two ways which seem appealing: Either LinkedBlockingQueue or a Parallelized map. I opted for #2. Pmap walks the data using a ‘sliding window’ approach, meaning if a thread is falling too far behind pmap will wait – This is good because of the heavy memory load inherent to this challenge. With LinkedBlockingQueue you have to decide on a number of workers but with pmap you just have to think in chunks – ie. how big do I want them? For this job its a no-brainer: A chunk = A tarball.

Ready to Launch

Now with the above functions implemented you have very free hands to decide how you want to attack the data, show/save the output etc etc, here’s one way to go:

(defn process-weather-data
  [dataset history-file]
  (let [stations   (northern-stations history-file)
        stn-ids    (disj (set (:stn stations))  "999999")
        wban-ids   (disj (set (:wban stations)) "999999")
        dataset    (->> (File. dataset) file-seq (filter #(.isFile %)) sort)
        headers    [:stn :wban :yearmoda :temp]
        result     (->> dataset
                        (pmap #(process-tarball % stn-ids wban-ids headers))
                        doall)]
    (spit "result" (sort-by :year result))))

(my runtime: 1 hour 20 minutes)

First we pull out the stations on the Northern Hemisphere and break that down into 2 sets (for fast comparisons). Then from both of those I disjoin “999999” which per convention is an empty field (see theREADME on NOAA). Then I take the path ‘dataset’ and mangle it into a sequence of the files, sorted by ascending order. The sort is nice because 1) It allows me to track progress, 2) The 2 largest data-sets will only have 2 threads running instead of 4 for the majority of the time. Then I manually define some headers, it wouldn’t be hard to rip them from a tarball, but there’s no need. Finally I start off the process calling pmap which launches 4 threads.

To avoid my system buckling (and spending excessive time) and boot a lean Arch installation, which claims less than 100Mb of RAM for itself – After 5 minutes this was the scenario:

5 minutes in

5 minutes in

Couldn’t be better, both cores are boiling and the memory isn’t headed toward a heap explosion. After 45 minutes, this was the situation:

45 minutes in

45 minutes in

The memory consumption has stabilized at 79.2% and both cores are still going full speed – excellent!

Results – Round 1

With the data ‘spit’ directly into a result file, its easy read it back and mangle it any way we want. For instance:

(doseq [{:keys [year mean]} (read-string (slurp "result"))]
             (println (format "%s\t%s" year (if mean
                                                (str (/ (* (- mean 32) 5) 9) )
                                                "null"))))

That will give you 2 columns of the readings converted to celcius, which you can copy/paste into your favorite Spreadsheet editor and observe the following:

Temperature Graph #1

Temperature Graph #1

And now perhaps you’re wondering, where’s the Hockey Stick ? Well one explanation could be, that we’re actually not doing a very good job at picking our input data, as we have simply parsed all available data. Because we see an explosion in the number of weather stations in the years 1995 – 2009 it’s a fair assumption that if these are unevenly distributed then because of their great number they distort the graph quite a bit.

A closer look

So we need to be more picky about the stations we use to avoid distorted data. Lets try to extract all the stations used in 1929 (our first recorded year) and follow their readings throughout the following years – That will give us a clear indication of the variations in global temperatures without any weight distortion. To accomplish this, we can help ourselves by making a function which extracts all station IDs from a given Tarball:

(defn get-stations [filename]
  (let [tarstream    (->> filename FileInputStream. TarInputStream.)
        all-stations (for [file (repeatedly #(.getNextEntry tarstream))
                           :while file
                           :when (not (.isDirectory file))]
                       (let [[_ stn wban] (re-find #"(\d+)-(\d+)-" (.getName file))]
                         {:stn  stn :wban wban}))]
    {:stn  (disj (set (map :stn all-stations)) "99999")
     :wban (disj (set (map :wban all-stations)) "99999")}))

climate> (get-stations "../dataset/gsod_1929.tar")
{:stn #{"037950" "033110" "038940" "034970"
"039800" "033960" "032620" "030750" "030910" "038040" "038560"
"038110" "990061""037770" "036010""039530" "038640""031590" "033790"
"030050" "039730"},
:wban #{}}

So all we need to do in order to follow these stations, is filter out those in the northern hemisphere and re-run the job. For the sake of faster experiments while we wash our data, I’ll accept the station-ids + outputfilename as arguments:

(defn process-weather-data
  [dataset history-file stations output]
  (let [dataset   (->> (File. dataset) file-seq (filter #(.isFile %)) sort)
        nstations (northern-stations history-file)
        stn-ids   (set (filter #((set (:stn  nstations)) %) (:stn  stations)))
        wban-ids  (set (filter #((set (:wban nstations)) %) (:wban stations)))
        headers   [:stn :wban :yearmoda :temp]
        result    (doall
                   (pmap #(process-tarball % stn-ids wban-ids) dataset))]
    (spit (str output ".raw") (with-out-str (prn result)))
    (println "Done")))

(let [tracked-stations  (get-stations "res/dataset/gsod_1929.tar")]
  (process-weather-data "res/dataset/" "res/history"
                        tracked-stations "stats-1929"))

(my runtime: 11 minutes)

Now we don’t have as much data as before (although still a lot), but we know that its not distorted by the addition a huge number of stations in various locations. That gives us the following graph:

Temperature Graph #2

Temperature Graph #2 – Tracking 14 stations

Hmm… I’m really starting to wonder how that Hockey Stick was produced because as we can clearly see from following about 14 stations is a gradual decrease in Global Temperature. But lets give Mr. Gore the benefit of the doubt and expand our scope to include all stations used from 1929 – 1940 and then follow those throughout the years. That will give us a ton of data and keep us somewhat safe from weight distortion. I’ll introduce a helper to compile all the stations from a given range:

(defn get-station-series [base start end]
  (apply merge-with into
         (for [i (range start (inc end))]
           (get-stations
            (str base (if (not= \/ (last base)) "/") "gsod_" i ".tar")))))

Call that with your directory containing the tars as a base and then 2 integers (1929 1940) and you’ll get in return a 2 sets containing all stations used in the period. Then all you need to do is start the job with those stations located in the NH filtered:

(let [tracked-stations  (get-stations-series "res/dataset" 1929 1940)]
  (process-weather-data "res/dataset/" "res/history"
                        tracked-stations"stats-1929-1940"))

(my runtime: 10 minutes)

Let that run for a while and you’ll get a lot data resulting in the following graph:

Temperature Graph #3

Temperature Graph #3 – Tracking 450 stations

It seems that when we leave out the great number of weather stations that were introduced in the last 50 years or so, that the tendency is absolutely not a rise in temperature.

Final crack at the Hockey Stick

Ok – If at first you don’t succeed, try harder. The reason the data is distorted is because the stations aren’t been weighted correctly. Some areas have a higher density of stations, some stations report more frequently than other – Long story short: We need to visually get an impression of each stations readings. By looking at each station indiviually instead of compressing them to an unevenly weighted average, we will be able to clearly deduce how the weather has changed through the years recorded.

To make this really easy, I’ll introduce a helper which spits out files ready for OpenOffice Spreadsheet:

(defn emit-dataset [data]
  (let [uids (distinct (flatten (map #(map :uid %) (map :reads data))))]
    (with-out-str
      (doseq [{:keys [year reads]} data]
        (print year)
        (doseq [uid uids]
          (if-let [reading (first (filter #(= uid (:uid %)) reads))]
            (print "," (:mean reading))
            (print ",null")))
        (println "")))))

(spit output (emit-dataset result))

As you can see, this little helper just outputs a CSV file, but it does so calling out the :uid on each reading. To get that bit of info I changed the reader, but I won’t go through it here, if you’re interested you can read the source on Github. The last will dump the CSV in a file, so you can place that at the bottom of your main func.

The 1929 stations:

14 stations from 1929

14 stations from 1929 seen individually

Despite the fact that some years are w/o readings, its clear that the general tendency is in direct contradiction with what Mr. Gore has shown – His graph resulting in a massive warmth increase throughout the 1900’s, meaning that the entire graph above should be rising quite drastically – The inconvenient truth seems to be however, that there is no significant rise in temperature.

100+ stations from 1929-33

100+ stations from 1929-33 seen individually

Its hard to get a good grasp of the data when visualized like this, but reviewing over 100 stations we see that a few areas are seeing a rise in temperature while most aren’t. I did a small hack ‘n’ slash sparkline viewer to try and get a feel for the larger sets coming up on 500 stations and they seem to show the same tendency.

Hockey Game is Over?

I’ve worked the official numbers from NOAA and now you’re all able to re-run the computations at home. Its clear from the official data that the globe isn’t wildly heating up, in fact in some places it is now cooling down. I’m always a bit skeptic when politicians try to solve the worlds problems using taxes and indeed the camps are divided on this issue. My personal oppinion is that the climate is changing, and quite noticable so over the past 20 years, but its difficult for me to say if this is caused by temperature, CO2 or even human influence, but as always it’s best to trust the scientists who work with this on a daily basis and they are overwhelming convinced that we have a problem.

Lau

Code here: Github

About the author

Lau Jensen is the owner of Best In Class, an avid Clojure Developer well experienced in project management and also the primary author of ClojureQL. You are very welcome to follow his posts on this blog or get in touch directly on lau.jensen@bestinclass.dk

  • 8forty

    Wish I could see the graphs…