Brians Brain Optimized in Clojurescript

In this post I’ll walk you through some of the steps I took to bring Brians Brain down from 340ms per frame, to about 35ms in compiled Clojurescript. Although this is a performance boost of about 970% I think its still much too slow. In the current brute force traversal of the of the array we are essentially iterating 90 columns, 90 rows multiplied by 8 passes while testing for active neighbors. Although running all 8 passes on each cell is highly unlikely our worst case scenario is 64.800 lookups for each frame – but even here, 35ms is ridiculous – So lets a look inside.

Live demo

First, lets have a look at the end result. This is Brians Brain, 100x100cells @ 20 FPS (50ms)
 

It’s all about the board

In our original version, we defined the board like this:

(def board
     (for [x (range (dim-board 0))]
       (for [y (range (dim-board 1))]
         [(if (< 50 (rand-int 100)) :on :off) x y])))

This is essentially a persistant nested pair of vectors. The benefit is obviously, that its idiomatic Clojure, easy to interface with and sharable between threads. Unfortunately in Javascript, we only have a single thread so persistence is giving us nothing more than a performance hit.

Our options for fast board access are narrowed down to either a 1D JS Array which is accessed via index, or a 2D JS Array which is accessed via coordinates. Accessing cells via index means calculating an index for a coordinate on each read, which only makes sense if the 1D array is significantly faster than the 2D array, thus negating the cost of calculating the index. So lets break it down:

multi = new int[50][50];
single = new int[2500];

Under the hood, this compiles to:

BIPUSH 50
BIPUSH 50
MULTIANEWARRAY int[][] 2
ASTORE 1
SIPUSH 2500
NEWARRAY T_INT
ASTORE 2

Which tells us that there is native support for multidimensional arrays. Lets see if the same is true of access to the nested cells:

for (int i = 0; i < 50; ++i)
    for (int j = 0; j < 50; ++j)
    {
        multi[i][j] = 20;
        single[i*50+j] = 20;
    }

Compiled:

ALOAD 1: multi
ILOAD 3: i
AALOAD
ILOAD 4: j
BIPUSH 20
IASTORE

ALOAD 2: single
ILOAD 3: i
BIPUSH 50
IMUL
ILOAD 4: j
IADD
BIPUSH 20
IASTORE

(source: Stackoverflow.com)

Again we see native support for accessing nested arrays, so in all likelyhood using a flat 1D array would not buy us any significant performance increase, because we would have to calculate the index on each lookup.

So lets make a board which is a 2d javascript array:

(defn make-board
  [w h]
  (let [board  (make-array h)
        buffer (make-array h)]
    (forloop [(y 0) (< y h) (inc y)]
       (let [row  (int-array w)]
         (forloop [(x 0) (< x w) (inc x)]
             (aset row x (rand-nth [on off on])))
         (aset board y (int-array row))
         (aset buffer y (int-array row))))
    (atom [board buffer])))

Basically this creates the array of colums and populates each cell with a row of on/offs. Because its a mutable javascript array we need to create both a board for rendering and a buffer. If we were to read/write on only one table, the mutation would become incorrect as we would relate to both the current and next generation of the board on each pass.

A key to gaining better performance is making sure that the loops themselves generate as little overhead as possible. In the first version I was mapping over seqs which was the main cause of slow traversal. Take a quick look at these comparisons and you’ll find that we need some kind of while loop to get optimal performance: JSPerf Loops. There’s bound to be a significant difference between the clojurescript code we write and the resulting javascript code generated when we apply advanced compiler optimizations, but for the sake of sanity lets try to get a close as possible.

The forloop generates exactly that kind of loop while micking javascripts idiomatic for syntax. I’ve borrowed this directly from David Nolens interesting port Chambered:

(defmacro forloop
  [[init pred step] & body]
  `(loop [~@init]
     (when ~pred
       ~@body
       (recur ~step))))

Optimizing 3 important functions

So with a simple looping construct in place there is actually only 3 functions that are being called frequently enough to require optimization: get-cell, active-neighbors and stepfn.

The simplest of these is the get-cell function, which accesses the board-array and fetches the content of a given cell. In idiomatic Clojure, it would look like so:

(defn get-cell
  [b x y]
  (aget b (mod y 120) (mod x 120)))

This will do a million lookups in about 55ms. This is near optimal I think for the JS engine, but looking at the Chrome profiler it does appear that mod is taking curiously long to complete which left me wondering if rolling a pure JS mod would be a bit faster. Javascripts own mod function, the % operator, is unusable for this task because it fails to handle negative values, here’s a workaround:

(defn mod [x m]
  (js* "((x%m)+m)%m"))

(defn get-cell
  [b x y]
  (js* "b[brianscript.core.mod(y,90)][brianscript.core.mod(x,90)]"))

The js* function is a fail-safe for when Clojurescript just wont generate the code you need and I couldn’t find a direct route to the % operator. Although this uses 2 calls to % it gives us a direct call to b[][] and reduces to runtime to consistently hit < 40 ms per million lookups. Crude, but faster.

Active Neighbors

This function is by far the most important performancewise, as it looks up a maximum of 8 adjacent cells for each cell on the board. The original version looks like so:

(defn active-neighbors [above [left _ right] below]
  (count
   (filter #(= :on (% 0))
           (concat above [left right] below))))

(defn rules [above current below]
  (let [[self x y]  (second current)]
    (cond
      (= :on    self)                              [:dying x y]
      (= :dying self)                              [:off   x y]
      (= 2 (active-neighbors above current below)) [:on    x y]
      :else                                        [:off   x y])))

In a functional approach, this makes all the sense in the world. Im cutting the board into overlapping rows of 3 at a time and looking for the onkeyword. There’s a small gain in using integers instead of keywords for the state of each cell, which is simply achieved by removing the : prefix and adding definitions:

(def off   0)
(def dying 1)
(def on    2)

The real value here, is that it allows us to use an int-array, which performs roughly 10-12% faster than a non-typed array.

The second gain is achieved by looking directly at each coordinate instead of generating 3 seqs per lookup which is done by calling get-cell instead of partition/concat. And the final boost is achived by short-circuiting the process as soon as we have a definitive answer. Here’s how:

(defn active-neighbors
  [blocal x y]
  (let [height   120
        width    120
        coords   [[(dec x) (dec y)] [x (dec y)] [(inc x) (dec y)]
                  [(dec x) y]                   [(inc x) y]
                  [(dec x) (inc y)] [x (inc y)] [(inc x) (inc y)]]]
    (loop [ coords cnt 0]
      (if (and c (< cnt 3))
        (let [state   (get-cell blocal (c 0) (c 1))]
          (cond
           (and (= state on) (= cnt 2))
           3

           (= state on)
           (recur cs (inc cnt))

           :else (recur cs cnt)))
        cnt))))

We loop through the coordinates and check if the count is below 3. If its not, we simply break the loop and return 3 since it makes no difference to the algorithm if there are more neighbors, we need exactly 2 live neighbors if we are to set the on state. If we have exactly 2 neighbors and the current cell is on we can also stop looking and just return 3. Otherwise we keep looking. This function ran at about 320ms in the functional version and now its down to about 80ms in a live repl and 30ms compiled with optimizations.

The step function

The final piece of the puzzle is applying the rules to the entire board and updating the state. The functional version allocated quite a bit of seqs and continously triggered garbage collection while running. The new version allocates nothing:

(defn stepfn
  [stage]
  (let [[dimx dimy]       dim-board
        [board buffer]    @stage]
    (forloop [(y 0) (< y dimy) (inc y)]
       (forloop [(x 0) (< x dimx) (inc x)]
                (let [current-state (get-cell board x y)]
                  (aset buffer y x
                        (cond
                         (= on current-state)                dying
                         (= dying current-state)             off
                         (= 2 (active-neighbors board x y))  on
                         :else off)))))
    (reset! stage [buffer board])))

This stepfn gives us < 5ms overhead and nearly all of the work is done in active-neighbors.

Final remarks on performant Clojurescript

I started out being fairly unhappy with the compiler, because when you’re looking for performance, indirections are a pain and they consume time. When I applied the advanced optimizations I was happy to see a much smaller code and improved performance. I don’t think that the array-based version is horribly complex to work with, compared to the straight functional approach, but I can think of very few real-world scenarios where this would be necessary. Could a user not wait 340ms for a report to be computed? I think so. But doing array traversal and performance intensive graphics outside of the GPU does seem like driving in a nail with a screwdriver. Here’s what can be computed in a single millisecond on the GPU:

For me, I’m done with Brians Brain and I hope you’ve enjoyed following the experience. Here are the steps:

– Brians functional Brain, Swing UI, parallized: Brians Functional Brain

– Brians functional Brain, naively ported to Clojurescript: Brians Brain on Clojurescript

– Brians Brain in optimized Clojurescript: This post

Code to be found here: BrianScript

Lau

About the author

Lau Jensen is the owner of Best In Class, an avid Clojure Developer well experienced in project management and also the primary author of ClojureQL. You are very welcome to follow his posts on this blog or get in touch directly on lau.jensen@bestinclass.dk

  • What’s the deal with inspecting JVM bytecode if you’re eventually compiling to JS?

  • LauJensen

    The bytecode is Javascript Bytecode, not JVM Bytecode.