The sequel to 'Brians functional brain'! I'll take you through a rewrite of the functional Brian's Brain simulator where I employ transients to thoroughly optimize our computational model and type-hints/double buffering to bring render times down! By adding 10 lines we more than double performance! Please don't read this, before having read the preceding post.
As I demonstrated in my last post, the immutable approach bears in itself many benefits, but as we also saw it gave us a big performance hit. I set out to figure out why and how to fix it - follow with me as we up performance quite drastically. Notice, that it is entirely possible to do a rewrite of the old code which is closer to our offset, but for the sake of not being boring I'll take another route.
Why was our last model so performance heavy? I could give many explanations from the code, but in order to make sure that I'm not missing anything obvious, let's whip out a profiler and have a peek:

(For profiling Clojure code I recommend jvisualvm for new-comers, it's a 2-click process to get started)
What we're seeing is roughly 50% of our CPU time going respectively to future, deref and get. Future as you recall is our method for firing our activity-loop and deref peeks into our atom. At first glance you'd think this is where we could find the first big win, but unfortunately the only reason we're seeing this is because the profiler goes between our code and HotSpot. When we remove the profiler these repetitive calls are optimized away by HotSpot, so we have to examine our code manually.
In the last post we modeled our code to be as idiomatic as possible, meaning we wanted the shortest route between concept and implementation. I think we both achieved this and paid the price. Now we have two routes to chose from - either maintain the original design by way of abstraction or optimize it throughly - as I'm sure you've guessed I'm opting for the latter. Primarily because this is a blog and not a business setting, so it might as well be 'educational' :)
I briefly mentioned last time that the most efficient way to define our board is as a single vector - the lookup times are stunning I'm told. I'll also strip out the coordinates and keep them in a separate struct, that makes for cleaner code later on:
(def board-size (apply * dim-board))
(def board (-> (vec (repeat board-size :off))
(assoc (/ (dim-board 0) 2) :on)
(assoc (inc (/ (dim-board 0) 2)) :on)))
(def cords (for [y (range (dim-board 0)) x (range (dim-board 1))] [x y]))
Throughout this post I'll assume as a minimum that you read my last post and have the knowledge that comes with it. So with that in mind the only two things that might be surprising is the macro -> which threads the first expression through the remaining forms. In this case I generates the single-dimensional board, then put 2 :on pixels at the middle of the first row - when you run the app you'll see why :) The second is the behavior of Clojures 'list comprehension' macro 'for', it will walk all values of x for each y, so not [0 0] [1 1] [2 2], but [0 0] [1 0] [2 0] and so on.
(note: if you want an assymetrical board, use a nested for or the :when clause)
As you hopefully recall, last time we called (partition 3 1...) to take out 3 items and increment the offset by 1, in order to build our torus-windows. This comes with 2 penalties.
Since we're no longer using (x, y) coordinates but indices, we'll define the following:
(defn torus-coordinate
[idx]
(cond (neg? idx) (+ idx board-size)
(>= idx board-size) (- idx board-size)
:else idx))
Quite simple, if we exceed the size of the board, we subtract the board-size. Contrary for negative values. Along with this model comes a new way of tracking down our neighbors, based on their index. All we need to do is define the number of values between each neighbor and then add that to our index.
(def above (- (dim-board 0)))
(def below (dim-board 0))
(def neighbors [-1 1 (dec above) above (inc above)
(dec below) below (inc below)])
So for every given index, we can now determine the neighbors like so:
(map #(torus-coordinate (+ idx %)) neighbors)
So all we need now, it to find an effecient way to walk the board.................allow me to introduce
If you missed it in the last post, the subtle painful detail of the functional/immutable approach, is that we're constructing new datastructures on every iteration of the board. Since all datastructures are immutable, we can never modify only recreate with the wanted changes. Thankfully Rich has built 'structural sharing' into our primary datastructures, making the pain as small as possible, but for this task, it wasn't enough. So let's carefully approach the topic of mutability in functional programming, an oxymoron I know.
If a tree falls in the woods and nobody is there to hear it, does it still make a sound?
That's how Rich Hickey opens the discussion on transients. His question is really this
If a pure function, internally mutates values and returns an immutable result, is that bad?
My first instict was: "Yes it is" and on larger scales I still see it like that. However for small and simple operations where mutability comes with a huge performance boost, it could be the right way to go. But when we start touching on mutability, that revives all my old fears from imperative software development, especially in a concurrent setting. For those of you who are like me, I'll walk you through some features of transients, which make them a useful alternative.
First, transients are explicitly defined:
user> (def t (transient [])) #
This gives me a transient vector to work with, clearly a separate type from our usual dataypes.
#1) Okay, they're explicitly defined but will eventually leak into my functional code!
Our regular functions for working with immutable structures assoc, dissoc, conj etc. all fail miserably on transients:
user> (assoc t 2 10) ; Evaluation aborted. clojure.lang.PersistentVector$TransientVector cannot be cast to clojure.lang.Associative [Thrown class java.lang.ClassCastException]
When you need to apply these operations to your transients they must be suffixed with an exclamation mark!
user> (assoc! t 2 10) #
Notice 2 things:
Though you can willfully violate point #1, you should never do it. When you've applied some operation to your transient, leave it and work with the return value, like you'll see me doing below. The last point means, that if the return value is accidentally leaked into the rest of the system it's instantly detected. Like I said starting out, transients should at a maximum be used to mutate 'local data' and given the explicitness surrounding the use of them, Clojure helps to keep you in place. When you're done, you return the transient like so:
user> (persistent! t) [1 2 10]
Have I convinced you it won't leak?
#2) Fine, they don't leak specifically, but in a concurrent setting they'll betray me!
I doubt you'll find anything in Clojure that doesn't have strong concurrency semantics, transients are no exception. They enforce thread-isolation, which is a big deal. If your transient is modified from any other thread than the one where it's spawned, this will happen:
user> (future (conj! t 5)) ; Evaluation aborted. java.lang.IllegalAccessError: Mutable used by non-owner thread [Thrown class java.util.concurrent.ExecutionException]
There are a few more things I could mention, but it's all covered. Clojure is still keeping me safe and ensuring my sleep at night.
We've come close to the flame without getting burnt, let's see what's gained:
In the first version, we defined, quite elegantly, some nested maps calling both torus-window and rules. For the sake of overview I've cooked this process into one loop. Clojures loop start by me binding names to initial values and then calling recur with the next iterations values.
(defn step [board]
(loop [i 0 next-state (transient board)]
(if (< i board-size)
(let [self (nth board i)]
(recur (inc i)
(assoc! next-state i (cond
(= :on self) :dying
(= :dying self) :off
(= 2 (on-neighbors i board)) :on
:else :off))))
(persistent! next-state))))
The intial value of 'next-state' is a transient version of our current board. The logic is simple: If we have not yet worked through all the items in board-size we bind 'self' to the actual value at the (nth i) value of our board. Then we 'recur' with an increment index and assoc(ciate)! the new state directly into our transient. This is where we gain some real momentum, since we're mutating the transient and not making a new structure every time. The last line is the 'else' clause to our if-statement, persisting the next-state vector and returning it. This change in structure requires no change in the remaining code. And the result? Per iteration of the board on my system:
That's quite a speed up, which causes another concern. My quicksilver spur of the moment naive painting strategy is not doing justice to my now speedy computations! So since I've done a few visualization (and might do more later on), let's make it right once and for all.
(see errata in top of post)
In our initial declarations, we skip the JPanel and add a Canvas instead, the Canvas supports Double Buffering:
(let [canvas (doto (Canvas.) (.setIgnoreRepaint true)
(.setSize (dim-screen 0) (dim-screen 1)))]
(.createBufferStrategy canvas 2)
Nothing special, but it does provoke a small change in the activity-loop, which in all other regards is identical to the old one:
(defn activity-loop [canvas stage]
(let [buffer (.getBufferStrategy canvas)
bi (.. (GraphicsEnvironment/getLocalGraphicsEnvironment)
(getDefaultScreenDevice) (getDefaultConfiguration)
(createCompatibleImage (dim-screen 0) (dim-screen 1)))
g2d (.createGraphics bi)]
(while true
(time (swap! stage step))
(render-scene g2d buffer bi @stage))))
The binding of 'buffer' is very standard, but try to write out the declaration of 'bi' in Java, it's quite painful. For demo purposes I've wrapped (swap! ...) in the time-macro, if run that will repeatedly output the time it takes to compute an iteration. If you want to test rendering, wrap render-scene in time instead. In the source below I've removed it and wrapped the rendering in a future, because I don't need it done 'right now'.
Finally we're ready to do some rendering. As you know, we don't have the comfort of coordinates stored directly on the board, which rules out running (map render board).... or does it?
(defn render-scene [g2d buffer bi stage]
(doto #^Graphics2D g2d
(.setColor Color/BLACK)
(.fillRect 0 0 (dim-screen 0) (dim-screen 1)))
(dorun (map #(render-cell g2d %1 %2) cords stage))
(.drawImage (.getDrawGraphics #^BufferStrategy buffer) bi 0 0 nil)
(when (not (.contentsLost #^BufferStrategy buffer))
(.show #^BufferStrategy buffer)))
There are a few things to notice. First off, I'm still using a map but just 'interleaving' my coordinates on the fly in the 5.th line. Secondly you see me add type-hints (the #^type in blue). Type-hints remove reflection calls where the compiler has to go and ask for the type, which takes a lot of time. To figure out when the compiler was doing this I ran the program with this line added to the top of the source:
(set! *warn-on-reflection* true)
Which gave me several tips for optimizing the code:
Reflection warning, transient-ca.clj:55 - call to setColor can't be resolved.
Reflection warning, transient-ca.clj:55 - call to fillRect can't be resolved.
Reflection warning, transient-ca.clj:59 - reference to field getDrawGraphics can't be resolved.
Reflection warning, transient-ca.clj:59 - call to drawImage can't be resolved.
Reflection warning, transient-ca.clj:60 - reference to field contentsLost can't be resolved.
Reflection warning, transient-ca.clj:61 - reference to field show can't be resolved.
Reflection warning, transient-ca.clj:64 - reference to field getBufferStrategy can't be resolved.
It couldn't be easier, it tells me which file, which line number and which method needs help. Some of these don't fire more than once - when the object is initialized - others fire repeatedly throughout my code. The last kind needs to be type-hinted in order to gain some performance. These are my results, your mileage will vary:
_
With great reservation I've introduced transients. You don't get them for free, they come with some advice:
Follow my lead here, I'm being exemplary for your sake. You don't start by writing un-idiomatic transient code, you start by making it elegant, idiomatic, functional and easily understood. If performance is critically low and you need to optimize, start from the top and work your way down, optimizing smaller routines first. Close to the last step you'll find transients. They are not meant to be your first resort. At least I don't think so, but you came here for my 2 cents right? :)
I showed off type-hints as a way to optimize your code, it's always a good idea. Personally I don't like seeing them if I don't have to, so if I'm not feeling the pain I usually don't bother to add them, but in a case such as the above there's no regrets.
Finally, we're still functional, there are no side-effects and no state-in-between. We've added a double-buffer rendering strategy and all in all we've only grown about 10 lines, coming to 79 lines.
Hope I haven't tempted anybody above measure,
/Lau
Get the source: here
ps: Transients are in the alpha-snapshot of Clojure, get it straight from github.