Brians Transient! Brain

2009-10-04 14:18:07

The sequel to 'Brians functional brain'! I'll take you through a rewrite of the functional Brian's Brain simulator where I employ transients to thoroughly optimize our computational model and type-hints/double buffering to bring render times down! By adding 10 lines we more than double performance! Please don't read this, before having read the preceding post.

As I demonstrated in my last post, the immutable approach bears in itself many benefits, but as we also saw it gave us a big performance hit. I set out to figure out why and how to fix it - follow with me as we up performance quite drastically. Notice, that it is entirely possible to do a rewrite of the old code which is closer to our offset, but for the sake of not being boring I'll take another route.

Why oh why?

Why was our last model so performance heavy? I could give many explanations from the code, but in order to make sure that I'm not missing anything obvious, let's whip out a profiler and have a peek:

visualvm

(For profiling Clojure code I recommend jvisualvm for new-comers, it's a 2-click process to get started)

What we're seeing is roughly 50% of our CPU time going respectively to future, deref and get. Future as you recall is our method for firing our activity-loop and deref peeks into our atom. At first glance you'd think this is where we could find the first big win, but unfortunately the only reason we're seeing this is because the profiler goes between our code and HotSpot. When we remove the profiler these repetitive calls are optimized away by HotSpot, so we have to examine our code manually.

Overview

In the last post we modeled our code to be as idiomatic as possible, meaning we wanted the shortest route between concept and implementation. I think we both achieved this and paid the price. Now we have two routes to chose from - either maintain the original design by way of abstraction or optimize it throughly - as I'm sure you've guessed I'm opting for the latter. Primarily because this is a blog and not a business setting, so it might as well be 'educational' :)

The World

I briefly mentioned last time that the most efficient way to define our board is as a single vector - the lookup times are stunning I'm told. I'll also strip out the coordinates and keep them in a separate struct, that makes for cleaner code later on:

(def board-size (apply * dim-board))
(def board        (-> (vec (repeat board-size :off))
                      (assoc (/ (dim-board 0) 2)       :on)
                      (assoc (inc (/ (dim-board 0) 2)) :on)))

(def cords (for [y (range (dim-board 0)) x (range (dim-board 1))] [x y]))

Throughout this post I'll assume as a minimum that you read my last post and have the knowledge that comes with it. So with that in mind the only two things that might be surprising is the macro -> which threads the first expression through the remaining forms. In this case I generates the single-dimensional board, then put 2 :on pixels at the middle of the first row - when you run the app you'll see why :) The second is the behavior of Clojures 'list comprehension' macro 'for', it will walk all values of x for each y, so not [0 0] [1 1] [2 2], but [0 0] [1 0] [2 0] and so on.

(note: if you want an assymetrical board, use a nested for or the :when clause)

The Window

As you hopefully recall, last time we called (partition 3 1...) to take out 3 items and increment the offset by 1, in order to build our torus-windows. This comes with 2 penalties.

  1. We're allocation a lot of lists everytime we walk over our grid, allocation takes time.
  2. Everything we allocate we later garbage collect (GC), GC gives us periodic performance hits.

Since we're no longer using (x, y) coordinates but indices, we'll define the following:

(defn torus-coordinate
  [idx]
  (cond (neg? idx)          (+ idx board-size)
        (>= idx board-size) (- idx board-size)
    :else idx))

Quite simple, if we exceed the size of the board, we subtract the board-size. Contrary for negative values. Along with this model comes a new way of tracking down our neighbors, based on their index. All we need to do is define the number of values between each neighbor and then add that to our index.

(def above     (- (dim-board 0)))
(def below     (dim-board 0))
(def neighbors [-1 1 (dec above) above (inc above)
                     (dec below) below (inc below)])

So for every given index, we can now determine the neighbors like so:

(map #(torus-coordinate (+ idx %)) neighbors)

This code is overly simplistic and too fast: read errata

So all we need now, it to find an effecient way to walk the board.................allow me to introduce

- - - TRANSIENTS - - -

If you missed it in the last post, the subtle painful detail of the functional/immutable approach, is that we're constructing new datastructures on every iteration of the board. Since all datastructures are immutable, we can never modify only recreate with the wanted changes. Thankfully Rich has built 'structural sharing' into our primary datastructures, making the pain as small as possible, but for this task, it wasn't enough. So let's carefully approach the topic of mutability in functional programming, an oxymoron I know.

If a tree falls in the woods and nobody is there to hear it, does it still make a sound?

That's how Rich Hickey opens the discussion on transients. His question is really this

If a pure function, internally mutates values and returns an immutable result, is that bad?

My first instict was: "Yes it is" and on larger scales I still see it like that. However for small and simple operations where mutability comes with a huge performance boost, it could be the right way to go. But when we start touching on mutability, that revives all my old fears from imperative software development, especially in a concurrent setting. For those of you who are like me, I'll walk you through some features of transients, which make them a useful alternative.

First, transients are explicitly defined:

user> (def t (transient []))
#

This gives me a transient vector to work with, clearly a separate type from our usual dataypes.

#1) Okay, they're explicitly defined but will eventually leak into my functional code!

Our regular functions for working with immutable structures assoc, dissoc, conj etc. all fail miserably on transients:

user> (assoc t 2 10)
; Evaluation aborted.
clojure.lang.PersistentVector$TransientVector cannot be cast to clojure.lang.Associative
  [Thrown class java.lang.ClassCastException]

When you need to apply these operations to your transients they must be suffixed with an exclamation mark!

user> (assoc! t 2 10)
#

Notice 2 things:

  1. We've not digressed to using statements, it's still an expression, you still need to capture the returned value.
  2. You don't get a vector back, it's still a transient object and that makes all the difference

Though you can willfully violate point #1, you should never do it. When you've applied some operation to your transient, leave it and work with the return value, like you'll see me doing below. The last point means, that if the return value is accidentally leaked into the rest of the system it's instantly detected. Like I said starting out, transients should at a maximum be used to mutate 'local data' and given the explicitness surrounding the use of them, Clojure helps to keep you in place. When you're done, you return the transient like so:

user> (persistent! t)
[1 2 10]

Have I convinced you it won't leak?

#2) Fine, they don't leak specifically, but in a concurrent setting they'll betray me!

I doubt you'll find anything in Clojure that doesn't have strong concurrency semantics, transients are no exception. They enforce thread-isolation, which is a big deal. If your transient is modified from any other thread than the one where it's spawned, this will happen:

user> (future (conj! t 5))
; Evaluation aborted.
java.lang.IllegalAccessError: Mutable used by non-owner thread
  [Thrown class java.util.concurrent.ExecutionException]

There are a few more things I could mention, but it's all covered. Clojure is still keeping me safe and ensuring my sleep at night.

We've come close to the flame without getting burnt, let's see what's gained:

Lets put it to good use

In the first version, we defined, quite elegantly, some nested maps calling both torus-window and rules. For the sake of overview I've cooked this process into one loop. Clojures loop start by me binding names to initial values and then calling recur with the next iterations values.

(defn step [board]
  (loop [i 0 next-state (transient board)]
    (if (< i board-size)
      (let [self         (nth board i)]
        (recur (inc i)
               (assoc! next-state i (cond
                                      (= :on    self)                :dying
                                      (= :dying self)                :off
                                      (= 2 (on-neighbors i board))   :on
                                      :else                          :off))))
      (persistent! next-state))))

The intial value of 'next-state' is a transient version of our current board. The logic is simple: If we have not yet worked through all the items in board-size we bind 'self' to the actual value at the (nth i) value of our board. Then we 'recur' with an increment index and assoc(ciate)! the new state directly into our transient. This is where we gain some real momentum, since we're mutating the transient and not making a new structure every time. The last line is the 'else' clause to our if-statement, persisting the next-state vector and returning it. This change in structure requires no change in the remaining code. And the result? Per iteration of the board on my system:

Before: 180 msecs_____ After: 37 msecs

That's quite a speed up, which causes another concern. My quicksilver spur of the moment naive painting strategy is not doing justice to my now speedy computations! So since I've done a few visualization (and might do more later on), let's make it right once and for all.
(see errata in top of post)

Double buffered strategic rendering

In our initial declarations, we skip the JPanel and add a Canvas instead, the Canvas supports Double Buffering:

(let [canvas  (doto (Canvas.) (.setIgnoreRepaint true)
                    (.setSize (dim-screen 0) (dim-screen 1)))]
  (.createBufferStrategy canvas 2)

Nothing special, but it does provoke a small change in the activity-loop, which in all other regards is identical to the old one:

(defn activity-loop [canvas stage]
  (let [buffer (.getBufferStrategy canvas)
        bi     (.. (GraphicsEnvironment/getLocalGraphicsEnvironment)
                   (getDefaultScreenDevice) (getDefaultConfiguration)
                   (createCompatibleImage (dim-screen 0) (dim-screen 1)))
        g2d    (.createGraphics bi)]
    (while true
      (time (swap! stage step))
      (render-scene g2d buffer bi @stage))))

The binding of 'buffer' is very standard, but try to write out the declaration of 'bi' in Java, it's quite painful. For demo purposes I've wrapped (swap! ...) in the time-macro, if run that will repeatedly output the time it takes to compute an iteration. If you want to test rendering, wrap render-scene in time instead. In the source below I've removed it and wrapped the rendering in a future, because I don't need it done 'right now'.

Finally we're ready to do some rendering. As you know, we don't have the comfort of coordinates stored directly on the board, which rules out running (map render board).... or does it?

(defn render-scene [g2d buffer bi stage]
  (doto #^Graphics2D g2d
    (.setColor Color/BLACK)
    (.fillRect 0 0 (dim-screen 0) (dim-screen 1)))
  (dorun (map #(render-cell g2d %1 %2) cords stage))
  (.drawImage (.getDrawGraphics #^BufferStrategy buffer) bi 0 0 nil)
  (when (not (.contentsLost #^BufferStrategy buffer))
    (.show #^BufferStrategy buffer)))

There are a few things to notice. First off, I'm still using a map but just 'interleaving' my coordinates on the fly in the 5.th line. Secondly you see me add type-hints (the #^type in blue). Type-hints remove reflection calls where the compiler has to go and ask for the type, which takes a lot of time. To figure out when the compiler was doing this I ran the program with this line added to the top of the source:

(set! *warn-on-reflection* true)

Which gave me several tips for optimizing the code:

Reflection warning, transient-ca.clj:55 - call to setColor can't be resolved.
Reflection warning, transient-ca.clj:55 - call to fillRect can't be resolved.
Reflection warning, transient-ca.clj:59 - reference to field getDrawGraphics can't be resolved.
Reflection warning, transient-ca.clj:59 - call to drawImage can't be resolved.
Reflection warning, transient-ca.clj:60 - reference to field contentsLost can't be resolved.
Reflection warning, transient-ca.clj:61 - reference to field show can't be resolved.
Reflection warning, transient-ca.clj:64 - reference to field getBufferStrategy can't be resolved.

It couldn't be easier, it tells me which file, which line number and which method needs help. Some of these don't fire more than once - when the object is initialized - others fire repeatedly throughout my code. The last kind needs to be type-hinted in order to gain some performance. These are my results, your mileage will vary:

Old: 260 msecs / frame

2xBuffered: 230 msecs/frame

type-hinted+2xBuffered: 40 msecs/frame

_

Conclusions

With great reservation I've introduced transients. You don't get them for free, they come with some advice:

DO NOT PREMATURELY OPTIMIZE

Follow my lead here, I'm being exemplary for your sake. You don't start by writing un-idiomatic transient code, you start by making it elegant, idiomatic, functional and easily understood. If performance is critically low and you need to optimize, start from the top and work your way down, optimizing smaller routines first. Close to the last step you'll find transients. They are not meant to be your first resort. At least I don't think so, but you came here for my 2 cents right? :)

I showed off type-hints as a way to optimize your code, it's always a good idea. Personally I don't like seeing them if I don't have to, so if I'm not feeling the pain I usually don't bother to add them, but in a case such as the above there's no regrets.

Finally, we're still functional, there are no side-effects and no state-in-between. We've added a double-buffer rendering strategy and all in all we've only grown about 10 lines, coming to 79 lines.

Hope I haven't tempted anybody above measure,

/Lau

Get the source: here

ps: Transients are in the alpha-snapshot of Clojure, get it straight from github.

Rick Moynihan
2009-10-05 14:48:34
Another great post... However I'm curious how you knew to disregard the profilers information...  i.e. how is it you know that hotspot will remove those calls?
Lau
2009-10-05 14:54:09
@Rick: Thanks. There are 3 things you can do. 1) Learn how to read bytecode and see for yourself. 2) Completely rewrite your program removing all 'derefs' for instance and seeing if the profiler still complains. 3) Ask <a href="http://clj-me.cgrand.net" rel="nofollow">CGrand</a>. 

I went with the last option and he told me that both he and <a href="http://twitter.com/richhickey" rel="nofollow">Rich Hickey</a> had made similar observations and seen that a complete rewrite didn't change the performance characteristics.
Rick Moynihan
2009-10-05 15:15:46
Hi Lau, Is (1) really an option, as you mention HotSpot optimises the calls away, which implies that the answer isn't really in the byte-code as byte HotSpot optimisations are performed at runtime not compile time.  (Though I understand that certain byte code structures are likely to be optimised by HotSpot).

I'm really just wondering what the rule of thumb is here.  Do we assume that significant time spent in Clojure's core isn't the bottleneck?  I appreciate profiling is hard, inexact and very much a hiesen-problem, but being able to identify performance red-herrings such as this seems to be something the clojure community should document...  Particularly when it leads to the blame being incorrectly laid at Clojure's feet.

Anyway, I'm really enjoying your posts!  Keep up the good work!
Lau
2009-10-05 17:50:49
Hey Rick.

You're asking the right questions. I'll be very careful to give out some definite guidelines for profiling bottlenecks, like I mentioned I conferred with CGrand before  reaching any conclusions. Cliff Click did a presentation about 1 year ago, demonstrating in a small way some results of examining bytecode to determine if the various compilers were inlining. I get the impression that in the right setting (I'm guessing primarily on Azul) it's a good way to gain some insight. Generally, if there is anything to be learned it's that repeated calls to deref and get are to be expected as the top hitters in most profiling results. The practical side of it, is of course trying to rid your code of these calls, re-profile it and see if anything has changed. I think the best course of action would be to dedicate a separate post for this topic, working through some use-cases. I'll consider it. In the mean time, check out the presentation: <a href="http://www.infoq.com/presentations/click-fast-bytecodes-funny-languages" rel="nofollow">here</a>
Miron
2009-10-05 17:55:14
Hello, haven't run the app yet (my current clojure setup isn't recent enough to include transients), but I'm wondering about wrapping.

It seems to me that your rows wrap in a weird way (the first cell off the right end of the row should be the first cell on the left, I think it's the first cell of the next row in your implementation).

What am I missing here? Is it really a bug?

Great post,
Miron
Lau
2009-10-05 19:38:03
@Miron: Hey, You're not missing anything, it's true that this is a naive 1D board comprehension, which only appears 2D when you map it graphically. In would be more correct to map the last cell in a row to the first index on the same row. For a good example of this read the preceding post, the current is only about performance optimizations, so I think it's best to keep it simple. Good catch though.
danlei
2009-10-06 00:57:20
Nice article, Lau. Keep up the good work.
Mark Reid
2009-10-10 01:27:28
Great post. I've also been profiling Clojure code recently and while VisualJVM is a nice tool for looking at overall performance I would recommend the clojure.contrib.profiling for more fine-grained profiling.

What nice about c.c.profiling is that you can wrap and name specific parts of your code in (prof :name ...). When you then call (profile ...) at the top-level of your code you get a table of min/mean/max runtimes for each named code section as well as how often each were called. When you don't want profiling on anymore set *enabling-profiling* to false and the `prof` macros disappear leaving your original code running without any interference from the profiler.

Looking forward to your future posts (pun semi-intended).
John
2009-10-18 21:38:31
What version of Clojure does this work with?  I tried 1.0, but it backtraces with an error: Exception in thread "main" java.lang.Exception: Unable to resolve symbol: transient in this context (transient-ca.clj:33)
Lau
2009-10-18 21:40:02
@John: Like I state as the very bottom this works with the snapshot alpha of Clojure, meaning pull it directly from Github and compile with 'ant'. Transients are bleeding edge :)
John
2009-10-18 22:25:00
I missed it in the "ps".  Sorry!  Thank you!
jdz
2009-10-22 11:32:53
I'm [slowly] writing a blog post about Brian's Brain in Common Lisp. And before I bash your implementation into corner, I'd like to point you to following deficiencies in your version:
- Your version will bail go down burning if supplied with odd width (easily fixable).
- At least on my computer it animates garbage if I supply it with geometry 160x100.
- Even with the default geometry, the brain's evolution should be symmetric. Yours gets assymmetric.

And I'm not mentioning the issue with double-buffering (on relatively fast computers, I suppose).
Lau
2009-10-22 12:27:46
@jdz: Hey my friend. As any respectable Common Lisper you're not keeping up with the times. In the post I clearly say how to modify the for loop if you want to support asymmetric boards, and in one of my first comments I also note that it's a naive board translation, skipping 1 row down on overflow, simply because the point of this post was to show of transients and type-hints, nothing else.

RE DoubleBuffering, if there is a bug in the java lib, you're the first guy to have noticed it - I wonder if its a single unit issue.
jdz
2009-10-22 12:56:34
Oh, my bad, missed the assymmetry note.  Although it conflicts face-to-face to your own "design goals" (referring to "how much would you have to change this code for it to work in 3 dimensions?" in Ikeda map article). I mean -- why would you intentionally write code that has specific assumptions about the input data when you can as easily write the general version?  I just don't get it.

And about the row-skpping I also don't agree. It does not matter how fast you can get me wrong results!

You have done a good job of showing how easy it can be to write something interesting to play with.  And you are to blame that I'm writing my own implementation.  But is showing off features more important than doing great job?

(I have a little pet theory that deing things right would make the code be not as concise and nice as in your articles, but I'll leave that to your conscience.)

And thanks for the articles, keep up the good work and let's hope I don't have anything to bitch about in the future! :P
jdz
2009-10-22 13:50:05
Just adding to my reply: the asymmetry note is not correct.  The problem with the cords definition is that the wrong indices are used for dimensions vector.  If the indices are swapped, everything works just fine.