In a Nutshell
LambdaCube 3D is a domain specific language and library that makes it possible to program GPUs in a purely functional style.
Purely Functional Rendering Engine
We recently added a new example to the repository that shows one way of implementing convolution filters. Our variance shadow mapping example uses a Gaussian blur as a component, but it’s implemented in a rather naive, straightforward way. This time we’re going to have a closer look at it and improve its performance while trying to preserve the output as much as possible. This little experiment was primarily inspired by a post from Daniel Rákos. While the optimisation is not specific to the Gaussian blur, we’ll stick with it as an example.
First of all, we’ll define a function to calculate the weights of a blur for a given filter width. As Daniel points out, taking the binomial coefficients, i.e. the appropriate row of Pascal’s triangle, and normalising it gives us the desired values. First let’s define the weights without normalisation:
binomialCoefficients :: Int -> [Float] binomialCoefficients n = iterate next  !! (n-1) where next xs = [x+y | x <- xs ++  | y <- 0:xs]
A convolution filter takes samples from the neighbourhood of a given texel and computes a weighted sum from them, therefore we need to assign offsets to the weights before we can use them. We will measure the offsets in texels. The following function assigns offsets to any list of weights with the assumption that the middle element of the list corresponds to the centre of the filter (from now on, we’ll assume the filter width is an odd integer):
withOffsets :: [Float] -> [(Float, Float)] withOffsets cs = [(o, c) | c <- cs | o <- [-lim..lim]] where lim = fromIntegral (length cs `quot` 2)
Finally, we want to normalise the weights, so their sum is 1. We define our normalisation step to work on the weights with offsets:
normalise :: [(Float, Float)] -> [(Float, Float)] normalise ocs = [(o, c/s) | (o, c) <- ocs] where s = sum [c | (_, c) <- ocs]
At this point, we can define a function to calculate weighted samples for a Gaussian blur:
gaussianSamples :: Int -> [(Float, Float)] gaussianSamples = normalise . withOffsets . binomialCoefficients
For instance, gaussianSamples 5 equals [(-2.0,0.0625),(-1.0,0.25),(0.0,0.375),(1.0,0.25),(2.0,0.0625)].
To improve the performance of the filter, we should make it as small as possible. One obvious thing we can do is to simply remove terms that don’t contribute to the results significantly. Since our primary use case is real-time graphics, a bit of inaccuracy most likely won’t make any perceivable difference. Our solution is to introduce a thresholding operation that throws away terms that are much smaller than the biggest term:
threshold :: Float -> [(Float, Float)] -> [(Float, Float)] threshold t ocs = [oc | oc@(_, c) <- ocs, c*t >= m] where m = maximum [c | (_, c) <- ocs]
This simple method can be surprisingly efficient. When we calculate the weights for wide Gaussian blurs, it turns out that a large portion of the outer areas hardly contribute to the final value. For instance, threshold 1000 (withOffsets (binomialCoefficients 101)) leaves only 37 terms.
Another, more clever trick we can apply is specific to texture sampling. Since the GPU provides linear interpolation for free, we can get a weighted sum of two adjacent samples with a single query: f(x) = (1 – frac(x)) * f(floor(x)) + frac(x) * f(ceil(x)). All we need to do is merge adjacent samples by calculating the necessary sampling position. The following function assumes that all samples are adjacent and separated by the unit distance; both criteria are met if we provide the output of withOffsets:
offsetWeight :: [(Float, Float)] -> [(Float, Float)] offsetWeight  =  offsetWeight [ow] = [ow] offsetWeight ((o1,w1):(o2,w2):ows) = (o1+w2/w', w') : offsetWeight ows where w' = w1+w2
The resulting list is half as long as the input. We can also combine this optimisation with thresholding, but only with care: we need to do the sample merging first, since offsetWeight assumes consecutive samples. It’s best to wrap it all up in a single function:
gaussianSamples :: Float -> Int -> [(Float, Float)] gaussianSamples tolerance = normalise . threshold tolerance . offsetWeight . withOffsets . binomialCoefficients
To stick with the previous example, length (gaussianSamples 1000 101) equals 19. This graph shows how these two optimisations affect the number of samples needed to reproduce a Gaussian of a given width:
So how shall we take these weights into use? First of all, we need some input to apply the filter to. We went with a simple pattern generated in a fragment shader; the geometry rendered is just a screen-sized quad:
originalImage :: Exp Obj (FrameBuffer N1 V4F) originalImage = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT) clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT) prims = Transform vert (Fetch "geometrySlot" Triangle (IV2F "position")) vert :: Exp V V2F -> VertexOut () vert pos = VertexOut pos' (floatV 1) (ZT) where V2 x y = unpack' pos pos' = pack' (V4 x y (floatV 0) (floatV 1)) frag :: Exp F () -> FragmentOut (Color V4F :+: ZZ) frag _ = FragmentOut (col :. ZT) where V4 x y _ _ = unpack' fragCoord' x' = sqrt' x @* floatF 16 y' = sqrt' y @* floatF 16 r = Cond ((x' @+ y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1) g = floatF 0 b = Cond ((x' @- y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1) col = pack' (V4 r g b (floatF 1))
This is a complete single-pass pipeline description. The vertex shader does nothing beyond simply extending the position vector with the missing Z and W coordinates. The fragment shader uses the fragment coordinate as its only input, so we don’t need to pass any data between the two shader phases. The resulting image is the following:
Next, we’ll define a one-dimensional convolution pass that takes a direction vector, a list of weights with offsets (which are used to scale the aforementioned vector), and an input image. The list is used to generate an unrolled loop in the fragment shader, while the vertex shader performs just an elementary coordinate transformation.
convolve :: V2F -> [(Float, Float)] -> Exp Obj (Image N1 V4F) -> Exp Obj (FrameBuffer N1 V4F) convolve (V2 dx dy) weights img = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where resX = windowWidth resY = windowHeight dir' :: Exp F V2F dir' = Const (V2 (dx / fromIntegral resX) (dy / fromIntegral resY)) accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT) clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT) prims = Transform vert (Fetch "postSlot" Triangle (IV2F "position")) vert :: Exp V V2F -> VertexOut V2F vert uv = VertexOut pos (Const 1) (NoPerspective uv' :. ZT) where uv' = uv @* floatV 0.5 @+ floatV 0.5 pos = pack' (V4 u v (floatV 1) (floatV 1)) V2 u v = unpack' uv frag :: Exp F V2F -> FragmentOut (Color V4F :+: ZZ) frag uv = FragmentOut (sample :. ZT) where sample = foldr1 (@+) [ texture' smp (uv @+ dir' @* floatF ofs) @* floatF coeff | (ofs, coeff) <- weights] smp = Sampler LinearFilter Clamp tex tex = Texture (Texture2D (Float RGBA) n1) (V2 resX resY) NoMip [img]
Incorporating the convolution pass is as simple as using function composition:
weights = gaussianSamples 1000 101 dirH = V2 1 0 dirV = V2 0 1 finalImage :: Exp Obj (FrameBuffer N1 V4F) finalImage = filterPass dirV (filterPass dirH originalImage) where filterPass dir = convolve dir weights . projectBuffer projectBuffer = PrjFrameBuffer "" tix0
Using our original definition for gaussianSamples, i.e. not using thresholding or merging, we get the following rendering:
When we add both optimisations, the resulting image is indistinguishable from the previous one:
Despite the striking similarity, the two images are not identical. If we calculate the difference and enhance the result, we get the following pattern:
With 8-bit RGB components, the maximum difference anywhere on the image is just 1, so no wonder it cannot be detected by the naked eye. This is a really nice result, because the run-time increases steeply with the number of samples. The width of the filter is also a factor, but it’s difficult to predict its effect due to the complexity of the GPU hardware. On the hardware we tested (nVidia GeForce 9650M GT) we got the following timings:
At a first glance, it looks like the time per frame is more or less directly proportional to the number of samples used in the filtering passes after we deduct the time needed to run the no-op 1×1 filter. However, in this particular case, the variant with both optimisations turned on seems to have a smaller time to sample count ratio than the naive version, about 0.2 vs. 0.25 ms, so the gains are even better than expected. On other hardware we might see different relations depending on the peculiarities of memory access costs.
At this point we ran into the limits of the current LambdaCube implementation. One more potential optimisation worth trying would be to calculate the sample coordinates in the vertex shader, making the texture reads non-dependent. Currently this is not possible due to the lack of support for array types in the shaders. It’s also evident that the EDSL approach without explicit let bindings is not viable, because compilation time is exponential with respect to the complexity of the shader. While there have been some positive developments in this area, we intend to switch to the DSL approach in the next version anyway, which should solve this problem. Finally, if you run the example, you’ll notice that you cannot resize the window. The reason for this is rather mundane: currently texture sizes are pipeline compile-time constants. In the future we’ll change their type from V2I to Exp Obj V2I, which will make it possible to modify the values during run-time.
After a long hiatus, we shall conclude our series of posts on the FRP physics example. The previous post discussed the falling brick whose motion state is updated upon collision events, while this one will focus on the mouse draggable rag doll. In fact, the Bullet portion of this feature turned out to be less interesting than some of the observations we made while implementing the interaction in Elerea.
The rag doll itself is straightforward business: it consists of a few capsules (one for each joint), and the constraints that hold them together. These objects are instantiated in the complexBody function given a description. The general idea in Bullet is that constraints are configured in world space. E.g. if we want to connect two bricks with a spring, we have to provide the world-space coordinates of the pivot points given the current position and orientation of the bricks, and the system calculates the necessary parameters based on this input. Afterwards, everything is handled by the physics world, and we don’t have to worry about it at all.
Originally, we wanted to extend the attribute system to be able to describe all the parameters of the constraints in a convenient way. Unfortunately, it turned out that the Bullet API is rather inconsistent in this area, and it would have required too much up front work to create a cleaner façade in front of it for the sake of the example. However, we intend to revisit this project in the future, when LambdaCube itself is in a better shape.
To make things interesting, we allow the user to pick up objects one at a time and drag them around. This is achieved by temporarily establishing a so-called point-to-point constraint while the mouse button is pressed. This constraint simply makes sure that two points always coincide in space without imposing any limits on orientation.
The logic we want to implement is the following:
The high-level process is described by the pickConstraint function:
pickConstraint :: BtDynamicsWorldClass bc => bc -> Signal Vec2 -> Signal CameraInfo -> Signal Bool -> Signal Vec2 -> SignalGen (Signal ()) pickConstraint dynamicsWorld windowSize cameraInfo mouseButton mousePos = do press <- edge mouseButton release <- edge (not <$> mouseButton) pick <- generator $ makePick <$> press <*> windowSize <*> cameraInfo <*> mousePos releaseInfo <- do rec sig <- delay Nothing $ do released <- release newPick <- pick currentPick <- sig case (released, newPick, currentPick) of (True, _, _) -> return Nothing (_, Just (constraintSignal, body), _) -> do constraint <- constraintSignal return $ Just (constraint, body, constraintSignal) (_, _, Just (_, body, constraintSignal)) -> do constraint <- constraintSignal return $ Just (constraint, body, constraintSignal) _ -> return Nothing return sig effectful2 stopPicking release releaseInfo
First, we define press and release events by detecting rising and falling edges of the mouseButton signal. The derived signals yield True only at the moment when the value of mouseButton changes in the appropriate direction. Afterwards, we define the pick signal, which has the type Maybe (Signal BtPoint2PointConstraint, BtRigidBody). When the user presses the button while hovering over a dynamic body, pick carries a signal that corresponds to the freshly instantiated constraint plus a reference to the body in question, otherwise it yields Nothing.
The releaseInfo signal is defined recursively through a delay, which is the most basic way of defining a stateful stream transformer in Elerea. In fact, the stateful and transfer combinators provided by the library are defined in a similar manner. The reason why we can’t use them in this case is the fact that the state contains signals that we need to sample to calculate the next state. This flattening is made possible thanks to Signal being a Monad instance.
The type of the state is Maybe (BtPoint2PointConstraint, BtRigidBody, Signal BtPoint2PointConstraint). The elements of the triple are: the current sample of the constraint, the body being dragged, and the time-changing signal that represents the constraint. The transformation rules described through pattern matching are the following:
In the end, releaseInfo will carry a triple wrapped in Just between a successful pick and a release event, and Nothing at any other moment. This signal, along with release itself, forms the input of stopPicking, which just invokes the appropriate Bullet functions to destroy the constraint at the right moment.
The missing piece of the puzzle is makePick, which is responsible for creating the constraint signal:
makePick :: Bool -> Vec2 -> CameraInfo -> Vec2 -> SignalGen (Maybe (Signal BtPoint2PointConstraint, BtRigidBody)) makePick press windowSizeCur cameraInfoCur mousePosCur = case press of False -> return Nothing True -> do pickInfo <- execute $ pickBody dynamicsWorld windowSizeCur cameraInfoCur mousePosCur case pickInfo of Nothing -> return Nothing Just (body, hitPosition, distance) -> do constraint <- createPick dynamicsWorld body hitPosition distance windowSize cameraInfo mousePos return $ Just (constraint, body)
This is a straightforward signal generator, and passing it into generator in the definition of pick ensures that it is invoked in every frame. The pickBody function is an ordinary IO operation that was already mentioned in the first post of this series. Most of the work is done in createPick when an appropriate body is found:
createPick :: (BtDynamicsWorldClass bc, BtRigidBodyClass b) => bc -> b -> Vec3 -> Float -> Signal Vec2 -> Signal CameraInfo -> Signal Vec2 -> SignalGen (Signal BtPoint2PointConstraint) createPick dynamicsWorld body hitPosition distance windowSize cameraInfo mousePos = do make' (createPickConstraint dynamicsWorld body hitPosition) [ setting :!~ flip set [impulseClamp := 30, tau := 0.001] , pivotB :< pivotPosition <$> windowSize <*> cameraInfo <*> mousePos ] where createPickConstraint dynamicsWorld body hitPosition = do bodyProj <- transformToProj4 <$> btRigidBody_getCenterOfMassTransform body let localPivot = trim ((extendWith 1 hitPosition :: Vec4) .* fromProjective (inverse bodyProj)) pickConstraint <- btPoint2PointConstraint1 body localPivot btDynamicsWorld_addConstraint dynamicsWorld pickConstraint True return pickConstraint pivotPosition windowSize cameraInfo mousePos = Just (rayFrom &+ (normalize (rayTo &- rayFrom) &* distance)) where rayFrom = cameraPosition cameraInfo rayTo = rayTarget windowSize cameraInfo mousePos
The actual constraint is instantiated in createPickConstraint, which is just a series of Bullet API calls. We define the second pivot point as a signal attribute; the signal is a stateless function of the starting distance, the mouse position, and the view projection parameters. Such signals can be defined by lifting a pure function (in this case pivotPosition) using the applicative combinators. Since pivotPosition never yields Nothing, the pivot point is updated in every frame.
The most interesting outcome of this experiment, at least in our opinion, is the realisation how FRP can make it easier to deal with mutable state in a disciplined way. In particular, it provides a nice solution in the situation when a mutable variable needs to be modified by several entities. Since all the future edits are available as a signal, it is straightforward to resolve edit conflicts with a state machine. In fact, the FRP approach practically forces us to do so.
Dealing with the interdependencies of several time-varying values can also be tricky. Again, with FRP we have no choice but to clearly define what happens in all the possible constellations. One example for this in the above code is the definition of releaseInfo, where we used pattern matching to account for all the possibilities. It is an open question how this method scales as the program grows in complexity, and we’ll see that better in our future experiments.
In the previous post, we introduced a simple attribute system to give the raw Bullet API a friendlier look. As it turns out, this system is easy to extend along the temporal dimension, since an attribute of a physical object is a time-varying value – implemented as a mutable variable under the hood. The common theme in various FRP approaches is about reifying the whole lifetime of such values in some form, which naturally leads to the idea of using the attributes as the bridge between the reactive library and the physics engine.
To make a seamless integration of Elerea and Bullet possible, we needed to define a few additional primitive signal constructors. Originally, the only way to feed data from IO sources into an Elerea network was through the external primitive, which constructed a signal and a corresponding IO function to update its output (the first argument is the initial value):
external :: a -> IO (Signal a, a -> IO ())
The obvious problem with external is the fact that it works outside the signal network, so it cannot be defined in a convenient way in terms of any entity that lives inside the reactive world. The simplest solution is to directly embed IO computations. One-off computations can be directly executed in SignalGen through the execute primitive, which is equivalent to liftIO. IO signals can be constructed with the effectful* family of functions, which are analogous to the applicative lifting combinators, but apply an IO function to the input signals instead of a pure one:
effectful :: IO a -> SignalGen (Signal a) effectful1 :: (t -> IO a) -> Signal t -> SignalGen (Signal a) effectful2 :: (t1 -> t2 -> IO a) -> Signal t1 -> Signal t2 -> SignalGen (Signal a) ...
The results must be in the SignalGen context in order to ensure that the constructed signals cause the IO computations to be evaluated exactly once per sampling step. The reason is simple: to meet this condition, the signal needs to be memoised, which requires an additional state variable, and state variables can only be introduced in SignalGen.
First, we extend the list of attribute operations with a fifth member, which defines the attribute to be updated by a signal. Since we might not want to take full control of the attribute, just intervene once in a while, the signal is required to be partial through Maybe. When the signal yields Nothing, the attribute is managed by Bullet during that superstep.
data AttrOp o = forall a . Attr o a := a | forall a . Attr o a :~ (a -> a) | forall a . Attr o a :!= IO a | forall a . Attr o a :!~ (a -> IO a) | forall a . Attr o a :< Signal (Maybe a)
Now we have all the building blocks necessary to define a signal-based variant of set from the last post:
set' :: o -> [AttrOp o] -> SignalGen (Signal ()) set' obj as = go as (return ()) where go  sig = return sig go (a:as) sig = case a of Attr getter setter := x -> execute (setter obj x >> return ()) >> go as sig Attr getter setter :~ f -> execute (getter obj >>= setter obj . f >> return ()) >> go as sig Attr getter setter :!= x -> execute (x >>= setter obj >> return ()) >> go as sig Attr getter setter :!~ f -> execute (getter obj >>= f >>= setter obj >> return ()) >> go as sig Attr getter setter :< s -> do dummy <- flip effectful1 s $ \mx -> case mx of Nothing -> return () Just x -> setter obj x >> return () go as (liftA2 const sig dummy)
The first four cases are unchanged, they just need to be wrapped with execute. The signal case is also straightforward: we use effectful1 to sample the signal and update the attribute with its current value. What might not be clear first is all the additional plumbing. Unfortunately, this is all necessary due to the fact that if a signal is not referenced anywhere in the system, it gets garbage collected.
In this case, effectful1 doesn’t produce any meaningful output. All we need is its side effects. However, we still need to store the dummy signal (a stream of unit values), otherwise the setter will only be updated until the next garbage collection round. We can think of the dummy signal as a thread that’s kept alive only as long as we have a handle to it. To ensure that the reference is not lost, we carefully wrap it in another dummy signal that keeps all the signal setters alive. It is up to the caller of set’ to store the final signal.
It is useful to define the equivalent of make as well:
make' :: IO o -> [AttrOp o] -> SignalGen (Signal o) make' act flags = do obj <- execute act dummy <- set' obj flags return (liftA2 const (return obj) dummy)
To reduce the chance of user error, we return a signal that represents the object we just constructed. The signal always yields a reference to the object, and keeps all the signal attributes alive. To ensure correct behaviour, we should always sample this signal when querying the object, instead of saving the reference and passing it around. This is necessary for another reason as well: thanks to Elerea’s dependency handling, it makes sure that the signal attributes are updated before the object is queried.
The example scene contains two independent details: a falling brick whose collisions with a designated object cause reactions, and a ragdoll that can be dragged with the mouse. In the rest of this post we’ll concentrate on the brick.
All the objects are created in the top level of the signal network, in the SignalGen monad. In this setup, the object used for collision detection is actually a ghost sphere, not a solid body. We’ll refer to it as the query space. Whenever the brick intersects this sphere, its position and orientation is reset, but its velocity is retained. First we create the sphere, which is just an ordinary IO operation, so we can execute it:
querySpace <- execute $ do ghostObject <- make btPairCachingGhostObject [ collisionFlags :~ (.|. e_btCollisionObject_CollisionFlags_CF_NO_CONTACT_RESPONSE) , collisionShape := sphereShape ghostRadius , worldTransform := Transform idmtx 0 ] btCollisionWorld_addCollisionObject dynamicsWorld ghostObject 1 (-1) return ghostObject
Afterwards, we create a signal that tells us in every frame which objects are colliding with the query space:
collisions <- effectful $ collisionInfo querySpace
The collisionInfo function returns a list of tuples, where each member contains references to both bodies involved, plus some spatial information that we ignore in this scenario. This is not a library function; its full definition is part of the example. The exact details are not relevant to the topic of this post, as they are just a direct application of the Bullet API, so we’re not going to discuss it here.
Given the collision signal, we can now define the brick:
let initBrickTrans = Transform idmtx (Vec3 2 20 (-3)) brick <- do rec brick <- make' (snd <$> localCreateRigidBodyM dynamicsWorld 1 initBrickTrans (boxShape brickSize)) [worldTransform :< boolToMaybe initBrickTrans . bodyInCollision brickBody <$> collisions] brickBody <- snapshot brick return brick
We use make’ to invoke the constructor of the brick and define the temporal behaviour of its worldTransform attribute in a single step. Again, the details of the construction are not particularly interesting: all we need is a mass, an initial transformation (position and orientation), and a collision shape for starters.
The real magic happens in the attribute override. Given the signal that tells us who collides with the query space, we can derive the signal that describes how the world transform of the brick needs to be updated over time. This is achieved by mapping a pure function over the signal, which yields Nothing if brickBody is not involved in any of the collisions, and Just initBrickTrans if it is.
One interesting bit to note is the recursion needed for the above definition, which is made possible thanks to SignalGen being a MonadFix instance. In order to define the world transform update signal, we need a reference to the object that we’re just creating. The reference comes from taking a snapshot of the brick signal. Since the update signal doesn’t need to be sampled for the sake of constructing the object, we don’t end up in an infinite loop.
While the general idea of using signals to define time-varying attributes works in practice, it leads to the need for ‘dummy’ signals that have to be kept around explicitly. The big problem with this solution is that it’s a potential source of programmer error. We believe that it is just one manifestation of the more general issue that Elerea provides no way to define the death of signals in a deterministic way. Currently we rely on the garbage collector to clean up all the update activity that’s not needed any more, and it’s up to the programmer to define the signals in a way that they stop their activity at the right time.
While FRP research solved the problem of start times in several ways, it’s not nearly as clear how to describe the endpoint of a signal’s life. Most likely all we need is a few additional primitives that capture the essence of end times the same way the SignalGen monad captures the essence of start times. Recently there have been some interesting developments in this area; we’re hoping that e.g. the work of Wolfgang Jeltsch or Heinrich Apfelmus will help us come up with a practical solution.
Earlier this year we were experimenting with creating an FRP-style API on top of the Bullet physics engine. As a result, we developed a simple example scene using Elerea, which is now available in the project repository. This post is the first in a series to discuss the example in detail. We start the tour by introducing the libraries used.
Bullet is a C++ library, which makes it tricky to drive from Haskell. Csaba solved the problem by generating a plain C wrapper around it, and created a Haskell binding for this wrapper. This provides us a nice steady base to build on. Programming against this interface feels very much like using Gtk2Hs. As an example, let’s see a slightly simplified variant of a function from the example used in mouse picking. In this function, we cast a ray into the world and return the closest dynamic rigid body it hit:
rayTarget :: Vec2 -> CameraInfo -> Vec2 -> Vec3 pickBody :: BtCollisionWorldClass bc => bc -> Vec2 -> CameraInfo -> Vec2 -> IO (Maybe (BtRigidBody, Vec3, Float)) pickBody dynamicsWorld windowSize cameraInfo mousePosition = do let rayFrom = cameraPosition cameraInfo rayTo = rayTarget windowSize cameraInfo mousePosition rayResult <- btCollisionWorld_ClosestRayResultCallback rayFrom rayTo btCollisionWorld_rayTest dynamicsWorld rayFrom rayTo rayResult hasHit <- btCollisionWorld_RayResultCallback_hasHit rayResult case hasHit of False -> return Nothing True -> do collisionObj <- btCollisionWorld_RayResultCallback_m_collisionObject_get rayResult isNotPickable <- btCollisionObject_isStaticOrKinematicObject collisionObj internalType <- btCollisionObject_getInternalType collisionObj case isNotPickable || internalType /= e_btCollisionObject_CollisionObjectTypes_CO_RIGID_BODY of True -> return Nothing False -> do btCollisionObject_setActivationState collisionObj 4 -- DISABLE_DEACTIVATION hitPosition <- btCollisionWorld_ClosestRayResultCallback_m_hitPointWorld_get rayResult body <- btRigidBody_upcast collisionObj -- this would be null if the internal type is not CO_RIGID_BODY return $ Just (body, hitPosition, len (hitPosition &- rayFrom))
We can think of the camera info as the transformation matrix that maps the world on the screen. The rayTarget function returns the endpoint of the ray corresponding to the mouse position on the far plane of the view frustum given all the relevant information. First we create a data structure (the ‘ray result callback’) to hold the result of the raycast, then perform the actual ray test. The value of hasHit is true if the segment between rayFrom and rayTo intersects any object in the physics world.
The C++ snippet corresponding to the first five lines of the do block might look something like this:
btVector3 rayFrom = cameraPosition(cameraInfo); btVector3 rayTo = rayTarget(windowSize, cameraInfo, mousePosition); btCollisionWorld::ClosestRayResultCallback rayCallback(rayFrom, rayTo); dynamicsWorld->rayTest(rayFrom, rayTo, rayCallback); bool hasHit = rayCallback.hasHit();
If hasHit is true, we can get a reference to the object from rayResult, and check if it is of the right type. If everything matches, we return the body, the world position of the point where the ray hit it first, and the distance to that point from the camera. One of the nice things about this binding is that it uses the vector types from the vect library out of the box instead of exposing Bullet specific vectors, so all the spatial calculations are really easy to write without having to jump through extra hoops first.
Elerea is an FRP library that’s primarily aimed at game programming. Its basic abstraction is the discrete generic stream – referred to as Signal –, and it can be used to describe fully dynamic data-flow networks. In essence, it provides a nice compositional way to define the state transformation during the simulation step. It also allows IO computations to be interspersed in this description, thereby providing lightweight (threadless) framework for cooperative multitasking.
There are two kinds of structures in Elerea. A value of type Signal a can be thought of as a time-varying value of type a. All the future values of the signal are fully determined by its definition, i.e. signals are context independent just as we expect from ordinary values in a pure functional language. The other structure is the SignalGen monad, which is a context where stateful signals are constructed. Mutually dependent signals can be defined thanks to the fact that SignalGen is an instance of MonadFix.
The basic idea behind SignalGen can be understood in terms of start times. Every context corresponds to a (discrete) moment on the global timeline, and every stateful signal constructed in that context is considered to start at that moment. However, signals themselves are defined in terms of the global time, which allows us to combine signals that originate from different contexts (e.g. create a numeric signal that’s the point-wise sum of two unrelated numeric signals). The API ensures that no signal can physically exist before its start time; theoretically, signals are undefined until that point.
When executing the resulting data-flow network, Elerea guarantees consistency by double buffering. The superstep that advances the network consists of two phases: read and commit. In the read phase every node queries its dependencies, and no-one changes their output. In the commit phase every node performs its state transition independently, based on the input from the read phase, so the inconsistent state is never observed anywhere in the system.
While the C-style API allows us to access all the functionalities, it’s not very comfortable to use. The biggest issue is its verbosity: all the names are fully qualified out of necessity, and each property of an object has to be set separately. Therefore, we took some inspiration from the glib attribute system used by Gtk2Hs, and implemented something similar in the example. An attribute is the pair of a getter and the corresponding setter for a given property:
data Attr o a = forall x . Attr !(o -> IO a) !(o -> a -> IO x)
We allow arbitrary return types for setters to make it easier to define attributes, since many Bullet setters return something other than unit. However, we discard these values for the time being, so it’s really just to avoid having to type ‘() <$’ so many times.
Attributes are brought to life through attribute operations, which specify how to calculate the value of the property. There are four possibilities: set a given value, transform the current value with a pure function, set a given value coming from an IO computation, and transform the current one with an IO computation. These are denoted as follows:
infixr 0 :=, :~, :!=, :!~ data AttrOp o = forall a . Attr o a := a | forall a . Attr o a :~ (a -> a) | forall a . Attr o a :!= IO a | forall a . Attr o a :!~ (a -> IO a)
We need existentials to hide the type of the property and only expose the type of the object, so we can easily create collections of attributes. Now we can define the functions that connect them to the actual objects:
set :: o -> [AttrOp o] -> IO o set obj attrs = (>> return obj) $ forM_ attrs $ \op -> case op of Attr _ setter := x -> setter obj x >> return () Attr getter setter :~ f -> getter obj >>= setter obj . f >> return () Attr _ setter :!= x -> x >>= setter obj >> return () Attr getter setter :!~ f -> getter obj >>= f >>= setter obj >> return () get :: o -> Attr o a -> IO a get obj (Attr getter _) = getter obj make :: IO o -> [AttrOp o] -> IO o make act flags = do obj <- act set obj flags return obj
This is a fully generic system nothing to do with Bullet at this point. The set function takes an object and updates all the attributes listed in its second argument. The get function is just a thin helper to retrieve the value of a property given the corresponding attribute. Finally, make is another thin helper that allows us to construct an object and set its attributes in a single step.
A simple example is the world transform property of collision objects. It can be read and written by the following two functions:
btCollisionObject_getWorldTransform :: BtCollisionObjectClass bc => bc -> IO Transform btCollisionObject_setWorldTransform :: BtCollisionObjectClass bc => bc -> Transform -> IO Transform
Turning it into an attribute is as simple as constructing a pair out of the above functions:
worldTransform :: BtCollisionObjectClass o => Attr o Transform worldTransform = Attr btCollisionObject_getWorldTransform btCollisionObject_setWorldTransform
Given this definition, we can write set body [worldTransform := …] to update the transform, and get body worldTransform to retrieve it. In the next post we’ll see how to extend the above system to define attributes tied to Elerea signals, which allows us to define all their future values at the time of their creation, and how to use this capability to define a rigid body whose position is reset every time it collides with another given object.
One of the test examples in the repository is an implementation of Variance Shadow Mapping, which lets us demonstrate multi-pass rendering through a relatively simple effect. In this post we’ll first have a look at the runnable EDSL implementation, discussing some of our syntactic woes on the way.
The Wikipedia article on shadow mapping gives a good overview of the general idea: first we render the scene from the point of view of the light source (using perspective projection for point or spot lights, and orthographic projection for directional lights), recording the distances of various surfaces from the light, then use this information as a look-up table in a second pass when we render the scene in the space of the main camera.
All we need to do to make shadow mapping work is to think carefully about the transformation matrices involved. The transformation from object local coordinates to screen coordinates is normally decomposed into three phases:
The model matrix is potentially different for every object, while view and projection are global to the rendering pass, so they can be composed before rendering. Consequently, we are going to use the following transformation matrices:
The first step is to record the distances of the closest surfaces from the light source. Whether the light is directional or point light is irrelevant; all we need is the light matrix to be able to transform the vertices as desired. This simple depth pass is handled by the following definition:
depth :: Exp Obj (FrameBuffer N1 (Float, Float)) depth = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where accCtx = AccumulationContext Nothing (DepthOp Less True :. ColorOp NoBlending True :. ZT) clearBuf = FrameBuffer (DepthImage n1 1000 :. ColorImage n1 0 :. ZT) prims = Transform vert (Fetch "geometrySlot" Triangle (IV3F "position")) lightMatrix = Uni (IM44F "lightMatrix") modelMatrix = Uni (IM44F "modelMatrix") vert :: Exp V V3F -> VertexOut Float vert pos = VertexOut lightPos (floatV 1) (Smooth depth :. ZT) where lightPos = lightMatrix @*. modelMatrix @*. v3v4 pos V4 _ _ depth _ = unpack' lightPos frag :: Exp F Float -> FragmentOut (Depth Float :+: Color Float :+: ZZ) frag depth = FragmentOutRastDepth (depth :. ZT)
Due to the lack of dedicated syntax, the notation is somewhat heavier than the DSL equivalent would be. The Exp type constructor corresponds to the @ operator; its first argument is the frequency, while the second is the payload type. For technical reasons, we also need our own non-flat tuple representation, which is constructed with the :. operator (:+: in the type system). In this case, depth is a one-layer framebuffer that stores two floating point numbers per pixel. In this particular case, both happen to be the same: the distance from the light. The first is the depth generated by the rasteriser, and the second is the output of the fragment shader. The latter value will change later when we calculate variance.
The top level of the definition starts with the last stage of the pipeline, the accumulation step. This is where the fragment shader is applied to the output of the rasteriser, i.e. the fragment stream. The accumulation context describes what happens to each fragment produced by the shader: they are only kept if their depth is less than that in the framebuffer, the new depth is written to the buffer, and the new colour simply overwrites the old one without blending. There is no additional fragment filter, which is expressed with the PassAll constant (equivalent to providing a constantly true function). Before accumulation, the framebuffer is cleared by setting the raster depth to 1000 and the ‘colour’ to 0 in each pixel.
As for the fragment stream, it is obtained by rasterising the primitive stream, which consists of triangles. There is only one vertex attribute, the position. Everything else is irrelevant for this pass, since we don’t need to calculate anything besides the actual shapes. Geometry can be pushed into the pipeline through named slots, and this is the point where we can define the name.
The vertex shader transforms the position into light-space coordinates by going through the world coordinate system first. Our infix operators are all prefixed with @ to avoid name collusions with the operators defined in the standard prelude; we intend to drop this prefix in the DSL. The v3v4 function simply extends a 3D vector with the homogeneous coordinate w=1. For the payload, it simply emits the z coordinate of the light-space position. We don’t have a convenient interface for swizzling at the moment: vector components can be extracted by pattern matching on the result of the unpack’ function (all the primitive functions have an apostrophe in their name to avoid collision with the prelude). Also, the definition of uniforms makes it quite apparent that we are basically building a raw AST.
In order to make use of the depth information, we need to convert it into a sampler, which will be referenced in the second, final pass:
shadowMapSize :: Num a => a shadowMapSize = 512 sm :: Exp Obj (FrameBuffer N1 (Float, V4F)) sm = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where accCtx = AccumulationContext Nothing (DepthOp Less True :. ColorOp NoBlending (one' :: V4B) :. ZT) clearBuf = FrameBuffer (DepthImage n1 1000 :. ColorImage n1 (V4 0.1 0.2 0.6 1) :. ZT) prims = Transform vert (Fetch "geometrySlot" Triangle (IV3F "position", IV3F "normal")) cameraMatrix = Uni (IM44F "cameraMatrix") lightMatrix = Uni (IM44F "lightMatrix") modelMatrix = Uni (IM44F "modelMatrix") lightPosition = Uni (IV3F "lightPosition") vert :: Exp V (V3F, V3F) -> VertexOut (V3F, V4F, V3F) vert attr = VertexOut viewPos (floatV 1) (Smooth (v4v3 worldPos) :. Smooth lightPos :. Smooth worldNormal :. ZT) where worldPos = modelMatrix @*. v3v4 localPos viewPos = cameraMatrix @*. worldPos lightPos = lightMatrix @*. worldPos worldNormal = normalize' (v4v3 (modelMatrix @*. n3v4 localNormal)) (localPos, localNormal) = untup2 attr frag :: Exp F (V3F, V4F, V3F) -> FragmentOut (Depth Float :+: Color V4F :+: ZZ) frag attr = FragmentOutRastDepth (luminance :. ZT) where V4 lightU lightV lightDepth lightW = unpack' lightPos uv = clampUV (scaleUV (pack' (V2 lightU lightV) @/ lightW)) surfaceDistance = texture' sampler uv lightPortion = Cond (lightDepth @<= surfaceDistance @+ floatF 0.01) (floatF 1) (floatF 0) lambert = max' (floatF 0) (dot' worldNormal (normalize' (lightPosition @- worldPos))) intensity = lambert @* lightPortion luminance = pack' (V4 intensity intensity intensity (floatF 1)) clampUV x = clamp' x (floatF 0) (floatF 1) scaleUV x = x @* floatF 0.5 @+ floatF 0.5 (worldPos, lightPos, worldNormal) = untup3 attr sampler = Sampler PointFilter Clamp shadowMap shadowMap :: Texture (Exp Obj) DIM2 SingleTex (Regular Float) Red shadowMap = Texture (Texture2D (Float Red) n1) (V2 shadowMapSize shadowMapSize) NoMip [PrjFrameBuffer "shadowMap" tix0 depth]
The definition of the sampler is right at the bottom. First we convert the framebuffer yielded by depth into an image using PrjFrameBuffer, which projects a given member of a tuple during the conversion. The predefined value tix0 is the tuple index of the first element. In this case, our ‘tuple’ is degenerate, since it consists of a single element anyway. The resulting image is converted into a two-dimensional texture (shadowMap) with just a floating-point red channel. Finally, the texture is wrapped in a sampler structure (sampler), which specifies that it’s a non-repeating image that must not be smoothened during sampling, since that wouldn’t be meaningful for depth values at the edges of objects.
The pipeline setup is very similar to that of the depth pass, and most of the difference is in the shaders. This is a more complex case, where we have tuples both in the vertex and the fragment stream. Again, for technical reasons we need to unpack these representations with a dedicated function (untup*) and pattern matching, just like vectors. We could also use view patterns to make this extra step a bit less painful, but in the end all this won’t be necessary in the DSL.
As for shadows, the work is divided up between the two shader phases. The vertex shader calculates the vectors needed: view position (used by the rasteriser), world space position, light space position and world space surface normal. We cheated a bit with the normal calculation, since we don’t use the inverse transpose of the matrix. This is fine as long as our transformations don’t involve any non-uniform scaling, or we’re only scaling axis-aligned cuboids. The n3v4 function extends a 3D vector with w=0, so it is treated as a direction.
The fragment shader calculates the final colour of each pixel. The light space position is used to address the shadow map (sampler) as well as to quickly determine the distance to the light source without calculating an extra square root. The value of lightPortion is 0 if there is an occluder between the current point and the light source, 1 otherwise. To avoid self-shadowing, a little offset is applied to the stored depth of the closest surface (this could have been done in the first pass as well). Afterwards, we calculate the light contribution using the Lambert model, i.e. taking the cosine of the angle between the surface normal and the vector that points towards the light. Multiplying this value with the light proportion gives us an image like this:
We can clearly see that the chosen depth offset is not sufficient, since the background plane still suffers from self-shadowing.
In the actual example, we use a spotlight instead of a point light that radiates in every direction. This effect is simply achieved by calculating a colour that depends on the shadow UV coordinates. We can change the definition of intensity and luminance to get a better idea of the light’s direction:
uv' = uv @- floatF 0.5 spotShape = floatF 1 @- length' uv' @* floatF 4 intensity = max' (floatF 0) (spotShape @* lambert) V2 spotR spotG = unpack' (scaleUV (round' (uv' @* floatF 10)) @* intensity) luminance = pack' (V4 spotR spotG intensity (floatF 1)) @* lightPortion
The resulting image is maybe a bit more interesting than the previous one:
There are several issues with basic shadow mapping. Aliased shadow edges are only the tip of the iceberg; nowadays they can be easily fixed by using shadow samplers, which can provide bilinear filtering on the light proportion instead of just a binary comparison. However, if we want softer shadows, life gets a lot more complicated. The obvious solution is to take several samples of the shadow map and average the results, but this entails a potentially severe performance hit. We can also get away with a single sample if we apply a jitter to the UV coordinates, but this results in a noisy pattern instead of a smooth transition.
Variance shadow mapping makes it possible to get a proper smoothing effect with a single sample from the shadow map. The basic idea is to store a probability distribution instead of an exact function, and estimate the probability of a pixel being occluded instead of performing an exact test. The VSM algorithm uses Chebyshev’s inequality for the estimation. Since our shadow map is now a probability distribution, we can directly blur it with a Gaussian filter and get meaningful results. Another nice side effect is that the new formula also addresses the problem of self-shadowing, and provides a robust solution in place of the rather brittle and scene dependent offset hack.
In the first pass, we store the first two moments of the depth distribution:
moments :: Exp Obj (FrameBuffer N1 (Float, V2F)) moments = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where accCtx = AccumulationContext Nothing (DepthOp Less True :. ColorOp NoBlending (one' :: V2B) :. ZT) clearBuf = FrameBuffer (DepthImage n1 1000 :. ColorImage n1 (V2 0 0) :. ZT) prims = Transform vert (Fetch "geometrySlot" Triangle (IV3F "position")) lightMatrix = Uni (IM44F "lightMatrix") modelMatrix = Uni (IM44F "modelMatrix") vert :: Exp V V3F -> VertexOut Float vert pos = VertexOut lightPos (floatV 1) (Smooth depth :. ZT) where lightPos = lightMatrix @*. modelMatrix @*. v3v4 pos V4 _ _ depth _ = unpack' lightPos frag :: Exp F Float -> FragmentOut (Depth Float :+: Color V2F :+: ZZ) frag depth = FragmentOutRastDepth (pack' (V2 moment1 moment2) :. ZT) where dx = dFdx' depth dy = dFdy' depth moment1 = depth moment2 = depth @* depth @+ floatF 0.25 @* (dx @* dx @+ dy @* dy)
The difference between moments and depth is the type (two-dimensional float vector instead of a single float per pixel) and the fragment shader, which calculates two moments of the distribution. The second pass (vsm) is also similar to the basic shadow mapping case, the only thing that changes is the formula for the light portion, which now becomes the maximum probability that the surface is lit:
V2 moment1 moment2 = unpack' (texture' sampler uv) variance = max' (floatF 0.002) (moment2 @- moment1 @* moment1) distance = max' (floatF 0) (lightDepth @- moment1) lightProbMax = variance @/ (variance @+ distance @* distance)
The other thing that changes slightly is the definition of the sampler:
sampler = Sampler LinearFilter Clamp shadowMap shadowMap :: Texture (Exp Obj) DIM2 SingleTex (Regular Float) RG shadowMap = Texture (Texture2D (Float RG) n1) (V2 shadowMapSize shadowMapSize) NoMip [PrjFrameBuffer "shadowMap" tix0 moments]
Unlike the previous one, this texture has a green component as well to store the second moment, and the sampler is set up to perform linear filtering. Using the value of lightProbMax directly as the light portion is a good first approximation, but it leads to light bleeding, a well-known problem with VSM:
Before addressing the bleeding issue, we should first take advantage of the fact that the shadow map can be filtered. We are going to insert an extra pair of passes between moments and vsm that blurs the shadow map. It is a pair of passes because we exploit the separability of the Gaussian filter, so first we blur the image vertically, then horizontally, thereby doing O(n) work per pixel instead of O(n2), where n is the width of the filter. The blur is described by the following function:
blur :: [(Float, Float)] -> Exp Obj (Image N1 V2F) -> Exp Obj (FrameBuffer N1 V2F) blur coefficients img = filter1D dirH (PrjFrameBuffer "" tix0 (filter1D dirV img)) where dirH v = Const (V2 (v / shadowMapSize) 0) :: Exp F V2F dirV v = Const (V2 0 (v / shadowMapSize)) :: Exp F V2F filter1D :: (Float -> Exp F V2F) -> Exp Obj (Image N1 V2F) -> Exp Obj (FrameBuffer N1 V2F) filter1D dir img = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V2B) :. ZT) clearBuf = FrameBuffer (ColorImage n1 (V2 0 0) :. ZT) prims = Transform vert (Fetch "postSlot" Triangle (IV2F "position")) vert :: Exp V V2F -> VertexOut V2F vert uv = VertexOut pos (Const 1) (NoPerspective uv' :. ZT) where uv' = uv @* floatV 0.5 @+ floatV 0.5 pos = pack' (V4 u v (floatV 1) (floatV 1)) V2 u v = unpack' uv frag :: Exp F V2F -> FragmentOut (Color V2F :+: ZZ) frag uv = FragmentOut (sample :. ZT) where sample = foldr1 (@+) [texture' smp (uv @+ dir ofs) @* floatF coeff | (ofs, coeff) <- coefficients] smp = Sampler LinearFilter Clamp tex tex = Texture (Texture2D (Float RG) n1) (V2 shadowMapSize shadowMapSize) NoMip [img]
The blur function takes a set of coefficient-offset pairs and an input image, and yields a framebuffer containing the filtered version of the image. This shows the power of our approach: since this is an ordinary pure function, it can be applied to any image in any pipeline. In fact, it could be part of a standard library of utility functions after factoring out the resolution as a separate argument instead of using our global constant shadowMapSize. The only notable novelty here is the use of Haskell for meta-programming: the amount of shader code generated is proportional to the length of the coefficient list, since the summation is expanded statically. We need to find a convenient substitute for this facility when migrating to the DSL.
To insert the blur, we need to change the definition of the sampler in vsm:
sampler = Sampler LinearFilter Clamp shadowMapBlur shadowMapBlur :: Texture (Exp Obj) DIM2 SingleTex (Regular Float) RG shadowMapBlur = Texture (Texture2D (Float RG) n1) (V2 shadowMapSize shadowMapSize) NoMip [PrjFrameBuffer "shadowMap" tix0 blurredMoments] where blurredMoments = blur blurCoefficients (PrjFrameBuffer "blur" tix0 moments) blurCoefficients = [(-4.0, 0.05), (-3.0, 0.09), (-2.0, 0.12), (-1.0, 0.15), (0.0, 0.16), (1.0, 0.15), (2.0, 0.12), (3.0, 0.09), (4.0, 0.05)]
After this change, our shadows change drastically:
As it turns out, blurring also helps against light bleeding to a certain extent, since light bleeding appears in areas where the variance is big. However, it is still quite obvious. There are several ways to address the problem, and we chose the simplest for the sake of the example: raising the light portion to a power. The higher the exponent, the less the light bleeding, but unfortunately increasing the exponent also causes overdarkening. In a way, this is a hack in a similar vein as the depth offset is for simple shadow mapping. In the example, we use the square of lightProbMax as the light portion, which gives us nice soft shadows:
We recommend using this example as a starting point for your own experiments, as it has no external dependencies.
In the earlier posts, we mentioned several times that we imagine LambdaCube as a standalone DSL. But what would this mean in practice?
One of our long-term goals is to build a productive content authoring environment that’s based on a purely functional rendering pipeline. Who knows, it might even be feasible to package LambdaCube as a Unity plugin! Another possible application could be generating an optimised rendering component to be integrated in another game engine. In any case, interoperability is a major concern for us. Driven by this goal, we came up with the following plan for the LambdaCube stack:
The heart of this scheme is obviously the intermediate language. The frontend compiler is going to be an independent tool (available both as a plain executable and a library with at least a C interface) that should be possible to easily integrate into any build system. This tool is responsible for most of the static analysis, some trivial as well as (hopefully) clever optimisations, and also planning the allocation of resources. As a result, it turns the declarative pipeline description into a series of instructions for a virtual machine that handles GPU specific data structures as primitives, e.g. setting up the rendering context, or binding vertex buffers, textures, and framebuffers in the right order. Given this intermediate description, we can take it further in many directions, as illustrated by the figure above.
Of course, LambdaCube is not likely to be useful when you are working on the bleeding edge and need accurate control over the resources. Also, when we are on said bleeding edge, it is most likely premature to think about building high-level abstractions. Therefore, we’d like to make it clear that we don’t see LambdaCube as a one-size-fits-all solution. If we were to make a mission statement, it would go something like this: we want to cover 95% of the use cases, making even moderately complex projects easier, and not get in the way of the remaining 5%.
We quickly put together two HD quality videos of the big LambdaCube examples for the lazy ones. First the Quake 3 level viewer:
Then the Stunts example, featuring some truly mad driving skils:
Over and out!
LambdaCube is based on a purely functional model of the graphics pipeline. The graphics pipeline is a multi-rate data-flow network, therefore it was a straightforward decision to use streams as the basic abstraction for data that’s processed sequentially. However, LambdaCube is not a purely stream-oriented language, as we will see below. For the time being, we only deal with DX10/OGL3.2 constructs, i.e. we allow geometry shaders but no tesselation.
The language extends beyond the responsibility of shader languages by also reifying framebuffers, thereby being able to expose results of single rendering passes as plain values. Conceptually, a framebuffer is the result of a fold operation over a fragment stream, and exposing it as a first-class value is the main source of the extra expressive power over conventional stream programming languages. A multi-pass rendering process is simply described by an expression that contains at least one subexpression that evaluates to a framebuffer converted to a sampler.
First we’ll look at the simple case of transforming a vertex buffer into a framebuffer that contains the final rendering in a single pass. This is achieved by sending the data through the rendering pipeline, whose stages are each modelled by a pure function. Part of the design process was to decide how we want to decompose the whole process.
The major building blocks of the LambdaCube pipeline are the following functions (with simplified Haskell-style type signatures):
fetch :: VertexBuffer a -- The input buffer → p -- The associated primitive → VertexStream p a -- The stream enumerating the elements of the buffer transform :: VertexShader a b -- The function applied to each vertex → VertexStream p a -- The incoming stream → PrimitiveStream p 1 V b -- The resulting stream reassemble :: GeometryShader p1 p2 n a b -- The function applied to each primitive → PrimitiveStream p1 1 V a -- The incoming stream → PrimitiveStream p2 n G b -- The resulting stream rasterize :: RasterContext p -- The settings used for rasterisation → PrimitiveStream p n f a -- The incoming stream → FragmentStream n a -- The resulting stream accumulate :: AccumulationContext b -- Accumulation settings → FragmentFilter a -- Filtering predicate (fragment kept if true) → FragmentShader a b -- The function applied to each fragment → FragmentStream n a -- The incoming stream of fragments → FrameBuffer n b -- The starting state of the framebuffer → FrameBuffer n b -- The final state of the framebuffer
Each stage boundary has a corresponding stream type, which captures the totality of the data that passes through the cross-section as an abstract entity.
The pipeline starts its operation by fetching the vertex data, which means taking a buffer of records and enumerating them one by one. While this is just a formal step in the model, it is significant in practice as fetching is the bridge between the CPU and the GPU. In the model, one can think of VertexBuffer t as an array that contains a piece of data of type t for each vertex. In reality, it is a reference to future data, and it consists of a mapping of names to parts of the data structure (typically a flat tuple) describing each vertex – basically a list of attribute names and types –, plus it specifies the name of the corresponding input slot.
Besides the vertex data, we also need to specify how consecutive vertices are assembled into primitives. The second parameter of fetch provides this information in a strongly typed way. The type p is different for each kind of primitive: point, line, or triangle. This distinction is useful mainly during rasterisation, as we will see below.
The first transformation is a simple map operation over the vertex stream: the vertex shader is applied to each vertex independently. We chose the name transform for this step because it is typically used to transform vertices from object local space to camera or clip space besides calculating additional data. Vertex shaders are simply pure functions:
VertexShader a b ≡ a@V → b@V'
This is the first time we see frequencies appear in the types, denoted with the type operator @. According to the above equivalence, a vertex shader is a function that takes a value in a vertex context (V) and produces a vertex output (V').
The vertex context can be thought of as an instance of the reader monad, where the input type is a tuple composed of additional information for the vertex (e.g. the index of the vertex within the stream). Consequently, it is an applicative functor, therefore any primitive constant and function operating on primitive types can be lifted to work in it. Since we do not need the expressiveness of monads, we can make this lifting implicit, which makes it trivial to define vertex shaders in a natural way.
The original vertex context is supplied by the transform function, while the vertex output has to be defined by the programmer. There is only one way to construct a value of type a@V':
vertexOut :: V4F@V → Float@V → (Interpolated a)@V → a@V'
The vertexOut function associates an interpolated value in a vertex context (Interpolated a) with a position in homogeneous coordinates (V4F) and a point size (Float). An interpolated value is simply a value associated with an interpolation strategy, which can be flat, smooth, or linear. The following functions can be used to construct such values:
flat :: a@f → (Interpolated a)@f smooth :: Continuous a ⇒ a@f → (Interpolated a)@f linear :: Continuous a ⇒ a@f → (Interpolated a)@f
The interpolation strategy chosen will be relevant during rasterisation. Since integer attributes cannot be interpolated, they must be flat shaded. Floating-point attributes (including vectors and matrices) can be interpolated; the formulae that define the resulting value for a fragment at a given position depending on the interpolation strategy are described in the OpenGL standard.
The final result of the transformation step is a primitive stream, i.e. transform is also responsible for primitive assembly besides vertex shading, while these are generally considered to be distinct stages of the rendering pipeline. For the time being, we don’t see the practical need to separate these stages, since we cannot do anything useful with the output of the vertex shader other than passing it forward.
The PrimitiveStream type constructor has for parameters: primitive type, layer count, stage of origin, and payload type. As the type signature of the transform function confirms, vertex shader preserves the primitive type originally associated with the stream, and the final type of the payload is identical to the output type of the vertex shader. The stage of origin is always V (to distinguish it from a primitive stream produced by the geometry shader), and the layer count is 1 by definition, as only geometry shaders can produce a multi-layered output, i.e. address a three-dimensional framebuffer.
The next step in the pipeline is optional: we take the stream of primitives produced by transform, and emit a stream of a possibly different primitive type, where each input corresponds to zero, one, or several (with a statically known maximum) outputs. We can see the role of the stage of origin in the primitive stream’s type: since the reassemble function requires a vertex shader output, the type system prevents us from adding more than one geometry shader to the pipeline.
The actual transformation is defined by the geometry shader, which corresponds to the following type:
GeometryShader p1 p2 n a b ≡ (n, p2, Int, GeometryTransducer p1 a b) GeometryTransducer p1 a b ≡ ((PrimitiveVertices p1 a)@G → (i, Int)@G, i@G → (i, j, Int)@G, j@G → (j, b)@G')
Unfortunately, we cannot provide a simple, elegant definition similar to the vertex shader, because geometry shaders are more involved. The arguments of the GeometryShader type constructor are the following: input primitive, output primitive, layer count, payload input and output types. This type is equivalent to a 4-tuple that directly specifies the layer count and the output primitive (strongly typed), the maximum number of primitives emitted for any incoming primitive, and the actual shader logic.
At the present moment, we prescribe a rigid structure of two nested loops in the geometry shader. This is defined by the user by providing the following three functions:
The final output must be defined in the inner loop using a function similar to vertexOut, but slightly more involved:
geometryOut :: V4F@G → Float@G → Int@G → Int@G → j@G → (Interpolated a)@G → (j, a)@G'
In order, the arguments are the vertex position, the point size, the primitive id, the layer id, the next inner loop state, and the payload, which needs to carry interpolation information in the same way as vertex outputs.
We are not particularly happy with this solution due to its clunkiness. A more elegant alternative would be to define the output with a single function as a list of dynamic length, but it might be difficult to provide a set of constructors that’s convenient to use and easy to generate code from at the same time. In any case, this is a design problem we shall attack again after having a reasonably user-friendly implementation.
During rasterisation, each primitive is turned into a stream of fragments, and the return value of rasterize is the concatenation of these streams. Within a primitive, each resulting fragment has a unique two-dimensional coordinate, i.e. they correspond to separate pixels of the framebuffer. This transformation is a fixed functionality of the GPU, and it is not fully defined, so the exact outcome might depend on the hardware. However, the OpenGL standard, for instance, prescribes that the rasterisation process fulfils certain criteria, e.g. the result is invariant to translation by integer coordinates as long as the primitive is not clipped.
Ultimately, the role of the rasterisation process is to determine the list of coordinates that a primitive covers, and calculate the interpolation of the vertex attributes in every resulting fragment. The formula used for interpolation depends on the strategy chosen in the vertex or geometry shader (if the latter is present), as discussed above.
The raster context is a tuple describing the details of rasterisation, whose exact shape depends on the primitive type. For instance, RasterContext Triangle is the most complex case, which specifies the following factors:
The fact that fragments correspond to distinct coordinates is the key in achieving high performance. Since they all affect different pixels of the output, they can be processed in parallel. For our purposes, this means that the ordering of fragments coming from any given primitive is undefined and cannot be relied upon.
The final step in the rendering process, performed by the accumulate function, consumes the fragment stream through a map and a fold phase. The fragment shader specifies the function used to map over the incoming fragments, while the final accumulation is a primarily fixed-structure computation that takes a framebuffer and blends the fragments into it one by one. Similarly to transform, this step is generally considered to be two separate phases that we fused together in the current version of the model.
The framebuffer is really just a glorified multidimensional array. The dimensions are width, height, component count, and layer count. The last dimension is encoded as a phantom type (denoted with n throughout the above type signatures), and it is greater than one only in the presence of a geometry shader. The first two dimensions are addressed through the fragment coordinates. As for the components, there are three kinds: depth, stencil, and colour. The first two serve special purposes – mainly related to discarding fragments through fixed functionality –, while colour components can be used to store any kind of data. We refer to this distinction as the semantics of a given piece of data. Semantics is similar to interpolation strategy in a sense that both are represented by a simple tagging scheme.
Similarly to the raster context, the accumulation context is a tuple that contains all the information needed to decide what operations to perform when processing each fragment produced by the rasterisation phase. The tuple has three main components, each optional depending on the type of the framebuffer used during accumulation, and each corresponding to a semantics:
The fragment shader is used to process the fragment before blending, while the fragment filter is a simple boolean function in the fragment context:
FragmentFilter a ≡ a@F → Bool@F FragmentShader a b ≡ a@F → b@F'
If the filter evaluates to false for a given fragment, that fragment is thrown away instead of being blended. Depth and stencil testing act as further filters.
Otherwise, the fragment shader is similar to the other shaders: it receives the interpolated attributes along with some additional data produced by the pipeline (the fragment context F), and yields a fragment output (F'). The output can be constructed with either of these functions:
fragmentOut :: a@F → (Color a)@F' fragmentOutDepth :: Float@F → a@F → (Depth Float, Color a)@F' fragmentOutRasterDepth :: a@F → (Depth Float, Color a)@F'
The fragmentOut function is used in the absence of a depth buffer. If the framebuffer has a depth component, the fragment comes with a computed depth value accessible through the fragment context, which can be either overridden (fragmentOutDepth) or kept (fragmentOutRasterDepth). The latter is a special case with a practical significance: it allows us to perform the depth test before evaluating the fragment shader, thereby getting a chance to discard the fragment earlier. This is a well-known optimisation that doesn’t change the final output.
In the end, the blending equation is used to combine the corresponding pixel in the framebuffer with the output of the fragment shader, and the result replaces the former.
In order to produce complex effects, especially when using deferred rendering, one pass is not sufficient. The contents of a framebuffer can be exposed to subsequent passes through ordinary samplers. Due to the multitude of the texture formats and the general complexity of the interaction between datatypes, we omit the actual type signatures here and only present the basic concept. The gory details will be discussed in a subsequent post.
Our model deals with four kinds of image types. In the order of getting data from one pass to the next they are the following:
There are functions that can convert from each type of the above list to the type below it by specifying all the additional information needed and possibly converting the payload on the way (e.g. projecting out just one element of a tuple type). Ultimately, samplers can be accessed in shader contexts through a family of primitive operations whose name starts as texture, which take a sampler and a sampling point (coordinates of matching dimension), and return the sample at that point. This is sufficient to create effects like shadow mapping or perform screen-space processing like fake ambient occlusion or bloom.
Besides multiple passes, we found it necessary to be able to handle meshes and bigger groups of objects at a higher level. In practice, this means being able to write fold-like loops over groups of objects. For instance, we might want to loop over the list of lights to accumulate their individual contributions to the scene. For each light, we need to render all the objects it affects to build the corresponding shadow map. We might also have an inner loop in case we use a cascading scheme (partitioning the scene) to improve the accuracy of the shadows.
If we want to define these loops with respect to the frequency framework, we can say that objects live in a frequency range between constants and primitives. It also seems that the general case is in essence a tree fold operation, where the nodes closer to the root correspond to the outer loops, and the leaves are the individual vertex buffers, yielding a framebuffer as the end result. The tree describes the order objects are rendered in, so it is not the scene graph itself, only related to it through a mechanical transformation.
Unfortunately, we haven’t been able to work out the details of the object layer yet. For the time being, we implement such loops manually in the Haskell EDSL version of LambdaCube.
At its heart, the graphics pipeline is simply a configurable data-flow network. The main input of the network arrives as a stream of vertex descriptions, and additional data can be provided in various slots, which are constant during a rendering pass: uniforms of basic types and samplers (textures with some attached logic). After some processing steps, the final output is one or more raster images.
We can look at this data-flow network as a mathematical function that maps scene descriptions to bitmaps. The internal structure of this function can be defined in terms of smaller building blocks that correspond to various stages of the pipeline. Even the programmable pipeline has a more or less fixed global structure, but the transformations within the main stages can be freely defined through shaders, which suggests that the pipeline can be naturally modelled as a higher-order function.
If a single execution of the whole pipeline – a pass – is in essence a function application, it is also straightforward to compose these functions to build more complex rendering processes. All we need is a mapping from framebuffers to textures, which can be read back through samplers in a later pass. In the end, even multi-pass rendering can be modelled as a pure function.
Unfortunately, the practice of programming GPUs completely obscures the purity of the transformation, because the data-flow has to be set up manually, and its description requires at least two languages: a shader language for the GPU parts and the host language to hook up everything on the CPU side. The latter involves programming against a stateful API that sets up the video card through side effects. This is an unnecessarily error prone process, and difficult to structure with composable modules.
Working directly with the data-flow is beneficial for several reasons. As an abstraction over the hardware, it opens up the possibility of using all the capabilities of the machine without having to specifically program for each platform, and might also make profiling-driven optimisations feasible. This also makes the code future proof, since only the data-flow compiler needs to be updated from time to time. Besides, pure functions provide a degree of compositionality unmatched by code relying on side effects. It should be much easier to create reusable components in this system, because they are context independent. At the same time, the final pipeline could be made efficient by applying whole-program optimisation steps based on mathematical equalities, e.g. floating (hoisting) certain operations from the fragment shader to an earlier stage.
Despite the fact that declarative stream programming is a well-established paradigm – represented by languages as old as e.g. Lustre –, and GPUs are an obvious subject for this approach, it is difficult to find systems that try to implement the idea outlined above. We quickly present a few relevant solutions below, and we’d be happy to receive pointers to others if you happen to know one or two; feel free to comment!
The Stanford Real-Time Programmable Shading Project is a good starting point, since this is where the concept of computation frequency was first introduced. The frequency associated with a node in the data-flow network rather unsurprisingly specifies how often that node emits new values. This quantity is directly related to the pipeline phase the node is in: nodes in the fragment shader produce several values for each input coming from the vertex shader, which in turn processes several records (vertex attributes) for each possible change in the uniforms, since uniforms are fixed during a pass. The frequency concept can reach as far as the preprocessing stage, since compile-time constants can be thought of as the ultimate slowest-changing values.
While the language developed by the project is deliberately C-like in its appearance, it is purely functional in nature. All the pipeline is described by a single program, and it is up to the system to either allocate everything in the appropriate shader stage or even multiple passes if the fragment-level computation is too complex to fit in one (or in the absence of programmable shaders). Frequencies are handled by the type system and inferred by the compiler. Unfortunately, due to the fact that this is a decade-old project, the expressiveness of the language is severely limited.
If we are to marry graphics and functional programming, it is impossible not to mention Conal Elliott’s name. His Vertigo project shows another way to turn purely functional descriptions into executable shader code, and the demos show how to define parametric surfaces and various materials. Unfortunately, this system doesn’t address the issue of frequencies, as it generates only vertex shaders. However, Conal’s work is generally interesting because he approaches modelling from a purely mathematical standpoint, therefore everything is dictated by precise denotational semantics.
One more recent attempt is Renaissance, which allows the user to describe a single pass in a pure functional language. Given such a description, it can infer the type and frequency of the nodes solely from their dependencies, and allocate all the operations in the corresponding shader stages. Unlike RTSL, the syntax of this language is inspired by Haskell, and type annotations are optional. Conceptually, the programmer defines a fragment shader, and it is up to the compiler to float computations back to earlier stages if they don’t refer to any fragment-frequency constructs. Otherwise, the two systems are similar in expressive power.
The project formerly known as HaGPipe is a Haskell EDSL. This is the only other system we’re aware of that can describe multi-pass rendering processes (i.e. rendering to textures and using the results in subsequent passes) without any need for assistance from the host language or manually tinkering with framebuffers. All the other systems mentioned here are in essence shader language replacements, only intended to be just a small component in a rendering engine. GPipe distinguishes between frequencies in the type system, by explicitly stating whether a value is a PrimitiveStream or a FragmentStream, for instance. It also makes it possible to explicitly configure the fixed functionality of the rendering pipeline, e.g. blending equation or depth test, thereby providing a complete solution.
Unlike the others, Spark approaches modelling using OO principles. It is interesting for us mainly because it is trying to attack the same problem from another direction: organisation of rendering code with reusable components, where the separation of concerns is independent from the pipeline stages. In Spark, the means of combination are inheritance and generous use of mixins. The type system is aware of different frequencies, and it also allows the user to define new ones.
We decided to make LambdaCube a DSL as opposed to an embedded DSL, like GPipe, for a number of reasons.
Most importantly, DSLs are decoupled from the compiler of the host language. As a result, pipeline descriptions can be bundled with geometry and other scene data without introducing unnecessary dependencies. They don’t have to be mixed with the code, where they don’t belong.
Being responsible for all the compilation stages starting with parsing gives us in general a lot more control over the process of executing code as opposed to relying on a host language compiler. For instance, it should allow us to provide facilities to perform quick partial recompiles combined with hot-swapping, at least during development when no complex optimisations are turned on. Also, it could potentially enable advanced features like interactive optimisation, where individual rewrite rules could be turned on and off while a test scene is running, thereby getting instant feedback about their effect on performance.
Finally, when designing a DSL, we are not limited by the features or structure of a host language, e.g. its type system or standard library. We can potentially get a better fit with the domain. Even if the host language has an expressive type system that’s technically capable of modelling the domain, using it directly might not necessarily provide optimal experience. For instance, type errors tend to be cryptic due to the disconnect with the domain.
At the present moment, LambdaCube is only available as a pseudo-EDSL in Haskell, until we implement the compiler front-end. The language exposed is essentially the AST of the actual DSL made slightly friendlier through a thin wrapper module, which adds convenience features like infix operators. However, we intend to move forward with the parser as soon as possible.
Even if the title of this opening post might sound slightly pompous, we believe that a bit of ambition won’t hurt at this point. After all, we intend to challenge the state of the art in 3D rendering. As for who the mysterious ‘we’ refers to, you can find out in the About section.
The primary topic of this blog will be LambdaCube 3D, a domain specific language and library that provides a purely functional way to program GPUs. However, we also intend to cover additional issues related to creating interactive audiovisual content, especially games, in idiomatic Haskell. Just a few examples: reliable cross-platform audio programming, integrating physics, functional reactive programming (FRP henceforth).
LambdaCube started its life in the beginning of 2009 as Csaba’s toy project to gain some experience in Haskell by putting cool things on the screen through the OpenGL bindings. As Gergely was his mentor at the university at the time, they teamed up and submitted an application to the Jane Street Summer Project. As a result, the demo application was completely rewritten and turned into a library in a few months, and saw its first Hackage release. This library was still designed as a reimplementation of the OGRE 3D engine, i.e. it would be able to load content created for OGRE, and it had superficial integration with the Bullet physics engine as well.
The first release was followed by a long period of silence due to other obligations (and maybe a slight case of double stealth mode), but development sped up again in 2011. After a few months of work behind the curtains, an improved version of the library was released, which already made it possible to build the Stunts demo. This version still followed the same principles as the first one, focusing on OGRE support, and was decidedly the last release of its kind, since it was impossible to provide a nice functional interface while building on a legacy base.
Previously, there have been attempts at programming GPUs in a way that’s truly functional in spirit. So far, the most feature complete system to our knowledge is GPipe, which served as the primary inspiration for the third major version of LambdaCube. Just as GPipe, LambdaCube allows the developer to describe the whole rendering pipeline in a purely functional manner as a stream processing network, and let the system figure out how to map the description to the underlying graphics APIs and manage resources. The primary difference between the two is that GPipe is a DSL embedded in Haskell, while LambdaCube aims to be a fully independent DSL that can exist outside the Haskell world.
In its current state, LambdaCube is already functional, but still in its infancy. The current API is a rudimentary EDSL that is not intended for direct use in the long run. It is essentially the internal phase of a compiler backend exposed for testing purposes. To exercise the library, we have created two small proof of concept examples: a port of the old LambdaCube Stunts example, and a Quake III level viewer. It is our pleasure to see that LambdaCube already provides much better runtime performance than GPipe without compromising the overall design.
If you are interested in getting your hands dirty, feel free to check out the Getting Started section.