LambdaCube 3D

Purely Functional Rendering Engine

LambdaCube on Hackage at last

We noticed a while ago that due to the easy accessibility of Hackage packages many people still equate LambdaCube with its former incarnation that intended to replicate Ogre3D in Haskell. Today we got around to finally uploading the real thing. To make everything clear, we also marked the old packages as deprecated. The packages to use at the moment are the following:

Note that this update is somewhat of a hasty response to deprecate the old engine. The current structure is by no means final; firstly, we’re working on separating the OpenGL specific parts from the core and splitting the package in two, which would later allow us or others to create different back-ends for LambdaCube. Also, we’re going to upload the Stunts and Quake 3 demos built on the new engine in the near future.

Introducing the LambdaCube Intermediate Representation

A few months ago we gave a quick overview of our long-term plans with LambdaCube. One of the central elements in this vision is an intermediate representation (IR) that allows us to split the LambdaCube compiler, separating the front-end and back-end functionalities. Currently we’re reorganising the implementation into two packages: one to compile the EDSL to the IR, and the other to execute the IR over native OpenGL. Today we’ll have a quick glance at the present shape of the IR that makes this possible.

Before jumping to the meat of the topic, we need a little context. The figure in the linked post provides a good overview, so let’s have a good look at it:

lc-architecture

The front-end compiler produces the IR, which is just a plain data structure. The back-end provides a library-style API that allows applications to manipulate the inputs of the pipeline (uniforms, textures, and geometry), and execute it at will. A typical LambdaCube back-end API is expected to contain functions along these lines:

compilePipeline :: IR -> IO Pipeline
setUniform :: Pipeline -> UniformName -> GpuPrimitive -> IO ()
setTexture :: Pipeline -> TextureName -> Texture -> IO ()
addGeometry :: Pipeline -> SlotName -> Geometry -> IO GeometryRef
removeGeometry :: Pipeline -> GeometryRef -> IO ()
executePipeline :: Pipeline -> IO ()

Or if the back-end is e.g. a Java library, it could define a Pipeline class:

class Pipeline {
    Pipeline(IR description) { ... }
    void setUniform(UniformName name, GpuPrimitive value) { ... }
    void setTexture(TextureName name, Texture texture) { ... }
    GeometryRef addGeometry(SlotName slot, Geometry geometry) { ... }
    void removeGeometry(GeometryRef geometry) { ... }
    void execute() { ... }
}

In typical usage, compilePipeline is invoked once in the initialisation phase to build a run-time representation of the pipeline. Afterwards, rendering a frame consists of setting up the inputs as desired, then executing the pipeline. Note that geometry doesn’t necessarily mean just vertex attributes, it can also include uniform settings. Our OpenGL back-end also allows per-object uniforms.

Now we can have a closer look at the pipeline! The top level definition of the pipeline IR is captured by the following data structure:

data Pipeline
    = Pipeline
    { textures   :: Vector TextureDescriptor
    , samplers   :: Vector SamplerDescriptor
    , targets    :: Vector RenderTarget
    , programs   :: Vector Program
    , slots      :: Vector Slot
    , commands   :: [Command]
    }
    deriving Show

A pipeline is represented as a collection of various top-level constants followed by a series of rendering commands. The constants are sorted into five different categories:

  1. Texture descriptions: specify the size and shape of textures, and also contain references to samplers.
  2. Sampler descriptions: specify the sampler parameters like wrap logic or filter settings.
  3. Render targets: each target is a list of images (either a texture or the output) with semantics (colour, depth, or stencil) specified.
  4. Shader programs: isolated fragments of the original pipeline that can be compiled separately, each corresponding to a rendering pass; they also specify the structure of their inputs and outputs (names and types).
  5. Input slots: each slot is a reference to some storage space for geometry that will be fed to the rendering passes. Slot descriptions define the structure of the input (vertex attributes and uniforms coming from the geometry), and they also enumerate references to all the passes that use them as a convenience feature.

Given all the data above, the commands describe the steps needed to execute the pipeline.

data Command
    = SetRasterContext          RasterContext
    | SetAccumulationContext    AccumulationContext
    | SetRenderTarget           RenderTargetName
    | SetProgram                ProgramName
    | SetSamplerUniform         UniformName TextureUnit
    | SetTexture                TextureUnit TextureName
    | SetSampler                TextureUnit (Maybe SamplerName)
    | RenderSlot                SlotName
    | ClearRenderTarget         [(ImageSemantic,Value)]
    | GenerateMipMap            TextureUnit
    | SaveImage                 FrameBufferComponent ImageRef
    | LoadImage                 ImageRef FrameBufferComponent

As we can see, this instruction set doesn’t resemble that of a conventional assembly language. There are no control structures, and we don’t deal with values at this level, only data dependencies. In essence, the pipeline program defines a suitable traversal order for the original pipeline definition. The job of the front-end compiler is to figure out this order and the necessary allocation of resources (mainly the texture units).

Most of the instructions – the ones whose name starts with Set – just specify the GPU state required to render a given pass. When everything is set up, RenderSlot is used to perform the pass. It can be optionally preceded by a ClearRenderTarget call. We also need some extra machinery to keep the results of passes around for further processing if the dependency graph refers to them several times. For the time being LoadImage and SaveImage are supposed to serve this purpose, but this part of the IR is still in flux.

The nice thing about this scheme is its clear separation of static structure and data. We basically take the standard OpenGL API and replace direct loads of data with named references. The rendering API is used to assign data during run-time using these names. This doesn’t necessarily have to involve a hash lookup, since the API can provide additional functionality to retrieve direct references to use in time critical code. The language maps closely to existing graphics interfaces, so it’s easy to create a lightweight interpreter or even a native code generator for it. Finally, there is a straightforward way to extend it with features we don’t support yet, like instancing or transform feedback, if the need arises.

A few thoughts on geometry shaders

We just added a new example to the LambdaCube repository, which shows off cube map based reflections. Reflections are rendered by sampling a cube map, which is created by rendering the world from the centre of the reflecting object in six directions. This is done in a single pass, using a geometry shader to replicate every incoming triangle six times. Here is the final result:

Reflecting surface simulated with cube mapping

Reflecting surface simulated with cube mapping

While the main focus of this blog is language and API design, we need to describe the pipeline structure of the example to put the rest of the discussion into context. The high-level structure corresponds to the following data-flow graph:

Pipeline structure for the cube map example

Pipeline structure for the cube map example

The most important observation is that several pieces of this graph are reused multiple times. For instance, all geometry goes through the model-view transformation, but sometimes this is performed in a vertex shader (VS), sometimes in a geometry shader (GS). Also, the same lighting equation is used when creating the reflection map as well as the non-reflective parts of the final rendering, so the corresponding fragment shader (FS) is shared.

The Good

For us, the most important result of writing this example was that we could express all the above mentioned instances of shared logic in a straightforward way. The high-level graph structure is captured by the top declarations in sceneRender’s definition:

sceneRender = Accumulate accCtx PassAll reflectFrag (Rasterize rastCtx reflectPrims) directRender
  where
    directRender = Accumulate accCtx PassAll frag (Rasterize rastCtx directPrims) clearBuf
    cubeMapRender = Accumulate accCtx PassAll frag (Rasterize rastCtx cubePrims) clearBuf6

    accCtx = AccumulationContext Nothing (DepthOp Less True :. ColorOp NoBlending (one' :: V4B) :. ZT)
    rastCtx = triangleCtx { ctxCullMode = CullFront CCW }
    clearBuf = FrameBuffer (DepthImage n1 1000 :. ColorImage n1 (V4 0.1 0.2 0.6 1) :. ZT)
    clearBuf6 = FrameBuffer (DepthImage n6 1000 :. ColorImage n6 (V4 0.05 0.1 0.3 1) :. ZT)
    worldInput = Fetch "geometrySlot" Triangles (IV3F "position", IV3F "normal")
    reflectInput = Fetch "reflectSlot" Triangles (IV3F "position", IV3F "normal")
    directPrims = Transform directVert worldInput
    cubePrims = Reassemble geom (Transform cubeMapVert worldInput)
    reflectPrims = Transform directVert reflectInput

The top-level definition describes the last pass, which draws the reflective capsule – whose geometry is carried by the primitive stream reflectPrims – on top of the image emitted by a previous pass called directRender. The two preceding passes render the scene without the capsule (worldInput) on a screen-sized framebuffer as well as the cube map. We can see that the pipeline section generating the cube map has a reassemble phase, which corresponds to the geometry shader. Note that these two passes have no data dependencies between each other, so they can be executed in any order by the back-end.

It’s clear to see how the same fragment shader is used in the first two passes. The more interesting story is finding a way to express the model-view transformation in one place and use it both in directVert and geom. As it turns out, we can simply extract the common functionality and give it a name. The function we get this way is frequency agnostic, which is reflected in its type:

    transformGeometry :: Exp f V4F -> Exp f V3F -> Exp f M44F -> (Exp f V4F, Exp f V4F, Exp f V3F)
    transformGeometry localPos localNormal viewMatrix = (viewPos, worldPos, worldNormal)
      where
        worldPos = modelMatrix @*. localPos
        viewPos = viewMatrix @*. worldPos
        worldNormal = normalize' (v4v3 (modelMatrix @*. n3v4 localNormal))

The simpler use case is directVert, which simply wraps the above functionality in a vertex shader:

    directVert :: Exp V (V3F, V3F) -> VertexOut () (V3F, V3F, V3F)
    directVert attr = VertexOut viewPos (floatV 1) ZT (Smooth (v4v3 worldPos) :. Smooth worldNormal :. Flat viewCameraPosition :. ZT)
      where
        (localPos, localNormal) = untup2 attr
        (viewPos, worldPos, worldNormal) = transformGeometry (v3v4 localPos) localNormal viewCameraMatrix

As for the geometry shader…

The Bad

… we already mentioned in the introduction of our functional pipeline model that we aren’t happy with the current way of expressing geometry shaders. The current approach is a very direct mapping of two nested for loops as an initialisation function and two state transformers – essentially unfold kernels. The outer loop is responsible for one primitive per iteration, while the inner loop emits the individual vertices. Without further ado, here’s the geometry shader needed by the example:

    geom :: GeometryShader Triangle Triangle () () 6 V3F (V3F, V3F, V3F) 
    geom = GeometryShader n6 TrianglesOutput 18 init prim vert
      where
        init attr = tup2 (primInit, intG 6)
          where
            primInit = tup2 (intG 0, attr)
        prim primState = tup5 (layer, layer, primState', vertInit, intG 3) 
          where
            (layer, attr) = untup2 primState
            primState' = tup2 (layer @+ intG 1, attr)
            vertInit = tup3 (intG 0, viewMatrix, attr)
            viewMatrix = indexG (map cubeCameraMatrix [1..6]) layer
        vert vertState = GeometryOut vertState' viewPos pointSize ZT (Smooth (v4v3 worldPos) :. Smooth worldNormal :. Flat cubeCameraPosition :. ZT)
          where
            (index, viewMatrix, attr) = untup3 vertState
            vertState' = tup3 (index @+ intG 1, viewMatrix, attr)
            (attr0, attr1, attr2) = untup3 attr
            (localPos, pointSize, _, localNormal) = untup4 (indexG [attr0, attr1, attr2] index)
            (viewPos, worldPos, worldNormal) = transformGeometry localPos localNormal viewMatrix

The init function’s sole job is to define the initial state and iteration count of the outer loop. The initial state is just a loop counter set to zero plus the input of the shader in a single tuple called attr, while the iteration count is 6. The prim function takes care of increasing this counter, specifying the layer for the primitive (equal to the counter), and picking the appropriate view matrix from one of six uniforms. It defines the iteration count (3, since we’re drawing triangles) and the initial state of the inner loop, which contains another counter set at zero, the chosen view matrix, and the attribute tuple. Finally, the vert function calculates the output attributes using transformGeometry, and also its next state, which only differs from the current one in having the counter incremented.

On one hand, we had success in reusing the common logic between different shader stages by simply extracting it as a pure function. On the other, it is obvious at this point that directly mapping imperative loops results in really awkward code. At least it does the job!

The Next Step?

We’ve been thinking about alternative ways to model geometry shaders that would allow a more convenient and ‘natural’ manner of expressing our intent. One option we’ve considered lately would be to have the shader yield a list of lists. This would allow us to use scoping to access attributes in the inner loop instead of having to pass them around explicitly, not to mention doing away with explicit loop counters altogether. We could use existing techniques to generate imperative code, e.g. stream fusion. However, it is an open question how we could introduce lists or some similar structure in the language without disrupting other parts, keeping the use of the new feature appropriately constrained. One thing is clear: there has to be a better way.

Designing a custom kind system for rendering pipeline descriptions

We are in the process of creating the next iteration of LambdaCube, where we finally depart from the Haskell EDSL approach and turn the language into a proper DSL. The reasons behind this move were outlined in an earlier post. However, we still use the EDSL as a testing ground while designing the type system, since GHC comes with a rich set of features available for immediate use. The topic of today’s instalment is our recent experiment to make illegal types unrepresentable through a custom kind system. This is made possible by the fact that GHC recently introduced support for promoting datatypes to kinds.

Even though the relevant DataKinds extension is over a year old (although it’s been officially supported only since GHC 7.6), we couldn’t find any example to use it for modelling a real-life domain. Our first limited impression is that this is a direction worth pursuing.

It might sound surprising that this idea was already brought up in the context of computer graphics in the Spark project. Tim Foley’s dissertation briefly discusses the type theory behind Spark (see Chapter 5 for details). The basic idea is that we can introduce a separate kind called Frequency (Spark refers to this concept as RecordType), and constrain the Exp type constructor (@ in Spark) to take a type of this kind as its first argument.

To define the new kind, all we need to do is write a plain data declaration, as opposed to the four empty data declarations we used to have:

data Frequency = Obj | V | G | F

As we enable the DataKinds extension, this definition automatically creates a kind and four types of this kind. Now we can change the definition of Exp to take advantage of it:

data Exp :: Frequency -> * -> * where
    ...

For the time being, we diverge from the Spark model. The difference is that the resulting type has kind * in our case, while in the context of Spark it would be labelled as RateQualifiedType. Unfortunately, Haskell doesn’t allow using non-* return kinds in data declarations, so we can’t test this idea with the current version. Since having a separate Exp universe is potentially useful, we might adopt the notion for the DSL proper.

We don’t have to stop here. There are a few more areas where we can sensibly constrain the possible types. For instance, primitive and fragment streams as well as framebuffers in LambdaCube have a type parameter that identifies the layer count, i.e. the number of framebuffer layers we can emit primitives to using geometry shaders. Instead of rolling a custom solution, now we can use type level naturals with support for numeric literals. Among others, the definition of stream types and images reflects the new structure:

data VertexStream prim t
data PrimitiveStream prim (layerCount :: Nat) (stage :: Frequency) t
data FragmentStream (layerCount :: Nat) t

data Image (layerCount :: Nat) t where
    ...

Playing with kinds led to a little surprise when we looked into the texture subsystem. We had marker types called DIM1, DIM2, and DIM3, which were used for two purposes: to denote the dimension of primitive vectors and also the shape of equilateral textures. Both structures have distinct fourth options: 4-dimension vectors and rectangle-shaped textures. While they are related – e.g. the texture shape implies the dimension of vectors used to address texels –, these are different concepts, and we consider it an abuse of the type system to let them share some cases. Now vector dimensions are represented as type-level naturals, and the TextureShape kind is used to classify the phantom types denoting the different options for texture shapes. It’s exactly like moving from an untyped language to a typed one.

But what did we achieve in the end? It looks like we could express the very same constraints with good old type classes. One crucial difference is that kinds defined as promoted data types are closed. Since LambdaCube tries to model a closed domain, we see this as an advantage. It also feels conceptually and notationally cleaner to express simple membership constraints with kinds than by shoving them in the context. However, the final decision about whether to use this approach has to wait until we have the DSL working.

Optimising convolution filters

We recently added a new example to the repository that shows one way of implementing convolution filters. Our variance shadow mapping example uses a Gaussian blur as a component, but it’s implemented in a rather naive, straightforward way. This time we’re going to have a closer look at it and improve its performance while trying to preserve the output as much as possible. This little experiment was primarily inspired by a post from Daniel Rákos. While the optimisation is not specific to the Gaussian blur, we’ll stick with it as an example.

First of all, we’ll define a function to calculate the weights of a blur for a given filter width. As Daniel points out, taking the binomial coefficients, i.e. the appropriate row of Pascal’s triangle, and normalising it gives us the desired values. First let’s define the weights without normalisation:

binomialCoefficients :: Int -> [Float]
binomialCoefficients n = iterate next [1] !! (n-1)
  where
    next xs = [x+y | x <- xs ++ [0] | y <- 0:xs]

A convolution filter takes samples from the neighbourhood of a given texel and computes a weighted sum from them, therefore we need to assign offsets to the weights before we can use them. We will measure the offsets in texels. The following function assigns offsets to any list of weights with the assumption that the middle element of the list corresponds to the centre of the filter (from now on, we’ll assume the filter width is an odd integer):

withOffsets :: [Float] -> [(Float, Float)]
withOffsets cs = [(o, c) | c <- cs | o <- [-lim..lim]]
  where
    lim = fromIntegral (length cs `quot` 2)

Finally, we want to normalise the weights, so their sum is 1. We define our normalisation step to work on the weights with offsets:

normalise :: [(Float, Float)] -> [(Float, Float)]
normalise ocs = [(o, c/s) | (o, c) <- ocs]
  where
    s = sum [c | (_, c) <- ocs]

At this point, we can define a function to calculate weighted samples for a Gaussian blur:

gaussianSamples :: Int -> [(Float, Float)]
gaussianSamples = normalise . withOffsets . binomialCoefficients

For instance, gaussianSamples 5 equals [(-2.0,0.0625),(-1.0,0.25),(0.0,0.375),(1.0,0.25),(2.0,0.0625)].

To improve the performance of the filter, we should make it as small as possible. One obvious thing we can do is to simply remove terms that don’t contribute to the results significantly. Since our primary use case is real-time graphics, a bit of inaccuracy most likely won’t make any perceivable difference. Our solution is to introduce a thresholding operation that throws away terms that are much smaller than the biggest term:

threshold :: Float -> [(Float, Float)] -> [(Float, Float)]
threshold t ocs = [oc | oc@(_, c) <- ocs, c*t >= m]
  where
    m = maximum [c | (_, c) <- ocs]

This simple method can be surprisingly efficient. When we calculate the weights for wide Gaussian blurs, it turns out that a large portion of the outer areas hardly contribute to the final value. For instance, threshold 1000 (withOffsets (binomialCoefficients 101)) leaves only 37 terms.

Another, more clever trick we can apply is specific to texture sampling. Since the GPU provides linear interpolation for free, we can get a weighted sum of two adjacent samples with a single query: f(x) = (1 – frac(x)) * f(floor(x)) + frac(x) * f(ceil(x)). All we need to do is merge adjacent samples by calculating the necessary sampling position. The following function assumes that all samples are adjacent and separated by the unit distance; both criteria are met if we provide the output of withOffsets:

offsetWeight :: [(Float, Float)] -> [(Float, Float)]
offsetWeight [] = []
offsetWeight [ow] = [ow] 
offsetWeight ((o1,w1):(o2,w2):ows) = (o1+w2/w', w') : offsetWeight ows
  where
    w' = w1+w2

The resulting list is half as long as the input. We can also combine this optimisation with thresholding, but only with care: we need to do the sample merging first, since offsetWeight assumes consecutive samples. It’s best to wrap it all up in a single function:

gaussianSamples :: Float -> Int -> [(Float, Float)]
gaussianSamples tolerance = normalise . threshold tolerance . offsetWeight . withOffsets . binomialCoefficients

To stick with the previous example, length (gaussianSamples 1000 101) equals 19. This graph shows how these two optimisations affect the number of samples needed to reproduce a Gaussian of a given width:

Sample count for various optimisations given the width of the Gaussian

Sample count for various optimisations given the width of the Gaussian

So how shall we take these weights into use? First of all, we need some input to apply the filter to. We went with a simple pattern generated in a fragment shader; the geometry rendered is just a screen-sized quad:

originalImage :: Exp Obj (FrameBuffer N1 V4F)
originalImage = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf
  where
    accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT)
    clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT)
    prims = Transform vert (Fetch "geometrySlot" Triangle (IV2F "position"))

    vert :: Exp V V2F -> VertexOut ()
    vert pos = VertexOut pos' (floatV 1) (ZT)
      where
        V2 x y = unpack' pos
        pos' = pack' (V4 x y (floatV 0) (floatV 1))

    frag :: Exp F () -> FragmentOut (Color V4F :+: ZZ)
    frag _ = FragmentOut (col :. ZT)
      where
        V4 x y _ _ = unpack' fragCoord'
        x' = sqrt' x @* floatF 16
        y' = sqrt' y @* floatF 16
        r = Cond ((x' @+ y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1)
        g = floatF 0
        b = Cond ((x' @- y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1)
        col = pack' (V4 r g b (floatF 1))

This is a complete single-pass pipeline description. The vertex shader does nothing beyond simply extending the position vector with the missing Z and W coordinates. The fragment shader uses the fragment coordinate as its only input, so we don’t need to pass any data between the two shader phases. The resulting image is the following:

The input used to test the filters

The input used to test the filters

Next, we’ll define a one-dimensional convolution pass that takes a direction vector, a list of weights with offsets (which are used to scale the aforementioned vector), and an input image. The list is used to generate an unrolled loop in the fragment shader, while the vertex shader performs just an elementary coordinate transformation.

convolve :: V2F -> [(Float, Float)] -> Exp Obj (Image N1 V4F) -> Exp Obj (FrameBuffer N1 V4F)
convolve (V2 dx dy) weights img = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf
  where
    resX = windowWidth
    resY = windowHeight
    dir' :: Exp F V2F
    dir' = Const (V2 (dx / fromIntegral resX) (dy / fromIntegral resY))

    accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT)
    clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT)
    prims = Transform vert (Fetch "postSlot" Triangle (IV2F "position"))

    vert :: Exp V V2F -> VertexOut V2F
    vert uv = VertexOut pos (Const 1) (NoPerspective uv' :. ZT)
      where
        uv'    = uv @* floatV 0.5 @+ floatV 0.5
        pos    = pack' (V4 u v (floatV 1) (floatV 1))
        V2 u v = unpack' uv

    frag :: Exp F V2F -> FragmentOut (Color V4F :+: ZZ)
    frag uv = FragmentOut (sample :. ZT)
      where
        sample = foldr1 (@+) [ texture' smp (uv @+ dir' @* floatF ofs) @* floatF coeff
                             | (ofs, coeff) <- weights]
        smp = Sampler LinearFilter Clamp tex
        tex = Texture (Texture2D (Float RGBA) n1) (V2 resX resY) NoMip [img]

Incorporating the convolution pass is as simple as using function composition:

weights = gaussianSamples 1000 101
dirH = V2 1 0
dirV = V2 0 1

finalImage :: Exp Obj (FrameBuffer N1 V4F)
finalImage = filterPass dirV (filterPass dirH originalImage)
  where
    filterPass dir = convolve dir weights . projectBuffer
    projectBuffer = PrjFrameBuffer "" tix0

Using our original definition for gaussianSamples, i.e. not using thresholding or merging, we get the following rendering:

The final image after applying the naive Gaussian blur

The final image after applying the naive Gaussian blur

When we add both optimisations, the resulting image is indistinguishable from the previous one:

The final image after applying the optimised Gaussian blur

The final image after applying the optimised Gaussian blur

Despite the striking similarity, the two images are not identical. If we calculate the difference and enhance the result, we get the following pattern:

Enhanced difference between the two blurred images

Enhanced difference between the two blurred images

With 8-bit RGB components, the maximum difference anywhere on the image is just 1, so no wonder it cannot be detected by the naked eye. This is a really nice result, because the run-time increases steeply with the number of samples. The width of the filter is also a factor, but it’s difficult to predict its effect due to the complexity of the GPU hardware. On the hardware we tested (nVidia GeForce 9650M GT) we got the following timings:

The comparative performance of the optimisations

The comparative performance of the optimisations

At a first glance, it looks like the time per frame is more or less directly proportional to the number of samples used in the filtering passes after we deduct the time needed to run the no-op 1×1 filter. However, in this particular case, the variant with both optimisations turned on seems to have a smaller time to sample count ratio than the naive version, about 0.2 vs. 0.25 ms, so the gains are even better than expected. On other hardware we might see different relations depending on the peculiarities of memory access costs.

At this point we ran into the limits of the current LambdaCube implementation. One more potential optimisation worth trying would be to calculate the sample coordinates in the vertex shader, making the texture reads non-dependent. Currently this is not possible due to the lack of support for array types in the shaders. It’s also evident that the EDSL approach without explicit let bindings is not viable, because compilation time is exponential with respect to the complexity of the shader. While there have been some positive developments in this area, we intend to switch to the DSL approach in the next version anyway, which should solve this problem. Finally, if you run the example, you’ll notice that you cannot resize the window. The reason for this is rather mundane: currently texture sizes are pipeline compile-time constants. In the future we’ll change their type from V2I to Exp Obj V2I, which will make it possible to modify the values during run-time.

Using Bullet physics with an FRP approach (part 3)

After a long hiatus, we shall conclude our series of posts on the FRP physics example. The previous post discussed the falling brick whose motion state is updated upon collision events, while this one will focus on the mouse draggable rag doll. In fact, the Bullet portion of this feature turned out to be less interesting than some of the observations we made while implementing the interaction in Elerea.

Defining a rag doll

The rag doll itself is straightforward business: it consists of a few capsules (one for each joint), and the constraints that hold them together. These objects are instantiated in the complexBody function given a description. The general idea in Bullet is that constraints are configured in world space. E.g. if we want to connect two bricks with a spring, we have to provide the world-space coordinates of the pivot points given the current position and orientation of the bricks, and the system calculates the necessary parameters based on this input. Afterwards, everything is handled by the physics world, and we don’t have to worry about it at all.

Originally, we wanted to extend the attribute system to be able to describe all the parameters of the constraints in a convenient way. Unfortunately, it turned out that the Bullet API is rather inconsistent in this area, and it would have required too much up front work to create a cleaner façade in front of it for the sake of the example. However, we intend to revisit this project in the future, when LambdaCube itself is in a better shape.

Reactive mouse picking logic

To make things interesting, we allow the user to pick up objects one at a time and drag them around. This is achieved by temporarily establishing a so-called point-to-point constraint while the mouse button is pressed. This constraint simply makes sure that two points always coincide in space without imposing any limits on orientation.

The logic we want to implement is the following:

  1. When the mouse button is pressed, check what the pointer is hovering over by shooting a ray into the world.
  2. If the ray hits a dynamic (finite-mass and not player controlled) object first, instantiate a constraint whose pivot point is the exact same spot where the ray hit.
  3. While the button is pressed, ensure that the other pivot point of the constraint is kept equal to a spatial position that corresponds to the current mouse position and is at the same distance from the camera as the original grabbing position.
  4. When the button is released, destroy the constraint.

The high-level process is described by the pickConstraint function:

pickConstraint :: BtDynamicsWorldClass bc => bc -> Signal Vec2 -> Signal CameraInfo
               -> Signal Bool -> Signal Vec2 -> SignalGen (Signal ())
pickConstraint dynamicsWorld windowSize cameraInfo mouseButton mousePos = do
    press <- edge mouseButton
    release <- edge (not <$> mouseButton)
    pick <- generator $ makePick <$> press <*> windowSize <*> cameraInfo <*> mousePos

    releaseInfo <- do
        rec sig <- delay Nothing $ do
                released <- release
                newPick <- pick
                currentPick <- sig
                case (released, newPick, currentPick) of
                    (True, _, _)                             -> return Nothing
                    (_, Just (constraintSignal, body), _)    -> do
                        constraint <- constraintSignal
                        return $ Just (constraint, body, constraintSignal)
                    (_, _, Just (_, body, constraintSignal)) -> do
                        constraint <- constraintSignal
                        return $ Just (constraint, body, constraintSignal)
                    _                                        -> return Nothing                            

        return sig

    effectful2 stopPicking release releaseInfo

First, we define press and release events by detecting rising and falling edges of the mouseButton signal. The derived signals yield True only at the moment when the value of mouseButton changes in the appropriate direction. Afterwards, we define the pick signal, which has the type Maybe (Signal BtPoint2PointConstraint, BtRigidBody). When the user presses the button while hovering over a dynamic body, pick carries a signal that corresponds to the freshly instantiated constraint plus a reference to the body in question, otherwise it yields Nothing.

The releaseInfo signal is defined recursively through a delay, which is the most basic way of defining a stateful stream transformer in Elerea. In fact, the stateful and transfer combinators provided by the library are defined in a similar manner. The reason why we can’t use them in this case is the fact that the state contains signals that we need to sample to calculate the next state. This flattening is made possible thanks to Signal being a Monad instance.

The type of the state is Maybe (BtPoint2PointConstraint, BtRigidBody, Signal BtPoint2PointConstraint). The elements of the triple are: the current sample of the constraint, the body being dragged, and the time-changing signal that represents the constraint. The transformation rules described through pattern matching are the following:

  1. if the mouse button is released, the state is cleared to Nothing;
  2. if a pick was generated, it is stored along with the current sample of the constraint signal;
  3. if there is an ongoing pick, the constraint signal is sampled;
  4. in any other case, the state is cleared.

In the end, releaseInfo will carry a triple wrapped in Just between a successful pick and a release event, and Nothing at any other moment. This signal, along with release itself, forms the input of stopPicking, which just invokes the appropriate Bullet functions to destroy the constraint at the right moment.

The missing piece of the puzzle is makePick, which is responsible for creating the constraint signal:

makePick :: Bool -> Vec2 -> CameraInfo -> Vec2 -> SignalGen (Maybe (Signal BtPoint2PointConstraint, BtRigidBody))
makePick press windowSizeCur cameraInfoCur mousePosCur = case press of
    False -> return Nothing
    True -> do
        pickInfo <- execute $ pickBody dynamicsWorld windowSizeCur cameraInfoCur mousePosCur
        case pickInfo of
            Nothing -> return Nothing
            Just (body, hitPosition, distance) -> do
                constraint <- createPick dynamicsWorld body hitPosition distance windowSize cameraInfo mousePos
                return $ Just (constraint, body)

This is a straightforward signal generator, and passing it into generator in the definition of pick ensures that it is invoked in every frame. The pickBody function is an ordinary IO operation that was already mentioned in the first post of this series. Most of the work is done in createPick when an appropriate body is found:

createPick :: (BtDynamicsWorldClass bc, BtRigidBodyClass b)
           => bc -> b -> Vec3 -> Float -> Signal Vec2 -> Signal CameraInfo
           -> Signal Vec2 -> SignalGen (Signal BtPoint2PointConstraint)
createPick dynamicsWorld body hitPosition distance windowSize cameraInfo mousePos = do
    make' (createPickConstraint dynamicsWorld body hitPosition)
        [ setting :!~ flip set [impulseClamp := 30, tau := 0.001]
        , pivotB :< pivotPosition <$> windowSize <*> cameraInfo <*> mousePos
        ]
  where
    createPickConstraint dynamicsWorld body hitPosition = do
        bodyProj <- transformToProj4 <$> btRigidBody_getCenterOfMassTransform body
        let localPivot = trim ((extendWith 1 hitPosition :: Vec4) .* fromProjective (inverse bodyProj))
        pickConstraint <- btPoint2PointConstraint1 body localPivot
        btDynamicsWorld_addConstraint dynamicsWorld pickConstraint True
        return pickConstraint

    pivotPosition windowSize cameraInfo mousePos = Just (rayFrom &+ (normalize (rayTo &- rayFrom) &* distance))
      where
        rayFrom = cameraPosition cameraInfo
        rayTo = rayTarget windowSize cameraInfo mousePos

The actual constraint is instantiated in createPickConstraint, which is just a series of Bullet API calls. We define the second pivot point as a signal attribute; the signal is a stateless function of the starting distance, the mouse position, and the view projection parameters. Such signals can be defined by lifting a pure function (in this case pivotPosition) using the applicative combinators. Since pivotPosition never yields Nothing, the pivot point is updated in every frame.

The final take-away

The most interesting outcome of this experiment, at least in our opinion, is the realisation how FRP can make it easier to deal with mutable state in a disciplined way. In particular, it provides a nice solution in the situation when a mutable variable needs to be modified by several entities. Since all the future edits are available as a signal, it is straightforward to resolve edit conflicts with a state machine. In fact, the FRP approach practically forces us to do so.

Dealing with the interdependencies of several time-varying values can also be tricky. Again, with FRP we have no choice but to clearly define what happens in all the possible constellations. One example for this in the above code is the definition of releaseInfo, where we used pattern matching to account for all the possibilities. It is an open question how this method scales as the program grows in complexity, and we’ll see that better in our future experiments.

Using Bullet physics with an FRP approach (part 2)

In the previous post, we introduced a simple attribute system to give the raw Bullet API a friendlier look. As it turns out, this system is easy to extend along the temporal dimension, since an attribute of a physical object is a time-varying value – implemented as a mutable variable under the hood. The common theme in various FRP approaches is about reifying the whole lifetime of such values in some form, which naturally leads to the idea of using the attributes as the bridge between the reactive library and the physics engine.

Time-varying attributes with Elerea

To make a seamless integration of Elerea and Bullet possible, we needed to define a few additional primitive signal constructors. Originally, the only way to feed data from IO sources into an Elerea network was through the external primitive, which constructed a signal and a corresponding IO function to update its output (the first argument is the initial value):

external :: a -> IO (Signal a, a -> IO ())

The obvious problem with external is the fact that it works outside the signal network, so it cannot be defined in a convenient way in terms of any entity that lives inside the reactive world. The simplest solution is to directly embed IO computations. One-off computations can be directly executed in SignalGen through the execute primitive, which is equivalent to liftIO. IO signals can be constructed with the effectful* family of functions, which are analogous to the applicative lifting combinators, but apply an IO function to the input signals instead of a pure one:

effectful :: IO a -> SignalGen (Signal a)
effectful1 :: (t -> IO a) -> Signal t -> SignalGen (Signal a)
effectful2 :: (t1 -> t2 -> IO a) -> Signal t1 -> Signal t2 -> SignalGen (Signal a)
...

The results must be in the SignalGen context in order to ensure that the constructed signals cause the IO computations to be evaluated exactly once per sampling step. The reason is simple: to meet this condition, the signal needs to be memoised, which requires an additional state variable, and state variables can only be introduced in SignalGen.

First, we extend the list of attribute operations with a fifth member, which defines the attribute to be updated by a signal. Since we might not want to take full control of the attribute, just intervene once in a while, the signal is required to be partial through Maybe. When the signal yields Nothing, the attribute is managed by Bullet during that superstep.

data AttrOp o = forall a . Attr o a := a
              | forall a . Attr o a :~ (a -> a)
              | forall a . Attr o a :!= IO a
              | forall a . Attr o a :!~ (a -> IO a)
              | forall a . Attr o a :< Signal (Maybe a)

Now we have all the building blocks necessary to define a signal-based variant of set from the last post:

set' :: o -> [AttrOp o] -> SignalGen (Signal ())
set' obj as = go as (return ())
  where
    go [] sig     = return sig
    go (a:as) sig = case a of
        Attr getter setter := x  ->
            execute (setter obj x >> return ()) >> go as sig
        Attr getter setter :~ f  ->
            execute (getter obj >>= setter obj . f >> return ()) >> go as sig
        Attr getter setter :!= x ->
            execute (x >>= setter obj >> return ()) >> go as sig
        Attr getter setter :!~ f ->
            execute (getter obj >>= f >>= setter obj >> return ()) >> go as sig
        Attr getter setter :< s  -> do     
            dummy <- flip effectful1 s $ \mx -> case mx of
                Nothing -> return ()
                Just x  -> setter obj x >> return ()
            go as (liftA2 const sig dummy)

The first four cases are unchanged, they just need to be wrapped with execute. The signal case is also straightforward: we use effectful1 to sample the signal and update the attribute with its current value. What might not be clear first is all the additional plumbing. Unfortunately, this is all necessary due to the fact that if a signal is not referenced anywhere in the system, it gets garbage collected.

In this case, effectful1 doesn’t produce any meaningful output. All we need is its side effects. However, we still need to store the dummy signal (a stream of unit values), otherwise the setter will only be updated until the next garbage collection round. We can think of the dummy signal as a thread that’s kept alive only as long as we have a handle to it. To ensure that the reference is not lost, we carefully wrap it in another dummy signal that keeps all the signal setters alive. It is up to the caller of set’ to store the final signal.

It is useful to define the equivalent of make as well:

make' :: IO o -> [AttrOp o] -> SignalGen (Signal o)
make' act flags = do
    obj <- execute act
    dummy <- set' obj flags
    return (liftA2 const (return obj) dummy)

To reduce the chance of user error, we return a signal that represents the object we just constructed. The signal always yields a reference to the object, and keeps all the signal attributes alive. To ensure correct behaviour, we should always sample this signal when querying the object, instead of saving the reference and passing it around. This is necessary for another reason as well: thanks to Elerea’s dependency handling, it makes sure that the signal attributes are updated before the object is queried.

A simple use case

The example scene contains two independent details: a falling brick whose collisions with a designated object cause reactions, and a ragdoll that can be dragged with the mouse. In the rest of this post we’ll concentrate on the brick.

The LambdaCube ‘FRP physics’ example

All the objects are created in the top level of the signal network, in the SignalGen monad. In this setup, the object used for collision detection is actually a ghost sphere, not a solid body. We’ll refer to it as the query space. Whenever the brick intersects this sphere, its position and orientation is reset, but its velocity is retained. First we create the sphere, which is just an ordinary IO operation, so we can execute it:

    querySpace <- execute $ do
        ghostObject <- make btPairCachingGhostObject
                       [ collisionFlags :~ (.|. e_btCollisionObject_CollisionFlags_CF_NO_CONTACT_RESPONSE)
                       , collisionShape := sphereShape ghostRadius
                       , worldTransform := Transform idmtx 0
                       ]
        btCollisionWorld_addCollisionObject dynamicsWorld ghostObject 1 (-1)
        return ghostObject

Afterwards, we create a signal that tells us in every frame which objects are colliding with the query space:

    collisions <- effectful $ collisionInfo querySpace

The collisionInfo function returns a list of tuples, where each member contains references to both bodies involved, plus some spatial information that we ignore in this scenario. This is not a library function; its full definition is part of the example. The exact details are not relevant to the topic of this post, as they are just a direct application of the Bullet API, so we’re not going to discuss it here.

Given the collision signal, we can now define the brick:

    let initBrickTrans = Transform idmtx (Vec3 2 20 (-3))
    brick <- do
        rec brick <- make' (snd <$> localCreateRigidBodyM dynamicsWorld 1 initBrickTrans (boxShape brickSize))
                     [worldTransform :< boolToMaybe initBrickTrans . bodyInCollision brickBody <$> collisions]
            brickBody <- snapshot brick
        return brick

We use make’ to invoke the constructor of the brick and define the temporal behaviour of its worldTransform attribute in a single step. Again, the details of the construction are not particularly interesting: all we need is a mass, an initial transformation (position and orientation), and a collision shape for starters.

The real magic happens in the attribute override. Given the signal that tells us who collides with the query space, we can derive the signal that describes how the world transform of the brick needs to be updated over time. This is achieved by mapping a pure function over the signal, which yields Nothing if brickBody is not involved in any of the collisions, and Just initBrickTrans if it is.

One interesting bit to note is the recursion needed for the above definition, which is made possible thanks to SignalGen being a MonadFix instance. In order to define the world transform update signal, we need a reference to the object that we’re just creating. The reference comes from taking a snapshot of the brick signal. Since the update signal doesn’t need to be sampled for the sake of constructing the object, we don’t end up in an infinite loop.

The role of dummy signals

While the general idea of using signals to define time-varying attributes works in practice, it leads to the need for ‘dummy’ signals that have to be kept around explicitly. The big problem with this solution is that it’s a potential source of programmer error. We believe that it is just one manifestation of the more general issue that Elerea provides no way to define the death of signals in a deterministic way. Currently we rely on the garbage collector to clean up all the update activity that’s not needed any more, and it’s up to the programmer to define the signals in a way that they stop their activity at the right time.

While FRP research solved the problem of start times in several ways, it’s not nearly as clear how to describe the endpoint of a signal’s life. Most likely all we need is a few additional primitives that capture the essence of end times the same way the SignalGen monad captures the essence of start times. Recently there have been some interesting developments in this area; we’re hoping that e.g. the work of Wolfgang Jeltsch or Heinrich Apfelmus will help us come up with a practical solution.

Using Bullet physics with an FRP approach (part 1)

Earlier this year we were experimenting with creating an FRP-style API on top of the Bullet physics engine. As a result, we developed a simple example scene using Elerea, which is now available in the project repository. This post is the first in a series to discuss the example in detail. We start the tour by introducing the libraries used.

The raw Bullet API

Bullet is a C++ library, which makes it tricky to drive from Haskell. Csaba solved the problem by generating a plain C wrapper around it, and created a Haskell binding for this wrapper. This provides us a nice steady base to build on. Programming against this interface feels very much like using Gtk2Hs. As an example, let’s see a slightly simplified variant of a function from the example used in mouse picking. In this function, we cast a ray into the world and return the closest dynamic rigid body it hit:

rayTarget :: Vec2 -> CameraInfo -> Vec2 -> Vec3

pickBody :: BtCollisionWorldClass bc => bc -> Vec2 -> CameraInfo -> Vec2 -> IO (Maybe (BtRigidBody, Vec3, Float))
pickBody dynamicsWorld windowSize cameraInfo mousePosition = do
    let rayFrom = cameraPosition cameraInfo
        rayTo = rayTarget windowSize cameraInfo mousePosition
    rayResult <- btCollisionWorld_ClosestRayResultCallback rayFrom rayTo
    btCollisionWorld_rayTest dynamicsWorld rayFrom rayTo rayResult
    hasHit <- btCollisionWorld_RayResultCallback_hasHit rayResult

    case hasHit of
        False -> return Nothing
        True -> do
            collisionObj <- btCollisionWorld_RayResultCallback_m_collisionObject_get rayResult
            isNotPickable <- btCollisionObject_isStaticOrKinematicObject collisionObj
            internalType <- btCollisionObject_getInternalType collisionObj
            case isNotPickable || internalType /= e_btCollisionObject_CollisionObjectTypes_CO_RIGID_BODY of
                True -> return Nothing
                False -> do
                    btCollisionObject_setActivationState collisionObj 4 -- DISABLE_DEACTIVATION
                    hitPosition <- btCollisionWorld_ClosestRayResultCallback_m_hitPointWorld_get rayResult
                    body <- btRigidBody_upcast collisionObj -- this would be null if the internal type is not CO_RIGID_BODY
                    return $ Just (body, hitPosition, len (hitPosition &- rayFrom))

We can think of the camera info as the transformation matrix that maps the world on the screen. The rayTarget function returns the endpoint of the ray corresponding to the mouse position on the far plane of the view frustum given all the relevant information. First we create a data structure (the ‘ray result callback’) to hold the result of the raycast, then perform the actual ray test. The value of hasHit is true if the segment between rayFrom and rayTo intersects any object in the physics world.

The C++ snippet corresponding to the first five lines of the do block might look something like this:

  btVector3 rayFrom = cameraPosition(cameraInfo);
  btVector3 rayTo = rayTarget(windowSize, cameraInfo, mousePosition);
  btCollisionWorld::ClosestRayResultCallback rayCallback(rayFrom, rayTo);
  dynamicsWorld->rayTest(rayFrom, rayTo, rayCallback);
  bool hasHit = rayCallback.hasHit();

If hasHit is true, we can get a reference to the object from rayResult, and check if it is of the right type. If everything matches, we return the body, the world position of the point where the ray hit it first, and the distance to that point from the camera. One of the nice things about this binding is that it uses the vector types from the vect library out of the box instead of exposing Bullet specific vectors, so all the spatial calculations are really easy to write without having to jump through extra hoops first.

Elerea basics

Elerea is an FRP library that’s primarily aimed at game programming. Its basic abstraction is the discrete generic stream – referred to as Signal –, and it can be used to describe fully dynamic data-flow networks. In essence, it provides a nice compositional way to define the state transformation during the simulation step. It also allows IO computations to be interspersed in this description, thereby providing lightweight (threadless) framework for cooperative multitasking.

There are two kinds of structures in Elerea. A value of type Signal a can be thought of as a time-varying value of type a. All the future values of the signal are fully determined by its definition, i.e. signals are context independent just as we expect from ordinary values in a pure functional language. The other structure is the SignalGen monad, which is a context where stateful signals are constructed. Mutually dependent signals can be defined thanks to the fact that SignalGen is an instance of MonadFix.

The basic idea behind SignalGen can be understood in terms of start times. Every context corresponds to a (discrete) moment on the global timeline, and every stateful signal constructed in that context is considered to start at that moment. However, signals themselves are defined in terms of the global time, which allows us to combine signals that originate from different contexts (e.g. create a numeric signal that’s the point-wise sum of two unrelated numeric signals). The API ensures that no signal can physically exist before its start time; theoretically, signals are undefined until that point.

When executing the resulting data-flow network, Elerea guarantees consistency by double buffering. The superstep that advances the network consists of two phases: read and commit. In the read phase every node queries its dependencies, and no-one changes their output. In the commit phase every node performs its state transition independently, based on the input from the read phase, so the inconsistent state is never observed anywhere in the system.

Attribute system

While the C-style API allows us to access all the functionalities, it’s not very comfortable to use. The biggest issue is its verbosity: all the names are fully qualified out of necessity, and each property of an object has to be set separately. Therefore, we took some inspiration from the glib attribute system used by Gtk2Hs, and implemented something similar in the example. An attribute is the pair of a getter and the corresponding setter for a given property:

data Attr o a = forall x . Attr !(o -> IO a) !(o -> a -> IO x)

We allow arbitrary return types for setters to make it easier to define attributes, since many Bullet setters return something other than unit. However, we discard these values for the time being, so it’s really just to avoid having to type ‘() <$’ so many times.

Attributes are brought to life through attribute operations, which specify how to calculate the value of the property. There are four possibilities: set a given value, transform the current value with a pure function, set a given value coming from an IO computation, and transform the current one with an IO computation. These are denoted as follows:

infixr 0 :=, :~, :!=, :!~

data AttrOp o = forall a . Attr o a := a
              | forall a . Attr o a :~ (a -> a)
              | forall a . Attr o a :!= IO a
              | forall a . Attr o a :!~ (a -> IO a)

We need existentials to hide the type of the property and only expose the type of the object, so we can easily create collections of attributes. Now we can define the functions that connect them to the actual objects:

set :: o -> [AttrOp o] -> IO o
set obj attrs = (>> return obj) $ forM_ attrs $ \op -> case op of
    Attr _      setter := x  -> setter obj x >> return ()
    Attr getter setter :~ f  -> getter obj >>= setter obj . f >> return ()
    Attr _      setter :!= x -> x >>= setter obj >> return ()
    Attr getter setter :!~ f -> getter obj >>= f >>= setter obj >> return ()

get :: o -> Attr o a -> IO a
get obj (Attr getter _) = getter obj 

make :: IO o -> [AttrOp o] -> IO o
make act flags = do
    obj <- act
    set obj flags
    return obj

This is a fully generic system nothing to do with Bullet at this point. The set function takes an object and updates all the attributes listed in its second argument. The get function is just a thin helper to retrieve the value of a property given the corresponding attribute. Finally, make is another thin helper that allows us to construct an object and set its attributes in a single step.

A simple example is the world transform property of collision objects. It can be read and written by the following two functions:

btCollisionObject_getWorldTransform
  :: BtCollisionObjectClass bc => bc -> IO Transform
btCollisionObject_setWorldTransform
  :: BtCollisionObjectClass bc => bc -> Transform -> IO Transform

Turning it into an attribute is as simple as constructing a pair out of the above functions:

worldTransform :: BtCollisionObjectClass o => Attr o Transform
worldTransform = Attr btCollisionObject_getWorldTransform
                      btCollisionObject_setWorldTransform

Given this definition, we can write set body [worldTransform := ...] to update the transform, and get body worldTransform to retrieve it. In the next post we’ll see how to extend the above system to define attributes tied to Elerea signals, which allows us to define all their future values at the time of their creation, and how to use this capability to define a rigid body whose position is reset every time it collides with another given object.

Variance Shadow Mapping

One of the test examples in the repository is an implementation of Variance Shadow Mapping, which lets us demonstrate multi-pass rendering through a relatively simple effect. In this post we’ll first have a look at the runnable EDSL implementation, discussing some of our syntactic woes on the way.

Basic shadow mapping

The Wikipedia article on shadow mapping gives a good overview of the general idea: first we render the scene from the point of view of the light source (using perspective projection for point or spot lights, and orthographic projection for directional lights), recording the distances of various surfaces from the light, then use this information as a look-up table in a second pass when we render the scene in the space of the main camera.

All we need to do to make shadow mapping work is to think carefully about the transformation matrices involved. The transformation from object local coordinates to screen coordinates is normally decomposed into three phases:

  1. model: object-local to world space (global coordinates);
  2. view: world to camera space (camera-local coordinates);
  3. projection: camera to screen space (coordinates after perspective projection and normalisation).

The model matrix is potentially different for every object, while view and projection are global to the rendering pass, so they can be composed before rendering. Consequently, we are going to use the following transformation matrices:

  • modelMatrix: per-object model matrix;
  • lightMatrix: view-projection matrix for the light;
  • cameraMatrix: view-projection matrix for the camera.

The first step is to record the distances of the closest surfaces from the light source. Whether the light is directional or point light is irrelevant; all we need is the light matrix to be able to transform the vertices as desired. This simple depth pass is handled by the following definition:

depth :: Exp Obj (FrameBuffer N1 (Float, Float))
depth = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf
  where
    accCtx = AccumulationContext Nothing (DepthOp Less True :. ColorOp NoBlending True :. ZT)
    clearBuf = FrameBuffer (DepthImage n1 1000 :. ColorImage n1 0 :. ZT)
    prims = Transform vert (Fetch "geometrySlot" Triangle (IV3F "position"))

    lightMatrix = Uni (IM44F "lightMatrix")
    modelMatrix = Uni (IM44F "modelMatrix")

    vert :: Exp V V3F -> VertexOut Float
    vert pos = VertexOut lightPos (floatV 1) (Smooth depth :. ZT)
      where
        lightPos = lightMatrix @*. modelMatrix @*. v3v4 pos
        V4 _ _ depth _ = unpack' lightPos

    frag :: Exp F Float -> FragmentOut (Depth Float :+: Color Float :+: ZZ)
    frag depth = FragmentOutRastDepth (depth :. ZT)

Due to the lack of dedicated syntax, the notation is somewhat heavier than the DSL equivalent would be. The Exp type constructor corresponds to the @ operator; its first argument is the frequency, while the second is the payload type. For technical reasons, we also need our own non-flat tuple representation, which is constructed with the :. operator (:+: in the type system). In this case, depth is a one-layer framebuffer that stores two floating point numbers per pixel. In this particular case, both happen to be the same: the distance from the light. The first is the depth generated by the rasteriser, and the second is the output of the fragment shader. The latter value will change later when we calculate variance.

The top level of the definition starts with the last stage of the pipeline, the accumulation step. This is where the fragment shader is applied to the output of the rasteriser, i.e. the fragment stream. The accumulation context describes what happens to each fragment produced by the shader: they are only kept if their depth is less than that in the framebuffer, the new depth is written to the buffer, and the new colour simply overwrites the old one without blending. There is no additional fragment filter, which is expressed with the PassAll constant (equivalent to providing a constantly true function). Before accumulation, the framebuffer is cleared by setting the raster depth to 1000 and the ‘colour’ to 0 in each pixel.

As for the fragment stream, it is obtained by rasterising the primitive stream, which consists of triangles. There is only one vertex attribute, the position. Everything else is irrelevant for this pass, since we don’t need to calculate anything besides the actual shapes. Geometry can be pushed into the pipeline through named slots, and this is the point where we can define the name.

The vertex shader transforms the position into light-space coordinates by going through the world coordinate system first. Our infix operators are all prefixed with @ to avoid name collusions with the operators defined in the standard prelude; we intend to drop this prefix in the DSL. The v3v4 function simply extends a 3D vector with the homogeneous coordinate w=1. For the payload, it simply emits the z coordinate of the light-space position. We don’t have a convenient interface for swizzling at the moment: vector components can be extracted by pattern matching on the result of the unpack’ function (all the primitive functions have an apostrophe in their name to avoid collision with the prelude). Also, the definition of uniforms makes it quite apparent that we are basically building a raw AST.

In order to make use of the depth information, we need to convert it into a sampler, which will be referenced in the second, final pass:

shadowMapSize :: Num a => a
shadowMapSize = 512

sm :: Exp Obj (FrameBuffer N1 (Float, V4F))
sm = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf
  where
    accCtx = AccumulationContext Nothing (DepthOp Less True :. ColorOp NoBlending (one' :: V4B) :. ZT)
    clearBuf = FrameBuffer (DepthImage n1 1000 :. ColorImage n1 (V4 0.1 0.2 0.6 1) :. ZT)
    prims = Transform vert (Fetch "geometrySlot" Triangle (IV3F "position", IV3F "normal"))

    cameraMatrix = Uni (IM44F "cameraMatrix")
    lightMatrix = Uni (IM44F "lightMatrix")
    modelMatrix = Uni (IM44F "modelMatrix")
    lightPosition = Uni (IV3F "lightPosition")

    vert :: Exp V (V3F, V3F) -> VertexOut (V3F, V4F, V3F)
    vert attr = VertexOut viewPos (floatV 1) (Smooth (v4v3 worldPos) :. Smooth lightPos :. Smooth worldNormal :. ZT)
      where
        worldPos = modelMatrix @*. v3v4 localPos
        viewPos = cameraMatrix @*. worldPos
        lightPos = lightMatrix @*. worldPos
        worldNormal = normalize' (v4v3 (modelMatrix @*. n3v4 localNormal))
        (localPos, localNormal) = untup2 attr

    frag :: Exp F (V3F, V4F, V3F) -> FragmentOut (Depth Float :+: Color V4F :+: ZZ)
    frag attr = FragmentOutRastDepth (luminance :. ZT)
      where
        V4 lightU lightV lightDepth lightW = unpack' lightPos
        uv = clampUV (scaleUV (pack' (V2 lightU lightV) @/ lightW))

        surfaceDistance = texture' sampler uv
        lightPortion = Cond (lightDepth @<= surfaceDistance @+ floatF 0.01) (floatF 1) (floatF 0)

        lambert = max' (floatF 0) (dot' worldNormal (normalize' (lightPosition @- worldPos)))

        intensity = lambert @* lightPortion
        luminance = pack' (V4 intensity intensity intensity (floatF 1))

        clampUV x = clamp' x (floatF 0) (floatF 1)
        scaleUV x = x @* floatF 0.5 @+ floatF 0.5

        (worldPos, lightPos, worldNormal) = untup3 attr

    sampler = Sampler PointFilter Clamp shadowMap

    shadowMap :: Texture (Exp Obj) DIM2 SingleTex (Regular Float) Red
    shadowMap = Texture (Texture2D (Float Red) n1) (V2 shadowMapSize shadowMapSize) NoMip [PrjFrameBuffer "shadowMap" tix0 depth]

The definition of the sampler is right at the bottom. First we convert the framebuffer yielded by depth into an image using PrjFrameBuffer, which projects a given member of a tuple during the conversion. The predefined value tix0 is the tuple index of the first element. In this case, our ‘tuple’ is degenerate, since it consists of a single element anyway. The resulting image is converted into a two-dimensional texture (shadowMap) with just a floating-point red channel. Finally, the texture is wrapped in a sampler structure (sampler), which specifies that it’s a non-repeating image that must not be smoothened during sampling, since that wouldn’t be meaningful for depth values at the edges of objects.

The pipeline setup is very similar to that of the depth pass, and most of the difference is in the shaders. This is a more complex case, where we have tuples both in the vertex and the fragment stream. Again, for technical reasons we need to unpack these representations with a dedicated function (untup*) and pattern matching, just like vectors. We could also use view patterns to make this extra step a bit less painful, but in the end all this won’t be necessary in the DSL.

As for shadows, the work is divided up between the two shader phases. The vertex shader calculates the vectors needed: view position (used by the rasteriser), world space position, light space position and world space surface normal. We cheated a bit with the normal calculation, since we don’t use the inverse transpose of the matrix. This is fine as long as our transformations don’t involve any non-uniform scaling, or we’re only scaling axis-aligned cuboids. The n3v4 function extends a 3D vector with w=0, so it is treated as a direction.

The fragment shader calculates the final colour of each pixel. The light space position is used to address the shadow map (sampler) as well as to quickly determine the distance to the light source without calculating an extra square root. The value of lightPortion is 0 if there is an occluder between the current point and the light source, 1 otherwise. To avoid self-shadowing, a little offset is applied to the stored depth of the closest surface (this could have been done in the first pass as well). Afterwards, we calculate the light contribution using the Lambert model, i.e. taking the cosine of the angle between the surface normal and the vector that points towards the light. Multiplying this value with the light proportion gives us an image like this:

Shadow Mapping with LambdaCube

Basic shadow mapping

We can clearly see that the chosen depth offset is not sufficient, since the background plane still suffers from self-shadowing.

In the actual example, we use a spotlight instead of a point light that radiates in every direction. This effect is simply achieved by calculating a colour that depends on the shadow UV coordinates. We can change the definition of intensity and luminance to get a better idea of the light’s direction:

        uv' = uv @- floatF 0.5
        spotShape = floatF 1 @- length' uv' @* floatF 4
        intensity = max' (floatF 0) (spotShape @* lambert)

        V2 spotR spotG = unpack' (scaleUV (round' (uv' @* floatF 10)) @* intensity)

        luminance = pack' (V4 spotR spotG intensity (floatF 1)) @* lightPortion

The resulting image is maybe a bit more interesting than the previous one:

Shadow Mapping with LambdaCube

Basic shadow mapping with a colourful spotlight

Variance shadow mapping

There are several issues with basic shadow mapping. Aliased shadow edges are only the tip of the iceberg; nowadays they can be easily fixed by using shadow samplers, which can provide bilinear filtering on the light proportion instead of just a binary comparison. However, if we want softer shadows, life gets a lot more complicated. The obvious solution is to take several samples of the shadow map and average the results, but this entails a potentially severe performance hit. We can also get away with a single sample if we apply a jitter to the UV coordinates, but this results in a noisy pattern instead of a smooth transition.

Variance shadow mapping makes it possible to get a proper smoothing effect with a single sample from the shadow map. The basic idea is to store a probability distribution instead of an exact function, and estimate the probability of a pixel being occluded instead of performing an exact test. The VSM algorithm uses Chebyshev’s inequality for the estimation. Since our shadow map is now a probability distribution, we can directly blur it with a Gaussian filter and get meaningful results. Another nice side effect is that the new formula also addresses the problem of self-shadowing, and provides a robust solution in place of the rather brittle and scene dependent offset hack.

In the first pass, we store the first two moments of the depth distribution:

moments :: Exp Obj (FrameBuffer N1 (Float, V2F))
moments = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf
  where
    accCtx = AccumulationContext Nothing (DepthOp Less True :. ColorOp NoBlending (one' :: V2B) :. ZT)
    clearBuf = FrameBuffer (DepthImage n1 1000 :. ColorImage n1 (V2 0 0) :. ZT)
    prims = Transform vert (Fetch "geometrySlot" Triangle (IV3F "position"))

    lightMatrix = Uni (IM44F "lightMatrix")
    modelMatrix = Uni (IM44F "modelMatrix")

    vert :: Exp V V3F -> VertexOut Float
    vert pos = VertexOut lightPos (floatV 1) (Smooth depth :. ZT)
      where
        lightPos = lightMatrix @*. modelMatrix @*. v3v4 pos
        V4 _ _ depth _ = unpack' lightPos

    frag :: Exp F Float -> FragmentOut (Depth Float :+: Color V2F :+: ZZ)
    frag depth = FragmentOutRastDepth (pack' (V2 moment1 moment2) :. ZT)
      where
        dx = dFdx' depth
        dy = dFdy' depth
        moment1 = depth
        moment2 = depth @* depth @+ floatF 0.25 @* (dx @* dx @+ dy @* dy)

The difference between moments and depth is the type (two-dimensional float vector instead of a single float per pixel) and the fragment shader, which calculates two moments of the distribution. The second pass (vsm) is also similar to the basic shadow mapping case, the only thing that changes is the formula for the light portion, which now becomes the maximum probability that the surface is lit:

        V2 moment1 moment2 = unpack' (texture' sampler uv)
        variance = max' (floatF 0.002) (moment2 @- moment1 @* moment1)
        distance = max' (floatF 0) (lightDepth @- moment1)
        lightProbMax = variance @/ (variance @+ distance @* distance)

The other thing that changes slightly is the definition of the sampler:

    sampler = Sampler LinearFilter Clamp shadowMap

    shadowMap :: Texture (Exp Obj) DIM2 SingleTex (Regular Float) RG
    shadowMap = Texture (Texture2D (Float RG) n1) (V2 shadowMapSize shadowMapSize) NoMip [PrjFrameBuffer "shadowMap" tix0 moments]

Unlike the previous one, this texture has a green component as well to store the second moment, and the sampler is set up to perform linear filtering. Using the value of lightProbMax directly as the light portion is a good first approximation, but it leads to light bleeding, a well-known problem with VSM:

Variance shadow mapping with light bleeding

Variance shadow mapping with obvious light bleeding

Before addressing the bleeding issue, we should first take advantage of the fact that the shadow map can be filtered. We are going to insert an extra pair of passes between moments and vsm that blurs the shadow map. It is a pair of passes because we exploit the separability of the Gaussian filter, so first we blur the image vertically, then horizontally, thereby doing O(n) work per pixel instead of O(n2), where n is the width of the filter. The blur is described by the following function:

blur :: [(Float, Float)] -> Exp Obj (Image N1 V2F) -> Exp Obj (FrameBuffer N1 V2F)
blur coefficients img = filter1D dirH (PrjFrameBuffer "" tix0 (filter1D dirV img))
  where
    dirH v = Const (V2 (v / shadowMapSize) 0) :: Exp F V2F
    dirV v = Const (V2 0 (v / shadowMapSize)) :: Exp F V2F

    filter1D :: (Float -> Exp F V2F) -> Exp Obj (Image N1 V2F) -> Exp Obj (FrameBuffer N1 V2F)
    filter1D dir img = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf
      where
        accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V2B) :. ZT)
        clearBuf = FrameBuffer (ColorImage n1 (V2 0 0) :. ZT)
        prims = Transform vert (Fetch "postSlot" Triangle (IV2F "position"))

        vert :: Exp V V2F -> VertexOut V2F
        vert uv = VertexOut pos (Const 1) (NoPerspective uv' :. ZT)
          where
            uv'    = uv @* floatV 0.5 @+ floatV 0.5
            pos    = pack' (V4 u v (floatV 1) (floatV 1))
            V2 u v = unpack' uv

        frag :: Exp F V2F -> FragmentOut (Color V2F :+: ZZ)
        frag uv = FragmentOut (sample :. ZT)
          where
            sample = foldr1 (@+) [texture' smp (uv @+ dir ofs) @* floatF coeff | (ofs, coeff) <- coefficients]
            smp = Sampler LinearFilter Clamp tex
            tex = Texture (Texture2D (Float RG) n1) (V2 shadowMapSize shadowMapSize) NoMip [img]

The blur function takes a set of coefficient-offset pairs and an input image, and yields a framebuffer containing the filtered version of the image. This shows the power of our approach: since this is an ordinary pure function, it can be applied to any image in any pipeline. In fact, it could be part of a standard library of utility functions after factoring out the resolution as a separate argument instead of using our global constant shadowMapSize. The only notable novelty here is the use of Haskell for meta-programming: the amount of shader code generated is proportional to the length of the coefficient list, since the summation is expanded statically. We need to find a convenient substitute for this facility when migrating to the DSL.

To insert the blur, we need to change the definition of the sampler in vsm:

    sampler = Sampler LinearFilter Clamp shadowMapBlur

    shadowMapBlur :: Texture (Exp Obj) DIM2 SingleTex (Regular Float) RG
    shadowMapBlur = Texture (Texture2D (Float RG) n1) (V2 shadowMapSize shadowMapSize) NoMip [PrjFrameBuffer "shadowMap" tix0 blurredMoments]
      where
        blurredMoments = blur blurCoefficients (PrjFrameBuffer "blur" tix0 moments)
        blurCoefficients = [(-4.0, 0.05), (-3.0, 0.09), (-2.0, 0.12), (-1.0, 0.15), (0.0, 0.16), (1.0, 0.15), (2.0, 0.12), (3.0, 0.09), (4.0, 0.05)]

After this change, our shadows change drastically:

Variance shadow mapping with Gaussian blur

Variance shadow mapping with Gaussian blur

As it turns out, blurring also helps against light bleeding to a certain extent, since light bleeding appears in areas where the variance is big. However, it is still quite obvious. There are several ways to address the problem, and we chose the simplest for the sake of the example: raising the light portion to a power. The higher the exponent, the less the light bleeding, but unfortunately increasing the exponent also causes overdarkening. In a way, this is a hack in a similar vein as the depth offset is for simple shadow mapping. In the example, we use the square of lightProbMax as the light portion, which gives us nice soft shadows:

We recommend using this example as a starting point for your own experiments, as it has no external dependencies.

A few words about the LambdaCube stack and goals

In the earlier posts, we mentioned several times that we imagine LambdaCube as a standalone DSL. But what would this mean in practice?

One of our long-term goals is to build a productive content authoring environment that’s based on a purely functional rendering pipeline. Who knows, it might even be feasible to package LambdaCube as a Unity plugin! Another possible application could be generating an optimised rendering component to be integrated in another game engine. In any case, interoperability is a major concern for us. Driven by this goal, we came up with the following plan for the LambdaCube stack:

The heart of this scheme is obviously the intermediate language. The frontend compiler is going to be an independent tool (available both as a plain executable and a library with at least a C interface) that should be possible to easily integrate into any build system. This tool is responsible for most of the static analysis, some trivial as well as (hopefully) clever optimisations, and also planning the allocation of resources. As a result, it turns the declarative pipeline description into a series of instructions for a virtual machine that handles GPU specific data structures as primitives, e.g. setting up the rendering context, or binding vertex buffers, textures, and framebuffers in the right order. Given this intermediate description, we can take it further in many directions, as illustrated by the figure above.

Of course, LambdaCube is not likely to be useful when you are working on the bleeding edge and need accurate control over the resources. Also, when we are on said bleeding edge, it is most likely premature to think about building high-level abstractions. Therefore, we’d like to make it clear that we don’t see LambdaCube as a one-size-fits-all solution. If we were to make a mission statement, it would go something like this: we want to cover 95% of the use cases, making even moderately complex projects easier, and not get in the way of the remaining 5%.

Follow

Get every new post delivered to your Inbox.