### In a Nutshell

LambdaCube 3D is a domain specific language and library that makes it possible to program GPUs in a purely functional style.

Purely Functional Rendering Engine

April 28, 2013

Posted by on We are in the process of creating the next iteration of LambdaCube, where we finally depart from the Haskell EDSL approach and turn the language into a proper DSL. The reasons behind this move were outlined in an earlier post. However, we still use the EDSL as a testing ground while designing the type system, since GHC comes with a rich set of features available for immediate use. The topic of today’s instalment is our recent experiment to make illegal types unrepresentable through a custom kind system. This is made possible by the fact that GHC recently introduced support for promoting datatypes to kinds.

Even though the relevant **DataKinds** extension is over a year old (although it’s been officially supported only since GHC 7.6), we couldn’t find any example to use it for modelling a real-life domain. Our first limited impression is that this is a direction worth pursuing.

It might sound surprising that this idea was already brought up in the context of computer graphics in the Spark project. Tim Foley’s dissertation briefly discusses the type theory behind Spark (see Chapter 5 for details). The basic idea is that we can introduce a separate kind called **Frequency** (Spark refers to this concept as **RecordType**), and constrain the **Exp** type constructor (**@** in Spark) to take a type of this kind as its first argument.

To define the new kind, all we need to do is write a plain data declaration, as opposed to the four empty data declarations we used to have:

data Frequency = Obj | V | G | F

As we enable the DataKinds extension, this definition automatically creates a kind and four types of this kind. Now we can change the definition of **Exp** to take advantage of it:

data Exp :: Frequency -> * -> * where ...

For the time being, we diverge from the Spark model. The difference is that the resulting type has kind ***** in our case, while in the context of Spark it would be labelled as **RateQualifiedType**. Unfortunately, Haskell doesn’t allow using non-* return kinds in data declarations, so we can’t test this idea with the current version. Since having a separate **Exp** universe is potentially useful, we might adopt the notion for the DSL proper.

We don’t have to stop here. There are a few more areas where we can sensibly constrain the possible types. For instance, primitive and fragment streams as well as framebuffers in LambdaCube have a type parameter that identifies the layer count, i.e. the number of framebuffer layers we can emit primitives to using geometry shaders. Instead of rolling a custom solution, now we can use type level naturals with support for numeric literals. Among others, the definition of stream types and images reflects the new structure:

data VertexStream prim t data PrimitiveStream prim (layerCount :: Nat) (stage :: Frequency) t data FragmentStream (layerCount :: Nat) t data Image (layerCount :: Nat) t where ...

Playing with kinds led to a little surprise when we looked into the texture subsystem. We had marker types called **DIM1**, **DIM2**, and **DIM3**, which were used for two purposes: to denote the dimension of primitive vectors and also the shape of equilateral textures. Both structures have distinct fourth options: 4-dimension vectors and rectangle-shaped textures. While they are related – e.g. the texture shape implies the dimension of vectors used to address texels –, these are different concepts, and we consider it an abuse of the type system to let them share some cases. Now vector dimensions are represented as type-level naturals, and the **TextureShape** kind is used to classify the phantom types denoting the different options for texture shapes. It’s exactly like moving from an untyped language to a typed one.

But what did we achieve in the end? It looks like we could express the very same constraints with good old type classes. One crucial difference is that kinds defined as promoted data types are closed. Since LambdaCube tries to model a closed domain, we see this as an advantage. It also feels conceptually and notationally cleaner to express simple membership constraints with kinds than by shoving them in the context. However, the final decision about whether to use this approach has to wait until we have the DSL working.

April 11, 2013

Posted by on We recently added a new example to the repository that shows one way of implementing convolution filters. Our variance shadow mapping example uses a Gaussian blur as a component, but it’s implemented in a rather naive, straightforward way. This time we’re going to have a closer look at it and improve its performance while trying to preserve the output as much as possible. This little experiment was primarily inspired by a post from Daniel Rákos. While the optimisation is not specific to the Gaussian blur, we’ll stick with it as an example.

First of all, we’ll define a function to calculate the weights of a blur for a given filter width. As Daniel points out, taking the binomial coefficients, i.e. the appropriate row of Pascal’s triangle, and normalising it gives us the desired values. First let’s define the weights without normalisation:

binomialCoefficients :: Int -> [Float] binomialCoefficients n = iterate next [1] !! (n-1) where next xs = [x+y | x <- xs ++ [0] | y <- 0:xs]

A convolution filter takes samples from the neighbourhood of a given texel and computes a weighted sum from them, therefore we need to assign offsets to the weights before we can use them. We will measure the offsets in texels. The following function assigns offsets to any list of weights with the assumption that the middle element of the list corresponds to the centre of the filter (from now on, we’ll assume the filter width is an odd integer):

withOffsets :: [Float] -> [(Float, Float)] withOffsets cs = [(o, c) | c <- cs | o <- [-lim..lim]] where lim = fromIntegral (length cs `quot` 2)

Finally, we want to normalise the weights, so their sum is 1. We define our normalisation step to work on the weights with offsets:

normalise :: [(Float, Float)] -> [(Float, Float)] normalise ocs = [(o, c/s) | (o, c) <- ocs] where s = sum [c | (_, c) <- ocs]

At this point, we can define a function to calculate weighted samples for a Gaussian blur:

gaussianSamples :: Int -> [(Float, Float)] gaussianSamples = normalise . withOffsets . binomialCoefficients

For instance, **gaussianSamples 5** equals **[(-2.0,0.0625),(-1.0,0.25),(0.0,0.375),(1.0,0.25),(2.0,0.0625)]**.

To improve the performance of the filter, we should make it as small as possible. One obvious thing we can do is to simply remove terms that don’t contribute to the results significantly. Since our primary use case is real-time graphics, a bit of inaccuracy most likely won’t make any perceivable difference. Our solution is to introduce a thresholding operation that throws away terms that are much smaller than the biggest term:

threshold :: Float -> [(Float, Float)] -> [(Float, Float)] threshold t ocs = [oc | oc@(_, c) <- ocs, c*t >= m] where m = maximum [c | (_, c) <- ocs]

This simple method can be surprisingly efficient. When we calculate the weights for wide Gaussian blurs, it turns out that a large portion of the outer areas hardly contribute to the final value. For instance, **threshold 1000 (withOffsets (binomialCoefficients 101))** leaves only 37 terms.

Another, more clever trick we can apply is specific to texture sampling. Since the GPU provides linear interpolation for free, we can get a weighted sum of two adjacent samples with a single query: f(x) = (1 – frac(x)) * f(floor(x)) + frac(x) * f(ceil(x)). All we need to do is merge adjacent samples by calculating the necessary sampling position. The following function assumes that all samples are adjacent and separated by the unit distance; both criteria are met if we provide the output of **withOffsets**:

offsetWeight :: [(Float, Float)] -> [(Float, Float)] offsetWeight [] = [] offsetWeight [ow] = [ow] offsetWeight ((o1,w1):(o2,w2):ows) = (o1+w2/w', w') : offsetWeight ows where w' = w1+w2

The resulting list is half as long as the input. We can also combine this optimisation with thresholding, but only with care: we need to do the sample merging first, since **offsetWeight** assumes consecutive samples. It’s best to wrap it all up in a single function:

gaussianSamples :: Float -> Int -> [(Float, Float)] gaussianSamples tolerance = normalise . threshold tolerance . offsetWeight . withOffsets . binomialCoefficients

To stick with the previous example, **length (gaussianSamples 1000 101)** equals 19. This graph shows how these two optimisations affect the number of samples needed to reproduce a Gaussian of a given width:

So how shall we take these weights into use? First of all, we need some input to apply the filter to. We went with a simple pattern generated in a fragment shader; the geometry rendered is just a screen-sized quad:

originalImage :: Exp Obj (FrameBuffer N1 V4F) originalImage = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT) clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT) prims = Transform vert (Fetch "geometrySlot" Triangle (IV2F "position")) vert :: Exp V V2F -> VertexOut () vert pos = VertexOut pos' (floatV 1) (ZT) where V2 x y = unpack' pos pos' = pack' (V4 x y (floatV 0) (floatV 1)) frag :: Exp F () -> FragmentOut (Color V4F :+: ZZ) frag _ = FragmentOut (col :. ZT) where V4 x y _ _ = unpack' fragCoord' x' = sqrt' x @* floatF 16 y' = sqrt' y @* floatF 16 r = Cond ((x' @+ y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1) g = floatF 0 b = Cond ((x' @- y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1) col = pack' (V4 r g b (floatF 1))

This is a complete single-pass pipeline description. The vertex shader does nothing beyond simply extending the position vector with the missing Z and W coordinates. The fragment shader uses the fragment coordinate as its only input, so we don’t need to pass any data between the two shader phases. The resulting image is the following:

Next, we’ll define a one-dimensional convolution pass that takes a direction vector, a list of weights with offsets (which are used to scale the aforementioned vector), and an input image. The list is used to generate an unrolled loop in the fragment shader, while the vertex shader performs just an elementary coordinate transformation.

convolve :: V2F -> [(Float, Float)] -> Exp Obj (Image N1 V4F) -> Exp Obj (FrameBuffer N1 V4F) convolve (V2 dx dy) weights img = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where resX = windowWidth resY = windowHeight dir' :: Exp F V2F dir' = Const (V2 (dx / fromIntegral resX) (dy / fromIntegral resY)) accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT) clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT) prims = Transform vert (Fetch "postSlot" Triangle (IV2F "position")) vert :: Exp V V2F -> VertexOut V2F vert uv = VertexOut pos (Const 1) (NoPerspective uv' :. ZT) where uv' = uv @* floatV 0.5 @+ floatV 0.5 pos = pack' (V4 u v (floatV 1) (floatV 1)) V2 u v = unpack' uv frag :: Exp F V2F -> FragmentOut (Color V4F :+: ZZ) frag uv = FragmentOut (sample :. ZT) where sample = foldr1 (@+) [ texture' smp (uv @+ dir' @* floatF ofs) @* floatF coeff | (ofs, coeff) <- weights] smp = Sampler LinearFilter Clamp tex tex = Texture (Texture2D (Float RGBA) n1) (V2 resX resY) NoMip [img]

Incorporating the convolution pass is as simple as using function composition:

weights = gaussianSamples 1000 101 dirH = V2 1 0 dirV = V2 0 1 finalImage :: Exp Obj (FrameBuffer N1 V4F) finalImage = filterPass dirV (filterPass dirH originalImage) where filterPass dir = convolve dir weights . projectBuffer projectBuffer = PrjFrameBuffer "" tix0

Using our original definition for **gaussianSamples**, i.e. not using thresholding or merging, we get the following rendering:

When we add both optimisations, the resulting image is indistinguishable from the previous one:

Despite the striking similarity, the two images are not identical. If we calculate the difference and enhance the result, we get the following pattern:

With 8-bit RGB components, the maximum difference anywhere on the image is just 1, so no wonder it cannot be detected by the naked eye. This is a really nice result, because the run-time increases steeply with the number of samples. The width of the filter is also a factor, but it’s difficult to predict its effect due to the complexity of the GPU hardware. On the hardware we tested (nVidia GeForce 9650M GT) we got the following timings:

At a first glance, it looks like the time per frame is more or less directly proportional to the number of samples used in the filtering passes after we deduct the time needed to run the no-op 1×1 filter. However, in this particular case, the variant with both optimisations turned on seems to have a smaller time to sample count ratio than the naive version, about 0.2 vs. 0.25 ms, so the gains are even better than expected. On other hardware we might see different relations depending on the peculiarities of memory access costs.

At this point we ran into the limits of the current LambdaCube implementation. One more potential optimisation worth trying would be to calculate the sample coordinates in the vertex shader, making the texture reads non-dependent. Currently this is not possible due to the lack of support for array types in the shaders. It’s also evident that the EDSL approach without explicit let bindings is not viable, because compilation time is exponential with respect to the complexity of the shader. While there have been some positive developments in this area, we intend to switch to the DSL approach in the next version anyway, which should solve this problem. Finally, if you run the example, you’ll notice that you cannot resize the window. The reason for this is rather mundane: currently texture sizes are pipeline compile-time constants. In the future we’ll change their type from **V2I** to **Exp Obj V2I**, which will make it possible to modify the values during run-time.