In a Nutshell
LambdaCube 3D is a domain specific language and library that makes it possible to program GPUs in a purely functional style.
Purely Functional Rendering Engine
We recently added a new example to the repository that shows one way of implementing convolution filters. Our variance shadow mapping example uses a Gaussian blur as a component, but it’s implemented in a rather naive, straightforward way. This time we’re going to have a closer look at it and improve its performance while trying to preserve the output as much as possible. This little experiment was primarily inspired by a post from Daniel Rákos. While the optimisation is not specific to the Gaussian blur, we’ll stick with it as an example.
First of all, we’ll define a function to calculate the weights of a blur for a given filter width. As Daniel points out, taking the binomial coefficients, i.e. the appropriate row of Pascal’s triangle, and normalising it gives us the desired values. First let’s define the weights without normalisation:
binomialCoefficients :: Int -> [Float] binomialCoefficients n = iterate next  !! (n-1) where next xs = [x+y | x <- xs ++  | y <- 0:xs]
A convolution filter takes samples from the neighbourhood of a given texel and computes a weighted sum from them, therefore we need to assign offsets to the weights before we can use them. We will measure the offsets in texels. The following function assigns offsets to any list of weights with the assumption that the middle element of the list corresponds to the centre of the filter (from now on, we’ll assume the filter width is an odd integer):
withOffsets :: [Float] -> [(Float, Float)] withOffsets cs = [(o, c) | c <- cs | o <- [-lim..lim]] where lim = fromIntegral (length cs `quot` 2)
Finally, we want to normalise the weights, so their sum is 1. We define our normalisation step to work on the weights with offsets:
normalise :: [(Float, Float)] -> [(Float, Float)] normalise ocs = [(o, c/s) | (o, c) <- ocs] where s = sum [c | (_, c) <- ocs]
At this point, we can define a function to calculate weighted samples for a Gaussian blur:
gaussianSamples :: Int -> [(Float, Float)] gaussianSamples = normalise . withOffsets . binomialCoefficients
For instance, gaussianSamples 5 equals [(-2.0,0.0625),(-1.0,0.25),(0.0,0.375),(1.0,0.25),(2.0,0.0625)].
To improve the performance of the filter, we should make it as small as possible. One obvious thing we can do is to simply remove terms that don’t contribute to the results significantly. Since our primary use case is real-time graphics, a bit of inaccuracy most likely won’t make any perceivable difference. Our solution is to introduce a thresholding operation that throws away terms that are much smaller than the biggest term:
threshold :: Float -> [(Float, Float)] -> [(Float, Float)] threshold t ocs = [oc | oc@(_, c) <- ocs, c*t >= m] where m = maximum [c | (_, c) <- ocs]
This simple method can be surprisingly efficient. When we calculate the weights for wide Gaussian blurs, it turns out that a large portion of the outer areas hardly contribute to the final value. For instance, threshold 1000 (withOffsets (binomialCoefficients 101)) leaves only 37 terms.
Another, more clever trick we can apply is specific to texture sampling. Since the GPU provides linear interpolation for free, we can get a weighted sum of two adjacent samples with a single query: f(x) = (1 – frac(x)) * f(floor(x)) + frac(x) * f(ceil(x)). All we need to do is merge adjacent samples by calculating the necessary sampling position. The following function assumes that all samples are adjacent and separated by the unit distance; both criteria are met if we provide the output of withOffsets:
offsetWeight :: [(Float, Float)] -> [(Float, Float)] offsetWeight  =  offsetWeight [ow] = [ow] offsetWeight ((o1,w1):(o2,w2):ows) = (o1+w2/w', w') : offsetWeight ows where w' = w1+w2
The resulting list is half as long as the input. We can also combine this optimisation with thresholding, but only with care: we need to do the sample merging first, since offsetWeight assumes consecutive samples. It’s best to wrap it all up in a single function:
gaussianSamples :: Float -> Int -> [(Float, Float)] gaussianSamples tolerance = normalise . threshold tolerance . offsetWeight . withOffsets . binomialCoefficients
To stick with the previous example, length (gaussianSamples 1000 101) equals 19. This graph shows how these two optimisations affect the number of samples needed to reproduce a Gaussian of a given width:
So how shall we take these weights into use? First of all, we need some input to apply the filter to. We went with a simple pattern generated in a fragment shader; the geometry rendered is just a screen-sized quad:
originalImage :: Exp Obj (FrameBuffer N1 V4F) originalImage = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT) clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT) prims = Transform vert (Fetch "geometrySlot" Triangle (IV2F "position")) vert :: Exp V V2F -> VertexOut () vert pos = VertexOut pos' (floatV 1) (ZT) where V2 x y = unpack' pos pos' = pack' (V4 x y (floatV 0) (floatV 1)) frag :: Exp F () -> FragmentOut (Color V4F :+: ZZ) frag _ = FragmentOut (col :. ZT) where V4 x y _ _ = unpack' fragCoord' x' = sqrt' x @* floatF 16 y' = sqrt' y @* floatF 16 r = Cond ((x' @+ y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1) g = floatF 0 b = Cond ((x' @- y') @% (floatF 50) @< (floatF 25)) (floatF 0) (floatF 1) col = pack' (V4 r g b (floatF 1))
This is a complete single-pass pipeline description. The vertex shader does nothing beyond simply extending the position vector with the missing Z and W coordinates. The fragment shader uses the fragment coordinate as its only input, so we don’t need to pass any data between the two shader phases. The resulting image is the following:
Next, we’ll define a one-dimensional convolution pass that takes a direction vector, a list of weights with offsets (which are used to scale the aforementioned vector), and an input image. The list is used to generate an unrolled loop in the fragment shader, while the vertex shader performs just an elementary coordinate transformation.
convolve :: V2F -> [(Float, Float)] -> Exp Obj (Image N1 V4F) -> Exp Obj (FrameBuffer N1 V4F) convolve (V2 dx dy) weights img = Accumulate accCtx PassAll frag (Rasterize triangleCtx prims) clearBuf where resX = windowWidth resY = windowHeight dir' :: Exp F V2F dir' = Const (V2 (dx / fromIntegral resX) (dy / fromIntegral resY)) accCtx = AccumulationContext Nothing (ColorOp NoBlending (one' :: V4B) :. ZT) clearBuf = FrameBuffer (ColorImage n1 (V4 0 0 0 1) :. ZT) prims = Transform vert (Fetch "postSlot" Triangle (IV2F "position")) vert :: Exp V V2F -> VertexOut V2F vert uv = VertexOut pos (Const 1) (NoPerspective uv' :. ZT) where uv' = uv @* floatV 0.5 @+ floatV 0.5 pos = pack' (V4 u v (floatV 1) (floatV 1)) V2 u v = unpack' uv frag :: Exp F V2F -> FragmentOut (Color V4F :+: ZZ) frag uv = FragmentOut (sample :. ZT) where sample = foldr1 (@+) [ texture' smp (uv @+ dir' @* floatF ofs) @* floatF coeff | (ofs, coeff) <- weights] smp = Sampler LinearFilter Clamp tex tex = Texture (Texture2D (Float RGBA) n1) (V2 resX resY) NoMip [img]
Incorporating the convolution pass is as simple as using function composition:
weights = gaussianSamples 1000 101 dirH = V2 1 0 dirV = V2 0 1 finalImage :: Exp Obj (FrameBuffer N1 V4F) finalImage = filterPass dirV (filterPass dirH originalImage) where filterPass dir = convolve dir weights . projectBuffer projectBuffer = PrjFrameBuffer "" tix0
Using our original definition for gaussianSamples, i.e. not using thresholding or merging, we get the following rendering:
When we add both optimisations, the resulting image is indistinguishable from the previous one:
Despite the striking similarity, the two images are not identical. If we calculate the difference and enhance the result, we get the following pattern:
With 8-bit RGB components, the maximum difference anywhere on the image is just 1, so no wonder it cannot be detected by the naked eye. This is a really nice result, because the run-time increases steeply with the number of samples. The width of the filter is also a factor, but it’s difficult to predict its effect due to the complexity of the GPU hardware. On the hardware we tested (nVidia GeForce 9650M GT) we got the following timings:
At a first glance, it looks like the time per frame is more or less directly proportional to the number of samples used in the filtering passes after we deduct the time needed to run the no-op 1×1 filter. However, in this particular case, the variant with both optimisations turned on seems to have a smaller time to sample count ratio than the naive version, about 0.2 vs. 0.25 ms, so the gains are even better than expected. On other hardware we might see different relations depending on the peculiarities of memory access costs.
At this point we ran into the limits of the current LambdaCube implementation. One more potential optimisation worth trying would be to calculate the sample coordinates in the vertex shader, making the texture reads non-dependent. Currently this is not possible due to the lack of support for array types in the shaders. It’s also evident that the EDSL approach without explicit let bindings is not viable, because compilation time is exponential with respect to the complexity of the shader. While there have been some positive developments in this area, we intend to switch to the DSL approach in the next version anyway, which should solve this problem. Finally, if you run the example, you’ll notice that you cannot resize the window. The reason for this is rather mundane: currently texture sizes are pipeline compile-time constants. In the future we’ll change their type from V2I to Exp Obj V2I, which will make it possible to modify the values during run-time.