In a Nutshell
LambdaCube 3D is a domain specific language and library that makes it possible to program GPUs in a purely functional style.
Purely Functional Rendering Engine
In the earlier posts, we mentioned several times that we imagine LambdaCube as a standalone DSL. But what would this mean in practice?
One of our long-term goals is to build a productive content authoring environment that’s based on a purely functional rendering pipeline. Who knows, it might even be feasible to package LambdaCube as a Unity plugin! Another possible application could be generating an optimised rendering component to be integrated in another game engine. In any case, interoperability is a major concern for us. Driven by this goal, we came up with the following plan for the LambdaCube stack:
The heart of this scheme is obviously the intermediate language. The frontend compiler is going to be an independent tool (available both as a plain executable and a library with at least a C interface) that should be possible to easily integrate into any build system. This tool is responsible for most of the static analysis, some trivial as well as (hopefully) clever optimisations, and also planning the allocation of resources. As a result, it turns the declarative pipeline description into a series of instructions for a virtual machine that handles GPU specific data structures as primitives, e.g. setting up the rendering context, or binding vertex buffers, textures, and framebuffers in the right order. Given this intermediate description, we can take it further in many directions, as illustrated by the figure above.
Of course, LambdaCube is not likely to be useful when you are working on the bleeding edge and need accurate control over the resources. Also, when we are on said bleeding edge, it is most likely premature to think about building high-level abstractions. Therefore, we’d like to make it clear that we don’t see LambdaCube as a one-size-fits-all solution. If we were to make a mission statement, it would go something like this: we want to cover 95% of the use cases, making even moderately complex projects easier, and not get in the way of the remaining 5%.
We quickly put together two HD quality videos of the big LambdaCube examples for the lazy ones. First the Quake 3 level viewer:
Then the Stunts example, featuring some truly mad driving skils:
Over and out!
LambdaCube is based on a purely functional model of the graphics pipeline. The graphics pipeline is a multi-rate data-flow network, therefore it was a straightforward decision to use streams as the basic abstraction for data that’s processed sequentially. However, LambdaCube is not a purely stream-oriented language, as we will see below. For the time being, we only deal with DX10/OGL3.2 constructs, i.e. we allow geometry shaders but no tesselation.
The language extends beyond the responsibility of shader languages by also reifying framebuffers, thereby being able to expose results of single rendering passes as plain values. Conceptually, a framebuffer is the result of a fold operation over a fragment stream, and exposing it as a first-class value is the main source of the extra expressive power over conventional stream programming languages. A multi-pass rendering process is simply described by an expression that contains at least one subexpression that evaluates to a framebuffer converted to a sampler.
First we’ll look at the simple case of transforming a vertex buffer into a framebuffer that contains the final rendering in a single pass. This is achieved by sending the data through the rendering pipeline, whose stages are each modelled by a pure function. Part of the design process was to decide how we want to decompose the whole process.
The major building blocks of the LambdaCube pipeline are the following functions (with simplified Haskell-style type signatures):
fetch :: VertexBuffer a -- The input buffer → p -- The associated primitive → VertexStream p a -- The stream enumerating the elements of the buffer transform :: VertexShader a b -- The function applied to each vertex → VertexStream p a -- The incoming stream → PrimitiveStream p 1 V b -- The resulting stream reassemble :: GeometryShader p1 p2 n a b -- The function applied to each primitive → PrimitiveStream p1 1 V a -- The incoming stream → PrimitiveStream p2 n G b -- The resulting stream rasterize :: RasterContext p -- The settings used for rasterisation → PrimitiveStream p n f a -- The incoming stream → FragmentStream n a -- The resulting stream accumulate :: AccumulationContext b -- Accumulation settings → FragmentFilter a -- Filtering predicate (fragment kept if true) → FragmentShader a b -- The function applied to each fragment → FragmentStream n a -- The incoming stream of fragments → FrameBuffer n b -- The starting state of the framebuffer → FrameBuffer n b -- The final state of the framebuffer
Each stage boundary has a corresponding stream type, which captures the totality of the data that passes through the cross-section as an abstract entity.
The pipeline starts its operation by fetching the vertex data, which means taking a buffer of records and enumerating them one by one. While this is just a formal step in the model, it is significant in practice as fetching is the bridge between the CPU and the GPU. In the model, one can think of VertexBuffer t as an array that contains a piece of data of type t for each vertex. In reality, it is a reference to future data, and it consists of a mapping of names to parts of the data structure (typically a flat tuple) describing each vertex – basically a list of attribute names and types –, plus it specifies the name of the corresponding input slot.
Besides the vertex data, we also need to specify how consecutive vertices are assembled into primitives. The second parameter of fetch provides this information in a strongly typed way. The type p is different for each kind of primitive: point, line, or triangle. This distinction is useful mainly during rasterisation, as we will see below.
The first transformation is a simple map operation over the vertex stream: the vertex shader is applied to each vertex independently. We chose the name transform for this step because it is typically used to transform vertices from object local space to camera or clip space besides calculating additional data. Vertex shaders are simply pure functions:
VertexShader a b ≡ a@V → b@V'
This is the first time we see frequencies appear in the types, denoted with the type operator @. According to the above equivalence, a vertex shader is a function that takes a value in a vertex context (V) and produces a vertex output (V').
The vertex context can be thought of as an instance of the reader monad, where the input type is a tuple composed of additional information for the vertex (e.g. the index of the vertex within the stream). Consequently, it is an applicative functor, therefore any primitive constant and function operating on primitive types can be lifted to work in it. Since we do not need the expressiveness of monads, we can make this lifting implicit, which makes it trivial to define vertex shaders in a natural way.
The original vertex context is supplied by the transform function, while the vertex output has to be defined by the programmer. There is only one way to construct a value of type a@V':
vertexOut :: V4F@V → Float@V → (Interpolated a)@V → a@V'
The vertexOut function associates an interpolated value in a vertex context (Interpolated a) with a position in homogeneous coordinates (V4F) and a point size (Float). An interpolated value is simply a value associated with an interpolation strategy, which can be flat, smooth, or linear. The following functions can be used to construct such values:
flat :: a@f → (Interpolated a)@f smooth :: Continuous a ⇒ a@f → (Interpolated a)@f linear :: Continuous a ⇒ a@f → (Interpolated a)@f
The interpolation strategy chosen will be relevant during rasterisation. Since integer attributes cannot be interpolated, they must be flat shaded. Floating-point attributes (including vectors and matrices) can be interpolated; the formulae that define the resulting value for a fragment at a given position depending on the interpolation strategy are described in the OpenGL standard.
The final result of the transformation step is a primitive stream, i.e. transform is also responsible for primitive assembly besides vertex shading, while these are generally considered to be distinct stages of the rendering pipeline. For the time being, we don’t see the practical need to separate these stages, since we cannot do anything useful with the output of the vertex shader other than passing it forward.
The PrimitiveStream type constructor has for parameters: primitive type, layer count, stage of origin, and payload type. As the type signature of the transform function confirms, vertex shader preserves the primitive type originally associated with the stream, and the final type of the payload is identical to the output type of the vertex shader. The stage of origin is always V (to distinguish it from a primitive stream produced by the geometry shader), and the layer count is 1 by definition, as only geometry shaders can produce a multi-layered output, i.e. address a three-dimensional framebuffer.
The next step in the pipeline is optional: we take the stream of primitives produced by transform, and emit a stream of a possibly different primitive type, where each input corresponds to zero, one, or several (with a statically known maximum) outputs. We can see the role of the stage of origin in the primitive stream’s type: since the reassemble function requires a vertex shader output, the type system prevents us from adding more than one geometry shader to the pipeline.
The actual transformation is defined by the geometry shader, which corresponds to the following type:
GeometryShader p1 p2 n a b ≡ (n, p2, Int, GeometryTransducer p1 a b) GeometryTransducer p1 a b ≡ ((PrimitiveVertices p1 a)@G → (i, Int)@G, i@G → (i, j, Int)@G, j@G → (j, b)@G')
Unfortunately, we cannot provide a simple, elegant definition similar to the vertex shader, because geometry shaders are more involved. The arguments of the GeometryShader type constructor are the following: input primitive, output primitive, layer count, payload input and output types. This type is equivalent to a 4-tuple that directly specifies the layer count and the output primitive (strongly typed), the maximum number of primitives emitted for any incoming primitive, and the actual shader logic.
At the present moment, we prescribe a rigid structure of two nested loops in the geometry shader. This is defined by the user by providing the following three functions:
The final output must be defined in the inner loop using a function similar to vertexOut, but slightly more involved:
geometryOut :: V4F@G → Float@G → Int@G → Int@G → j@G → (Interpolated a)@G → (j, a)@G'
In order, the arguments are the vertex position, the point size, the primitive id, the layer id, the next inner loop state, and the payload, which needs to carry interpolation information in the same way as vertex outputs.
We are not particularly happy with this solution due to its clunkiness. A more elegant alternative would be to define the output with a single function as a list of dynamic length, but it might be difficult to provide a set of constructors that’s convenient to use and easy to generate code from at the same time. In any case, this is a design problem we shall attack again after having a reasonably user-friendly implementation.
During rasterisation, each primitive is turned into a stream of fragments, and the return value of rasterize is the concatenation of these streams. Within a primitive, each resulting fragment has a unique two-dimensional coordinate, i.e. they correspond to separate pixels of the framebuffer. This transformation is a fixed functionality of the GPU, and it is not fully defined, so the exact outcome might depend on the hardware. However, the OpenGL standard, for instance, prescribes that the rasterisation process fulfils certain criteria, e.g. the result is invariant to translation by integer coordinates as long as the primitive is not clipped.
Ultimately, the role of the rasterisation process is to determine the list of coordinates that a primitive covers, and calculate the interpolation of the vertex attributes in every resulting fragment. The formula used for interpolation depends on the strategy chosen in the vertex or geometry shader (if the latter is present), as discussed above.
The raster context is a tuple describing the details of rasterisation, whose exact shape depends on the primitive type. For instance, RasterContext Triangle is the most complex case, which specifies the following factors:
The fact that fragments correspond to distinct coordinates is the key in achieving high performance. Since they all affect different pixels of the output, they can be processed in parallel. For our purposes, this means that the ordering of fragments coming from any given primitive is undefined and cannot be relied upon.
The final step in the rendering process, performed by the accumulate function, consumes the fragment stream through a map and a fold phase. The fragment shader specifies the function used to map over the incoming fragments, while the final accumulation is a primarily fixed-structure computation that takes a framebuffer and blends the fragments into it one by one. Similarly to transform, this step is generally considered to be two separate phases that we fused together in the current version of the model.
The framebuffer is really just a glorified multidimensional array. The dimensions are width, height, component count, and layer count. The last dimension is encoded as a phantom type (denoted with n throughout the above type signatures), and it is greater than one only in the presence of a geometry shader. The first two dimensions are addressed through the fragment coordinates. As for the components, there are three kinds: depth, stencil, and colour. The first two serve special purposes – mainly related to discarding fragments through fixed functionality –, while colour components can be used to store any kind of data. We refer to this distinction as the semantics of a given piece of data. Semantics is similar to interpolation strategy in a sense that both are represented by a simple tagging scheme.
Similarly to the raster context, the accumulation context is a tuple that contains all the information needed to decide what operations to perform when processing each fragment produced by the rasterisation phase. The tuple has three main components, each optional depending on the type of the framebuffer used during accumulation, and each corresponding to a semantics:
The fragment shader is used to process the fragment before blending, while the fragment filter is a simple boolean function in the fragment context:
FragmentFilter a ≡ a@F → Bool@F FragmentShader a b ≡ a@F → b@F'
If the filter evaluates to false for a given fragment, that fragment is thrown away instead of being blended. Depth and stencil testing act as further filters.
Otherwise, the fragment shader is similar to the other shaders: it receives the interpolated attributes along with some additional data produced by the pipeline (the fragment context F), and yields a fragment output (F'). The output can be constructed with either of these functions:
fragmentOut :: a@F → (Color a)@F' fragmentOutDepth :: Float@F → a@F → (Depth Float, Color a)@F' fragmentOutRasterDepth :: a@F → (Depth Float, Color a)@F'
The fragmentOut function is used in the absence of a depth buffer. If the framebuffer has a depth component, the fragment comes with a computed depth value accessible through the fragment context, which can be either overridden (fragmentOutDepth) or kept (fragmentOutRasterDepth). The latter is a special case with a practical significance: it allows us to perform the depth test before evaluating the fragment shader, thereby getting a chance to discard the fragment earlier. This is a well-known optimisation that doesn’t change the final output.
In the end, the blending equation is used to combine the corresponding pixel in the framebuffer with the output of the fragment shader, and the result replaces the former.
In order to produce complex effects, especially when using deferred rendering, one pass is not sufficient. The contents of a framebuffer can be exposed to subsequent passes through ordinary samplers. Due to the multitude of the texture formats and the general complexity of the interaction between datatypes, we omit the actual type signatures here and only present the basic concept. The gory details will be discussed in a subsequent post.
Our model deals with four kinds of image types. In the order of getting data from one pass to the next they are the following:
There are functions that can convert from each type of the above list to the type below it by specifying all the additional information needed and possibly converting the payload on the way (e.g. projecting out just one element of a tuple type). Ultimately, samplers can be accessed in shader contexts through a family of primitive operations whose name starts as texture, which take a sampler and a sampling point (coordinates of matching dimension), and return the sample at that point. This is sufficient to create effects like shadow mapping or perform screen-space processing like fake ambient occlusion or bloom.
Besides multiple passes, we found it necessary to be able to handle meshes and bigger groups of objects at a higher level. In practice, this means being able to write fold-like loops over groups of objects. For instance, we might want to loop over the list of lights to accumulate their individual contributions to the scene. For each light, we need to render all the objects it affects to build the corresponding shadow map. We might also have an inner loop in case we use a cascading scheme (partitioning the scene) to improve the accuracy of the shadows.
If we want to define these loops with respect to the frequency framework, we can say that objects live in a frequency range between constants and primitives. It also seems that the general case is in essence a tree fold operation, where the nodes closer to the root correspond to the outer loops, and the leaves are the individual vertex buffers, yielding a framebuffer as the end result. The tree describes the order objects are rendered in, so it is not the scene graph itself, only related to it through a mechanical transformation.
Unfortunately, we haven’t been able to work out the details of the object layer yet. For the time being, we implement such loops manually in the Haskell EDSL version of LambdaCube.