7. Writing Fast Futhark Programs

This document contains tips, tricks, and hints for writing efficient Futhark code. Ideally you’d need to know nothing more than an abstract cost model, but sometimes it is useful to have an idea of how the compiler will transform your program, what values look like in memory, and what kind of code the compiler will generate for you. These details are documented below. Don’t be discouraged by the complexities mentioned here - most Futhark programs are written without worrying about any of these details, and they still manage to run with good performance. This document focuses on corner cases and pitfalls, which easily makes for depressing reading.

7.1. Parallelism

The Futhark compiler only generates parallel code for explicitly parallel constructs such as map and reduce. A plain loop will not result in parallel code (unless the loop body itself contains parallel operations). The most important parallel constructs are the second-order array combinators (SOACs) such as map and reduce, but functions such as copy are also parallel.

When describing the asymptotic cost of a Futhark function, it is not enough to give a traditional big-O measure of the total amount of work. Both foldl and reduce involve O(n) work, where n is the size of the input array, but foldl is sequential while reduce is parallel, and this is an important distinction. To make this distinction, each function is described by two costs: the work, which is the total amount of operations, and the span (sometimes called depth) which is intuitively the “longest chain of sequential dependencies”. We say that foldl has span O(n), while reduce has span O(log(n)). This explains that reduce is more parallel than foldl. The documentation for a Futhark function should mention both its work and span. See this for more details on parallel cost models and pointers to literature.

7.1.1. Scans and reductions

The scan and reduce SOACs are rather inefficient when their operators are on arrays. If possible, use tuples instead (see Small Arrays). The one exception is when the operator is a map2 or equivalent. Example:

reduce (map2 (+)) (replicate n 0) xss

Such “vectorised” operators are typically handled quite efficiently. Although to be on the safe side, you can rewrite the above by interchanging the reduce and map:

map (reduce (+) 0) (transpose xss)

Avoid reductions over tiny arrays, e.g. reduce (+) 0 [x,y,z]. In such cases the compiler will generate complex code to exploit a negligible amount of parallelism. Instead, just unroll the loop manually (x+y+z) or perhaps use foldl (+) 0 [x,z,y], which produces a sequential loop.

7.1.2. Histograms

The reduce_by_index construct (“generalised histogram”) has a clever and adaptive implementation that handles multiple updates of the same bin efficiently. Its main weakness is when computing a very large histogram (many millions of bins) where only a tiny fraction of the bins are updated. This is because the main mechanism for optimising conflicts is by duplicating the histogram in memory, but this is not efficient when it is very large. If you know your program involves such a situation, it may be better to implement the histogram operation by sorting and then performing an irregular segmented reduction.

Particularly with the GPU backends, reduce_by_index is much faster when the operator involves a single 32-bit or 64-bit value. Even if you really want an 8-bit or 16-bit result, it may be faster to compute it with a 32-bit or 64-bit type and manually mask off the excess bits.

7.1.3. Nested parallelism

Futhark allows nested parallelism, understood as a parallel construct used inside some other parallel construct. The simplest example is nested SOACs. Example:

map (\xs -> reduce (+) 0 xs) xss

Nested parallelism is allowed and encouraged, but its compilation to efficient code is rather complicated, depending on the compiler backend that is used. The problem is that sometimes exploiting all levels of parallelism is not optimal, yet how much to exploit depends on run-time information that is not available to the compiler.

7.1.3.1. Sequential backends

The sequential backends are straightforward: all parallel operations are compiled into sequential loops. Due to Futhark’s low-overhead data representation (see below), this is often surprisingly efficient.

7.1.3.2. Multicore backend

Whenever the multicore backend encounters nested parallelism, it generates two code versions: one where the nested parallel constructs are also parallelised (possibly recursively involving further nested parallelism), and one where they are turned into sequential loops. At runtime, based on the amount of work available and self-tuning heuristics, the scheduler picks the version that it believes best balances overhead with exploitation of parallelism.

7.1.3.3. GPU backends

The GPU backends handle parallelism through an elaborate program transformation called incremental flattening. The full details are beyond the scope of this document, but some properties are useful to know of. See this paper for more details.

The main restriction is that the GPU backends can only handle regular nested parallelism, meaning that the sizes of inner parallel dimensions are invariant to the outer parallel dimensions. For example, this expression contains irregular nested parallelism:

map (\i -> reduce (+) 0 (iota i)) is

This is because the size of the nested parallel construct is i, and i has a different value for every iteration of the outer map. The Futhark compiler will currently turn the irregular constructs (here, the reduce) into a sequential loop. Depending on how complicated the irregularity is, it may even refuse to generate code entirely.

Incremental flattening is based on generating multiple code versions to cater to different classes of datasets. At run-time, one of these versions will be picked for execution by comparing properties of the input (its size) with a threshold parameter. These threshold parameters have sensible defaults, but for optimal performance, they can be tuned with futhark-autotune.

7.2. Value Representation

The compiler discards all type abstraction when compiling. Using the module system to make a type abstract causes no run-time overhead.

7.2.1. Scalars

Scalar values (i32, f64, bool, etc) are represented as themselves. The internal representation does not distinguish signs, so i32 and u32 have the same representation, and converting between types that differ only in sign is free.

7.2.2. Tuples

Tuples are flattened and then represented directly by their individual components - there are no tuple objects at runtime. A function that takes an argument of type (f64,f64) corresponds to a C function that takes two arguments of type double. This has one performance implication: whenever you pass a tuple to a function, the entire tuple is copied (except any embedded arrays, which are always passed by reference, see below). Due to the compiler’s heavy use of inlining, this is rarely a problem in practice, but it can be a concern when using the loop construct with a large tuple as the loop variant parameter.

7.2.3. Records

Records are turned into tuples by simply sorting their fields and discarding the labels. This means there is no overhead to using a record compared to using a tuple.

7.2.4. Sum Types

A sum type value is represented as a tuple containing all the payload components in order, prefixed with an i8 tag to identify the constructor. For example,

#foo i32 bool | #bar i32

would be represented as a tuple of type

(i8, i32, bool, i32)

where the value

#foo 42 false

is represented as

(1, 42, false, 0)

where #foo is assigned the tag 1 because it is alphabetically after #bar.

To shrink the tuples, if multiple constructors have payload elements of the same type, the compiler assigns them to the same elements in the result tuple. The representation of the above sum type is actually the following:

(i8, i32, bool)

The types must be the same for deduplication to take place - despite i32 and f32 being of the same size, they cannot be assigned the same tuple element. This means that the type

#foo [n]i32 | #bar [n]i32

is efficiently represented as

(u8, [n]i32)
#foo [n]i32 | #bar [n]f32

becomes

(u8, [n]i32, [n]f32)

which is not great. Take caution when you use sum types with large arrays in their payloads.

7.2.5. Functions

Higher-order functions are implemented via defunctionalisation. At run-time, they are represented by a record containing their lexical closure. Since the type system forbids putting functions in arrays, this is essentially a constant cost, and not worth worrying about.

7.2.6. Arrays

Arrays are the only Futhark values that are boxed - that is, are stored on the heap.

The elements of an array are unboxed, stored adjacent to each other in memory. There is zero memory overhead except for the minuscule amount needed to track the shape of the array.

7.2.6.1. Multidimensional arrays

At the surface language level, Futhark may appear to support “arrays of arrays”, and this is indeed a convenient aspect of its programming model, but at runtime multi-dimensional arrays are stored in flattened form. A value of type [x][y]i32 is laid out in memory simply as one array containing x*y integers. This means that constructing an array [x,y,x] can be (relatively) expensive if x, y, z are themselves large arrays, as they must be copied in their entirety.

Since arrays cannot contain other arrays, memory management only has to be concerned with one level of indirection. In practice, it means that Futhark can use straightforward reference counting to keep track of when to free the memory backing an array, as circular references are not possible. Further, since arrays tend to be large and relatively few in number, the usual performance impact of naive reference counting is not present.

7.2.6.2. Arrays of tuples

For arrays of tuples, Futhark uses the so-called structure of arrays representation. In Futhark terms, an array [n](a,b,c) is at runtime represented as the tuple ([n]a,[n]b,[n]c). This means that the final memory representation always consists of arrays of scalars.

This has some significant implications. For example, zip and unzip are very cheap, as the actual runtime representation is in always “unzipped”, so these functions don’t actually have to do anything.

Since records and sum types are represented as tuples, this also explains how arrays of these are represented.

7.2.6.3. Element order

The exact in-memory element ordering is up to the compiler, and depends on how the array is constructed and how it is used. Absent any other information, Futhark represents multidimensional arrays in row-major order. However, depending on how the array is traversed, the compiler may insert code to represent it in some other order. For particularly tricky programs, an array may even be duplicated in memory, represented in different ways, to ensure efficient traversal. This means you should normally not worry about how to represent your arrays to ensure coalesced access on GPUs or similar. That is the compiler’s job.

7.3. Crucial Optimisations

Some of the optimisations done by the Futhark compiler are important, complex, or subtle enough that it may be useful to know how they work, and how to write code that caters to their quirks.

7.3.1. Fusion

Futhark performs fusion of SOACs. For an expression map f (map g A), then the compiler will optimise this into a single map with the composition of f and g, which prevents us from storing an intermediate array in memory. This is called vertical fusion or producer-consumer fusion. In this case the producer is map g and the consumer is map f.

Fusion does not depend on the expressions being adjacent as in this example, as the optimisation is performed on a data dependency graph representing the program. This means that you can decompose your programs into many small parallel operations without worrying about the overhead, as the compiler will fuse them together automatically.

Not all producer-consumer relationships between SOACs can be fused. Generally, map can always be fused as a producer, but scan, reduce, and similar SOACs can only act as consumers.

Horizontal fusion occurs when two SOACs take as input the same array, but are not themselves in a producer-consumer relationship. Example:

(map f xs, map g xs)

Such cases are fused into a single operation that traverses xs just once. More than two SOACs can be involved in horizontal fusion, and they need not be of the same kind (e.g. one could be a map and the other a reduce).

7.4. Free Operations

Some operations such as array slicing, take, drop, transpose and reverse are “free” in the sense that they merely return a different view of some underlying array. In most cases they have constant cost, no matter the size of the array they operate on. This is because they are index space transformations that simply result in different code being generated when the arrays are eventually used.

However, there are some cases where the compiler is forced to manifest such a “view” as an actual array in memory, which involves a full copy. An incomplete list follows:

  • Any array returned by an entry point is converted to row-major order.

  • An array returned by an if branch may be copied if its representation is substantially different from that of the other branch.

  • An array returned by a loop body may be copied if its representation is substantially different from that of the initial loop values.

  • An array is copied whenever it becomes the element of another multidimensional array. This is most obviously the case for array literals ([x,y,z]), but also for map expressions where the mapped function returns an array.

Note that this notion of “views” is not part of the Futhark type system - it is merely an implementation detail. Strictly speaking, all functions that return an array (e.g. reverse) should be considered to have a cost proportional to the size of the array, even if that cost will almost never actually be paid at run-time. If you want to be sure no copy takes place, it may be better to explicitly maintain tuples of indexes into some other array.

7.5. Small Arrays

The compiler assumes arrays are “large”, which for example means that operations across them are worth parallelising. It also means they are boxed and heap-allocated, even when the size is a small constant. This can cause unexpectedly bad performance when using small constant-size arrays (say, five elements or less). Consider using tuples or records instead. This post contains more information on how and why. If in doubt, try both and measure which is faster.

7.6. Inlining

The compiler currently inlines all functions at their call site, unless they have been marked with the noinline attribute (see Attributes). This can lead to code explosion, which mostly results in slow compile times, but can also affect run-time performance. In many cases this is currently unavoidable, but sometimes the program can be rewritten such that instead of calling the same function in multiple places, it is called in a single place, in a loop. E.g. we might rewrite f x (f y (f z v)) as:

loop acc = v for a in [z,y,x] do
  f a acc