7. Writing Fast Futhark Programs¶
This document contains tips, tricks, and hints for writing efficient Futhark code. Ideally you’d need to know nothing more than an abstract cost model, but sometimes it is useful to have an idea of how the compiler will transform your program, what values look like in memory, and what kind of code the compiler will generate for you. These details are documented below. Don’t be discouraged by the complexities mentioned here - most Futhark programs are written without worrying about any of these details, and they still manage to run with good performance. This document focuses on corner cases and pitfalls, which easily makes for depressing reading.
7.1. Parallelism¶
The Futhark compiler only generates parallel code for explicitly
parallel constructs such as map
and reduce
. A plain loop
will not result in parallel code (unless the loop body itself
contains parallel operations). The most important parallel constructs
are the second-order array combinators (SOACs) such as map
and
reduce
, but functions such as copy
are also parallel.
When describing the asymptotic cost of a Futhark function, it is not
enough to give a traditional big-O measure of the total amount of
work. Both foldl
and reduce
involve O(n) work, where n is
the size of the input array, but foldl
is sequential while
reduce
is parallel, and this is an important distinction. To make
this distinction, each function is described by two costs: the
work, which is the total amount of operations, and the span
(sometimes called depth) which is intuitively the “longest chain of
sequential dependencies”. We say that foldl
has span O(n),
while reduce
has span O(log(n)). This explains that
reduce
is more parallel than foldl
. The documentation for a
Futhark function should mention both its work and span. See this for more details on
parallel cost models and pointers to literature.
7.1.1. Scans and reductions¶
The scan
and reduce
SOACs are rather inefficient when their
operators are on arrays. If possible, use tuples instead (see
Small Arrays). The one exception is when the
operator is a map2
or equivalent. Example:
reduce (map2 (+)) (replicate n 0) xss
Such “vectorised” operators are typically handled quite efficiently.
Although to be on the safe side, you can rewrite the above by
interchanging the reduce
and map
:
map (reduce (+) 0) (transpose xss)
Avoid reductions over tiny arrays, e.g. reduce (+) 0 [x,y,z]
. In
such cases the compiler will generate complex code to exploit a
negligible amount of parallelism. Instead, just unroll the loop
manually (x+y+z
) or perhaps use foldl (+) 0 [x,z,y]
, which
produces a sequential loop.
7.1.2. Histograms¶
The reduce_by_index
construct (“generalised histogram”) has a
clever and adaptive implementation that handles multiple updates of
the same bin efficiently. Its main weakness is when computing a very
large histogram (many millions of bins) where only a tiny fraction of
the bins are updated. This is because the main mechanism for
optimising conflicts is by duplicating the histogram in memory, but
this is not efficient when it is very large. If you know your program
involves such a situation, it may be better to implement the histogram
operation by sorting and then performing an irregular segmented
reduction.
Particularly with the GPU backends, reduce_by_index
is much faster
when the operator involves a single 32-bit or 64-bit value. Even if
you really want an 8-bit or 16-bit result, it may be faster to compute
it with a 32-bit or 64-bit type and manually mask off the excess bits.
7.1.3. Nested parallelism¶
Futhark allows nested parallelism, understood as a parallel construct used inside some other parallel construct. The simplest example is nested SOACs. Example:
map (\xs -> reduce (+) 0 xs) xss
Nested parallelism is allowed and encouraged, but its compilation to efficient code is rather complicated, depending on the compiler backend that is used. The problem is that sometimes exploiting all levels of parallelism is not optimal, yet how much to exploit depends on run-time information that is not available to the compiler.
7.1.3.1. Sequential backends¶
The sequential backends are straightforward: all parallel operations are compiled into sequential loops. Due to Futhark’s low-overhead data representation (see below), this is often surprisingly efficient.
7.1.3.2. Multicore backend¶
Whenever the multicore backend encounters nested parallelism, it generates two code versions: one where the nested parallel constructs are also parallelised (possibly recursively involving further nested parallelism), and one where they are turned into sequential loops. At runtime, based on the amount of work available and self-tuning heuristics, the scheduler picks the version that it believes best balances overhead with exploitation of parallelism.
7.1.3.3. GPU backends¶
The GPU backends handle parallelism through an elaborate program transformation called incremental flattening. The full details are beyond the scope of this document, but some properties are useful to know of. See this paper for more details.
The main restriction is that the GPU backends can only handle regular nested parallelism, meaning that the sizes of inner parallel dimensions are invariant to the outer parallel dimensions. For example, this expression contains irregular nested parallelism:
map (\i -> reduce (+) 0 (iota i)) is
This is because the size of the nested parallel construct is i
,
and i
has a different value for every iteration of the outer
map
. The Futhark compiler will currently turn the irregular
constructs (here, the reduce
) into a sequential loop. Depending
on how complicated the irregularity is, it may even refuse to generate
code entirely.
Incremental flattening is based on generating multiple code versions to cater to different classes of datasets. At run-time, one of these versions will be picked for execution by comparing properties of the input (its size) with a threshold parameter. These threshold parameters have sensible defaults, but for optimal performance, they can be tuned with futhark-autotune.
7.2. Value Representation¶
The compiler discards all type abstraction when compiling. Using the module system to make a type abstract causes no run-time overhead.
7.2.1. Scalars¶
Scalar values (i32
, f64
, bool
, etc) are represented as
themselves. The internal representation does not distinguish signs,
so i32
and u32
have the same representation, and converting
between types that differ only in sign is free.
7.2.2. Tuples¶
Tuples are flattened and then represented directly by their individual
components - there are no tuple objects at runtime. A function that
takes an argument of type (f64,f64)
corresponds to a C function
that takes two arguments of type double
. This has one performance
implication: whenever you pass a tuple to a function, the entire
tuple is copied (except any embedded arrays, which are always passed
by reference, see below). Due to the compiler’s heavy use of
inlining, this is rarely a problem in practice, but it can be a
concern when using the loop
construct with a large tuple as the
loop variant parameter.
7.2.3. Records¶
Records are turned into tuples by simply sorting their fields and discarding the labels. This means there is no overhead to using a record compared to using a tuple.
7.2.4. Sum Types¶
A sum type value is represented as a tuple containing all the payload components in order, prefixed with an i8 tag to identify the constructor. For example,
#foo i32 bool | #bar i32
would be represented as a tuple of type
(i8, i32, bool, i32)
where the value
#foo 42 false
is represented as
(1, 42, false, 0)
where #foo
is assigned the tag 1
because it is alphabetically
after #bar
.
To shrink the tuples, if multiple constructors have payload elements of the same type, the compiler assigns them to the same elements in the result tuple. The representation of the above sum type is actually the following:
(i8, i32, bool)
The types must be the same for deduplication to take place - despite i32 and f32 being of the same size, they cannot be assigned the same tuple element. This means that the type
#foo [n]i32 | #bar [n]i32
is efficiently represented as
(u8, [n]i32)
#foo [n]i32 | #bar [n]f32
becomes
(u8, [n]i32, [n]f32)
which is not great. Take caution when you use sum types with large arrays in their payloads.
7.2.5. Functions¶
Higher-order functions are implemented via defunctionalisation. At run-time, they are represented by a record containing their lexical closure. Since the type system forbids putting functions in arrays, this is essentially a constant cost, and not worth worrying about.
7.2.6. Arrays¶
Arrays are the only Futhark values that are boxed - that is, are stored on the heap.
The elements of an array are unboxed, stored adjacent to each other in memory. There is zero memory overhead except for the minuscule amount needed to track the shape of the array.
7.2.6.1. Multidimensional arrays¶
At the surface language level, Futhark may appear to support “arrays
of arrays”, and this is indeed a convenient aspect of its programming
model, but at runtime multi-dimensional arrays are stored in flattened
form. A value of type [x][y]i32
is laid out in memory simply as
one array containing x*y integers. This means that constructing an
array [x,y,x]
can be (relatively) expensive if x
, y
, z
are themselves large arrays, as they must be copied in their entirety.
Since arrays cannot contain other arrays, memory management only has to be concerned with one level of indirection. In practice, it means that Futhark can use straightforward reference counting to keep track of when to free the memory backing an array, as circular references are not possible. Further, since arrays tend to be large and relatively few in number, the usual performance impact of naive reference counting is not present.
7.2.6.2. Arrays of tuples¶
For arrays of tuples, Futhark uses the so-called structure of arrays representation. In
Futhark terms, an array [n](a,b,c)
is at runtime represented as
the tuple ([n]a,[n]b,[n]c)
. This means that the final memory
representation always consists of arrays of scalars.
This has some significant implications. For example, zip
and
unzip
are very cheap, as the actual runtime representation is in
always “unzipped”, so these functions don’t actually have to do
anything.
Since records and sum types are represented as tuples, this also explains how arrays of these are represented.
7.2.6.3. Element order¶
The exact in-memory element ordering is up to the compiler, and depends on how the array is constructed and how it is used. Absent any other information, Futhark represents multidimensional arrays in row-major order. However, depending on how the array is traversed, the compiler may insert code to represent it in some other order. For particularly tricky programs, an array may even be duplicated in memory, represented in different ways, to ensure efficient traversal. This means you should normally not worry about how to represent your arrays to ensure coalesced access on GPUs or similar. That is the compiler’s job.
7.3. Crucial Optimisations¶
Some of the optimisations done by the Futhark compiler are important, complex, or subtle enough that it may be useful to know how they work, and how to write code that caters to their quirks.
7.3.1. Fusion¶
Futhark performs fusion of SOACs. For an expression map f (map g
A)
, then the compiler will optimise this into a single map
with
the composition of f
and g
, which prevents us from storing an
intermediate array in memory. This is called vertical fusion or
producer-consumer fusion. In this case the producer is map g
and the consumer is map f
.
Fusion does not depend on the expressions being adjacent as in this example, as the optimisation is performed on a data dependency graph representing the program. This means that you can decompose your programs into many small parallel operations without worrying about the overhead, as the compiler will fuse them together automatically.
Not all producer-consumer relationships between SOACs can be fused.
Generally, map
can always be fused as a producer, but scan
,
reduce
, and similar SOACs can only act as consumers.
Horizontal fusion occurs when two SOACs take as input the same array, but are not themselves in a producer-consumer relationship. Example:
(map f xs, map g xs)
Such cases are fused into a single operation that traverses xs
just once. More than two SOACs can be involved in horizontal fusion,
and they need not be of the same kind (e.g. one could be a map
and
the other a reduce
).
7.4. Free Operations¶
Some operations such as array slicing, take
, drop
,
transpose
and reverse
are “free” in the sense that they merely
return a different view of some underlying array. In most cases they
have constant cost, no matter the size of the array they operate on.
This is because they are index space transformations that simply
result in different code being generated when the arrays are
eventually used.
However, there are some cases where the compiler is forced to manifest such a “view” as an actual array in memory, which involves a full copy. An incomplete list follows:
Any array returned by an entry point is converted to row-major order.
An array returned by an
if
branch may be copied if its representation is substantially different from that of the other branch.An array returned by a
loop
body may be copied if its representation is substantially different from that of the initial loop values.An array is copied whenever it becomes the element of another multidimensional array. This is most obviously the case for array literals (
[x,y,z]
), but also formap
expressions where the mapped function returns an array.
Note that this notion of “views” is not part of the Futhark type
system - it is merely an implementation detail. Strictly speaking,
all functions that return an array (e.g. reverse
) should be
considered to have a cost proportional to the size of the array, even
if that cost will almost never actually be paid at run-time. If you
want to be sure no copy takes place, it may be better to explicitly
maintain tuples of indexes into some other array.
7.5. Small Arrays¶
The compiler assumes arrays are “large”, which for example means that operations across them are worth parallelising. It also means they are boxed and heap-allocated, even when the size is a small constant. This can cause unexpectedly bad performance when using small constant-size arrays (say, five elements or less). Consider using tuples or records instead. This post contains more information on how and why. If in doubt, try both and measure which is faster.
7.6. Inlining¶
The compiler currently inlines all functions at their call site,
unless they have been marked with the noinline
attribute (see
Attributes). This can lead to code explosion, which mostly
results in slow compile times, but can also affect run-time
performance. In many cases this is currently unavoidable, but
sometimes the program can be rewritten such that instead of calling
the same function in multiple places, it is called in a single place,
in a loop. E.g. we might rewrite f x (f y (f z v))
as:
loop acc = v for a in [z,y,x] do
f a acc