Futhark 0.21.13 released

Posted on July 7, 2022 by Troels Henriksen

Mighty software isn’t released, rather it escapes, and so Futhark 0.21.13 just did (full changelog here). In contrast to most of the releases since the last release I bothered writing a blog post about, this one has important user-visible improvements. I will merely summarise most of the changes since the last announcement, either because they are small or because I already wrote about them, but one of them (an entirely new backend for vectorised CPU code) deserves a longer exposition.

The summarised changes are as follows.

Entirely new fusion engine. Fusion of parallel loops is perhaps the single most important optimisation in the entire Futhark compiler, and one of the first we implemented. The fusion algorithm was always conceived of as a graph reduction algorithm (as described in our very first paper), but the actual implementation was based on two traversals of the AST, with the dataflow graph existing only implicitly. It worked well enough to remain around for years, but it was difficult to understand how the algorithm corresponded to the code, impossible to extend it in certain ways, and had a few bugs that were difficult to fix.

As their master’s project at DIKU, Amar Topalovic and Walter Restelli-Nielsen replaced the old fusion engine with a new one that implements the same algorithm, but by explicitly transforming a graph representation of the program based on a set of rules. While this also fixes a few bugs, the main improvement is to serve as a good foundation for further improvements. The impact on programs is negligible, as a central goal was to reproduce the behaviour of the old fusion engine, but it’s quite impressive that two students managed to rebuild a critical part of the Futhark compiler as part of their master’s thesis.
Exposing the structure of records in the external API, as discussed in a previous blog post.
Deprecation of using == on arrays. This is because we are working on a language improvement that we have code-named AUTOMAP because silly names helps keep you motivated, but which is more commonly known as vectorisation or broadcasting. The idea is to allow functions to apply to arrays without having to use map explicitly; for example allowing programmers to write A + B instead of map2 (map2 (+)) A B when adding two matrices. Unfortunately, since == is an operator like any other, this means that its behaviour will change from returning a boolean indicating whether two arrays are equal, to returning a boolean array indicating the element-wise equality of the arrays. If desired, this array can then be passed to reduce to see if all the components are true.
Migration of sequential code to GPU. Another master’s thesis project, this one by Philip Børgesen. This is a GPU pipeline optimisation pass that migrates non-parallel code to the GPU in order to reduce CPU-GPU communication. This is a known GPU programming technique, but Futhark until now used a fairly naive compilation strategy of running all parallel code on the GPU, and all surrounding sequential code on the CPU. This requires moving data back and forth over the PCIe bus as well as expensive synchronisation. Even though the GPU isn’t great at running purely sequential code, it can still outperform the CPU for small nuggets of computation when the data is already present on the GPU.

This summary doesn’t truly express how extremely thorough Philip’s work was, which among other things involved designing a variant of the Ford-Fulkerson max flow algorithm to minimise CPU-GPU communication. See his thesis for the full details and a performance evaluation.

A new backend

This Futhark release comes with a new parallel compiler backend courtesy of the students Pema Malling, Louis Marott Normann, Oliver Petersen, and Kristoffer Kortbæk, who implemented it for their bachelor’s thesis. (See a pattern? Also, thesis here.) This backend is an extension of the multicore CPU backend that Minh Duc Tran implemented (as his master’s thesis, as you might by now expect) to make better use of the SIMD instructions that are supported by modern CPUs.

To understand how the new backend works, first let’s recap how the existing multicore backend works in simplified terms. Ultimately the question is how to execute in parallel a small collection of built-in parallel primitives, roughly corresponding to generalised variants of the source-language functions map, reduce, scan, and a few others. Consider map as the simplest case. Every element is independent of the others, so in principle you could start a new thread for each one in order to maximise parallelism. In practice, the overhead of such maximal parallelism is not worth it. Your CPU does not have a million cores - it has perhaps 16. So instead, we split the array being maped into a number of chunks, assign a chunk to each CPU core, after which each thread performs a sequential map on its chunk.

Even though we have now decided not to exploit thread parallelism inside each chunk, we are semantically still dealing with map, and each iteration is still independent. Modern CPUs have SIMD instructions that allow data parallelism - performing the same operation on different data - which we can use to execute the otherwise sequential maps. The performance advantage of using SIMD instructions are very context-dependent, but at a high level, it seems clearly better to have one instruction working on multiple pieces of data at once. In principle, an SIMD addition instruction operating on vectors of four numbers each should be four times faster than having to execute four distinct addition instructions.

In principle, a C compiler could automatically vectorise the loops produced by the existing multicore backend and make them use SIMD instructions. In practice, auto-vectorisation is quite fragile, so it doesn’t happen except for very simple cases. Ironically, I’m not a big fan of black-box compiler optimisations, especially when they are fragile. Thus the core idea of the new backend: instead of processing each chunk sequentially within a thread, uses SIMD instructions to further squeeze out a bit of performance.

Another person who doesn’t much like auto-vectorisation is Matt Pharr. In fact, he found auto-vectorisation so fragile that he invented a new programming language called ISPC in order to make vectorisation more reliable. ISPC is a low-level data-parallel language like CUDA, in which multiple program instances are active simultaneously, and where vectorisation cannot fail. It is a property of the programming model, rather than a black-box compiler optimisation. The new Futhark backend, tellingly named ispc, generates ISPC code for the per-chunk loops (the sequential maps). This makes it the responsibility of the ISPC compiler itself to select the right SIMD instructions for a given CPU - this is the kind of stuff we really don’t have the resources to maintain ourselves. One disadvantage is that ISPC isn’t really intended as a compilation target, but students fortunately have high pain tolerance, so they managed to make it work - even very exotic stuff, such as dynamic bounds checking.

So what is performance like? It’s a bit unpredictable. ISPC is very sensitive to static memory access patterns and control flow, and the students solely focused on backend work. The input code they are compiling is produced by the existing multicore pipeline, and is not at all optimised for SIMD execution in terms of memory layout and nested parallelism. On mandelbrot I get about 1.67x speedup on my laptop with a Ryzen 4750 CPU compared to the multicore backend. Not much, but it’s free performance - all it requires is passing --backend=ispc to the Futhark compiler.

On crystal I get a speedup of 6.78x, which is far more than I would expect. The probable cause is that ISPC uses CPU intrinsics for trigonometric operations such as sines and cosines, while a C compiler will normally make a function call to the math library.

On Barnes-Hut n-body, the ispc backend sees a slowdown of 0.62x compared to multicore. This is because the core of this program is an irregular romp through a tree structure, which is poorly suited to SIMD execution - or at least the kind of SIMD code we currently generate.

We hope to improve these results by adding dedicated optimisation passes to the compilation pipeline used for the ispc backend. If you’re a student, feel free to hit us up!

The fruits of forced labour

Exploiting students as a source of free labour has long been part of the Futhark development process. It’s also not unusual in academia in general, and jokes about dusty decks, developed by generations of students, are common. The punchline tends to be about the code quality of these systems, which tend to be monstrosities developed by accretion as each contributor added whatever they required for their immediate needs, without any consideration for overall design. Many research compilers are exactly like this, and ever since we started development of Futhark, I was terrified that we would end up like that.

While Futhark has received many contributions from students, I think we have managed to avoid these problems. The overall code quality of the compiler is still pretty high, judging by the ease with which we can still make changes and the relatively low bug density quantity. Further, the bugs we do have tend to be localised logic errors, rather than unintended interactions between distinct components. I have a few guesses for how we managed this:

Take software engineering seriously. The Futhark compiler is cleanly designed as a frontend, middle-end, and back-end. Interaction between these parts is through value-oriented interfaces. There is no mysterious hidden state in the form of global symbol tables. Compiler passes are functions from input program to output program. If you prettyprint any of these intermediate programs, what you see is the full state of the compiler. This makes it pretty easy to define new passes without having to modify much of the compiler, or to remove ones if they turn out to be broken. Only changes to the intermediate representation, which are rare, require non-local changes.
Be honest about the difficulty. While we are certainly actively recruiting students (it’s free labour!) we are honest about how difficult it is to work on a nontrivial compiler. We are not interested in having students at all costs. Some students aren’t up for doing applied projects at this level of complexity, and that’s fine.
Say no. Even if a student manages to complete a project, and even if they get a good grade, it is not a given that their code will make it into the compiler. We do not expect students to stick around to maintain their contributions, so we only merge code that we believe we can maintain long-term on our own. As an example, a student has been experimenting with supporting multi-GPU execution in the CUDA backend, which has not yet been merged. This is not because the student did bad work, but because the problem is inherently really hairy, requires modifications to our GPU runtime system, and it is not yet clear that this approach is the right one in the long run.
Treat the student as an open source contributor. For this latest run of projects, I asked the students to make draft pull requests quite early on. This let me do systematic code reviews (and fixes), correct misunderstandings, and make sure that the students didn’t go too far down incorrect paths. Having regular supervision meetings about concepts and ideas is well and good, but actually reviewing concrete code line-by-line helped this last batch of projects have even higher code quality than usual.
Be lucky. Maybe we just got really lucky this last semester. In that case, let’s just hope that we stay lucky.