# A case study in parallelisation: Advent of Code 2022, day 9

I am currently participating in the Advent of
Code, with the goal of solving every task
in Futhark. I did this once
before,
but I like to think that both Futhark and myself have improved since
then. Inspired by Snektron’s solutions from last
year I’m trying to do *everything*
in Futhark - including the input parsing - and also to come up with
data-parallel solutions, even when the problem is not really suited
for it. I doubt Advent of Code considers parallelism to be a goal
when they come up with their problems, so there will certainly be some
that have to be solved sequentially, but so far I’ve managed to
parallelise all except one. I will do a proper retrospective when
(if) I finish, but in this post I want to take a closer look at the
day 9 problem. It is a problem
that at first glance looks very sequential, and to be honest is
probably *best* solved using a sequential algorithm, but a
work-efficient
algorithm actually turns out to be possible.

### Problem statement

I’ll skip the elf metaphors and go straight to the problem. We are
simulating two points in a discrete 2D grid, the *head* and the
*tail*. These two points must never be more than one grid cell apart,
but they *may* occupy the same cell. Whenever the head moves, the
tail must also be moved appropriately, to make sure it is still
touching. Examples from the problem statement:

```
..... ..... .....
.TH.. -> .T.H. -> ..TH.
..... ..... .....
... ... ...
.T. .T. ...
.H. -> ... -> .T.
... .H. .H.
... ... ...
..... ..... .....
..... ..H.. ..H..
..H.. -> ..... -> ..T..
.T... .T... .....
..... ..... .....
..... ..... .....
..... ..... .....
..H.. -> ...H. -> ..TH.
.T... .T... .....
..... ..... .....
```

Given an initial position of the head and tail, we are then given a sequence of head movements, and asked to simulate the corresponding movement of the tail. The final answer requested by the problems statement is the number of distinct coordinates touched by the tail, but the interesting part is the simulation itself.

### Analysis

At first glance, this looks pretty sequential, like most time-stepped
simulations. However, there is hope: it is straightforward to, in
parallel, compute all the intermediate positions occupied by the head,
given a list of movements. This is because the head movement is just
addition of 2D coordinate points, which is
associative, and
so we can compute them with a scan. For example, we represent a
movement upwards as `(0,1)`

. In Futhark we can implement it like
this:

```
type pos = (i32,i32)
def pos_add (a: pos) (b: pos) = (a.0 + b.0, a.1 + b.1)
def coords moves = scan pos_add (0,0) moves
```

Then we observe that after every movement of the head and corresponding adjustment of the tail, the position of the tail can be described relative to the head - and there are only nine possible states (the eight neighbouring grid cells and the cell containing the head itself). If we can only compute the position of the tail relative to that of the head, for every step of the simulation, then we can find the absolute positions of the tail.

Okay, so how do we do that? It turns out that the relative position
of a tail *after* a move depends on the movement of the head and the
relative position of the tail *before* a move. For example:

If the tail overlaps the head and we move the head right, then the new position of the tail is to the left.

`.... .... .H.. -> .TH. .... ....`

If the tail overlaps the head and we move the head left, then the new position of the tail is to the right.

`.... .... .H.. -> .HT. .... ....`

If the tail is down-left from the head and we move the head right, then the new position of the tail is to the left.

`..... ..... ..H.. -> ..TH. .T... ..... ..... .....`

We can imagine implementing this as some function full of control flow:

```
type dir = #U | #D | #L | #R | #UL | #UR | #DL | #DR | #C
def step (head_mov: dir) (tail_rel: dir) : dir =
match (head_mov, tail_rel)
case (#R, #C) -> #L
case (#L, #C) -> #R
case (#R, #DL) -> #L
...
```

The `tail_rel`

is the state of our simulation, while the `head_mov`

the head motion of the next timestep (which does not depend on the
state). We can simulate the movement of the tail simply by applying
this `step`

function repeatedly. Essentially we have a sequential
loop where this function is the loop body. Where is the parallelism?

Now, there is a trick that allows you to “parallelise” any sequential
loop. It’s based on the observation that we can view the body of a
loop as a function from one loop state to another (like the `step`

function above), and if we construct a list of the functions
corresponding to each iteration of the loop, we can *compose* all
those functions in parallel, as function composition is an associative
operator. The result of such a reduction will be a single function
from initial state to final state, while the result of a *scan* will
be a list of functions from initial state to corresponding
intermediate states - which is what we want to get the intermediate
tails.

The reason this doesn’t *actually* let us naively parallelise any
sequential loop is that the result of function composition is really
just the two functions concatenated. The result of actually performing
such a reduction with composition is essentially just a big unrolled
loop - the final function is as sequential and as costly as the
original loop was. *But* if we can come up with a different
representation for our function, one that allows a more efficient
definition of composition, then we might have a chance.

### The trick

The `step`

function above has a quite small domain - the `dir`

type
has only 9 possible values, for a total of *81* possible inputs.
Further, the `head_mov`

argument does not depend on the simulation
state, but simply on which time step we are in. This means we can
really think of *step* as a family of 9 functions of type `dir -> dir`

- a function with a domain of size *9*. Such a function can
easily be represented as a lookup table! A table that given a
current relative position produces the new relative tail position.

Let us number the potential relative positions from 0 to 8:

```
0 1 2
3 4 5
6 7 8
```

A value of 0 means *upper left* (`#UL`

in the `dir`

type) and a value
of *4* means *center* (`#C`

).

For an *upwards* movement of the head, we can then explain the
movement of the tail with the following table:

```
3 4 5
6 7 8
7 7 7
```

For example, if the tail was previously *down and to the right*,
corresponding to position 8 in the first table, we look at the
bottom-right corner in the last table and find a 7 - meaning that the
tail will now be directly below the head.

Since each entry of these tables can be represented in 4 bits, we can use a 64-bit integer to store a table:

`type perm = u64`

The name `perm`

is because it is a permutation-like structure,
although it is strictly speaking not a permutation.

We define a small utility function for converting integer arrays to this more efficient representation - uses of this function will fold away any array literals:

```
def tovec (xs: [9]u64) : perm =
let f i = xs[i32.u64 i]<<(i*4)
in f 0 | f 1 | f 2 | f 3 | f 4 | f 5 | f 6 | f 7 | f 8
```

And a function for looking up a numbered field in one of these tables:

```
def perm_lookup (i: u64) (x: perm) : u64 =
4)) & 0b1111 (x >> (i *
```

Then we can define a function that, given a head direction, produces the corresponding table.

```
def perm (d: dir): perm =
match d case #U -> tovec [3,4,5, 6,7,8, 7,7,7]
case #D -> tovec [1,1,1, 0,1,2, 3,4,5]
case #L -> tovec [1,2,5, 4,5,5, 7,8,5]
case #R -> tovec [3,0,1, 3,3,4, 3,6,7]
case #UL -> tovec [4,5,5, 7,8,5, 7,7,8]
case #DL -> tovec [1,1,2, 1,2,5, 4,5,5]
case #UR -> tovec [3,3,4, 3,6,7, 6,7,7]
case #DR -> tovec [0,1,1, 3,0,1, 3,3,4]
case #C -> tovec [0,1,2, 3,4,5, 6,7,8]
```

And finally, the crucial piece - composition of `perm`

s. This is done
exactly the same way you compose permutations (and is the reason I
named the type `perm`

):

```
def perm_compose (a: perm) (b: perm): perm =
0 a) b,
tovec [perm_lookup (perm_lookup 1 a) b,
perm_lookup (perm_lookup 2 a) b,
perm_lookup (perm_lookup 3 a) b,
perm_lookup (perm_lookup 4 a) b,
perm_lookup (perm_lookup 5 a) b,
perm_lookup (perm_lookup 6 a) b,
perm_lookup (perm_lookup 7 a) b,
perm_lookup (perm_lookup 8 a) b] perm_lookup (perm_lookup
```

That’s basically it. We can now compute a function for each intermediate step simply by saying

` map perm dirs |> scan perm_compose (perm #C)`

where we use `perm #C`

as the neutral element (when the head does not
move, neither does the tail). The end result is an array of
functions, represented as an array of `perm`

s.

The full program has a bunch more code to then go from the relative tail positions to the absolute positions, but it’s embarrassingly parallel and not particularly interesting.

This trick, of coming up with a bespoke representation of a small-domain function and using that to parallelise an otherwise sequential loop, is known to the literature before, but it is the first time I have had a chance to try it myself. I’m quite pleased with how well it turned out.