Programming arm64: Fibonacci

Writing assembly code marks my happiest programming days. It's a wild sport. You talk to the machine directly by composing a sequence of instructions. No one tells you how to layout your data, no one can take goto away from you. You are free to do whatever you like.

These days I find myself longing for the good old fun. Somewhere in the deep corner of my desk drawer lies a raspberry pi. It's a 3b+ model I bought years ago, featuring a 64-bit quadcore 1.4GHz ARM cpu and a relatively small RAM of 1GB. It's tiny, cheap, quiet, a perfect playground for assembly programming.

First things first, I need to get a 64bit OS. The official OS Raspbian is 32-bit only. Fortunately Ubuntu Mate 18.04 for Pi has a 64-bit flavor. Its desktop is slow as snail, but as long as the terminal works all right, I can't complain. To have a stable performance, we need to set cpu frequency at a fixed value. The performance numbers in the following sections are all based on 600MHz.

Now let's start with a little exercise to get familiar with arm64 assembly programming. Fibonacci sequence is a simple example that frequently turns up in benchmarks, so I will use it as warm up.

When writing assembly, you almost always want to document first what you are going to do. At least you need to have a clear big picture. The fibonacci sequence is defined as:

fib(n): n=0? -> 0 n=1? -> 1 else -> fib(n-1) + fib(n-2)

It's straight forward to translate the pseudo code above into arm64 code. I won't pretend it is easy though. You must spend some time going through the arm64 instruction set.

fib: cmp x0,1 ble 1f stp x29,x30,[sp,-16]! sub x29,x0,2 sub x0,x0,1 bl fib mov x1,x29 mov x29,x0 mov x0,x1 bl fib add x0,x0,x29 ldp x29,x30,[sp],16 1: ret

When calling the code above, fib(40) takes about 6.10 seconds to complete, with 54.3 million calls per second. Half of the calls takes the quick path, containing only 3 instructions, the other half takes the longer path of 13 instructions. Roughly each call takes about 11 cpu cycles. So most of the time our Pi is able to execute 1 instruction per cycle, possibly with 1 extra cycle for memory access and jumps. However, modern CPU behavior is really hard to predict, so this result may not apply to other programs.

Let's write the same code in C:

long fib(long n) { if (n > 1) return fib(n-1)+fib(n-2); else return n; } Compiled with gcc -O2, it takes 5.34 seconds. I am devastated. How on earch is my hand rolled assembly slower than compiled code! After a closer examination of its output, which is too bloated to be included here, I find that the smart ass is basically computing fib(n) = fib(n-1) + fib(n-3) + fib(n-5) + ... and thus skipping almost half of the calls. It's like taking a shortcut in a marathon match. Well, if they want to play fancy tricks, it's only fair that we do the same: fib: cmp x0,1 ble 1f stp x27,x28,[sp,-32]! str x30,[sp,16] mov x28,0 // s <- 0 sub x27,x0,1 // i <- n-1 2: mov x0,x27 bl fib add x28,x28,x0 // s <- s + fib(i) subs x27,x27,2 // i <- i - 2 bgt 2b // i > 0? cinc x0,x28,eq ldr x30,[sp,16] ldp x27,x28,[sp],32 1: ret

The code above takes 3.80 seconds. 30% speed up is not a bad investment.

At the end, let's write a sane version that can finish the work instantly.

fib(n): n=0? -> 0 n=1? -> 1 n=2? -> 1 else -> 2*fib(n-2)+fib(n-3) fib: cmp x0,2 ble 1f stp x29,x30,[sp,-16]! mov x29,x0 sub x0,x0,2 bl fib mov x1,x29 add x29,x0,x0 sub x0,x1,3 bl fib add x0,x0,x29 ldp x29,x30,[sp],16 ret 1: blt 1f mov x0,1 1: ret Another way is formatting fib() into tail recursive call, which can be further turned into a loop: fib(n)=fib1(0,1,0,n) fib1(x,y,i,n): i=n? -> x else -> fib1(y,x+y,i+1,n) The corresponding assembly is: fib: mov x3,x0 mov x0,0 mov x1,1 mov x2,0 fib1: cmp x2,x3 beq 1f add x4,x0,x1 mov x0,x1 mov x1,x4 add x2,x2,1 b fib1 1: ret

Nice and easy, isn't it? An idle CPU is cool but not fun. In days ahead, we should try more challenging tasks on our little Pi.