Writing assembly code marks my happiest programming days.
It's a wild sport.
You talk to the machine directly by composing a sequence of instructions.
No one tells you how to layout your data,
no one can take goto
away from you.
You are free to do whatever you like.
These days I find myself longing for the good old fun. Somewhere in the deep corner of my desk drawer lies a raspberry pi. It's a 3b+ model I bought years ago, featuring a 64-bit quadcore 1.4GHz ARM cpu and a relatively small RAM of 1GB. It's tiny, cheap, quiet, a perfect playground for assembly programming.
First things first, I need to get a 64bit OS. The official OS Raspbian is 32-bit only. Fortunately Ubuntu Mate 18.04 for Pi has a 64-bit flavor. Its desktop is slow as snail, but as long as the terminal works all right, I can't complain. To have a stable performance, we need to set cpu frequency at a fixed value. The performance numbers in the following sections are all based on 600MHz.
Now let's start with a little exercise to get familiar with arm64 assembly programming. Fibonacci sequence is a simple example that frequently turns up in benchmarks, so I will use it as warm up.
When writing assembly, you almost always want to document first what you are going to do. At least you need to have a clear big picture. The fibonacci sequence is defined as:
It's straight forward to translate the pseudo code above into arm64 code. I won't pretend it is easy though. You must spend some time going through the arm64 instruction set.
When calling the code above, fib(40) takes about 6.10 seconds to complete, with 54.3 million calls per second. Half of the calls takes the quick path, containing only 3 instructions, the other half takes the longer path of 13 instructions. Roughly each call takes about 11 cpu cycles. So most of the time our Pi is able to execute 1 instruction per cycle, possibly with 1 extra cycle for memory access and jumps. However, modern CPU behavior is really hard to predict, so this result may not apply to other programs.
Let's write the same code in C:
gcc -O2
, it takes 5.34 seconds.
I am devastated.
How on earch is my hand rolled assembly slower than compiled code!
After a closer examination of its output,
which is too bloated to be included here,
I find that the smart ass is basically computing
The code above takes 3.80 seconds. 30% speed up is not a bad investment.
At the end, let's write a sane version that can finish the work instantly.
Nice and easy, isn't it? An idle CPU is cool but not fun. In days ahead, we should try more challenging tasks on our little Pi.