April 4, 2020
Programming arm64: sine

The mere thought to implement sine and cosine surprises myself as a C programmer. These are like sacred stones in libc. No one would think of turning them over and checking how good they are. More over, conventional wisdom always warns us not to reinvent wheels. The thing is, if we don't try to reinvent, how can you tell if the wheels we got are the best? Besides, everyone using the same wheel is quite boring. For an assembly programmer, only cycles and bytes are sacred, and one will try to build things from scratch just to save either of them.

Since I have never directly programmed FPU before, implementing sine and cosine using assembly can be a good exercise for me. It will force me to go through the list of FPU instructions and pick up the right ones for the job. Fortunately it's not as unpleasant as I thought. Instruction manuals are just like API documents, except they are better documented and don't deprecate.

The Algorithm

We use taylor series to approximate the value of sin(x). See the definition of sin1a(x) and cos1a(x) below. The smaller x is, the more accurate the series is. Therefore we will only calculate values within the range [0,PI_4), and map everything else to that range. sin(x) = sin1(floor(x / PI_4)[2:0], x mod PI_4) sin1(i, x) = i=0? -> sin1a(x) i=1? -> cos1a(PI_4 - x) i=2? -> cos1a(x) i=3? -> sin1a(PI_4 - x) i=4? -> -sin1a(x) i=5? -> -cos1a(PI_4 - x) i=6? -> -cos1a(x) i=7? -> -sin1a(PI_4 - x) cos(x) = sin(PI_2 - x) sin1a(x) = x - x^3/3! + x^5/5! - x^7/7! cos1a(x) = 1 - x^2/2! + x^4/4! - x^6/6!

The Code

Translation to code is straight forward. The only tricky part probably is the floor and mod operation. We don't want to use libc here, obviously, since the overhead of function calls will completely wipe out any benefits we might have. Turns out we only need 3 FPU instructions to do the work. Can't be simpler, can't be happier. newcos: adr x0,PI_2 ldr d1,[x0] fsub d0,d1,d0 newsin: adr x0,PI_4 ldr d1,[x0] // d1 <- PI_4 mov x2,x30 // save lr for sin2a5 & sin2a6 adr x3,sin2a0 // origin of jump table fdiv d2,d0,d1 // d2 <- x / PI_4 frintm d2,d2 // d2 <- floor (d2) fmsub d0,d1,d2,d0 // d0 <- x mod PI_4 fcvtas x1,d2 and x1,x1,#7 add x3,x3,x1,lsl#4 br x3 .align 4 sin2a0: b sin1a .align 4 sin2a1: fsub d0,d1,d0 b cos1a .align 4 sin2a2: b cos1a .align 4 sin2a3: fsub d0,d1,d0 b sin1a .align 4 sin2a4: fneg d0,d0 b sin1a .align 4 sin2a5: fsub d0,d1,d0 bl cos1a fneg d0,d0 br x2 .align 4 sin2a6: bl cos1a fneg d0,d0 br x2 .align 4 sin2a7: fsub d0,d0,d1 b sin1a

In the code above, we implement the 8 conditional branches of sin1() as a jump table. Usually jump table is a list of addresses you can jump to. We are doing things differently here that each table entry is a 16-byte code block, which can contain up to 4 instructions. We dispatch to each entry directly, saving the address loading from memory. I guess no compiler optimization will care to go this far. Such is the joy of assembly programming -- you are not at the mercy of your language implementation.

Computing taylor series of sin1a and cos1a is pretty simple, without any branches:

// d0 - x // Use x0, v0~v7 sin1a: adr x0,sin1a1 ldp d4,d5,[x0],16 ldp d6,d7,[x0] fmul d1,d0,d0 // d1 <- x^2 fmul d2,d1,d0 // d2 <- x^3 fmul d3,d2,d1 // d3 <- x^5 fmul d1,d1,d3 // d1 <- x^7 fmadd d0,d5,d2,d0 fmadd d0,d6,d3,d0 fmadd d0,d7,d1,d0 ret cos1a: adr x0,cos1a1 ldp d4,d5,[x0],16 ldp d6,d7,[x0] fmul d1,d0,d0 // d1 <- x^2 fmul d2,d1,d1 // d2 <- x^4 fmul d3,d2,d1 // d3 <- x^6 fmov d0,d4 // d0 <- 1 fmadd d0,d1,d5,d0 fmadd d0,d2,d6,d0 fmadd d0,d3,d7,d0 ret .align 3 PI: .double 3.14159265358979 PI_4: .double 0.78539816339745 PI_2: .double 1.57079632679489 .align 4 sin1a1: .double 1,-0.16666666667,0.00833333333333,-0.0001984126984126 cos1a1: .double 1,-0.5,0.04166666666667,-0.001388888888889

Benchmark

Running on Raspberry Pi 3B+ (2017) at 600MHz,
           calls/s    cycles/call
sin(libc)  3265486    183          
newsin     7003550    85
cos(libc)  3355043    178
newcos     6601419    91
See, our new wheels are running at 2x speed of old, with a small foot print at 352 bytes. It seems that assembly programming is not a laughable business after all. And we haven't got to vector instructions yet.