The mere thought to implement sine and cosine surprises myself as a C programmer. These are like sacred stones in libc. No one would think of turning them over and checking how good they are. More over, conventional wisdom always warns us not to reinvent wheels. The thing is, if we don't try to reinvent, how can you tell if the wheels we got are the best? Besides, everyone using the same wheel is quite boring. For an assembly programmer, only cycles and bytes are sacred, and one will try to build things from scratch just to save either of them.
Since I have never directly programmed FPU before, implementing sine and cosine using assembly can be a good exercise for me. It will force me to go through the list of FPU instructions and pick up the right ones for the job. Fortunately it's not as unpleasant as I thought. Instruction manuals are just like API documents, except they are better documented and don't deprecate.
The AlgorithmWe use taylor series to approximate the value of
sin(x). See the definition of
cos1a(x)below. The smaller x is, the more accurate the series is. Therefore we will only calculate values within the range
[0,PI_4), and map everything else to that range.
The CodeTranslation to code is straight forward. The only tricky part probably is the floor and mod operation. We don't want to use libc here, obviously, since the overhead of function calls will completely wipe out any benefits we might have. Turns out we only need 3 FPU instructions to do the work. Can't be simpler, can't be happier.
In the code above, we implement the 8 conditional branches of
sin1() as a jump table.
Usually jump table is a list of addresses you can jump to.
We are doing things differently here that
each table entry is a 16-byte code block, which can contain up to 4 instructions.
We dispatch to each entry directly, saving the address loading from memory.
I guess no compiler optimization will care to go this far.
Such is the joy of assembly programming -- you are not at the mercy of your language implementation.
Computing taylor series of
cos1a is pretty simple, without any branches:
BenchmarkRunning on Raspberry Pi 3B+ (2017) at 600MHz,
calls/s cycles/call sin(libc) 3265486 183 newsin 7003550 85 cos(libc) 3355043 178 newcos 6601419 91See, our new wheels are running at 2x speed of old, with a small foot print at 352 bytes. It seems that assembly programming is not a laughable business after all. And we haven't got to vector instructions yet.