April 18, 2020
Programming arm64: Drawing lines

This time we are going to implement line drawing in 3 different ways. All of them will be in assembly only, since we already had enough fun competing with C compiler in the last two exercises.

## The Simple Way

The simplest way of line drawing is to walk through the line step by step. The stepping size, dx and dy for each direction, is calculated before the walk begins. Let D be the max of the distances in both x and y direction, then we have dx = (x1-x0)/D, dy=(y1-x0)/D. Translation to assembly is straight forward: // draw_line_simple(x0,y0,x1,y1) // x0 - x0 // x1 - y0 // x2 - x1 // x3 - y1 draw_line_simple: .begin .is n,x4 .is v,x5 .is w,x6 .is t,x7 scvtf s0,x0 // x0 scvtf s1,x1 // y0 scvtf s2,x2 // x1 scvtf s3,x3 // x2 fsub s2,s2,s0 // s2<-dx fsub s3,s3,s1 // s3<-dy fabs s4,s2 fabs s5,s3 fmax s4,s4,s5 fcvtas \$n,s4 fdiv s2,s2,s4 // s2 <- dx/n fdiv s3,s3,s4 // s3 <- dy/n mov \$w,SIZE // canvas width mov \$v,255 // pixel value b 2f 1: fadd s0,s0,s2 fadd s1,s1,s3 fcvtas w0,s0 fcvtas w1,s1 2: madd \$t,\$w,x1,x0 strb \$v.w,[\$c,\$t] subs \$n,\$n,1 bpl 1b ret .end

There are 24 instructions in the code above. It takes about 3.3s to draw 1 million random lines within a 512x512 canvas on Raspberry Pi 3B+ running at 600MHz. To simplify things, the canvas is just a byte array buffer pointed to by a global register `\$c`. Note that the time also counts 4 random number generations for each line.

## The Bresenham Way

Bresenham algorithm is a rather clever way of drawing lines with only integer arithmetic. In depth explanation of the algorithm can be found in any textbook on computer graphics. It can be summarized as below:

DLB(x0,y0,x1,y1) = sx <- sign(x1-x0) sy <- sign(y1-y0) dx <- |x1-x0| dy <- |y1-y0| DLB1(x0,y0,0) WHERE DLB1(x,y,e) is defined as x=x1 AND y=y1 ? -> Done. e <- e + dx - dy / e <= 0 ? -> | dy + 2e >= 0? -> DLB1(x+sx,y+sy,e) | | else -> DLB1(x,y+sy,e+dy) | | else ->| dx >= 2e? -> DLB1(x+sx,y+sy,e) \ | else DLB1(x+sx,y,e-dx) When translated into assembly: draw_line_b: .begin .is x0,x0 .is y0,x1 .is x1,x2 .is y1,x3 .greg dx .greg dy .greg dd .greg sx .greg sy .greg e mov \$sx,1 mov \$sy,1 subs \$dx,\$x1,\$x0 cneg \$sx,\$sx,mi cneg \$dx,\$dx,mi subs \$dy,\$y1,\$y0 cneg \$sy,\$sy,mi cneg \$dy,\$dy,mi mov \$e,0 sub \$dd,\$dx,\$dy mov w5,255 mov x4,SIZE dl1: madd x6,\$y0,x4,\$x0 strb w5,[\$c,x6] cmp \$x0,\$x1 ccmp \$y0,\$y1,0,eq beq 4f add \$x0,\$x0,\$sx add \$y0,\$y0,\$sy adds \$e,\$e,\$dd bmi 1f cmp \$dx,\$e,lsl#1 bge dl1 sub \$y0,\$y0,\$sy sub \$e,\$e,\$dx b dl1 1: adds xzr,\$dy,\$e,lsl#1 bge dl1 sub \$x0,\$x0,\$sx add \$e,\$e,\$dy b dl1 4: ret .end

The algorithm is implemented in 32 instructions. It takes 2.8s to draw 1 million random lines, about 12% speed up over the simple version.

## The SIMD Way

In the two previous ways, dots on the line are calculated one after another. It doesn't have to be sequential though, since the position of each dot can be calculated independently. In fact, using vector instructions of arm64, also known as SIMD, we can calculate 4 dots at a time.

draw_line_p4: .begin .is n,x4 .is t,x5 .is width,x6 .vreg a .vreg b .vreg d .vreg s .vreg x .vreg y .vreg w mov \$t,255 // pixel value mov \$width,SIZE madd x7,x1,\$width,x0 strb \$t.w,[\$c,x7] mov \$a.2s,w0 mov \$a.2s,w1 scvtf \$a.2s,\$a.2s mov \$b.2s,w2 mov \$b.2s,w3 scvtf \$b.2s,\$b.2s fabd \$d.2s,\$b.2s,\$a.2s fmaxp s0,\$d.2s fcvtas \$n,s0 cbz \$n,4f // Done fsub \$s.2s,\$b.2s,\$a.2s dup \$d.2s,v0.2s fdiv \$s.2s,\$s.2s,\$d.2s ldr q4,factors // v4 <- [1,2,3,4] dup \$x.4s,\$a.2s // x <- [x0,x0,x0,x0] dup \$y.4s,\$a.2s // y <- [y0,y0,y0,y0] fmla \$x.4s,v4.4s,\$s.2s // x += sx[1,2,3,4] fmla \$y.4s,v4.4s,\$s.2s // y += sy[1,2,3,4] dup \$w.4s,\$width.w fmov v7.4s,4.0 fmul v4.4s,v7.4s,\$s.2s // v4 <- 4[sx,sx,sx,sx] fmul v5.4s,v7.4s,\$s.2s // v5 <- 4[sy,sy,sy,sy] b 1f 2: fadd \$x.4s,\$x.4s,v4.4s fadd \$y.4s,\$y.4s,v5.4s 1: // x,y containing the coordinates fcvtas v0.4s,\$x.4s fcvtas v1.4s,\$y.4s mla v0.4s,\$w.4s,v1.4s // v0 <- y*width+x subs \$n,\$n,4 bge 1f add \$n,\$n,4 adr x1,3f sub x1,x1,\$n,lsl#3 mov \$n,0 br x1 1: // v0 contains 4 pixel address mov w8,v0.4s strb \$t.w,[\$c,x8] mov w8,v0.4s strb \$t.w,[\$c,x8] mov w8,v0.4s strb \$t.w,[\$c,x8] mov w8,v0.4s strb \$t.w,[\$c,x8] 3: cbnz \$n,2b 4: ret factors: .float 1.0,2.0,3.0,4.0 .end // draw_line_p4

The program is about twice the size of the simple version. It may look quite complicated with all those vector instructions, but it's essentially doing the same stepping, except doing it 4 at a time. The loop of at the bottom should give you a rough idea of how it works.

The SIMD implementation draws 40% faster than the simple version, and 30% faster than the Bresenham algorithm. However, if I have to choose one of these to draw lines, I will probably use the slowest yet simplest way. Cleverness is like entropy, the less of it in a system, the better we are.

## Some other words

You may have noticed that in these implementations we didn't touch stack at all. ARM64 has a lot of registers. The abundance of registers allows us to store a fair amount of program states directly in CPU. Unfortunately, as register usage grows, it quickly becomes too much a headache for an assembly programmer to remember which is for which.

In the code listings above, there are `.is` and `.greg` directives to the rescue. They are borrowed from MMIXAL specially for allocating and naming registers. I had to write a simple script to translate the source to the standard GAS format. It makes life so much easier that I can't really complain about the extra step.