During out last lecture, we were given the code to disassemble and analyze. So I copied the tarball to my folder, run the usual “tar -xvf name” on it, run “make”, and now I was ready to go. This is the code I was analyzing:
#include <stdlib.h>
#include <stdio.h>
void main() {
int a[256], i, t;
srand(1); // initialize prng
for (i=0; i<256; i++) {
a[i]=rand(); // load array randomly
}
for (i=0; i<256; i++) {
t+=a[i]; // sum the array
}
printf("%d\n",t); // print the sum
}
As you can see, it’s pretty easy and straightforward program. After I run “objdump -d name”, I got the following:
0000000000400500 <main>:
400500: d11003ff sub sp, sp, #0x400
400504: 52800020 mov w0, #0x1 // #1
400508: a9bd7bfd stp x29, x30, [sp,#-48]!
40050c: 910003fd mov x29, sp
400510: f90013f5 str x21, [sp,#32]
400514: 9100c3b5 add x21, x29, #0x30
400518: a90153f3 stp x19, x20, [sp,#16]
40051c: 9110c3b4 add x20, x29, #0x430
400520: aa1503f3 mov x19, x21
400524: 97ffffef bl 4004e0 <srand@plt>
400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 b.ne 400528 <main+0x28>
400538: 4f000400 movi v0.4s, #0x0
40053c: 3cc106a1 ldr q1, [x21],#16
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 b.ne 40053c <main+0x3c>
40054c: 4eb1b800 addv s0, v0.4s
400550: a94153f3 ldp x19, x20, [sp,#16]
400554: f94013f5 ldr x21, [sp,#32]
400558: 90000000 adrp x0, 400000 <_init-0x460>
40055c: a8c37bfd ldp x29, x30, [sp],#48
400560: 911da000 add x0, x0, #0x768
400564: 0e043c01 mov w1, v0.s[0]
400568: 911003ff add sp, sp, #0x400
40056c: 17ffffe1 b 4004f0 <printf@plt>
Now this code took a bit of figuring out. Breaking the code into the sections helped:
400524: 97ffffef bl 4004e0 <srand@plt>
This line is out call to srand(), this is fairly clear.
400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 b.ne 400528 <main+0x28>
This block acts in a pretty interesting way, since it FIRST calls rand() function, stores the result, and only then it checks the exit condition. “str w0, [x19],#4
” Line shows that instead of incrementing the index, compiler decided to increment the pointer to the array, and when the pointer points to the projected end of the array, the loop will stop.
400538: 4f000400 movi v0.4s, #0x0
40053c: 3cc106a1 ldr q1, [x21],#16
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 b.ne 40053c <main+0x3c>
This block is the whole reason for the exercise. Here we are touching upon Single Instruction Multiple Data (SIMD) concept. “400538: 4f000400 movi v0.4s, #0x0
” loads a vector register, divided in 4 single word sized parts(32 bits) with zeros, preparing it to be the sum register. "ldr q1, [x21],#16"
loads another vector register with the values from our initial array. Then again, exit condition checked. Then the loaded values are added to the sum register in the respective positions. Then the loop ends or continues depending on the result of the condition check.
40054c: 4eb1b800 addv s0, v0.4s
This line adds up all the parts of the sum vector register into the rightmost position, filling the rest with nulls.
40056c: 17ffffe1 b 4004f0 <printf@plt>
Finally, we are calling printf for the output.
As we can see, by performing couple more of step to work with the vector registers, we save enormous time on the calculations, that now happen four at the time. This is the beauty of vectorization: if you can condense it like that, you should use it!
Unfortunately, there are problems with the vectorization, that we have to watch out for. First of all, we need to specify ‘-O3″ gcc option for it to work. This is the code that was vectorized in my example, when I change the option to “-O2”:
400538: b8404660 ldr w0, [x19],#4
40053c: eb1302df cmp x22, x19
400540: 0b0002b5 add w21, w21, w0
400544: 54ffffa1 b.ne 400538 <main+0x38>
It reads much easier, but it does addition one at the time. No more vectorization for us. Unfortunately, in some situations running the third level of optimization might be problematic for the rest of the code due to drastic changes in it. So one would have to choose.
Another limitation is happening when we try to add two arrays into a third one (it would be something like this: “add v0.4s, v1.4s, v2.4s
“). Since the compiler doesn’t know whether the arrays are overlapping or not, it will play safe and assume that they will. Vectorization will not happen then. To fix that, you would need to add “__restricted__” option to the array to specify that it will not be overlapped.
Another limitation is that the vectorization might have problems with doing calculations on misaligned values. So imagine accessing first element of the first array and second element of the second array and going from there. There is an option to run “__builtin_assume_aligned
” function (can be found here) to tell the compiler how to align. Now, according to these changes, this problem is actively worked on, and it seems that the compiler should figure it out on his own by now.
Another limitation for vectorization is that you cannot vectorize values, coming one at the time from a function. In our code, when we do “a[i]=rand();
“, compiler sees that rand() will return only 1 value at the time, so you need to perform 4 separate calls to obtain 4 values. Hence no vectorization.
And the last limitation I will mention is that there is no output/calculation/condition can be done on the vectorized values.
This lab was very interesting, showing me that sometimes, a small change in a code or optimization level can have a significant impact on performance, if it allows vectorization to kick in!