[SPO600] Final post

What a long and wonderful journey it’s been. I’ve learned so much about how computer actually works with the code I’m writing. I’ve learned how complex and diverse can the real-world programs be. I’ve learned that if you want to submit a patch for those programs, you need to be prepared to face tons of criticism, and very diverse opinions from different members of the community.

After a week and a half of waiting for a review from the community members for my latest patch, I got another destroying reply that pretty much said “what that guy said you could do and it would be useful, it actually useless”. I am going to stick with the opinion that the CORE developer had (the patchV2 that I created), and hope that the rank will take precedence when it actually comes to the decision making about my patch. I will continue monitoring the communication about it and I will try to push the patch through , even though it will happen, as I expect, not earlier than July. Finally, it will simply feel good to have such thing done in my life!

My patch is located on my Git repository: https://github.com/ArtemL/SPO600/blob/master/spinlock-docsV2.patch  

My CommitFest post on PostgreSQL: https://commitfest.postgresql.org/5/208/

I tried to find my mail archive, but it’s not in the archive mode yet.

So the result is that I’ve created the patch that was requested by one of the core developers, but that developer didn’t comment on it later, and the other members of the community didn’t like that idea. I will have to see if I can do something about it, but at this moment, during the release cycle, they are not very responsive.

[SPO600] The last attempt

This morning I sent the last email, asking for a feedback for my patch from the community. Over the last week I’ve been getting nearly a hundred emails a day from the developers email stream, so I can see them pretty busy. I understand the fact that during the release they will not have time for a future release non-critical patch reviews, but it surely doesn’t help me!

[SPO600] Project delays

After several days of waiting for a response to my patch, I have decided to go on IRC and try to bug people directly. This brought me two pieces of information. First of all, if the channel has around a thousand users, it doesn’t mean that if you ask a question, anyone will answer. Second, there is nothing I can do at this time that would make the community to review my patch at the moment. There is a current release happening, so people who are in charge of committing, or even reviewing, are busy with it. Here is the part of the IRC chat:

01[09:14] <ArtemL> Greeting everyone! I was wondering… I have submitted a patch for 2015/06 commitfest, as well as for pgsql-hackers mailing list, but got a response… I was wondering if I am doing something wrong, or I shouldn’t get any response?
01[09:14] <ArtemL> sorry, got NO response
02[09:15] * todakure (c896474c@gateway/web/freenode/ip.200.150.71.76) Quit (Ping timeout: 246 seconds)
02[09:16] * cleme1mp (~mclement@96-36-24-135.static.aldl.mi.charter.com) Quit (Ping timeout: 250 seconds)
03[09:16] * cleme1mp (~mclement@96-36-24-135.static.aldl.mi.charter.com) has joined #postgresql
03[09:17] * dantti (~dantti@200.204.2.138) has joined #postgresql
[09:19] <Snow-Man> sigh.
[09:19] <Snow-Man> If you did all that then I doubt you did anything wrong.
[09:20] <oicu> ArtemL, people are busy trying to wrap up 9.5
[09:20] <Snow-Man> Sadly, we’re quite busy and haven’t had a chance to review everything.
[09:20] <RhodiumToad> also -hackers has been pretty quiet over the past few days
[09:20] <Snow-Man> Ah, yeah, and if it’s in the 2015/06 commitfest then you’re not likely to hear much til June.
03[09:21] * mattarse (~mattarse@113.164.94.155) has joined #postgresql
03[09:21] * davidfetter_fbn (~chatzilla@2601:9:a81:4000:c8e0:f09:3cef:b553) has joined #postgresql
[09:21] <Snow-Man> RhodiumToad: I noticed that also.. Though I’m at a loss to explain it.
03[09:21] * ancientcc (~ancientcc@140.207.223.188) has joined #postgresql
[09:22] <RhodiumToad> it’s not a US holiday or anything, is it?
03[09:23] * crobbins (~crobbins@2602:306:3403:d520:3d65:4b89:e725:c9f7) has joined #postgresql
02[09:23] * ancientcc (~ancientcc@140.207.223.188) Quit (Client Quit)
01[09:23] <ArtemL> Oh I see. That’s reassuring! I guess I will wait until they will get a bit more free. Is there any way to get a response “yes, your patch is likely to get implemented” or “nah, you gotta do it better?”
[09:23] <Snow-Man> Nope.

Snow-Man (Steven Frost) would be one of the top contributors and reviewers, and is likely to be the one to review my code later. So… I patiently will wait. I will try to send another email later this week just to see if someone is bored enough to actually have a look at it, but my hopes are low…

[SPO600] Disassembling lab

During out last lecture, we were given the code to disassemble and analyze. So I copied the tarball to my folder, run the usual “tar -xvf name” on it, run “make”, and now I was ready to go. This is the code I was analyzing:

#include <stdlib.h>
#include <stdio.h>

void main() {

int a[256], i, t;

srand(1); // initialize prng

for (i=0; i<256; i++) {
a[i]=rand(); // load array randomly
}

for (i=0; i<256; i++) {
t+=a[i]; // sum the array
}
printf("%d\n",t); // print the sum
}

As you can see, it’s pretty easy and straightforward program. After I run “objdump -d name”, I got the following:

0000000000400500 <main>:
400500: d11003ff sub sp, sp, #0x400
400504: 52800020 mov w0, #0x1 // #1
400508: a9bd7bfd stp x29, x30, [sp,#-48]!
40050c: 910003fd mov x29, sp
400510: f90013f5 str x21, [sp,#32]
400514: 9100c3b5 add x21, x29, #0x30
400518: a90153f3 stp x19, x20, [sp,#16]
40051c: 9110c3b4 add x20, x29, #0x430
400520: aa1503f3 mov x19, x21
400524: 97ffffef bl 4004e0 <srand@plt>
400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 b.ne 400528 <main+0x28>
400538: 4f000400 movi v0.4s, #0x0
40053c: 3cc106a1 ldr q1, [x21],#16
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 b.ne 40053c <main+0x3c>
40054c: 4eb1b800 addv s0, v0.4s
400550: a94153f3 ldp x19, x20, [sp,#16]
400554: f94013f5 ldr x21, [sp,#32]
400558: 90000000 adrp x0, 400000 <_init-0x460>
40055c: a8c37bfd ldp x29, x30, [sp],#48
400560: 911da000 add x0, x0, #0x768
400564: 0e043c01 mov w1, v0.s[0]
400568: 911003ff add sp, sp, #0x400
40056c: 17ffffe1 b 4004f0 <printf@plt>

Now this code took a bit of figuring out. Breaking the code into the sections helped:

400524: 97ffffef bl 4004e0 <srand@plt>

This line is out call to srand(), this is fairly clear.

400528: 97ffffe2 bl 4004b0 <rand@plt>
40052c: b8004660 str w0, [x19],#4
400530: eb14027f cmp x19, x20
400534: 54ffffa1 b.ne 400528 <main+0x28>

This block acts in a pretty interesting way, since it FIRST calls rand() function, stores the result, and only then it checks the exit condition. “str w0, [x19],#4” Line shows that instead of incrementing the index, compiler decided to increment the pointer to the array, and when the pointer points to the projected end of the array, the loop will stop.

400538: 4f000400 movi v0.4s, #0x0
40053c: 3cc106a1 ldr q1, [x21],#16
400540: eb15029f cmp x20, x21
400544: 4ea18400 add v0.4s, v0.4s, v1.4s
400548: 54ffffa1 b.ne 40053c <main+0x3c>

This block is the whole reason for the exercise. Here we are touching upon Single Instruction Multiple Data (SIMD) concept. “400538: 4f000400 movi v0.4s, #0x0” loads a vector register, divided in 4 single word sized parts(32 bits) with zeros, preparing it to be the sum register.  "ldr q1, [x21],#16" loads another vector register with the values from our initial array. Then again, exit condition checked. Then the loaded values are added to the sum register in the respective positions. Then the loop ends or continues depending on the result of the condition check.

40054c: 4eb1b800 addv s0, v0.4s

This line adds up all the parts of the sum vector register into the rightmost position, filling the rest with nulls.

40056c: 17ffffe1 b 4004f0 <printf@plt>

Finally, we are calling printf for the output.

As we can see, by performing couple more of step to work with the vector registers, we save enormous time on the calculations, that now happen four at the time. This is the beauty of vectorization: if you can condense it like that, you should use it!

Unfortunately, there are problems with the vectorization, that we have to watch out for. First of all, we need to specify ‘-O3″ gcc option for it to work. This is the code that was vectorized in my example, when I change the option to “-O2”:

400538: b8404660 ldr w0, [x19],#4
40053c: eb1302df cmp x22, x19
400540: 0b0002b5 add w21, w21, w0
400544: 54ffffa1 b.ne 400538 <main+0x38>

It reads much easier, but it does addition one at the time. No more vectorization for us. Unfortunately, in some situations running the third level of optimization might be problematic for the rest of the code due to drastic changes in it. So one would have to choose.

Another limitation is happening when we try to add two arrays into a third one (it would be something like this: “add v0.4s, v1.4s, v2.4s“).  Since the compiler doesn’t know whether the arrays are overlapping or not, it will play safe and assume that they will. Vectorization will not happen then. To fix that, you would need to add “__restricted__” option to the array to specify that it will not be overlapped.

Another limitation is that the vectorization might have problems with doing calculations on misaligned values. So imagine accessing first element of the first array and second element of the second array and going from there. There is an option to run “__builtin_assume_aligned” function (can be found here) to tell the compiler how to align. Now, according to these changes, this problem is actively worked on, and it seems that the compiler should figure it out on his own by now.

Another limitation for vectorization is that you cannot vectorize values, coming one at the time from a function. In our code, when we do “a[i]=rand();“, compiler sees that rand() will return only 1 value at the time, so you need to perform 4 separate calls to obtain 4 values. Hence no vectorization.

And the last limitation I will mention is that there is no output/calculation/condition can be done on the vectorized values.

This lab was very interesting, showing me that sometimes, a small change in a code or optimization level can have a significant impact on performance, if it allows vectorization to kick in!

[SPO600] Project – Part 3 – V2

As I have described in my previous post, after hours upon hours of learning how to submit a path, I finally made it. It took the community only 2 hours to completely destroy my work. Good thing that I was given two suggestions to how to improve my code. One suggestion was to drastically change my code, while another suggestion was to abandon the code and do something else. I decided to… do both. Well, except the abandoning part. First of all, I localized my code to the proper parts, instead of creating one giant lump in the top. Second, I created the file layout. Third, I committed. Four, profit!

[SPO600] Project – Nightmares of patching

So, I’ve done with coding the patch. At least so I thought. Next step is to submit the patch and, hopefully, get it accepted. To avoid all extra communication I decided to follow the instructions for submitting the patch as closely as I can. That’s where I started having problems. Although PostgreSQL community claims that anyone can submit a patch, steps that one has to take seemed to be pretty scary to me.

First of all, I had to read tons of documentation, related to submitting the patch. These are some (not all) of the links I had to pile through:

https://wiki.postgresql.org/wiki/Submitting_a_Patch

https://wiki.postgresql.org/wiki/Working_with_Git

https://wiki.postgresql.org/wiki/Creating_Clean_Patches

http://petereisentraut.blogspot.ca/2009/09/how-to-submit-patch-by-email.html

http://en.wikipedia.org/wiki/Diff_utility

After spending few hours on just reading through those articles, I finally started trying to do something. First of all, I run “git diff –color s_lock.h s_lock2.h” to see how things will come out (s_lock.h is the original file, while s_lock2.h is my new file). Just as I expected, it was horrible. Tons of whitespace everywhere. Alignment, that I did in Visual Studio for the file, came out completely screwed up  on Linux. Plus, seeing my code in the context of the file showed me how poorly I organized it. So, back to the drawing board.

After I fixed all the problems and reworked the organization, I run the command again and checked that everything was, in fact as I expected it to be. Second round of changes took only a few minutes, and the third run of the command showed that I am finally there.

The biggest question now was “Now what?”. I decided to take it easy and followed these steps:

git clone git://git.postgresql.org/git/postgresql.git
cd postgresql
git checkout -b my-cool-feature
$EDITOR
git commit -a
git diff --patience master my-cool-feature | filterdiff --format=context > ../my-cool-feature.patch

I created new directory, pulled the repository there, switched my new file, and run the commit. It complained that it didn’t know who I am, so I had to run “git config user.name” command. Trying to commit again didn’t work, so I spent some time trying to understand why. Finally I realized that I need to set up not only my name, but also my email. “git config user.email” command fixed the problem. Then it committed fine.

The first warning sign was the message that told me the amount of lines that were removed and added. I was surprised, since I have changed way less lines than it said. I went into the patch file and realized something terrible. The git repository has newer version of the file than the one I downloaded from the website. It includes the changes people have made since the last release. So what should I do? Should I use the older version, but then my patch file has tons of extra lines, that aren’t in the git repository, or should I change the new version of file, which means I would have to change a whole lot to the code I have written? With a sign I went to check what was changed in the new file. Again, I run “diff” command to see the changes (it’s a very handy command, I must say!), and started working on merging my code with the new file.

Some time later I finally created my working patch file. Next step is to send an email to the developers with my patch.

This was the most emotionally complicated task. I sent the email to developers with info, outlined in https://wiki.postgresql.org/wiki/Submitting_a_Patch#Patch_submission. Hopefully, they will take it as something serious, not as a lame email by a student! Also, hearing from my classmates that their patches were getting rejected really didn’t help my case. In any case, I sent an email, I created the post on CommitFest. Uploaded my patch to github, as it was requested in the CommitFest post, and…

Now we wait.

[SPO600] Project – Alternative Path – Update #1

As I’ve mentioned before, I decided to update the spinlock file with new structure and documentation. As the first few updates showed me, it’s not as easy as one would think. Although quite a bit of code is repeating itself (hence the whole idea of restructuring: get rid of repeating code and ease up reading the file), there are a lot of distinctions between platforms, that are hard to address. For instance, one variable (which I assume is used for register definition) is defined as “unsigned int” for one platform, while it is defined as “unsigned char” for another. Also, depending on the result of a particular #ifdef statement, some platforms (for example “arm”) will do different code for a half of the whole cycle, so combining it with other #ifdef statements will be very  nested, bulky, and hard to follow.

Documentation also proved to be a slight problem. Different platforms had comments is random places. Not only I have to preserve the location, I have to make sure, that the comment will be correctly linked to the correct platform code, which is a bit hard for a platform merge.

Despite these problems, the file becomes shorter and shorter! I hope the community will like it!

[SPO600] Project, step 2, update #2

As I mentioned in my previous post, my plans to optimize spinlocks of PostgreSQL for aarch64 were ruined by whoever already implemented them. At the same time, I spent long time trying to understand the file and see what and how was already implemented. The file has a lot of repeating code, and not enough of documentation to elaborate on platforms and compilers already supported. So as a result, I will be reworking the file with new structure and documentation!

Example of what I mean: Let’s compare two platform-specific implementations:

First:

#ifdef __checi386__ /* 32-bit i386 */
#define HAS_TEST_AND_SET

typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

static __inline__ int
tas(volatile slock_t *lock)
{
register slock_t _res = 1;
__asm__ __volatile__(
" cmpb $0,%1 \n"
" jne 1f \n"
" lock \n"
" xchgb %0,%1 \n"
"1: \n"
: "+q"(_res), "+m"(*lock)
:
: "memory", "cc");
return (int) _res;
}

#define SPIN_DELAY() spin_delay()

static __inline__ void
spin_delay(void)
{

__asm__ __volatile__(
" rep; nop \n");
}

#endif /* __i386__ */

Second:

#ifdef __x86_64__ /* AMD Opteron, Intel EM64T */
#define HAS_TEST_AND_SET

typedef unsigned char slock_t;

#define TAS(lock) tas(lock)

#define TAS_SPIN(lock) (*(lock) ? 1 : TAS(lock))

static __inline__ int
tas(volatile slock_t *lock)
{
register slock_t _res = 1;

__asm__ __volatile__(
" lock \n"
" xchgb %0,%1 \n"
: "+q"(_res), "+m"(*lock)
:
: "memory", "cc");
return (int) _res;
}

#define SPIN_DELAY() spin_delay()

static __inline__ void
spin_delay(void)
{
__asm__ __volatile__(
" rep; nop \n");
}

#endif /* __x86_64__ */

As you can see, pretty much the only difference is extra line for TAS_SPIN(lock) definition and some difference in the assembly of the tas(lock) implementation. With majority of PS implementation being so much alike, I would like to shorted the code by about 70%, and have the checks  made at the time when needed. This will also make the flow of the code much easier to read, as there will be mostly only 1 path, as opposed to multiple paths, which is a lot harder to read.

Another thing I wanted to do for this file is to add some information in documentation. As a student/developer, I would find it very convenient to read right in the beginning of the file what platforms the code was already optimized for, what compilers were supported, and where compiler intrinsics vs PS assembler were used.  This allows for a quick read to see if there is something missing, or can be further improved.

Wait for “[SPO600] Project, step 2, update #3” to see what I have done to the file!

[SPO600] Project, step 2, update #1

As I have mentioned in step 1 of my project, I was planning to optimize the spinlock function of PostgreSQL with platform specific code (in particular for aarch64). Unfortunately, due to hard to read code, I haven’t noticed the present aarch64 implementation. Possibly, this calls for a file documentation update, which I will discuss in the later update.

After a lecture, given on spinlocks, compiler intrinsics and atomic libraries, I was given an idea to rewrite the file entirely, to avoid the use of platform-specific spinlocks, and to switch them to compiler intrinsics. Further digging through the code showed that currently the code is checking whether it’s possible to use the compiler intrinsics on a specific platform, using them if it’s possible and if not, invoking platform specific code. If no code was written for a specific platform, a general code is used.

So the structure of the spinlock file is:

  •  check if spinlocks were requested at all: “#ifdef HAVE_SPINLOCKS”
    • if yes, check if gcc compiler will be used: “#if defined(__GNUC__) || defined(__INTEL_COMPILER)”
      • If yes, check the platform: “#ifdef __checi386__”
        • Now, if the platform is expected to support or not support the compiler intrinsics, then the code will only have one version for each of the two implemented functions (tas(lock) and spin_delay(lock)). In case if there are options, it will check to make sure that the right option will be used: “#ifndef __INTEL_COMPILE check ” or “#ifdef HAVE_GCC_INT_ATOMICS”.
      • After finishing with one platform, move to another. Currently supported hardware-specific implementations are for: i386, x86_64, ia64, arm, aarch64, s390 and s390x, sparc, powerpc, mc68000 and m68k (on Linux), vax, alpha, mips and sgi, m32r, sh.
    • If another compiler is used, go through the list of supported compilers and use the compiler intrinsics if it’s available. List of additional implemented compilers: UNIVEL, alpha, hppa, non gcc HPUX on IA64, AIX, SUNPRO, WIN32 and WIN64 compilers.
    • If no compilers can be used, it will stop with an error saying that “PostgreSQL does not have native spinlock support on this platform.  To continue the compilation, rerun configure using –disable-spinlocks.  However, performance will be poor.” So, essentially, we can’t use spinlocks on this machine.
  • If spinlocks weren’t requested, create the fake implementations, which shouldn’t be called anyways, since, again, no spinlocks were requested. My assumption is that they create it to generally implement the methods.
  • And finally, if whatever required methods weren’t implemented in the above steps, at the bottom of the file, those values are assigned based on default values.

So the structure of the file that is checking: 1) the possibility of using compiler intrinsics as an alternative for a spin-lock code; 2) if none are available, it does platform-specific in-line assembler spin-lock implementation; 3) if none can be done, it will make a generic implementation. Considering that this code structure will always utilize the best case scenario (intrinsics, then PS assembly, then generic), I have a strong opinion that this file cannot be reworked for a better spin-lock/alternative implementation, therefore the optimization of this file is complete.

Having said that, I am still in need to submit something for the step 3 of SPO600 project. What I will do, I will explain in my next post “[SPO600] Project, step 2, update #2”.

[SPO600] – Step 1 – Revised

New decisions:

As you might remember, in my previous blog post about step 1 of the SPO600 project I decided to take on couple of bug reports for PostgreSQL program. After I tried to download and install the program, I found out that the program was widely optimised for x86_64 and i(insert a number here)86 architectures, but it barely has any optimization for aarch64. This, being much closer to the requirements of the project, became my new target for the project.

After downloading the source code, I checked for the place that had optimizations for other platforms, thinking that if it was important to optimize for them, it should be important to optimize for aarch64 as well. Bingo! One of the files had platform-specific locking mechanism. Sounds like a great place for optimization!

The file I will be working on is: 

/src/include/storage/s_lock.h

The platform-independent code that relies and utilizes my code:

/src/include/storage/spin.h

The functions I will be implementing for aarch64: 

* void S_INIT_LOCK(slock_t *lock)
* Initialize a spinlock (to the unlocked state).
*
* int S_LOCK(slock_t *lock)
* Acquire a spinlock, waiting if necessary.
* Time out and abort() if unable to acquire the lock in a
* “reasonable” amount of time — typically ~ 1 minute.
* Should return number of “delays”; see s_lock.c
*
* void S_UNLOCK(slock_t *lock)
* Unlock a previously acquired lock.
*
* bool S_LOCK_FREE(slock_t *lock)
* Tests if the lock is free. Returns TRUE if free, FALSE if locked.
* This does *not* change the state of the lock.
*
* void SPIN_DELAY(void)
* Delay operation to occur inside spinlock wait loop.

Future coding:

Although the code is written in c language, it heavily utilizes  inline assembler language, hence my newly acquired  skill in coding in assembler will come in handy. To properly carry out this project I will have to ensure that I will implement the methods, addressing the specifics and capabilities of the aarch64 system. The file that I will be working on also has default platform-independent implementations of these function. In case if I find out that those implementations cannot be improved, the goal of my project will be to prove this new thesis. This will be done by providing the facts about each function that make it impossible to optimize for aarch64.

Next step: 

Although the standard step now would be to provide profiling information to show that the functions work slower on aarch64, than on other platform (x86_64 would be a good fit due to available to us architecture); I decided to first code the optimized functions, and then compare the results WITH and WITHOUT them. Why? Just because the function doesn’t run slower on aarch64 than on x86_64, it doesn’t mean that it won’t run faster with the optimization! So it make sense to check it against itself optimized VS non-optimized.

Additional Information:

Please refer to my previous blog post for information non the community, submission guidelines, etc, here:

https://lisyonok85.wordpress.com/2015/03/08/spo600-project-stage-1/