Today I went very low level, through x86 assembly code generated by GCC. I wanted to check if the compiler was allocating correctly registers for the NekoVM important variables. In the main loop there was three variable specified as "register" : the accumulator (VM register), the VM stack pointer and the VM code pointer. Theses three are really important since basically you're manipulating them all the time.
Checking at the generated code with
gcc -O3 -S interp.c I found that none was actually allocated as a register. Looking at OCaml VM sources, I noticed that the interpreter was binding theses variables to specific register using GCC asm extension such as :
int *sp asm("%edi");
By doing that with NekoVM I first got some weird unreadable GCC error saying that it couldn't allocate enough registers. I went through comment-and-test some part of the code to check what was not working. In several cases in the VM (adding one string and an integer, adding two strings and resizing the stack) I was using some local variables that were preventing for some reason the registers from being allocated.
I fixed it by moving theses part of the code in other functions, which is not so bad because the call overhead is very low compared to the cost of the operations performed. I then somehow managed to get my registers allocated properly. But when running... SegFault.
I had to check more documentation about the manual register allocation in order to understand what was going wrong. Actually according to calling conventions only some registers (on x86 %ebx, %ebp, %esi and %edi) are preserved between calls. One of my allocated registers was %ecx and was then changed inside a call, and was crashing the VM when back. I moved then VM stack pointer to %edi and VM code pointer to %esi.
For the accumulator, I could have put into %ebx, but that would have make GCC unable to use a preserved register and might hurt some performances. Since the "acc" is most of the time assigned when returning from a call, I only added at a few opcodes some save & restore statements, and allocated the register to %eax which is the processor accumulator.
For only a few lines changed into the interpreter code, and some GCC-specific statements added here and there (macros when __GNUC__ defined), I got a 2 times speedup on my favorite fibonacci(35) sample. That's pretty good news ! Tomorrow I will check about how the Microsoft Compiler is allocating the registers and if I can deal with it the way I did with GCC. I also need to collect some statistics about which opcodes are used the most often on Dinoparc website to see if I can't specialize some of them a little bit to get additionnal speedup. Right now the VM only have 54 opcodes, so there is more slots available.