Fast Flash9 Fixed Point Sine
Recently I've been starting playing a bit with Flash9.
A few months ago, after adapting the Haxe compiler to target the AVM2 (which also runs ActionScript3), I wrote the library which enable you to write Flash9 assembler in Haxe.
Yesterday, I started to microbenchmark some of the most common operations in order to check what kind of thing is expensive and what kind of thing can be done fast.
I was using some Bench program written in Haxe, then generating the corresponding AS3 code by using the HaxetoAS3 Generator, and comparing the results of both outputs.
This way, I could find that some of the operations in Haxe were a bit under optimized, so now everything should be fixed on CVS and these improvements will be included in next 1.17 release.
I was also able to get some interesting conclusions :
 array accesses are not that fast
 static calls are very slow compared to method calls (like 10 times slower). That's a bit strange given that the VM should be able to resolve them at JIT'time and them optimize better
 in particular, while floating point and integer arithmetics operations are very fast, calls to Math functions such as "sin" or "cos" are very slow
As you know, games and 3D engines use a lot of Math functions, so I was interested in writing an optimized Math.sin function. Using precomputed tables would not work because of array access time, so I was thinking to use some approximative float calculus (like Taylor series). After a bit of googling, I found this blog post which was exactly what I was looking for. This can be rewrote as the following code (without precomputed constants) :
function fastsin( x : Float ) { if( x < PI ) x += 2*PI else if( x > PI ) x = 2*PI; var sin = (x  x * abs(x) / PI) * 4 / PI; var fix = sin + .225 * (sin * abs(sin)  sin); return fix; }
This code does not calculate exactly a sin, but gives a very very good approximate that can be perfectly used in games or 3D.
Clamping
First remark : the clamping of value in the [PI,PI] range is a bit slow and inaccurate in the case the value is 10*PI for example.
So how can we easily clamp a Float value between two bounds ?
Using a modulo does not work very well in that case since we want a mod(2PI). We need then some kind of overflow... just like Flash9 int type which is constraint into the range [2^31,2^31].
Does that remind you something ? Let's do the maths :
var range = 2^31 / PI; x = int(x * range) / range;
And here it is ! Superfast accurate clamp !
Fixed point
Another good thing to know is that integer calculus is a lot faster than floating point calculus, so making the rest of the fastsin operations using integer calculus would be a lot faster.
But we have floating point values, so how do we turn them into integer operations ? By using Fixed Point integers. For those that don't know about it, here how it works :
 you convert a float to an int by multiplying it by K=2^15/PI, so you have a 0,000096 precision which should be enough for our fastsin.
 Since our float is between PI and PI, it will turn into an integer between [2^15,2^15] which is good since it means we can multiply two of these integers without any overflow.
 addition and subtraction works the same, since (a*K)+(b*K) == (a+b)*K
 for multiplication however, you have (a*K)*(b*K) = (a * b) * K * K so once multiplied you have to divise by K which is not very easy here, but there will be some trick.
 once calculus is terminated, you can simply divide your integer by K to get the float result
Let's get things into practice. First, after clamping, we reduce the integer from [2^31,2^31] to [2^15,2^15] range :
var x = int(angle * 2^31 / PI) >> 16; var x = int(angle * 683565275.57643158978229477811035) >> 16;
Then we can perform several simplifications of the sin variable calculus :
var sin = (x/K  x/K * abs(x/K) / PI) * (4 / PI); var sin = (x  x * abs(x) / (PI * K)) * (4 / PI) / K; var sin = (x  x * abs(x) / (PI * 2^15 / PI)) * (4 / PI) / K; var sinB = (x  x * abs(x) / 2^15) * (4/PI) * (2^30/PI) / (2^15/PI); var sinB = (x  ((x * abs(x)) >> 15)) * 41721;
Here, we then have calculated sinB = sin*2^30/PI using only integer calculus. Now that the first part is optimized, we can also continue with the second part :
var fix = sinB/(2^30/PI) + .225 * (sinB/(2^30/PI) * abs(sinB/(2^30/PI))  sinB/(2^30/PI)); var fix = (.775*sinB + .225*sinB*abs(sinB) / (2^30/PI)) / (2^30/PI); var fix = (sinB + (.225/.775)*PI*sinB*abs(sinB)/2^30) * .775*PI/2^30 var fix = (sinB + ~0.912*(sinB/2^15)*abs(sinB/2^15)) * .775* PI/2^30 var sinC = sinB >> 15; var fix = (sinB + ~0.912 * sinC * abs(sinC)) * .775*PI/2^30; var fix = (sinB + ((sinC * abs(sinC)) >> 9) * 467) * .775*PI/2^30; var fix = (sinB + ((sinC * abs(sinC)) >> 9) * 467) return fix / 441009855.21060102566599663103894;
Let's explain a bit the 0.912 trick here. In fixed point, we want to do only a multiplication by an integer. So we have to find an good balance between precision and speed. So we start calculating all powers of 2 multiplied by 0.912, until we find that (2^9 * 0.912) ~= 467, so we can then substitute 0.912 by 467/2^9, and use a right shift instead of the division.
Ouch ! We're done ! Let's write everything down : this is the Haxe version (you can't compile it with Haxe 1.16 since __int__ was just added on CVS) :
function fastsin( x : Float ) { var x = (untyped __int__(angle*683565275.576431589782294)) >> 16; var sinB = (x  ((x * (x<0?x:x)) >> 15)) * 41721; var sinC = sinB >> 15; var fix = sinB + (sinC * (sinC<0?sinC:sinC) >> 9) * 467; return fix / 441009855.21060102566599663103894; }
Doing some quick tests gives the following :
 approximation works as good as the floating point version, with a difference of less than 0.00129 between the real sinus calculus and the fast one.
 when called as a method, speed compared to Math.sin is 225% (more than two times faster), when inlined 320% (more than three times faster) !
Bad news and good news !
By doing this optimization, we could manage to remove entirely the cost of the static call. It means that putting fastsin inside a static function would go slower that real Math.sin. But the good news is that we can inline the code.
This is something that Haxe will support soon. You will be able to mark fastsin method as inline and each time it's called, it will insert at compile time in your code the 5 lines of the method body instead of the actual call, giving you the best speedup you can get.
The bad news is for ActionScript3 users. I don't know why, but even with typing everything to int, some of the integer operations are not correctly optimized by the MXMLC compiler. Hence, it actually performs quite bad...
Now, a very good news : if you're interested in getting the best performances from Flash9 VM, you will have to use Haxe :)

(you)May 28, 2017 at 18:45