[If you have arrived here from a blog post discussing asm.js please read my own opinion about asm.js]
Disclaimer: This is my personal blog. The views expressed on this page are mine alone and not those of my employer.
Mr. C. is a C programmer as you can probably guess from his name. Today he was asked by his boss to write a very simple function: given an array of numbered 2d points calculate vector sum of all even numbered points... He opens his favorite text editor and quickly types something like (I'll be intentionally skipping some
Oh, that was pretty easy... There are still some time left, so mr. C. throws in some array creation and benchmarking code:
His code seems to compile and run just fine:
% gcc --std=c99 -O3 -o point point.c % ./point create: 508075 [50.81 per iteration] usec, sum: 214779 [21.48 per iteration] usec
That was very easy, but the numbers don't look as impressive as mr. C. or his boss expected:
% ~/src/v8/d8 point1.js create: 4629000 [462.9 per iteration] usec, sum: 2056000 [205.6 per iteration] usec
C is a low-level programming language. People call it a portable assembly sometimes. You just take a piece of memory and say: hey, this smallish sequence of 24 bytes is actually a
Point which has one
uint32_t field followed by two
Small chunk of unused memory between
x appears to make
y fields nicely aligned. CPUs ♥ aligned doubles. On some architectures you can't even read an unaligned one directly from memory into a floating point register (you'll get punished if you try).
Arrays in C are also nice and tight: you just take a bigger region of memory and assume that it contains a sequence of
Points one after another.
Point with this and that as fields. Instead you have an extremely flexible thing called
Object can have properties which can come and go as they wish (almost). You can write constructor:
but nothing actually prevents you from modifying
V8 actually tries to make the object more struct like. It attempts to figure out how much space should be preallocated directly inside the object for properties by looking at the constructor and/or doing allocation profiling. Some VMs just use a fixed constant or even store all properties outside of an object in a separate array. Another thing to notice is that V8 always boxes numbers that are not 31-bit integers (32-bit on x64). Some VMs (e.g. SpiderMonkey and JSC) don't box doubles but instead use NaN-tagging instead.
Point object will be 3.3 times as big on x64 version of V8 than it was in
C program and it will not be continuously allocated in memory (doubles are accessed via indirection). There are three additional fields: pointer to hidden class (known as
Map in V8,
Shape in SpiderMonkey,
Array of points object looks basically the same:
The mathematics of performance is simple: each point of flexibility and each indirection (arrow on the picture) adds additional runtime cost. Evaluating simply looking expression
points? How do I get property called
i from it? How then I get
points is an array of
Point objects" via types declarations JIT has to eavesdrop to notice that.
If you want to get good performance you need to speak loud enough for JIT compiler to hear you and adapt to your application. This means: your objects should have stable layouts, property access and call sites better be monomorphic, no funny language features (e.g. with or eval) in sight, using typed arrays where appropriate instead of pure JS arrays etc. In a sense you'll be writing statically typed code in a language that does not support static typing.
"Wait a second!" somebody might say here "Mr. C's code above is pretty damn type stable and yada-yada. But it's still 10x times slower! Can he make it faster?"
Yes, he can but he will have to
This is a very naïve implementation (I would even say parody) of the famous
malloc&free duo. It does not try to optimize memory usage at all but it is perfect for our demonstration.
Original C program returns
sum on the stack so will have our toy stack to mimic that.
(byte*)points + sizeof(Point) * i might be wasteful because multiplication is expensive compared to addition so good C compiler will try to replace
points[i] with a new artificial induction variable
p' that starts as
p' = points[i] and is advanced as
p += sizeof(Point).
Summing loop was translated from C in the same way as initialization loop in
createArrayOfPoints but there is another important classical optimization demonstrated here: scalar replacement. Smart compiler sees (via escape analysis usually) that
Point sum does not have to exist as a continuos entity on the stack. Instead it can be exploded and its fields becomes separate variables (register allocator later puts them into registers). Result of the
sumArrayOfPoints is returned on the "stack" just like in the original C version.
The rest of code is unchanged but minor changes are required in the
We have to take care of managing stack space for stack allocated
sum object and we should remember to
% ~/src/v8/d8 point.c.js create: 532000 [53.2 per iteration] usec, sum: 988000 [98.8 per iteration] usec
If you are curious enough to look at the code V8 generates for the sum loop and compare it with code generated from C by GCC you will notice that while it is pretty tight V8 does not understand that i32 and f64 are actually the same array and treats them separately. There are also bounds checks that are not there for C code.
Don't be charmed by the sweet spot
You can hit the sweet spot, no doubt about that. But you have to follow rules to hit it. And those rules are strict (application runs best when it's actually pretty static in it's behavior) and sometimes unknown (different VMs rely on different heuristics). And hitting the sweetest part requires some strange engineering decisions as demonstrated by the mr. C.'s example above.
Real world performance is not as shallow as some micro-benchmark score. It has at least three very important facets: speed, memory, VM's complexity.