I should say that I am not alone in my desire to explain how things work internally and help people write a more performant code. A lot of people from all over the world are trying to do the same. But there is I think a problem that prevents this knowledge from being absorbed efficiently by developers. We are trying to convey our knowledge in the wrong form. I am guilty of this myself:
- sometimes I wrap things I know about V8 into hard to digest lists of "do this, not that" recommendations. The problem with such serving is that it really does not explain anything. Most probably it will be followed like a sacred ritual and might easily become outdated without anybody noticing it.
- sometimes trying to explain how VMs works internally we choose wrong level of abstraction. I love a thought that seeing a slide full of assembly code might encourage people to learn assembly and reread this slide later, but I am afraid that sometimes these slides just fall trough and get forgotten by people as something not useful in practice.
Note that I have a habit of checking at least some final results computed by my μbenchmark. This saves me from embarrassment when somebody discovers that my revolutionary jsperf test-cases are nothing but my own bugs.
If you take the code above and put it into Lua interpreter you will get something like this:
∮ lua points.lua 150.2
However if you just try to run translated code with
d8 (V8's standalone shell) it will politely refuse:
∮ d8 points.js points.js:9: ReferenceError: Table is not defined var array = new Table(); ^ ReferenceError: Table is not defined at MakeArrayOfPoints (points.js:9:19) at points.js:37:13
The reason for this refusal is simple: we are still missing runtime system code which is actually responsible for implementing object model and semantics of loads and stores. It might seem obvious, but I want to highlight this: VM, that looks like a single black box from the outside, on the inside is actually an orchestra of boxes playing together to deliver best possible performance. There are compilers, runtime routines, object model, garbage collector, etc. Fortunately our language and example are very simple so our runtime system is only couple dozen lines large:
Notice that I have to use Harmony
Object because potentially table can contain any key, not just string ones.
∮ d8 --harmony quasi-lua-runtime.js points.js 737
invokedynamic is essentially a structural inline cache exposed at bytecode level. Inline caching (usually abbreviated as IC in V8 sources) is actually a very old technique developed roughly 30 years ago for Smalltalk VMs.
Good duck always quacks the same way
The idea behind inline caching is very simple: we want to create a bypass or fast path that would allow us to quickly, without entering runtime system, load object's property if our assumptions about object and it's properties are correct. It's quite hard to formulate any meaningful assumptions about object layout in a program written in language full of dynamic typing, late binding and other quirks like
eval so instead we want to let our loads/stores observe&learn: once they see some object they can adapt themselves in a way that makes subsequent loads from similarly structured objects faster. In a sense we are going to cache knowledge about the layout of the previously seen object inside the load/store itself hence the name inline caching. ICs can be actually applied to virtually any operation with a dynamic behavior as long as you can figure out a meaningful fast path: arithmetical operators, calls to free functions, calls to methods, etc. Some ICs can also cache more than a single fast path that is become polymorphic.
If we start thinking how to apply ICs to the translated code above it soon becomes obvious that we need to change our object model. There is no way we can do a fast load from a
Map, we always have to go through
get method. [If we could peak into raw hashtable behind
Map we could make IC work for us even without new object layout by caching bucket index.]
Discovering hidden structure
For efficiency tables that are used like structured data should become more like C
structs: a sequence of named fields at fixed offsets. The same about tables that are used as arrays: we want numeric properties to be stored in array like fashion. But it's obvious that not every table fits such representation: some are actually used as tables, either contain non-string non-number keys or contain too many string named properties that come and disappear as table is mutated. Unfortunately we can't perform any kind of expensive type inference, instead we have to discover a structure behind each and every table while the program runs creating and mutating them. Fortunately there is a well known technique that allows to do precisely that ☺. This technique is known as hidden classes.
The idea behind hidden classes boils down to two simple things:
- runtime system associates a hidden class with each an every object, just like Java VM would associate an instance of
java.lang.Classwith every object;
- if layout of the object changes then runtime system will create or find a new hidden class that matches this new layout and attach it to the object;
Hidden classes have a very important feature: they allow VM to quickly check assumptions about object layout by doing a simple comparison against a cached hidden class. This is exactly what we need for our inline caches. Lets implement some simple hidden classes system for our quasi-Lua runtime. Every hidden class is essentially a collection of property descriptors, where each descriptor is either a real property or a transition that points from a class that does not have some property to a class that has this property:
Transitions exist to enable sharing of hidden classes between objects that are created in the same way: if you have two objects that share hidden class and you add the same property to both of them you don't want to get different hidden classes.
Now we can make our tables flexible and allow them to adapt to the way they are constructed
Now that we have hidden classes in our runtime system that would allow us to perform quick checks of object layout and quick loads of properties by their index we just have to implement inline caches themselves. This requires some additional functionality both in compiler and runtime system (remember how I was talking about cooperation between different parts of VM?).
Patchwork quilts of generated code
- a global variable per IC will be used to emulate modifiable call instruction;
- and closures will be used instead of stubs.
arguments.caller is not fine-grained enough) so we'll just pass IC's id into IC stub explicitly. Here is how IC-ified code will look like:
Changes above are again self-explanatory: every property load/store site got it's own IC with an id. One small last step left: to implement
MISS stubs and stub "compiler" that would produce specialized stubs:
It's a lot of code (and comments) but it should be simple to understand given all explanations above: ICs observe and stub compiler/factory produces adapted-specialized stubs [attentive reader can even notice that I could have initialized all keyed store ICs with fast loads from the very start or that it gets stuck in fast state once it enters it].
If we throw all the code we got together and rerun our "benchmark" we'll get very pleasing results:
∮ d8 --harmony quasi-lua-runtime-ic.js points-ic.js 117
This is a factor of 6 speedup compared to our first naïve attempt!
I hope tomorrow or maybe the week after you will stop and shout Eureka! and tell your coworkers why conditionally adding properties to an object in one place of the code can affect performance of some other distant hot loop touching these objects. Because hidden classes, you know, they change!