I was thinking about writing a smallish blog post summarizing my thoughts on closure variables vs. instance field performance as a reply to Marijn Haverbeke’s post which postulates initial mystery when I realized that this is an ideal candidate for a longer post that illustrates how V8 handles closures and how these design decisions affect performance.
Contexts
If you program JavaScript you probably know that every function carries around a lexical environment that is used to resolve variables into their values when you execute function’s body. This sounds a bit vague but actually boils down to a very simple thing:
Function f
needs some kind of storage attached to it to keep variables x
and y
because when f
is called makeF
s activation that initially created them no longer exists.
V8 does exactly this: it creates an object called Context
and attaches it to the closure (which is internally represented with an instance of JSFunction
class).
[From here on I will be saying context variable instead of captured variable.] There are however a couple of important nuances. First of all V8 creates a Context
when we enter makeF
not when we create a closure itself as many people expect. It is important to keep this in mind when binding variables that are also used in a hot loop. Optimizing compiler will not be able to allocate such variables to registers and each load and store will become a memory operation.
[There is an optimization called register promotion that allows compiler to forward loads to stores across loops and delay storing value into memory for as long as possible but nothing similar is implemented in V8 at the moment]
Another thing to keep in mind: if Context
is potentially needed it will be created eagerly when you enter the scope and will be shared by all closures created in the scope. If scope itself is nested inside a closure then newly created Context
will have a pointer to the parent. This might lead to surprising memory leaks. For example:
In this code closure innerG
stored in the variable o
will retain:
GIANT
through a link to a sharedContext
that was used byinnerF
to access variabley
;HUGE
through a link to a sharedContext
that links to the parentContext
created forinner
.
inner
itself will survive as long as o
points to innerG
, because inner
is retained by the context it created for innerF
and innerG
. The same is true for outer
(except that outer
is also retained through global variable, so there is no leak): context it created for inner
points back to it. For the asynchronous code with deeply nested callbacks it might mean that outermost callback will be kept alive until the inner most dies. This will in turn increase the pressure on garbage collection and in the worst case outer callback might even end up being promoted to the old generation. Avoiding deep callback nesting in hot places might improve your app's performance by reducing the pressure on GC.It is useful to keep this picture in mind when debugging memory leaks in callback-centric code bases.
In addition to normal rules governing outer scope references certain language constructs force variables to be context allocated:
- direct call to
eval
andwith
-statement cause all variables in all scopes enclosing a scope containing them to be context allocated; - reference to
arguments
object from a non-strict function causes parameters to be context allocated;
From the last observation it also follows that hot functions using arguments
object should either have an empty formal parameter list or be declared strict because this allows to avoid allocation and enables inlining (functions that have to allocate a context are not inlined).
Generated code
Now lets take a look at the code that V8 compilers (there are two) generate for reading and writing context variables in comparison to the code generated for a monomorphic instance fields access.
[
To see machine code generated by V8 you can fetch V8's source from the repository, build a standalone shell d8
and invoke it with --print-code --print-code-stubs --code-comments
. See cheat sheet below.
∮ svn co http://v8.googlecode.com/svn/branches/bleeding_edge v8
∮ cd v8
∮ make dependencies
∮ make ia32.release objectprint=on disassembler=on
∮ out/ia32.release/d8 --print-code --print-code-stubs --code-comments test.js
]
Not surprisingly instance field load (1) is compiled by non-optimizing compiler into an inline cache invocation:
After it is executed several times IC call gets patched with the following stub:
Nothing unexpected here, it is a classical implementation of an inline caching. If you want to read a highlevel explanation of how inline caches and hidden classes work check out my post with implementation of Inline Cache in JavaScript.
If we take a look at the load of a context variable (2) we will discover that it is compiled even by the nonoptimizing compiler into something much simpler:
There are two things to notice here:
- V8 keeps a dedicated register
esi
for the current context, to avoid loading it from the frame or closure object itself; - compiler was able to resolve variable into a fixed index during the compilation so there is no late binding, no lookup overhead and no need to involve inline caches.
If we take a look at the optimized code then we discover that loading of a context variable is basically the same, but loading of an instance field from the optimized code is a bit different:
[Here comments ;;; @N:
refer to instructions in Crankshaft's low-level IR aka lithium].
Crankshaft specialized the load site for a particular type of object and inserted type guards that cause deoptimization and a switch to unoptimized code when they fail. One can say that essentially Crankshaft inlined IC stub and decomposed it into individual operations (checking non-sminess, checking hidden class, loading field) and rerouted slow case (miss) through deoptimization. V8 does not actually implement type specialization as inlining of a stub but this is a very handy way of thinking about it especially because the main and only source of type information currently used by V8 is inline caches. [Take a note of this, I’ll discuss some consequences below.]
Splitting guards and actual operations (like loads) allows optimizing compiler to eliminate redundancy. Lets check what happens if we add one more field into our class (I am skipping warm up code):
Non-optimized version of getSum
will have three ICs (one for each property load and one for +
which also has a bit of late binding mixed in), but optimized version is more compact than that:
Instead of leaving two check-non-smi
and two check-maps
guards compiler performed common subexpression elimination (CSE) and eliminated redundant guards. Code looks shiny but loads from context look no less shiny and do not require any guards because their binding is resolved statically. How does it happen then that closure based OOP ends up being slower than a classical one?
Lets return to our example and add more OOPness into it (after all OO is about methods calling methods calling methods calling methods…):
I am not going to look at the non-optimized code now but will immediately proceed to optimized code of function ClassicObject.prototype.getSum
Looks almost the same as we had before… Wait a second! What happened with all those calls to getX
and getY
, what is this check-prototype-maps
thingy and how did field loads appear directly in getSum
code?
The truth is Crankshaft inlined both of these small functions into getSum
and completely eliminated calls. Obviously this inlining decision will become incorrect if somebody replaces getX
or getY
on the ClassicObject.prototype
that is why Crankshaft generated a guard against hidden class of the prototype — helpful check-prototype-maps
fellow. There is another interesting trick behind this check: V8’s hidden classes encode not only structure of the object (which properties at which offsets object has) but also methods attached to the object (which closure is attached to which property). This allows to make object-oriented programs quite faster, with a single guard verifying several assumptions about the object.
If we now take a look at optimized code attached to the closure_object.getSum
then we will become a little bit disappointed:
Optimizing compiler was able to inline both getX
and getY
but it did not realize several things:
- even at the parser level it is already obvious that
getX
andgetY
are immutable constants that are not modified in runtime so there is no need to guard inlined code with a check against closure identity; getX
,getY
andgetSum
have the same context but optimizer did not take that into account, instead it embedded context ofgetX
and context ofgetY
as constants into the code introducing unnecessary indirection through cells (instructions@20
and@32
).
This happens because Crankshaft does not fully utilize static information that parser could give it and instead relies on the type feedback which in fact can be easily lost if you create two ClosureObject
objects.
To observe this disaster lets change our warm up code a bit:
Optimized code in ClassicObject.prototype.getSum
did not change at all because hidden classes of both classic objects are the same as they were constructed the same way. However quality of the code in the ClosureObject
’s getSum
degraded significantly. [Half a year ago V8 would even produce several copies of optimized code one for each getSum
closure, but now it pays more attention to sharing optimized code across closures produced from the same function literal]
Calls to getX
and getY
are no longer inlined and even more — V8 does not understand that they are guaranteed to be functions — they are not direct and go through a generic CallFunctionStub
that will check if the target (passed in register edi
) is actually a function.
Why did this happen? The short answer: type feedback from two instances of getSum
got mixed up together turning getX
and getY
call sites into megamorphic calls. The easiest way to explain it is to draw a picture:
V8 shares unoptimized code across all closures produced from the same function literal. At the same time inline caches and other facilities collecting type feedback are attached to unoptimized code. As the result type feedback is shared and mixed. Everything is mostly fine while type feedback is based on hidden classes because they capture organization of objects — objects constructed the same way have the same hidden class. However for getX
and getY
callsites V8 collects call targets. If you have only one ClosureObject
with a single getSum
everything will seem monomorphic to V8 because getX
and getY
are always the same. However if you create and start using another ClosureObject
those callsites will become megamorphic: identity of call targets does not match anymore.
There are multiple things that V8 could do better here like utilizing static information about immutability of bindings that parser can recover from the source and using SharedFunctionInfo
identity instead of closure identity for call target feedback and inlining guards (Issue 2206).
Until these issues are addressed classical objects might be a better choice if you are looking for a predictable performance. Though each individual case requires careful investigation (e.g. is it singleton or multiple objects are going to be produced? is it on hot path? etc). And of course don’t optimize before you need to optimize :-)