Imagine for a moment that you are a modern day hero. A prince or a princess adept in JavaScript programming traveling through enchanted world and searching for the World’s Salvation, the Ultimate Love or the well guarded recipe of Siberian dumplings.
Sooner or later you encounter a merciless Sphinx who asks you the simplest question you’ve ever heard in your life: “What is the cost of accessing the length
property of a string value in JavaScript running in the recent version of Chrome?”
You are a seasoned adventurer and your pockets are full of tools. You throw together a simple jsPerf test case and tell Sphinx the answer.
And Sphinx eats you.
Compilers vs. μbenchmarks
It’s a fairy tale inside the fairy tale really. Microbenchmarks are both the best friends and worst foes of compilers. Nothing helps you to solve a performance problem better than a microbenchmark demonstrating it. Nothing spoils your day more than a microbenchmark that is used to derive incorrect conclusions.
As compilers get smarter it becomes harder to write a microbenchmark that would give you an answer to even the simplest question. As Robert Sheckley wrote in his short story Ask a Foolish Question: In order to ask a question you must already know most of the answer.
Back in the glorious days of interpreted JavaScript you could write:
and get the answer you were searching for. This was backed by a simple mathematics: repeat action many times in the loop, measure how much that takes and then substract amount of time needed for the empty loop itself.
\[\frac{(N \cdot C_{op} + C_{loop}) - C_{loop}}{N} \approx C_{op}\]But this mathematical tranquility ended when JIT compilers arrived from the other universe and started optimizing things.
Loop-invariant code motion
There are many ways for an optimizing compiler to spoil your calculations. One of them is the loop-invariant code motion usually abbreviated as LICM. This optimization identifies expressions that produce the same value on all loop iterations and hoists them out of the loop.
In fact many JavaScript programmers are actually performing LICM by hand when they manually move array.length
expression out of the loop condition and keep in the temporary variable during iteration.
Implementation of the LICM is usually pretty straightforward. It involves figuring out dependencies between different expressions inside the loop. For example stores can affect loads from the same location making them non-invariant if the stored value itself is non-invariant. Calls to function might potentially have completely unknown effects and so on. You can take a look at V8’s LICM pass if you’d like to see how it can be done.
[IonMonkey as well implements a variation of the classical LICM. PyPy and LuaJIT2 tracing JITs apply a different approach: they combine loop peeling with CSE to achieve effect similar to LICM. You can read more in a paper Loop-Aware Optimizations in PyPy's Tracing JIT from PyPy developers. There were some indications that JavaScriptCore might go the same route.]
Lets now take a look at the intermediate representation that V8’s optimizing compiler generates for a nicely warmed up and optimized version of CostOfLength
:
B0 t4 Parameter 1 // N Goto B1 B1 i20 Constant 0 i51 Constant 1 t19 Constant "Hello" t25 CallConstantFunction Date.now t51 StringLength t19 i142 Change t51 tagged to int i138 Change t4 tagged to int Goto B2 B2 // loop entry i32 Phi [i20, i142] // res i34 Phi [i20, i53] // i CompareIDAndBranch (i34 < i138) goto (B3, B5) B3 Goto B4 B4 StackCheck i53 Add i34 i52 // i++ Goto B2
[I prettified hydrogen manually by throwing out some irrelevant details but did not remove anything important.]
As you can see string length access was hoisted out of the loop entirely and is now sitting right above the loop entry block B2
. This turned our pretty mathematical formula for $C_{op}$ into something strange:
Hmm. This does not look right. No wonder Sphinx was not happy with the answer.
The only advice I can give here is to avoid loop invariant expressions when trying to estimate cost of operations.
Constant propagation
This is another optimization that could have agitated the Sphinx but V8 does not apply it in this case.
The idea behind this optimization is even simpler than LICM: you just fold all constant expressions until only non-constant ones that depend on some runtime values or behavior remain. There are various well known and widely used forms of this optimization for example Sparse Conditional Constant Propagation.
V8 in fact does a limited constant folding for numeric expressions during SSA graph construction phase. What would happen if it did the same for StringLength
operation?
It’s quite easy to try out by adding a simple canonicalization rule for HStringLength
IR instruction:
With this trivial 14 lines change most of which is just boilerplate code StringLength
disappears entirely from the optimized version of CostOfLength
function and mathematics of the microbenchmark is distorted further:
B0 t4 Parameter 1 // N Goto B1 B1 i20 Constant 0 i51 Constant 1 t19 Constant "Hello" t25 CallConstantFunction Date.now t51 StringLength t19 i147 Constant 5 // StringLength was evaluated during compilation! i138 Change t4 tagged to int Goto B2 B2 // loop entry i32 Phi [i20, i147] // res i34 Phi [i20, i53] // i CompareIDAndBranch (i34 < i138) goto (B3, B5) B3 Goto B4 B4 StackCheck i53 Add i34 i52 // i++ Goto B2\[\frac{(0 \cdot C_{op} + C_{loop}) - C_{loop}}{N} \approx \frac{0}{N} = 0\]
While V8 currently does not fold StringLength
away example above demonstrates that it can be easily implemented. The advice here would be to avoid constant expressions when trying to estimate cost of operations.
Dead code elimination (DCE)
DCE is another optimization that is often mentioned in connection to wrong microbenchmarking. DCE tries to eliminate operations that have no observable effect on the rest of the program execution.
If a JavaScript applied DCE to our microbenchmark then both loops would just disappear from the program. The same would happen to any expression without side effects, computations which are done but never used etc.
At the moment V8 does not perform DCE and prototyping one just to illustrate the point of the post is a bit too much for me… So I will not discuss it further. However the advice here would be: use and preferably check results of the computations inside your microbenchmarks to prevent DCE from throwing them away. Checking is especially helpful because it allows to catch bugs both in microbenchmarks and JIT compilers.
What did we actually measure?
Now that we understood why Sphinx ate us and got some tips on avoiding it in the future it is time to dig a little bit deeper.
Is there any difference between optimized code produced for the loop inside CostOfLoop
and the loop inside CostOfLengthPlusLoop
? Were we actually right when we said that CostOfLengthPlusLoop
is $(C_{op} + C_{loop})$?
Looking at the generated code indeed reveals that there is a difference:
[Confusing stack-check
inside both loops is emitted by V8 to allow interruption of long running loops. If V8 for some reason (debugging, preemption, GC request etc) needs to pause JavaScript execution it will just set stack limit (located at `0xba5adc` for the runs above) and any currently running loop will pass control to the runtime system.]
Indeed two strange moves appeared in CostOfLenthPlusLoop
. My math was wrong again!
The second one (2)
is easy to explain: it came from the φ-function for the variable res
. This variable is 0
(if loop was not executed) and is equal to str.length
after the first iteration. That’s exactly what this move does - initializes res
at the backedge of the loop.
[Attentive readers with a bit of compiler background could have noticed that if the loop was peeled then this move would disappear because `res` never changes after the first iteration.]
Now, where did (1)
come from? It stores res
into a memory location (stack slot) that is never read again inside the loop. This does not make much sense! Indeed it does not, but it’s actually a long story. It’s an artifact of the simple approach that V8 uses in its register allocator to place spill stores: if the value is spill somewhere during its lifetime then it is spilled at its definition.
In our case res
is spilled to the stack after the loop because we have a call to Date.now
. This causes both res
and i
to be stored to their spill slots again and again on loop entry. It is surely suboptimal and can be fixed (see Issue 2450 for more details).
Essentially those spill stores and stack-check
(that involves memory load) is what consumes most cycles when you time both loops. Even if str.length
was still inside the loop its cost would be quite negligible compared to the cost of the loop itself.
In any case right now overall difference between CostOfLenthPlusLoop
and CostOfLoop
is far too small for millisecond precision measurement even with N = 1e9
.
Moral?
…and μbenchmarks, Sphinx, compiler engineer and our hero lived happily ever after
I think the only way to win this game against the merciless Sphinx is to transcend the question and realize that there is no single right and simple answer to it.
The time when you could have easily guessed the cost of a single operation by simple observation is over. If you want to know something then sometimes the only way is to go and actually learn it from the inside out.
Take care and don’t fall into the Sphinx’s fairy tale trap again.