this slide
was intentionally
left blank

[but you can always reach me at @mraleph and [email protected] for questions]

Any Java programmers?


class Producer {
  byte[] src;

  public void j4(Consumer dst) {
    for (int i = 0; i < src.length; i++) {
      dst.consume(src[i]);
    }
  }

  public void j5(Consumer dst) {
    for (byte b : src) {
      dst.consume(b);
    }
  }
}

Which is faster?

j4 is faster
j5 is faster
the same
depends on consume
depends on definition of "faster"

j4 is faster
j5 is faster
the same
depends on consume
depends on definition of "faster"


static void measure(String name,
                    Runnable r) {
  long start = System.currentTimeMillis();
  for (int i = 0; i < 1000; i++) r.run();
  long end = System.currentTimeMillis();
  System.out.println(
      name + " took " +
      (end - start) + " ms.");
}

Don't do this!

use JMH instead


class Consumer {
  void consume(byte b) { }
}

public static void main(String[] args) {
  final Producer p =
      new Producer(new byte[1024 * 1024 * 10]);
  final Consumer c = new Consumer();

  measure("j4", () -> p.j4(c));
  measure("j5", () -> p.j5(c));
}


class Consumer {
  void consume(byte b) { }
}

public static void main(String[] args) {
  final Producer p =
      new Producer(new byte[1024 * 1024 * 10]);
  final Consumer c = new Consumer();

  measure("j4", () -> p.j4(c));  // ⇒ 7 ms
  measure("j5", () -> p.j5(c));  // ⇒ 6 ms
}


measure("j4", () -> p.j4(c));  // ⇒ 7 ms
measure("j5", () -> p.j5(c));  // ⇒ 6 ms
measure("j4", () -> p.j4(c));  // ⇒ 0 ms (!)
measure("j5", () -> p.j5(c));  // ⇒ 0 ms (!)


class Consumer {
  void consume(byte b) {
    /* doing nothing */
  }
}

measure("j4", () -> p.j4(c));  // ⇒ 7 ms
measure("j5", () -> p.j5(c));  // ⇒ 6 ms
measure("j4", () -> p.j4(c));  // ⇒ 0 ms (!)
measure("j5", () -> p.j5(c));  // ⇒ 0 ms (!)


class Consumer {
  byte trap = 1;
  void consume(byte b) {
    if (b == trap) /* do something */
       throw new RuntimeException("impossible!");
  }
}

measure("j4", () -> p.j4(c));
measure("j5", () -> p.j5(c));


class Consumer {
  byte trap = 1;
  void consume(byte b) {
    if (b == trap) /* do something */
       throw new RuntimeException("impossible!");
  }
}

measure("j4", () -> p.j4(c));  // ⇒ 4248 ms
measure("j5", () -> p.j5(c));  // ⇒ 4233 ms


class Consumer {
  volatile byte trap = 1;
  void consume(byte b) {
    if (b == trap) /* do something heavier! */
       throw new RuntimeException("impossible!");
  }
}

measure("j4", () -> p.j4(c));
measure("j5", () -> p.j5(c));


class Consumer {
  volatile byte trap = 1;
  void consume(byte b) {
    if (b == trap) /* do something heavier! */
       throw new RuntimeException("impossible!");
  }
}

measure("j4", () -> p.j4(c));  // ⇒ 13130 ms
measure("j5", () -> p.j5(c));  // ⇒ 4738 ms

try to figure it out

this talk is about JavaScript

or read http://psy-lob-saw.blogspot.dk/2014/08/the-volatile-read-suprise.html

Any C++ programmers?


float dot(uint8_t* va, float* vb, size_t n) {
    float sum = 0.0;
    for (size_t i = 0; i < n; i++) {
      sum += va[i] * vb[i];
    }
    return sum;
}

int main(int argc, char* argv[]) {
  /*
   * call dot() on 2^11 long vectors 2^21 times
   */
}

$ gcc -m32 -O2 ./test.cc -o ./t32 && time ./t32
./t32 4.23s user 0.00s system 100% cpu 4.231 total
$ gcc -m64 -O2 ./test.cc -o ./t64 && time ./t64
./t64 12.16s user 0.00s system 100% cpu 12.155 total

bug in GCC <4.9

generated code hits partial register dependency stall


float dot(uint8_t* va, float* vb, size_t n) {
    // ...
      sum += I2F(va[i]) * vb[i];
    // ...
}

float __attribute__((always_inline)) I2F(int i) {
    float f;
    __asm__("xorps %0, %0;"
            "cvtsi2ss %1, %0"
            : "=X"(f)
            : "r"(i));
    return f;
}

now JavaScript

we say performance
we mean jsperf.com


// split
testString.split("e").length - 1;

// while
var count = 0;
while (testString.length) {
  if (testString.charAt(0) === "e") {
      count += 1;
   }
   testString = testString.slice(1);
}


// split (IT'S OVER 9000K OP/S ON FF!)
testString.split("e").length - 1;

// while
var count = 0;
while (testString.length) {
  if (testString.charAt(0) === "e") {
      count += 1;
   }
   testString = testString.slice(1);  // WAT?
}

DOUBT

EVERYTHING

µbench 101

the cost of `OP`?


function benchmark() {
  var start = Date.now();
  /* OP */
  return (Date.now() - start) / N;
}

what if `OP` faster than clock?


function benchmark(N) {
  var start = Date.now();
  for (var i = 0; i < N; i++) {
    /* OP */
  }
  return (Date.now() - start) / N;
}

\[C\]

\[C \times N\]

\[\frac{C \times N}{N}\]

\[\frac{C \times \cancel{N}}{\cancel{N}}\]







    while (testString.length) {
      if (testString.charAt(0) === "e") {
          count += 1;
       }
       testString = testString.slice(1);
    }


function benchmark() {
  var testString = "...";

  var start = Date.now();
  for (var i = 0; i < N; i++) {
    var count = 0;
    while (testString.length) {
      if (testString.charAt(0) === "e") {
          count += 1;
       }
       testString = testString.slice(1);
    }
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var testString = "...";

  var start = Date.now();
  for (var i = 0; i < N; i++) {
    var count = 0;
    while (testString.length) {
      if (testString.charAt(0) === "e") {
          count += 1;
       }
       testString = testString.slice(1);
    }
  }
  return (Date.now() - start) / N;
}

repetitions start with `testString === ""`

\[C \times N\]

\[C \times 1 + 0 \times (N - 1)\]

\[\frac{C}{N}\]

\[\frac{C}{N} \approx 0\]

jsperf misuse

`OP` should take the same time

when repeatedly executed

«`~~` for the win!»

NOPE

JITs are clever

they optimize your program as it runs

\[C \times N\]

\[C_u N_u + C_o N_o\]

[unoptimized + optimized version]

\[{\sum C_{i} N_i}\]

[multiple unoptimized and optimized versions]

\[{\sum C_{i} N_i} + \color{#c1357a}{\frac{1}{\sqrt{2\pi}}\int C(\xi) e ^ {-\frac{\xi^2}{2}} d\xi}\]

[math gets too complicated to approach analytically]

JITs are clever

to measure you need to outsmart

Disclaimer: On slides I will show optimizations as source-to-source transformations. In reality compilers operate on more sophisticated representations.


function benchmark() {
  var i = '1561565', j;
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    j = ~~i;
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var j;
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    j = ~~'1561565';
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var j;
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    j = ~-1561566;
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var j;
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    j = 1561565;
  }
  return (Date.now() - start) / N;
}

constant propagation

«hey, I can trick the compiler!»


function benchmark() {
  var i = Date.now().toString(), j;
  //  ^ not a constant any more
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    j = ~~i;
  }
  return (Date.now() - start) / N;
}

«Wrong answer, McFly!»


function benchmark() {
  var i = Date.now().toString(), j;
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    j = ~~i;
    //  ^^^ loop invariant
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var i = Date.now().toString(), j;
  var start = Date.now();
  var j0 = ~~i;
  //  ^^^^^^^^ hoisted from the loop
  for (var n = 0; n < N; n++) {
    j = j0;
  }
  return (Date.now() - start) / N;
}

loop invariant code motion


function benchmark() {
  var i = Date.now().toString(), j;
  var start = Date.now();
  var j0 = ~~i;
  for (var n = 0; n < N; n++) {
    j = j0;
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var i = Date.now().toString(), j;
  var start = Date.now();
  var j0 = ~~i;
  for (var n = 0; n < N; n++) {
    j = j0;
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var i = Date.now().toString(), j;
  var start = Date.now();
  var j0 = ~~i;
  for (var n = 0; n < N; n++) {
    j = j0;
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var i = Date.now().toString(), j;
  var start = Date.now();
  var j0 = ~~i;
  for (var n = 0; n < N; n++) {
    j = j0;
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    /* sound of silence */
  }
  return (Date.now() - start) / N;
}


function benchmark() {
  var start = Date.now();
  /*
   * sound of silence
   */
  return (Date.now() - start) / N;
}

dead code elimination

[most engines won't go as far as shown with it,
but it is possible]


// Remember this one?
// split (IT'S OVER 9000K OP/S ON FF!)
testString.split("e").length - 1;
// it's full of dead code.

optimizer eats µbenchmarks for breakfast

chew proof µbenchmarks?

do not try
to write them

avoid constants

avoid loop invariants

avoid dead code

verify results


function benchmark() {
  var i = Date.now().toString(),
      i1 = Date.now().toString(), t;
  var j;
  var start = Date.now();
  for (var n = 0; n < N; n++) {
    t = i; i = i1; i1 = t;
    j = ~~i;
  }
  if (i != j || i1 != j) throw "whoa?";
  return (Date.now() - start) / N;
}

this is as good
as it gets

still not enough

(potentially)


for (var n = 0; n < N; n++) {
  t = i; i = i1; i1 = t;
  j = ~~i;
}


for (var n = 0; n < (N / 2); n++) {
  t = i; i = i1; i1 = t;
  j = ~~i;
  t = i; i = i1; i1 = t;
  j = ~~i;              
}
if ((N % 2) === 1) {
  t = i; i = i1; i1 = t;
  j = ~~i;
}


for (var n = 0; n < (N / 2); n++) {
  j = ~~i1;
  t = i1;
  j = ~~i;
}


for (var n = 0; n < (N / 2); n++) {
  j = ~~i1; /* dead */
  t = i1;   /* dead */
  j = ~~i;  /* invariant */
}

loop unrolling

V8 doesn't do it

but I want to induce paranoia ☺

«good speedup in Chrome 32!»

«IT'S A KIND OF MAGIC!»

WRONG


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {
    fn('a', 'b', 'c', 'd');
  }
  return (new Date - start);
}


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {
    fn('a', 'b', 'c', 'd');
  }
  return (new Date - start);
}


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {
    fn('a', 'b', 'c', 'd');
//  ^^ why, yes, sir, I know callee.
  }
  return (new Date - start);
}


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {
    fn('a', 'b', 'c', 'd');
//  ^^ it's right there
  }
  return (new Date - start);
}


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {
    fn('a', 'b', 'c', 'd');
//  ^^ I take it there and put it here.
  }
  return (new Date - start);
}


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {
    'a' + 'b' + 'c' + 'd';
//  ^^ DONE
  }
  return (new Date - start);
}


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {

    /* aaaaand it is eliminated as DEAD code */
  }
  return (new Date - start);
}

inlining

Looks simple!

Why the difference?


load("lodash.js");
load("benchmark.js");

Benchmark.prototype.setup = function() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }
};

var suite = new Benchmark.Suite;
suite
  .add('invoke', function () { fn('a', 'b', 'c', 'd'); })
  .on('cycle', function (event) { print(event.target); })
  .run();

$ v8/3.19.18.23/out/ia32.release/d8 test.js
invoke x 24,417,452 ops/sec ±1.13%

$ v8/3.22.4/out/ia32.release/d8 test.js
invoke x 774,162,347 ops/sec ±0.27%

$ v8/3.19.18.23/out/ia32.release/d8    \
      --print-opt-code --code-comments \
      --trace-hydrogen --trace-phase=Z \
      --trace-deopt                    \
      test.js > code.asm

+ IRHydra

It was optimized
correctly

It was optimized almost correctly

OSR entry tries to convert non-number to integer


var start = new Date;
//  ^^^^^ compiler wrongly speculates
//        that it is an integer
for (var n = 0; n < N; n++) {
  /* nothing */
}
return (new Date - start);

It was V8's bug.

but it looked on benchmark
like function calls were slow

More traps here.


function benchmark() {
  function fn(a, b, c, d) {
    return a + b + c + d;
  }

  var start = new Date;
  for (var n = 0; n < N; n++) {
    fn('a', 'b', 'c', 'd');
  }
  return (new Date - start);
}

benchmark(); // => 2 ms
benchmark(); // => 37 ms

Related to how V8 collects type feedback and guards inlined functions

µbench vs. VM

$ d8 concat.js

  
    Inside 
    x 198,893,290 ops/sec ±1.08%
  
  
    Outside
    x 674,248,118 ops/sec ±1.08%
  
Fastest is Outside

«quick! move all your strings out of functions»

time to look
at the code

we expect both variants to be DCEd

IRHydra²

all samples from this talk
are demos within the tool

why `Outside` is fast(er)?

hmm, hard to see the forest behind the trees

lets activate an "interesting IR" mode

"uninteresting" IR instructions are hidden
source is spliced into IR
loops are marked with red stripes on the left

By the way: that was core loop of our benchmark and it was EMPTY

only iteration variable increment and loop itself remained

`Outside` is fast
at doing nothing

V8 failed to constant fold

[actually a bug]

lets fix it!


--- src/hydrogen.cc (revision 21348)
+++ src/hydrogen.cc (working copy)
@@ -9639,6 +9639,14 @@
       return left;
     }

+    if (FLAG_fold_constants &&
+        left->IsConstant() && HConstant::cast(left)->HasStringValue() &&
+        right->IsConstant() && HConstant::cast(right)->HasStringValue()) {
+      return AddUncasted<HStringAdd>(
+          left, right, allocation_mode.GetPretenureMode(),
+          STRING_ADD_CHECK_NONE, allocation_mode.feedback_site());
+    }
+
     // Register the dependent code with the allocation site.
     if (!allocation_mode.feedback_site().is_null()) {
       ASSERT(!graph()->info()->IsStub());

$ d8 concat.js

  
    Inside 
    x 758,530,478 ops/sec ±1.82%
  
  
    Outside
    x 674,248,118 ops/sec ±1.14%
  
Fastest is Inside

both are doing nothing!

presumption of performance

reasonable code
reasonably fast

confirmation bias

«prototype chains are slow»


var obj =
  Object.create(
    Object.create(
      Object.create(
        Object.create(
          Object.create({prop: 10})))));


var obj =
  Object.create(
    Object.create(
      Object.create(
        Object.create(
          Object.create({prop: 10})))));
                  // LISP tribute ^^^^^


function doManyLookups() {
  var counter = 0;
  for(var i = 0; i < 1000; i++)
    for(var j = 0; j < 1000; j++)
      for(var k = 0; k < 1000; k++)
        counter += obj.prop;
  print('In total: ' + counter);
}


function lookupAndCache() {
  var counter = 0;
  var value = obj.prop;
  for(var i = 0; i < 1000; i++)
    for(var j = 0; j < 1000; j++)
      for(var k = 0; k < 1000; k++)
        counter += value;
  print('In total: ' + counter);
}


// State of art benchmark driver.
function measure(f) {
  var start = Date.now();
  f();
  var end = Date.now();
  print(f.name + ' took ' +
        (end - start) + ' ms.');
}


measure(doManyLookups);
measure(lookupAndCache);

$ node prototype.js
In total: 10000000000
doManyLookups took 8243 ms.
In total: 10000000000
lookupAndCache took 1058 ms.

CONFIRMED

lets make it harder


-   Object.create({prop: 10})))));
+   Object.create({ get prop () { return 10 }})))));

$ node prototype.js
In total: 10000000000
doManyLookups took 1082 ms.
In total: 10000000000
lookupAndCache took 1061 ms.

«what kind of voodoo is this?»

getter was completely inlined

all prototype traversal checks were hoisted

data property is handled in a generic way

through an inline cache

old V8 did not inline loads from data properties
defined on prototypes

trying newer V8

$ d8 prototype.js
In total: 10000000000
doManyLookups took 1294 ms.
In total: 10000000000
lookupAndCache took 1189 ms.

$ d8 prototype.js
In total: 10000000000
doManyLookups took 1294 ms.
In total: 10000000000
lookupAndCache took 1189 ms.

prototype chain traversal got LICMed

... and now for something completely different

what if we run benchmark twice?


measure(doManyLookups);
measure(doManyLookups);
measure(lookupAndCache);
measure(lookupAndCache);

$ d8 prototype.js | grep took
doManyLookups took 1301 ms.
doManyLookups took 3408 ms.
lookupAndCache took 1204 ms.
lookupAndCache took 3406 ms.

what just
happened here?

turns out: 'In total: ' + counter type-feedback leaks upwards into the loop and causes excessive boxing of the counter variable

solution: hide `+` from representation inference


-   print('In total: ' + counter);
+   print('In total: ' + counter.toString());

$ d8 prototype.js | grep took
doManyLookups took 1298 ms.
doManyLookups took 1119 ms.
lookupAndCache took 1188 ms.
lookupAndCache took 982 ms.

«huh?»

now desert!

method call
vs.
function call


function mk(word) {
    var len = word.length;
    if (len > 255) return undefined;
    var i = len >> 2;
    return String.fromCharCode(
        (word.charCodeAt(    0) & 0x03) << 14 |
        (word.charCodeAt(    i) & 0x03) << 12 |
        (word.charCodeAt(  i+i) & 0x03) << 10 |
        (word.charCodeAt(i+i+i) & 0x03) <<  8 |
        len
    );
}


Benchmark.prototype.setup = function() {
   function mk(word) {
        /* ... */
    }

    var MK = function() { };
    MK.prototype.mk = mk;
    var mker = new MK;
};


suite
  .add('Function', function() {
    var key = mk('www.wired.com');
    key = mk('www.youtube.com');
    key = mk('scorecardresearch.com');
    key = mk('www.google-analytics.com');
  })
  .add('Method', function() {
    var key = mker.mk('www.wired.com');
    key = mker.mk('www.youtube.com');
    key = mker.mk('scorecardresearch.com');
    key = mker.mk('www.google-analytics.com');
  })

$ d8 method-vs-function.js

  
    Function
    x   4,149,776 ops/sec ±0.62%
  
  
    Method
    x 682,273,122 ops/sec ±0.72%
  
Fastest is Method

so method call IS faster?

keep secret what I am going to show you now


--- a/method-function.js
+++ b/method-function.js
@@ -2,6 +2,9 @@
 load("../benchmark.js");

 Benchmark.prototype.setup = function() {
+   "Speed" + "your" + "JS" + "with" +
+   "this" + "one" + "weird" + "trick";
+
    function mk(word) {
         var len = word.length;
         if (len > 255) return undefined;

$ d8 method-vs-function.js

  
    Function
    x 695,708,197 ops/sec ±0.38%
  
  
    Method
    x 692,496,013 ops/sec ±0.29%
  
Fastest is Function,Method

`Function` was never optimized!

— heuristics that decide when to optimize are based (among other things) on the amount of initialized inline caches

— function call does not go through an IC, but method call does

— Method had enough initialized ICs to trigger optimizations, but Function didn't!

— Until fake + ICs were added in the setup section

now `Function` is optimized!

... and it's body is almost completely LICMed

SOURCE view is easier

measuring almost empty loop again!

µbench NEJ TAK

"Given a string of space separated words `[a-z]*` find most common one."


function count(str) {
  var words = str.split(' ');
  var counts = {};
  var maxWord, maxCount = 0;
  var word, count;

  for (var i = 0; i < words.length; i++) {
    word = words[i];
    /* count word here */
  }

  return maxWord;
}


count = counts[word];
if (typeof count === "undefined") {
  counts[word] = count = 1;
} else {
  counts[word] = count + 1;
}

if (count > maxCount) {
  maxCount = count;
  maxWord = word;
}

# on 1.5Mb / 255k words string
$ d8 count.js
naive x 13.51 ops/sec ±2.87%

In `C++` we would just update the cell.


var cell = counts[word];
if (typeof cell === "undefined") {
  count = 1;
  counts[word] = { value: count };
} else {
  cell.value = count = cell.value + 1;
}

# on 1.5Mb / 255k words string
$ d8 count.js
naive x 13.51 ops/sec ±2.87%
cells x 20.42 ops/sec ±2.51%

Rethink the foundations!


function count(str) {
  var words = str.split(' ');
  var counts = {};
  var maxWord, maxCount = 0;
  var word, count;

  for (var i = 0; i < words.length; i++) {
    word = words[i];
    /* count word here */
  }

  return maxWord;
}

stream it char by char into a trie


function count() {
  // Idea 1: compute substring at the very end.
  var maxCount = 0, maxWordStart, maxWordEnd,
      lastSpace = 0;

  // Idea 2: instead of hash-table use simplistic trie.
  var root = { val: 0 };
  var curr = root;

  for (var i = 0; i < str.length; i++) {
    var ch = str.charCodeAt(i);
    if (ch === 32) { // if space increase the count;
    } else { // if not space advance in the trie;
    }
  }

  return str.substring(maxWordStart + 1, maxWordEnd);
}


// handling non-space: find or create target node.
var next = curr[ch];
if (typeof next === 'undefined') {
  next = curr[ch] = { val: 0 };
}
curr = next;


// handling space: increase node's counter.
var count = curr.val = curr.val + 1;
if (count > maxCount) {
  maxCount     = count;
  maxWordStart = lastSpace;
  maxWordEnd   = i;
}
curr = root; // restart from the root
lastSpace = i;

# on 1.5Mb / 255k words string
$ d8 count.js
naive x 13.51 ops/sec ±2.87%
cells x 20.42 ops/sec ±2.51%
trie  x 61.46 ops/sec ±0.48%

Inside	x 198,893,290 ops/sec ±1.08%
Outside	x 674,248,118 ops/sec ±1.08%

Inside	x 758,530,478 ops/sec ±1.82%
Outside	x 674,248,118 ops/sec ±1.14%

Function	x 4,149,776 ops/sec ±0.62%
Method	x 682,273,122 ops/sec ±0.72%

Function	x 695,708,197 ops/sec ±0.38%
Method	x 692,496,013 ops/sec ±0.29%

this slide was intentionally left blank

[but you can always reach me at @mraleph and [email protected] for questions]

Any Java programmers?

Which is faster?

Don't do this!

use JMH instead

try to figure it out

this talk is about JavaScript

Any C++ programmers?

bug in GCC <4.9

generated code hits partial register dependency stall

now JavaScript

we say performance we mean jsperf.com

DOUBT

EVERYTHING

µbench 101

the cost of OP?

what if OP faster than clock?

\[C\]

\[C \times N\]

\[\frac{C \times N}{N}\]

\[\frac{C \times \cancel{N}}{\cancel{N}}\]

repetitions start with testString === ""

\[C \times N\]

\[C \times 1 + 0 \times (N - 1)\]

\[\frac{C}{N}\]

\[\frac{C}{N} \approx 0\]

jsperf misuse

OP should take the same time

when repeatedly executed

«~~ for the win!»

NOPE

JITs are clever

they optimize your program as it runs

\[C \times N\]

\[C_u N_u + C_o N_o\]

[unoptimized + optimized version]

\[{\sum C_{i} N_i}\]

[multiple unoptimized and optimized versions]

\[{\sum C_{i} N_i} + \color{#c1357a}{\frac{1}{\sqrt{2\pi}}\int C(\xi) e ^ {-\frac{\xi^2}{2}} d\xi}\]

[math gets too complicated to approach analytically]

JITs are clever

to measure you need to outsmart

Disclaimer: On slides I will show optimizations as source-to-source transformations. In reality compilers operate on more sophisticated representations.

constant propagation

«hey, I can trick the compiler!»

«Wrong answer, McFly!»

loop invariant code motion

dead code elimination

[most engines won't go as far as shown with it,but it is possible]

optimizer eats µbenchmarks for breakfast

chew proof µbenchmarks?

do not try to write them

avoid constants avoid loop invariants avoid dead code verify results

this is as goodas it gets

still not enough

(potentially)

loop unrolling

V8 doesn't do it

but I want to induce paranoia ☺

«good speedup in Chrome 32!»

«IT'S A KIND OF MAGIC!»

WRONG

inlining

Looks simple!

Why the difference?

+ IRHydra

It was optimizedcorrectly

It was optimized almost correctly

OSR entry tries to convert non-number to integer

It was V8's bug.

but it looked on benchmarklike function calls were slow

More traps here.

Related to how V8 collects type feedback and guards inlined functions

µbench vs. VM

«quick! move all your strings out of functions»

time to lookat the code

we expect both variants to be DCEd

IRHydra2

all samples from this talkare demos within the tool

this slide
was intentionally
left blank

we say performance
we mean jsperf.com

the cost of `OP`?

what if `OP` faster than clock?

repetitions start with `testString === ""`

`OP` should take the same time

«`~~` for the win!»

[most engines won't go as far as shown with it,
but it is possible]

do not try
to write them

avoid constants

avoid loop invariants

avoid dead code

verify results

this is as good
as it gets

It was optimized
correctly

but it looked on benchmark
like function calls were slow

time to look
at the code

IRHydra²

all samples from this talk
are demos within the tool

why `Outside` is fast(er)?

`Outside` is fast
at doing nothing

reasonable code
reasonably fast

old V8 did not inline loads from data properties
defined on prototypes

what just
happened here?

solution: hide `+` from representation inference

method call
vs.
function call

`Function` was never optimized!

now `Function` is optimized!

"Given a string of space separated words `[a-z]*` find most common one."

In `C++` we would just update the cell.

never assume that
some language feature
has to be slow

talk to JS VM teams
and report bugs