Microbenchmarks are experiments

Wake up babe, new microbenchmark just dropped!

More languages, more insights!

A few interesting takeaways:

* Java and Kotlin are quick! Possible explanation: Google is heavily invested in performance here.
* Js is really fast as far as interpreted / jit languages go.
* Python is quite slow without things like PyPy. pic.twitter.com/GIshus2UXO
— Ben Dicken (@BenjDicken) November 25, 2024

This one has cool animations of balls bouncing around. It claims more insights compared to the other microbenchmark that was all the rage last week! Yay!

Sigh. 2025 is almost around the corner and people still seem to think that the goal of a benchmark is to produce a number which by itself reveals some hidden truth about the world. You write a loop in Dart, you write a loop in C. You run it three times (statistics!). You compare the running time and, lo and behold, these two numbers reveal you all you need to know about Dart and C performance. The rest of the day can be spent browsing numerological magazines for the meaning of your date of birth…

Benchmarks are not numerology. Their results are not a divine revelation. Benchmarks are experiments. Their results are meaningless without interpretation and validation. Our understanding of performance is driven by the same cyclic process that drives science in general: you formulate a hypothesis, you devise the experiment, you analyze the results, you adjust the hypothesis and repeat."Zen and The Art of Motorcycle Maintenance" has a pretty good description of scientific method, which I think every programmer should read and will likely benefit from.

I hope next time you see a cool benchmark animation on your favorite social network your first question will rather be “Why have the author spent time making the animation, instead of making an analysis of results benchmarks?”. The answer is that animations are simpler.

Anyway, the benchmark in question. I know, I know. I claimed with rightful indignation that I would not be bothered by meaningless benchmarks. Still I could not resist trying to understand what hides behind the number, because that is much more important then the number itself. The benchmark itself fits on a napkin:

import 'dart:math';

void main(List<String> args) {
  final u = int.parse(args[0]); // Get an input number from the command line
  final r = Random().nextInt(10000); // Get a random integer 0 <= r < 10k

  final List<int> a = List.filled(10000, 0, growable: false);
  for (int i = 0; i < 10000; i++) {
    // 10k outer loop iterations
    for (int j = 0; j < 100000; j++) {
      // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j % u; // Simple sum
    }
    a[i] += r; // Add a random value to each element in array
  }
  print(a[r]); // Print out a single element from the array
}

JavaScript version is: The code is slightly modified compared to the original to make it directly runnable in JavaScriptCore shell: console.log became print and process.argv[2] became arguments[0]

var u = Number(arguments[0]);           // Get an input number from the command line
var r = Math.floor(Math.random() * 10000); // Get a random number 0 <= r < 10k
var a = new Int32Array(10000);             // Array of 10k elements initialized to 0
for (let i = 0; i < 10000; i++) {          // 10k outer loop iterations
  for (let j = 0; j < 100000; j++) {       // 100k inner loop iterations, per outer loop iteration
    a[i] = a[i] + j%u;                     // Simple sum
  }
  a[i] += r;                               // Add a random value to each element in array
}
print(a[r]);                         // Print out a single element from the array

Running this on nightly versions of Dart and JavaScriptCore yields the following:You might be surprised that a former V8 engineer like me is using JSC for benchmarking here, but that's simply because JSC was fastest of two runtimes. Thus it makes sense to look at what it does.

$ dart compile exe -o dart/code.exe dart/code.dart
Generated: /Users/vegorov/src/temp/languages/loops/dart/code.exe
$ hyperfine '~/.jsvu/bin/javascriptcore js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore js/code.js -- 40
  Time (mean ± σ):      2.079 s ±  0.026 s    [User: 2.052 s, System: 0.013 s]
  Range (min … max):    2.057 s …  2.145 s    10 runs

Benchmark 2: dart/code.exe 40
  Time (mean ± σ):      2.809 s ±  0.023 s    [User: 2.773 s, System: 0.011 s]
  Range (min … max):    2.790 s …  2.860 s    10 runs

Summary
  ~/.jsvu/bin/javascriptcore js/code.js -- 40 ran
    1.35 ± 0.02 times faster than dart/code.exe 40

Okay, that’s interesting. Maybe even disconcerting to some. JavaScript is faster than Dart?! How is this possible? Well, the truth is JavaScript is a pretty fast language if all you need to is loop over Int32Array using constant iteration bounds and add few integers together.

Leaving snark aside, the first thing to notice is that Dart and JavaScript programs are not exactly equivalent: JavaScript is written using Int32Array while Dart is using List<int> which is closer to JS Array. Lets fix that:

  // ...
  final a = Int64List(10000);
  // ...

$ dart compile exe -o dart/code.exe dart/code.dart
Generated: /Users/vegorov/src/temp/languages/loops/dart/code.exe
$ hyperfine '~/.jsvu/bin/javascriptcore js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore js/code.js -- 40
  Time (mean ± σ):      2.081 s ±  0.033 s    [User: 2.051 s, System: 0.014 s]
  Range (min … max):    2.060 s …  2.171 s    10 runs

Benchmark 2: dart/code.exe 40
  Time (mean ± σ):      2.212 s ±  0.015 s    [User: 2.175 s, System: 0.011 s]
  Range (min … max):    2.192 s …  2.241 s    10 runs

Summary
  ~/.jsvu/bin/javascriptcore js/code.js -- 40 ran
    1.06 ± 0.02 times faster than dart/code.exe 40

Well, that’s better. Still some difference though, so it makes sense to look at the generated code.

$ dart compile exe -o dart/code.exe                                   \
  -v                                                                  \
  --extra-gen-snapshot-options=--print-flow-graph-optimized           \
  --extra-gen-snapshot-options=--print-flow-graph-filter=dart_::_main \
  --extra-gen-snapshot-options=--disassemble-optimized                \
  --extra-gen-snapshot-options=--code-comments                        \
  dart/code.dart > /tmp/dart-code.asm 2>&1
$ ~/.jsvu/bin/javascriptcore --dumpDFGDisassembly=true js/code.js -- 40

Here is what Dart’s AOT compiler produces for the innermost loop:

;; B7
;; CheckStackOverflow:58(stack=0, loop=2)
loop:
  ldr tmp, [thr, #56]
  cmp sp, tmp
  bls ->interrupt
;; Branch if RelationalOp(<, v15, v66) goto (5, 6)
  movz tmp2, #0x86a0
  movk tmp2, #0x1 lsl 16
  cmp r4, tmp2
  bge ->loop_exit
;; B5
;; v67 <- LoadIndexed:44([_Int64List] v46, v10)
  add tmp, r2, r1 lsl #3
  ldr r5, [tmp, #23]
;; v23 <- BinaryInt64Op(%, v15, v5)
  cbz r0, 0x1036b2400
  sdiv r7, r4, r0
  msub r6, r7, r0, r4
  cmp r6, zr
  blt ->mod_slow_path
;; v24 <- BinaryInt64Op(+, v67, v23)
  add r7, r5, r6
;; StoreIndexed:50([_Int64List] v46, v10, v24)
  add r5, r2, r1 lsl #3
  str r7, [r5, #23]
;; v25 <- BinaryInt64Op(+, v15, v68 T{_Smi})
  add r5, r4, #0x1
;; ParallelMove r4 <- r5 goto:56 B7
  mov r4, r5
  b ->loop

JavaScriptCore does this for comparison:I have stripped information about DFG/B3/Air IL levels because it makes output too verbose for an unprepared eye. Full output for both Dart and JSC is available here

loop:
  ldr      w7, [x2, w9, uxtw#2]
  sxtw     x6, w8
  mul      x6, x6, x0
  lsr      x6, x6, #32
  asr      w6, w6, #4
  add      w6, w6, w6, lsr #31
  msub     w6, w6, w4, w8
  tst      w8, w8
  b.mi     ->mod_slow_path
  adds     w6, w6, w7
  b.vs     ->overflow
  str      w6, [x2, w9, uxtw#2]
  add      w8, w8, #1
  cmp      w8, w3
  b.lt     ->loop

Comparing Dart and JSC output reveals the following differences:

Modulo operation is nowhere to be seen in JSC output, instead we have multiplication and some bitwise arithmetic. This is possible because JSC specialized the code a particular value of u. You can notice this in the DFG dump:
```
  4  0 58:  D@139:< 2:->	JSConstant(..., Int32: 40, ...)
  ...
  25  5 58:  D@136:<!2:->	ArithMod(Int32:D@142, Int32:D@139, ...)
```
Modulo with constant divisor can be strength reduced to avoid actual division operation.
JSC has rotate loop in such a way that there is only a single branch
There is an interrupt checking in Dart’s code (performed by a confusingly named CheckStackOverflow IL instruction) and not in JSC code. These interrupt checks are inserted to allow VM to interrupt long running code to perform various activities (e.g. OOB message delivery or GC).

The last item is the most interesting to me. I have always known that the cost of interrupt checks inserted into all loops tends to add up if these loops are very tight. I periodically come back to this and discuss with my colleagues various possible optimizations we could apply here, but so far we have not come up with something that would be fairly simple to implement and would significantly improve the performance. Instead, I have added an escape hatch: @pragma('vm:unsafe:no-interrupts') can tell the compiler that it should omit interrupt checks on the loops inside the function. JSC uses a combination of signals and code patching to interrupt JIT compiled code: a signal is sent to interrupt the thread, if thread is executing JIT compiled code that code will be patched to go into runtime system when signal handler completes. See VMTraps.cpp and JumpReplacement::installVMTrapBreakpoint. For environments where code patching is not an option JSC also supports a polling based interruption mechanism which works by inserting explicit checks into the JIT compiled code.

It might be tempting to just plaster this pragma on the main, but that’s exactly what one should not do. At least not when using this pragma in the real world (though you probably should not actually use it in the real world).

// no. NO. NO! Not a good idea.
// @pragma('vm:unsafe:no-interrupts')
void main(List<String> args) {
  // ...
}

Other Dart VM specific pragmas are documented here. Those with unsafe in their names are intended to be used with caution - only if you fully understand the implications associated with them.

Your reasoning should go like this: main has two nested loops, and executes for ~2seconds. Making it uninterruptible for the whole two seconds is not a good idea. What if VM needs to trigger a GC in some other isolate in the same isolate group? That GC would have to wait whole 2 seconds, blocking the whole isolate group. Not a good idea. Instead, you could do the following:

void main(List<String> args) {
  // ...
  @pragma('vm:unsafe:no-interrupts')
  void innerLoop(int i) {
    for (int j = 0; j < 100000; j++) {
      // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j % u; // Simple sum
    }
  }

  for (int i = 0; i < 10000; i++) {
    // 10k outer loop iterations
    innerLoop(i);
    a[i] += r; // Add a random value to each element in array
  }
  // ...
}

This way main will still be interruptible in the outer loop, but there will be no interrupt checks in the inner loop reducing uninterruptible periods to 0.2ms, which is probably just fine for most of the situations.Note that inliner fully inlines and eliminates overhead of the innerLoop. It does not always happen - so when writing performance sensitive code and factoring it into various helpers you should always check the code which gets generated.

This version of Dart code yields the following result:

$ hyperfine '~/.jsvu/bin/javascriptcore js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore js/code.js -- 40
  Time (mean ± σ):      2.100 s ±  0.059 s    [User: 2.056 s, System: 0.016 s]
  Range (min … max):    2.059 s …  2.236 s    10 runs

Benchmark 2: dart/code.exe 40
  Time (mean ± σ):      2.091 s ±  0.022 s    [User: 2.056 s, System: 0.010 s]
  Range (min … max):    2.069 s …  2.131 s    10 runs

Summary
  dart/code.exe 40 ran
    1.00 ± 0.03 times faster than ~/.jsvu/bin/javascriptcore js/code.js -- 40

Which is fine by me. By the way, as described in a sidenote above JSC does also support polling based interruption, so we could do this comparison the other way around: keep Dart VM’s interruption checks in the code and ask JSC to add them as well by passing --usePollingTraps=true:

$ hyperfine '~/.jsvu/bin/javascriptcore --usePollingTraps=true js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore --usePollingTraps=true js/code.js -- 40
  Time (mean ± σ):      2.204 s ±  0.032 s    [User: 2.161 s, System: 0.015 s]
  Range (min … max):    2.174 s …  2.261 s    10 runs

Benchmark 2: dart/code.exe 40
  Time (mean ± σ):      2.197 s ±  0.013 s    [User: 2.167 s, System: 0.008 s]
  Range (min … max):    2.187 s …  2.220 s    10 runs

Summary
  dart/code.exe 40 ran
    1.00 ± 0.02 times faster than ~/.jsvu/bin/javascriptcore --usePollingTraps=true js/code.js -- 40

Now you ask a question: why are we comparing against JavaScript and not against a more serious competitor like C?

#include "stdio.h"
#include "stdlib.h"
#include "stdint.h"

int main (int argc, char** argv) {
  int u = atoi(argv[1]);               // Get an input number from the command line
  int r = rand() % 10000;              // Get a random integer 0 <= r < 10k
  int32_t a[10000] = {0};              // Array of 10k elements initialized to 0
  for (int i = 0; i < 10000; i++) {    // 10k outer loop iterations
    for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j%u;               // Simple sum
    }
    a[i] += r;                         // Add a random value to each element in array
  }
  printf("%d\n", a[r]);                // Print out a single element from the array
}

$ clang -O3 -o c/code.exe c/code.c
$ hyperfine 'c/code.exe 40' 'dart/code.exe 40'
Benchmark 1: c/code.exe 40
  Time (mean ± σ):     644.0 ms ±  11.9 ms    [User: 630.9 ms, System: 3.3 ms]
  Range (min … max):   635.1 ms … 671.6 ms    10 runs

Benchmark 2: dart/code.exe 40
  Time (mean ± σ):      2.234 s ±  0.024 s    [User: 2.186 s, System: 0.014 s]
  Range (min … max):    2.189 s …  2.267 s    10 runs

Summary
  c/code.exe 40 ran
    3.47 ± 0.07 times faster than dart/code.exe 40

Well, that’s some serious competion!

What if I told you that you can make original Dart version (one with List.filled and interrupt checks) to run just 10% slower than C version with a one line change? And on that line there would be only 4 non-whitespace characters and a semicolon?

$ hyperfine --warmup 1 'c/code.exe 40' 'dart/code.exe 40'
Benchmark 1: c/code.exe 40
  Time (mean ± σ):     638.6 ms ±   7.0 ms    [User: 629.2 ms, System: 2.1 ms]
  Range (min … max):   631.6 ms … 655.6 ms    10 runs

Benchmark 2: dart/code.exe 40
  Time (mean ± σ):     692.4 ms ±   5.0 ms    [User: 667.6 ms, System: 6.0 ms]
  Range (min … max):   686.1 ms … 701.3 ms    10 runs

Summary
  c/code.exe 40 ran
    1.08 ± 0.01 times faster than dart/code.exe 40

Are you ready?

import 'dart:math';

void main(List<String> args) {
  final u = int.parse(args[0]); // Get an input number from the command line
  final r = Random().nextInt(10000); // Get a random integer 0 <= r < 10k

  final a = List.filled(10000, 0, growable: false);
  for (int i = 0; i < 10000; i++) {
    // 10k outer loop iterations
    a[i]; /* <- THIS LINE WAS ADDED */
    for (int j = 0; j < 100000; j++) {
      // 100k inner loop iterations, per outer loop iteration
      a[i] = a[i] + j % u; // Simple sum
    }
    a[i] += r; // Add a random value to each element in array
  }

  print(a[r]); // Print out a single element from the array
}

I added a[i]; right before the inner loop and that somehow made things massively faster. How did that happen?

What LLVM notices and what Dart does not is that a loop: As far as I could see by inspecting output of -mllvm --print-before-all this happens in the LLVM's LICM pass, probably in promoteLoopAccessesToScalars.

for (int j = 0; j < 100000; j++) {
  a[i] = a[i] + j % u; // Simple sum
}

is the same as

var v = a[i];
for (int j = 0; j < 100000; j++) {
  v = v + j % u; // Simple sum
}
a[i] = v;

This means inner loop no longer touches any memory which makes it much faster.

Dart compiler is relatively good at forwarding loads, but it does not perform any load hoisting. Once we manually added a[i] before the loop made a[i] inside the loop redundant. Compiler noticed that it has a[i] value flowing on all edges and eliminated a[i] load inside the loop:You might be tempted to try combining Int64List change with a[i]; hint but that would not work. It turns out that Dart's load forwarding pass is very shy around forwarding stores into typed lists. It is a bug I intend to fix.

var v = a[i];
for (int j = 0; j < 100000; j++) {
 v = v + j % u;
 a[i] = v;  // (1)
}
a[i] = v + r; // (2)

After that it noticed that a[i] = ... at (1) is dead because it will be overwritten by store at (2) and we end up with code similar to the one LLVM was able to produce.

var v = a[i];
for (int j = 0; j < 100000; j++) {
 v = v + j % u;
}
a[i] = v + r;

In reality compiler should really try to perform this optimization itself: I have occasionally encountered hot loops (e.g. lexer keeping a character position in a field) where it would be really beneficial. I am planning to take a look at this some time in the future.

I wish to conclude this post in a circular fashion, like a novel where weary hero returns home and discovers that everything, including him, have changed. Hopefully you, reader, too have changed and no longer believe that measuring a time it takes to perform 1000000000 of modulo operations and looking at that number alone somehow enlightens you about performance of some programming language.

Addendum: Is this a good benchmark?

I want to highlight one important thing: I think this is a very poor benchmark which does not demonstrate anything useful for the purpose of cross language performance comparison. It might reveal something interesting if you dig deep but otherwise it is just useless garbage created to farm engagement capitalizing on people’s emotions. People like to see their favorite languages do well on this sort of meaningless comparisons.

The workload it performs it not meaningful: if this is the sort of workload you need to run in your program there are other ways to implement it which would be much more meaningful. You don’t need these loops at all: you can compute the result analytically using a formula.

When you see a benchmark like this you have two options:

walk past it and ignore it;
dig deep to understand what is happening, fix some serious issues (if any) and then ignore it;

Want to discuss contents of this post? Drop me a mail [email protected] or find me on Mastodon, X or Bluesky.