Wake up babe, new microbenchmark just dropped!
More languages, more insights!
— Ben Dicken (@BenjDicken) November 25, 2024
A few interesting takeaways:
* Java and Kotlin are quick! Possible explanation: Google is heavily invested in performance here.
* Js is really fast as far as interpreted / jit languages go.
* Python is quite slow without things like PyPy. pic.twitter.com/GIshus2UXO
This one has cool animations of balls bouncing around. It claims more insights compared to the other microbenchmark that was all the rage last week! Yay!
Sigh. 2025 is almost around the corner and people still seem to think that the goal of a benchmark is to produce a number which by itself reveals some hidden truth about the world. You write a loop in Dart, you write a loop in C. You run it three times (statistics!). You compare the running time and, lo and behold, these two numbers reveal you all you need to know about Dart and C performance. The rest of the day can be spent browsing numerological magazines for the meaning of your date of birth…
Benchmarks are not numerology. Their results are not a divine revelation. Benchmarks are experiments. Their results are meaningless without interpretation and validation. Our understanding of performance is driven by the same cyclic process that drives science in general: you formulate a hypothesis, you devise the experiment, you analyze the results, you adjust the hypothesis and repeat."Zen and The Art of Motorcycle Maintenance" has a pretty good description of scientific method, which I think every programmer should read and will likely benefit from.
I hope next time you see a cool benchmark animation on your favorite social network your first question will rather be “Why have the author spent time making the animation, instead of making an analysis of results benchmarks?”. The answer is that animations are simpler.
Anyway, the benchmark in question. I know, I know. I claimed with rightful indignation that I would not be bothered by meaningless benchmarks. Still I could not resist trying to understand what hides behind the number, because that is much more important then the number itself. The benchmark itself fits on a napkin:
import 'dart:math';
void main(List<String> args) {
final u = int.parse(args[0]); // Get an input number from the command line
final r = Random().nextInt(10000); // Get a random integer 0 <= r < 10k
final List<int> a = List.filled(10000, 0, growable: false);
for (int i = 0; i < 10000; i++) {
// 10k outer loop iterations
for (int j = 0; j < 100000; j++) {
// 100k inner loop iterations, per outer loop iteration
a[i] = a[i] + j % u; // Simple sum
}
a[i] += r; // Add a random value to each element in array
}
print(a[r]); // Print out a single element from the array
}
JavaScript version is:
The code is slightly modified compared to the original to make it directly runnable in JavaScriptCore shell: console.log
became print
and process.argv[2]
became arguments[0]
var u = Number(arguments[0]); // Get an input number from the command line
var r = Math.floor(Math.random() * 10000); // Get a random number 0 <= r < 10k
var a = new Int32Array(10000); // Array of 10k elements initialized to 0
for (let i = 0; i < 10000; i++) { // 10k outer loop iterations
for (let j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
a[i] = a[i] + j%u; // Simple sum
}
a[i] += r; // Add a random value to each element in array
}
print(a[r]); // Print out a single element from the array
Running this on nightly versions of Dart and JavaScriptCore yields the following:You might be surprised that a former V8 engineer like me is using JSC for benchmarking here, but that's simply because JSC was fastest of two runtimes. Thus it makes sense to look at what it does.
$ dart compile exe -o dart/code.exe dart/code.dart
Generated: /Users/vegorov/src/temp/languages/loops/dart/code.exe
$ hyperfine '~/.jsvu/bin/javascriptcore js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore js/code.js -- 40
Time (mean ± σ): 2.079 s ± 0.026 s [User: 2.052 s, System: 0.013 s]
Range (min … max): 2.057 s … 2.145 s 10 runs
Benchmark 2: dart/code.exe 40
Time (mean ± σ): 2.809 s ± 0.023 s [User: 2.773 s, System: 0.011 s]
Range (min … max): 2.790 s … 2.860 s 10 runs
Summary
~/.jsvu/bin/javascriptcore js/code.js -- 40 ran
1.35 ± 0.02 times faster than dart/code.exe 40
Okay, that’s interesting. Maybe even disconcerting to some. JavaScript is faster
than Dart?! How is this possible? Well, the truth is JavaScript is a pretty
fast language if all you need to is loop over Int32Array
using constant
iteration bounds and add few integers together.
Leaving snark aside, the first thing to notice is that Dart and JavaScript
programs are not exactly equivalent: JavaScript is written using Int32Array
while Dart is using List<int>
which is closer to JS Array
. Lets fix that:
// ...
final a = Int64List(10000);
// ...
$ dart compile exe -o dart/code.exe dart/code.dart
Generated: /Users/vegorov/src/temp/languages/loops/dart/code.exe
$ hyperfine '~/.jsvu/bin/javascriptcore js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore js/code.js -- 40
Time (mean ± σ): 2.081 s ± 0.033 s [User: 2.051 s, System: 0.014 s]
Range (min … max): 2.060 s … 2.171 s 10 runs
Benchmark 2: dart/code.exe 40
Time (mean ± σ): 2.212 s ± 0.015 s [User: 2.175 s, System: 0.011 s]
Range (min … max): 2.192 s … 2.241 s 10 runs
Summary
~/.jsvu/bin/javascriptcore js/code.js -- 40 ran
1.06 ± 0.02 times faster than dart/code.exe 40
Well, that’s better. Still some difference though, so it makes sense to look at the generated code.
$ dart compile exe -o dart/code.exe \
-v \
--extra-gen-snapshot-options=--print-flow-graph-optimized \
--extra-gen-snapshot-options=--print-flow-graph-filter=dart_::_main \
--extra-gen-snapshot-options=--disassemble-optimized \
--extra-gen-snapshot-options=--code-comments \
dart/code.dart > /tmp/dart-code.asm 2>&1
$ ~/.jsvu/bin/javascriptcore --dumpDFGDisassembly=true js/code.js -- 40
Here is what Dart’s AOT compiler produces for the innermost loop:
;; B7
;; CheckStackOverflow:58(stack=0, loop=2)
loop:
ldr tmp, [thr, #56]
cmp sp, tmp
bls ->interrupt
;; Branch if RelationalOp(<, v15, v66) goto (5, 6)
movz tmp2, #0x86a0
movk tmp2, #0x1 lsl 16
cmp r4, tmp2
bge ->loop_exit
;; B5
;; v67 <- LoadIndexed:44([_Int64List] v46, v10)
add tmp, r2, r1 lsl #3
ldr r5, [tmp, #23]
;; v23 <- BinaryInt64Op(%, v15, v5)
cbz r0, 0x1036b2400
sdiv r7, r4, r0
msub r6, r7, r0, r4
cmp r6, zr
blt ->mod_slow_path
;; v24 <- BinaryInt64Op(+, v67, v23)
add r7, r5, r6
;; StoreIndexed:50([_Int64List] v46, v10, v24)
add r5, r2, r1 lsl #3
str r7, [r5, #23]
;; v25 <- BinaryInt64Op(+, v15, v68 T{_Smi})
add r5, r4, #0x1
;; ParallelMove r4 <- r5 goto:56 B7
mov r4, r5
b ->loop
JavaScriptCore does this for comparison:I have stripped information about DFG/B3/Air IL levels because it makes output too verbose for an unprepared eye. Full output for both Dart and JSC is available here
loop:
ldr w7, [x2, w9, uxtw#2]
sxtw x6, w8
mul x6, x6, x0
lsr x6, x6, #32
asr w6, w6, #4
add w6, w6, w6, lsr #31
msub w6, w6, w4, w8
tst w8, w8
b.mi ->mod_slow_path
adds w6, w6, w7
b.vs ->overflow
str w6, [x2, w9, uxtw#2]
add w8, w8, #1
cmp w8, w3
b.lt ->loop
Comparing Dart and JSC output reveals the following differences:
- Modulo operation is nowhere to be seen in JSC output, instead we have
multiplication and some bitwise arithmetic. This is possible because JSC
specialized the code a particular value of
u
. You can notice this in the DFG dump:4 0 58: D@139:< 2:-> JSConstant(..., Int32: 40, ...) ... 25 5 58: D@136:<!2:-> ArithMod(Int32:D@142, Int32:D@139, ...)
Modulo with constant divisor can be strength reduced to avoid actual division operation.
- JSC has rotate loop in such a way that there is only a single branch
- There is an interrupt checking in Dart’s code (performed by a confusingly
named
CheckStackOverflow
IL instruction) and not in JSC code. These interrupt checks are inserted to allow VM to interrupt long running code to perform various activities (e.g. OOB message delivery or GC).
The last item is the most interesting to me. I have always known that the cost
of interrupt checks inserted into all loops tends to add up if these loops are
very tight. I periodically come back to this and discuss with my colleagues
various possible optimizations we could apply here, but so far we have not
come up with something that would be fairly simple to implement and would
significantly improve the performance. Instead, I have added an escape hatch:
@pragma('vm:unsafe:no-interrupts')
can tell the compiler that it should
omit interrupt checks on the loops inside the function.
JSC uses a combination of signals and code patching to
interrupt JIT compiled code: a signal is sent to interrupt the thread, if thread
is executing JIT compiled code that code will be patched to go into runtime
system when signal handler completes. See VMTraps.cpp
and JumpReplacement::installVMTrapBreakpoint
.
For environments where code patching is not an option JSC also supports a
polling based
interruption mechanism which works by inserting explicit checks into the JIT
compiled code.
It might be tempting to just plaster this pragma on the main
, but that’s
exactly what one should not do. At least not when using this pragma in the
real world (though you probably should not actually use it in the real world).</p>
// no. NO. NO! Not a good idea.
// @pragma('vm:unsafe:no-interrupts')
void main(List<String> args) {
// ...
}
Other Dart VM specific pragmas are documented here. Those with unsafe
in their names are intended to be used with
caution - only if you fully understand the implications associated with them.
Your reasoning should go like this: main
has two nested loops, and executes
for ~2seconds. Making it uninterruptible for the whole two seconds is not a
good idea. What if VM needs to trigger a GC in some other isolate in the same
isolate group? That GC would have to wait whole 2 seconds, blocking the
whole isolate group. Not a good idea. Instead, you could do the following:
void main(List<String> args) {
// ...
@pragma('vm:unsafe:no-interrupts')
void innerLoop(int i) {
for (int j = 0; j < 100000; j++) {
// 100k inner loop iterations, per outer loop iteration
a[i] = a[i] + j % u; // Simple sum
}
}
for (int i = 0; i < 10000; i++) {
// 10k outer loop iterations
innerLoop(i);
a[i] += r; // Add a random value to each element in array
}
// ...
}
This way main
will still be interruptible in the outer loop, but there will
be no interrupt checks in the inner loop reducing uninterruptible periods to
0.2ms, which is probably just fine for most of the situations.Note that inliner fully inlines and eliminates overhead of the innerLoop
. It does not always happen - so when writing performance sensitive code and factoring it into various helpers you should always check the code which gets generated.
This version of Dart code yields the following result:
$ hyperfine '~/.jsvu/bin/javascriptcore js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore js/code.js -- 40
Time (mean ± σ): 2.100 s ± 0.059 s [User: 2.056 s, System: 0.016 s]
Range (min … max): 2.059 s … 2.236 s 10 runs
Benchmark 2: dart/code.exe 40
Time (mean ± σ): 2.091 s ± 0.022 s [User: 2.056 s, System: 0.010 s]
Range (min … max): 2.069 s … 2.131 s 10 runs
Summary
dart/code.exe 40 ran
1.00 ± 0.03 times faster than ~/.jsvu/bin/javascriptcore js/code.js -- 40
Which is fine by me. By the way, as described in a sidenote above JSC does also
support polling based interruption, so we could do this comparison the other
way around: keep Dart VM’s interruption checks in the code and ask JSC to add
them as well by passing --usePollingTraps=true
:
$ hyperfine '~/.jsvu/bin/javascriptcore --usePollingTraps=true js/code.js -- 40' 'dart/code.exe 40'
Benchmark 1: ~/.jsvu/bin/javascriptcore --usePollingTraps=true js/code.js -- 40
Time (mean ± σ): 2.204 s ± 0.032 s [User: 2.161 s, System: 0.015 s]
Range (min … max): 2.174 s … 2.261 s 10 runs
Benchmark 2: dart/code.exe 40
Time (mean ± σ): 2.197 s ± 0.013 s [User: 2.167 s, System: 0.008 s]
Range (min … max): 2.187 s … 2.220 s 10 runs
Summary
dart/code.exe 40 ran
1.00 ± 0.02 times faster than ~/.jsvu/bin/javascriptcore --usePollingTraps=true js/code.js -- 40
Now you ask a question: why are we comparing against JavaScript and not against a more serious competitor like C?
#include "stdio.h"
#include "stdlib.h"
#include "stdint.h"
int main (int argc, char** argv) {
int u = atoi(argv[1]); // Get an input number from the command line
int r = rand() % 10000; // Get a random integer 0 <= r < 10k
int32_t a[10000] = {0}; // Array of 10k elements initialized to 0
for (int i = 0; i < 10000; i++) { // 10k outer loop iterations
for (int j = 0; j < 100000; j++) { // 100k inner loop iterations, per outer loop iteration
a[i] = a[i] + j%u; // Simple sum
}
a[i] += r; // Add a random value to each element in array
}
printf("%d\n", a[r]); // Print out a single element from the array
}
$ clang -O3 -o c/code.exe c/code.c
$ hyperfine 'c/code.exe 40' 'dart/code.exe 40'
Benchmark 1: c/code.exe 40
Time (mean ± σ): 644.0 ms ± 11.9 ms [User: 630.9 ms, System: 3.3 ms]
Range (min … max): 635.1 ms … 671.6 ms 10 runs
Benchmark 2: dart/code.exe 40
Time (mean ± σ): 2.234 s ± 0.024 s [User: 2.186 s, System: 0.014 s]
Range (min … max): 2.189 s … 2.267 s 10 runs
Summary
c/code.exe 40 ran
3.47 ± 0.07 times faster than dart/code.exe 40
Well, that’s some serious competion!
What if I told you that you can make original Dart version (one with
List.filled
and interrupt checks) to run just 10% slower than C version with
a one line change? And on that line there would be only 4 non-whitespace
characters and a semicolon?
$ hyperfine --warmup 1 'c/code.exe 40' 'dart/code.exe 40'
Benchmark 1: c/code.exe 40
Time (mean ± σ): 638.6 ms ± 7.0 ms [User: 629.2 ms, System: 2.1 ms]
Range (min … max): 631.6 ms … 655.6 ms 10 runs
Benchmark 2: dart/code.exe 40
Time (mean ± σ): 692.4 ms ± 5.0 ms [User: 667.6 ms, System: 6.0 ms]
Range (min … max): 686.1 ms … 701.3 ms 10 runs
Summary
c/code.exe 40 ran
1.08 ± 0.01 times faster than dart/code.exe 40
Are you ready?
import 'dart:math';
void main(List<String> args) {
final u = int.parse(args[0]); // Get an input number from the command line
final r = Random().nextInt(10000); // Get a random integer 0 <= r < 10k
final a = List.filled(10000, 0, growable: false);
for (int i = 0; i < 10000; i++) {
// 10k outer loop iterations
a[i]; /* <- THIS LINE WAS ADDED */
for (int j = 0; j < 100000; j++) {
// 100k inner loop iterations, per outer loop iteration
a[i] = a[i] + j % u; // Simple sum
}
a[i] += r; // Add a random value to each element in array
}
print(a[r]); // Print out a single element from the array
}
I added a[i];
right before the inner loop and that somehow made things
massively faster. How did that happen?
What LLVM notices and what Dart does not is that a loop:
As far as I could see by inspecting output of
-mllvm --print-before-all
this happens in the LLVM's LICM pass, probably in promoteLoopAccessesToScalars
.
for (int j = 0; j < 100000; j++) {
a[i] = a[i] + j % u; // Simple sum
}
is the same as
var v = a[i];
for (int j = 0; j < 100000; j++) {
v = v + j % u; // Simple sum
}
a[i] = v;
This means inner loop no longer touches any memory which makes it much faster.
Dart compiler is relatively good at forwarding loads, but it does not perform
any load hoisting. Once we manually added a[i]
before the loop
made a[i]
inside the loop redundant. Compiler noticed that it has a[i]
value flowing on all edges and eliminated a[i]
load inside the loop:You might be tempted to try combining Int64List
change with a[i];
hint but that would not work. It turns out that Dart's load forwarding pass is very shy around forwarding stores into typed lists. It is a bug I intend to fix.
var v = a[i];
for (int j = 0; j < 100000; j++) {
v = v + j % u;
a[i] = v; // (1)
}
a[i] = v + r; // (2)
After that it noticed that a[i] = ...
at (1)
is dead because it will
be overwritten by store at (2)
and we end up with code similar to the one
LLVM was able to produce.
var v = a[i];
for (int j = 0; j < 100000; j++) {
v = v + j % u;
}
a[i] = v + r;
In reality compiler should really try to perform this optimization itself: I have occasionally encountered hot loops (e.g. lexer keeping a character position in a field) where it would be really beneficial. I am planning to take a look at this some time in the future.
I wish to conclude this post in a circular fashion, like a novel where weary hero returns home and discovers that everything, including him, have changed. Hopefully you, reader, too have changed and no longer believe that measuring a time it takes to perform 1000000000 of modulo operations and looking at that number alone somehow enlightens you about performance of some programming language.
My blog no longer has
comments section. If you want to discuss
contents of this post you can send me an email at [email protected]
or find me on Mastodon or
X.