Microbenchmarking Dart (Part 1)

In the past few months I have started receiving more and more questions about performance of some specific Dart operations. Here is an example of such a question asked by Romain Rastel in the context of his work on improving performance of ChangeNotifier in Flutter.

Looks like creating a fixed-length list with a small number of items, can be, sometimes a lot less performant than creating a growable list. pic.twitter.com/B5opjZkmrX
— Romain Rastel 💙 (@lets4r) November 30, 2020

Given my experience I knew exactly what was going wrong in this particular benchmark after the very first glance… but for the sake of storytelling let me pretend that I did not. How would I approach this then?

I would normally start by trying to repeat reported numbers. In this particular case I would start by creating an empty Flutter application

$ flutter create ubench
$ cd ubench

then in lib/benchmark.dart I’d put the following code This code was taken from Romain’s gist with a minor fix: in the original version benchmark names were swapped by accident, so fixed-length benchmark was allocating growable lists and vice versa.

// ubench/lib/benchmark.dart
import 'package:benchmark_harness/benchmark_harness.dart';

abstract class Benchmark extends BenchmarkBase {
  const Benchmark(String name) : super(name);

  @override
  void exercise() {
    for (int i = 0; i < 100000; i++) {
      run();
    }
  }
}

class GrowableListBenchmark extends Benchmark {
  const GrowableListBenchmark(this.length) : super('growable[$length]');

  final int length;

  @override
  void run() {
    List<int>()..length = length;
  }
}

class FixedLengthListBenchmark extends Benchmark {
  const FixedLengthListBenchmark(this.length) : super('fixed-length[$length]');

  final int length;

  @override
  void run() {
    List(length);
  }
}

void main() {
  const GrowableListBenchmark(32).report();
  const FixedLengthListBenchmark(32).report();
}

and finally I’d run it in the release mode

$ flutter run --release -t lib/benchmark.dart
...
I/flutter (18126): growable[32](RunTime): 31464.890625 us.
I/flutter (18126): fixed-length[32](RunTime): 713.8279800142756 us

The result seems to show fixed length lists being 43 times faster to allocate than growable lists. Should we leave it at that and head over to refactor our code to use as many fixed-length lists as possible?

Absolutely not… or at least not with an expectation that our code becomes 43 times faster. It does actually make sense to prefer fixed-length lists over growable lists where fixed-length lists are a natural fit. They have slightly smaller memory footprint, are faster to allocate and involve less indirections to access an element. But you should do this choice deliberately based on clear understanding of how things work and not based on raw uninterpreted results of microbenchmarks.

Drawing conclusions from raw microbenchmark numbers without any sort of critical analysis is a common pitfall associated with microbenchmarking and we should do our best to avoid falling into it. Unfortunately package:benchmark_harness is not making it easier to avoid such pitfalls: it provides developers with a way to write microbenchmarks but does not give them tools or guidance on how to validate your benchmarks and interpret their results. To make things worse package:benchmark_harness does not even attempt to make it very straightforward to write an accurate microbenchmark.

Consider for example that I could have written this list benchmark in the following way, without overriding exercise to repeat run 100000 times:

// ubench/lib/benchmark-without-exercise.dart
import 'package:benchmark_harness/benchmark_harness.dart';

// Just using BenchmarkBase directly. Rest is the same.
class GrowableListBenchmark extends BenchmarkBase {
  // ...
}

// Just using BenchmarkBase directly. Rest is the same.
class FixedLengthListBenchmark extends BenchmarkBase {
  // ...
}

Running this variant would show that growable lists are only 6 times slower than fixed length lists

$ flutter run --release -t lib/benchmark-without-exercise.dart
I/flutter (14407): growable[32](RunTime): 1.8629797056305768 us.
I/flutter (14407): fixed-length[32](RunTime): 0.3052065645659146 us.

Which benchmark result should I trust? Neither of them really! I should look under the hood and try to understand what exactly is happening.

Flutter and Dart already provide developers with enough tooling to figure out why benchmark numbers look this way. Unfortunately some of this tooling is somewhat obscure and hard to discover.

For example, it is well known that you can use flutter run --profile to profile your application with Observatory, however it is not well known that you can also profile release builds using native profilers (like simpleperf on Android or Instruments on iOS). Similarly it is not known (most likely not known at all outside of a group of engineers working on the VM) that you can dump annotated disassembly of a specific method from an AOT build by doing

$ flutter build apk --extra-gen-snapshot-options=--print-flow-graph,\
    --disassemble,\
    --print-flow-graph-filter=FixedLengthListBenchmark.run

I could spend the rest of this post explaining how one could use these tools to understand what exactly is going on in these list benchmarks, but instead I would like to try and imagine how an integrated tooling for benchmarking could be built out of the primitives provided by Dart and Flutter. This tooling should not only run benchmarks, but also automatically provide enough insight for a developer to spot mistakes they made during benchmarking and help them interpret the results.

Preliminary setup

I have forked benchmark_harness package into mraleph/benchmark_harness on GitHub. All of my prototype code is going to live in a new experimental-cli branch in the fork.

From here on I will document an evolution of this experimental benchmarking CLI. I would like to stress a highly experimental nature of this tooling: as you will notice that some of its features will end up depending on a patches to Dart and Flutter SDK internals. It might be weeks or months before these patches land and it will become possible to just merge my changes into upstream version of the harness.

I started by adding a trivial bin/benchmark_harness.dart script which would serve as an entry point to our new benchmarking tooling.

$ git clone [email protected]:mraleph/benchmark_harness.git
$ cd benchmark_harness
$ cat > bin/benchmark_harness.dart
void main() { print('Running benchmarks...'); }
^D

Finally I changed pubspec.yaml in ubench project (remember it is an empty Flutter project we created to host our benchmarks) to have a path dependency on my version of benchmark_harness

# ubench/pubspec.yaml

dependencies:
  # ...
  benchmark_harness:
    path: ../benchmark_harness
  # ...

This allows me to run benchmark_harness script while in ubench project directory

$ flutter pub get
$ flutter pub run benchmark_harness
Running benchmarks...

Generating benchmarks

Have you ever looked at how benchmark_harness runs your benchmarks?

It turns out that this package is doing something rather simple (and to an extent naive): it starts a Stopwatch, then repeatedly calls exercise until 2 seconds elapses according to the stopwatch. Time elapsed divided by number of times exercise was called is the reported benchmark score. Take a look yourself:

// benchmark_harness/lib/src/benchmark_base.dart
abstract class BenchmarkBase {
  // Measures the score for the benchmark and returns it.
  double measure() {
    // ...
    // Run the benchmark for at least 2000ms.
    var result = measureFor(exercise, 2000);
    // ...
  }

  // Exercises the benchmark. By default invokes [run] 10 times.
  void exercise() {
    for (var i = 0; i < 10; i++) {
      run();
    }
  }

  // Measures the score for this benchmark by executing it repeatedly until
  // time minimum has been reached.
  static double measureFor(Function f, int minimumMillis) {
    var minimumMicros = minimumMillis * 1000;
    var iter = 0;
    var watch = Stopwatch();
    watch.start();
    var elapsed = 0;
    while (elapsed < minimumMicros) {
      f();
      elapsed = watch.elapsedMicroseconds;
      iter++;
    }
    return elapsed / iter;
  }
}

This code unfortunately has an issue which makes it unsuitable for microbenchmarking: measured loop has a bunch of overhead unrelated to the exercise itself. Most noticeably it gets current time from the OS on each and every iteration. There is also an overhead associated with multiple levels of virtual dispatch between measured loop and the body of run method containing an actual operation we want to measure. There was a PR against benchmark_harness, which tried to address the issue of calling Stopwatch.elapsedMilliseconds too often, but it somehow got stuck in limbo despite being approved.

The best way to avoid these overheads it to have a separate measured loop for each benchmark.

Here is how this could look like. User declares microbenchmarks by writing a top-level function marked with @benchmark annotation.

// ubench/lib/main.dart
import 'package:benchmark_harness/benchmark_harness.dart';

const N = 32;

@benchmark
void allocateFixedArray() {
  List.filled(N, null, growable: false);
}

@benchmark
void allocateGrowableArray() {
  List.filled(N, null, growable: true);
}

Benchmarking tooling would then generate an auxiliary source file which contains a measured loop for each benchmark, plus some code to select which benchmarks should run at compile time:

// ubench/lib/main.benchmark.dart
import 'package:benchmark_harness/benchmark_harness.dart' as benchmark_harness;

import 'package:ubench/main.dart' as lib;

// ...

void _$measuredLoop$allocateFixedArray(int numIterations) {
  while (numIterations-- > 0) {
    lib.allocateFixedArray();
  }
}

// ...

const _targetBenchmark =
    String.fromEnvironment('targetBenchmark', defaultValue: 'all');
const _shouldMeasureAll = _targetBenchmark == 'all';

const _shouldMeasure$allocateFixedArray =
    _shouldMeasureAll || _targetBenchmark == 'allocateFixedArray';

// ...

void main() {
  benchmark_runner.runBenchmarks(const {
    // ...
    if (_shouldMeasure$allocateFixedArray)
      'allocateFixedArray': _$measuredLoop$allocateFixedArray,
    // ...
  });
}

The actual measurement would happen in a simple measure helper function:

// benchmark_harness/lib/benchmark_runner.dart

/// Runs the given measured [loop] function with an exponentially increasing
/// parameter values until it finds one that causes [loop] to run for at
/// least [thresholdMilliseconds] and returns [BenchmarkResult] describing
/// that run.
BenchmarkResult measure(void Function(int) loop,
    {required String name, int thresholdMilliseconds = 5000}) {
  var n = 2;
  final sw = Stopwatch();
  do {
    n *= 2;
    sw.reset();
    sw.start();
    loop(n);
    sw.stop();
  } while (sw.elapsedMilliseconds < thresholdMilliseconds);

  return BenchmarkResult(
    name: name,
    elapsedMilliseconds: sw.elapsedMilliseconds,
    numIterations: n,
  );
}

We are starting with a very simple implementation, which should nevertheless satisfy our initial microbenchmarking needs. However for more complex cases we might want to do something a bit more rigorous: for example once large enough numIterations is found we can repeat loop(numIterations) multiple times and asses statistical properties of observed running times.

Using `source_gen`

To generate main.benchmark.dart we need to parse main.dart and find all functions annotated with @benchmark annotation. Fortunately Dart has a number of canonical tools for code generation which make this really easy.

All I had to do is to depend on package:source_gen and to define a subclass of GeneratorForAnnotation:

// benchmark_harness/lib/src/benchmark_generator.dart

class BenchmarkGenerator extends GeneratorForAnnotation<Benchmark> {
  // ...
  @override
  String generateForAnnotatedElement(
      Element element, ConstantReader annotation, BuildStep buildStep) {
    final name = element.name;
    return '''
void ${_\$measuredLoop\$$name}(int numIterations) {
  while (numIterations-- > 0) {
    lib.${name}();
  }
}
''';
  }
}

which I then wrapped in a Builder

// benchmark_harness/lib/builder.dart

Builder benchmarkLibraryBuilder(BuilderOptions options) =>
    LibraryBuilder(BenchmarkGenerator(), generatedExtension: '.benchmark.dart');

and finally exposed this Builder factory through build.yaml

# benchmark_harness/build.yaml

builders:
  benchmark:
    import: "package:benchmark_harness/builder.dart"
    builder_factories: ["benchmarkLibraryBuilder"]
    build_extensions: {".dart": [".benchmark.dart"]}
    auto_apply: dependents
    build_to: source

That was basically it. Now whenever I run build_runner build in ubench I will get lib/main.benchmark.dart generated for benchmarks defined in lib/main.dart:

ubench $ flutter pub run build_runner build
ubench $ ls lib/*.benchmark.dart
lib/main.benchmark.dart

You can see the full source for code generator in this commit.

Running benchmarks

We can execute individual benchmarks by passing appropriate --dart-define to flutter run, for example:

$ flutter run --release --dart-define targetBenchmark=allocateFixedArray -t lib/main.benchmark.dart
Launching lib/main.benchmark.dart on Pixel 3a in release mode...
Running Gradle task 'assembleRelease'...
Running Gradle task 'assembleRelease'... Done                       4.9s
✓ Built build/app/outputs/flutter-apk/app-release.apk (4.9MB).
Installing build/app/outputs/flutter-apk/app.apk...              1,268ms

Flutter run key commands.
h Repeat this help message.
c Clear the screen
q Quit (terminate the application on the device).
I/flutter (12463): benchmark_harness[{"event":"benchmark.running"}]
I/flutter (12463): benchmark_harness[{"event":"benchmark.result","params":{...}}]
I/flutter (12463): benchmark_harness[{"event":"benchmark.done"}]

Application finished.

But doing this manually is not exactly what I was aiming for. Instead I am going to change bin/benchmark_harness.dart script to both build benchmarks and then to run all generated files to collect benchmark results (for full code see this commit).

// benchmark_harness/bin/benchmark_harness.dart

void main() async {
  // ...
  // Generate benchmark wrapper scripts.
  print(red('Generating benchmark wrappers'));
  'flutter pub run build_runner build'.start(progress: Progress.devNull());

  // Run all generated benchmarks.
  final resultsByFile = <String, Map<String, BenchmarkResult>>{};
  for (var file in find('*.benchmark.dart').toList().map(p.relative)) {
    resultsByFile[file] = await runBenchmarksIn(file);
  }

  // Report results.
  // ...
}

/// Runs all benchmarks in `.benchmark.dart` [file] one by one and collects
/// their results.
Future<Map<String, BenchmarkResult>> runBenchmarksIn(String file) async {
  // ...
}

Such helper script makes running benchmarks really straightforward:

$ flutter pub run benchmark_harness
Generating benchmark wrappers
Found 2 benchmarks in lib/main.benchmark.dart
  measuring allocateFixedArray
    benchmark is running
      done
  measuring allocateGrowableArray
    benchmark is running
      done

--------------------------------------------------------------------------------

Results for lib/main.benchmark.dart
allocateFixedArray: 0.0000030226074159145355 ms/iteration (fastest)
allocateGrowableArray: 0.00018900632858276367 ms/iteration (62.5 times as slow)

Understanding the results

Now that we have a tool for running microbenchmarks, lets extend it with support for profiling benchmarks as they are running. This would help us to understand where benchmark is spending time and confirm that it is measuring exactly what we want it to measure.

Flutter’s release builds exclude Dart’s builtin profiler so we will have to use a native profiler instead, for example simpleperf on Android.

Android has comprehensive documentation for using simpleperf, which I am not going to duplicate here. simpleperf also comes with C++ (and Java) code called app_api which can be linked into an application to allow programmatic access to the profiler.

In reality app_api does not do anything overly fancy: it just runs simpleperf binary with the right command line options. That’s why I decided to just port relevant parts of app_api to pure Dart. We could also bind to C++ version of app_api using Dart FFI, but that requires packaging this C++ into a Flutter plugin, which complicates things, because benchmark_harness is a pure Dart package and it can’t depend on a Flutter plugin package.

// benchmark_harness/lib/src/simpleperf/profiling_session.dart

class ProfilingSession {
  Future<void> start(
      {RecordingOptions options = const RecordingOptions()}) async {
    // ...
    await _startSimpleperfProcess(options);
  }

  Future<void> _startSimpleperfProcess(RecordingOptions options) async {
    final simpleperfBinary = await _findSimplePerf();
    _simpleperf = await Process.start(
      simpleperfBinary,
      [
        'record',
        '--log-to-android-buffer',
        '--log',
        'debug',
        '--stdio-controls-profiling',
        '--in-app',
        '--tracepoint-events',
        '/data/local/tmp/tracepoint_events',
        '-o',
        options.outputFilename ?? _makeOutputFilename(),
        '-e',
        options.event,
        '-f',
        options.frequency.toString(),
        '-p',
        _getpid().toString(),
        ..._callgraphFlagsFrom(options),
      ],
      workingDirectory: simpleperfDataDir,
    );
    // ...
  }
}

Then I adjusted benchmark_runner.dart to run benchmark it just measured under the profiler and save profile into a perf-$benchmarkName.data file. This file will be created in application’s data directory:

Future<void> runBenchmarks(Map<String, void Function(int)> benchmarks) async {
  _event('benchmark.running');
  final profiler = Platform.isAndroid ? ProfilingSession() : null;
  for (var entry in benchmarks.entries) {
    final result = measure(entry.value, name: entry.key);
    _event('benchmark.result', result);

    if (profiler != null) {
      // Run benchmark for the same amount of iterations and profile it.
      await profiler.start(
          options: RecordingOptions(outputFilename: 'perf-${entry.key}.data'));
      entry.value(result.numIterations);
      await profiler.stop();
    }
  }
  _event('benchmark.done');
}

NDK also comes with a helper script api_profiler.py, which implements two commands:

api_profiler.py prepare configures your device for profiling - we are going to call it before running benchmarks;
api_profiler.py collect pulls collected profiles from the device - we are going to call it after all benchmarks finish running to pull all generated perf-*.data from the device.

Displaying collected profiling data

NDK’s simpleperf binary supports both record and report commands, just like Linux perf. Looking around in the NDK I have also discovered a bunch of helper scripts written in Python (e.g. report_html.py which can generate a HTML report). Peaking into those scripts I have discovered that they make use of libsimpleperf_report.so library which handles parsing and symbolization of collected profiles. The API for this library is defined at the top of simpleperf/report_lib_interface.cpp file in simpleperf sources.

Using ffigen I generated dart:ffi based bindings for this library, allowing me to use it from benchmark_harness script to process collected profiling samples:

final reportLib = report_bindings.NativeLibrary(
    ffi.DynamicLibrary.open(ndk.simpleperfReportLib));

Future<void> _printProfile(String profileData) async {
  final session = reportLib.CreateReportLib();
  reportLib.SetRecordFile(session, Utf8.toUtf8(profileData).cast());

  // Iterate over all collected samples.
  for (;;) {
    final sample = reportLib.GetNextSample(session);
    if (sample == ffi.nullptr) {
      break;
    }
    final period = sample.ref.period;
    final symbol = reportLib.GetSymbolOfCurrentSample(session);

    final dsoName = Utf8.fromUtf8(symbol.ref.dso_name.cast());
    final symbolName = Utf8.fromUtf8(symbol.ref.symbol_name.cast());

    // Process sample for the symbol [symbolName] in dso [dsoName] and collect
    // aggregate statistics (samples per symbol, total sampling period, etc).
    // ...
  }

  // Report top N hottest symbols
}

When I run this for the first time I’ve discovered that simpleperf can’t really attribute most of the samples to a meaningful symbol neither for libapp.so (which contains AOT compiled Dart code) nor for libflutter.so (which contains Flutter engine code). Here is the very first report I got:

Hot methods when running allocateGrowableArray:
24% _kDartIsolateSnapshotInstructions (libapp.so)
04% unknown (libflutter.so)
15% unknown ([kernel.kallsyms])
44% pthread_mutex_lock (libc.so)
30% pthread_mutex_unlock (libc.so)
  ...

This is not surprising: both of these libraries are stripped and don’t contain any useful symbol information for simpleperf to use.

Fortunately, libflutter.so symbols can be fetched from Cloud Storage where build infrastructure is archiving them, e.g. symbols for an ARM64 Android release build of Flutter engine at commit e115066d... reside in gs://flutter_infra/flutter/e115066d.../android-arm64-release/symbols.zip. Just few months ago I have written some Dart code for downloading and caching Flutter Engine symbols based on commit hash for @flutter-symbolizer-bot, so I could just reuse the very same code here.

Getting symbols for libapp.so is a more interesting problem. Dart VM AOT compiler is capable of producing DWARF debug sections in the ELF binaries. However just passing --extra-gen-snapshot-options=--dwarf-stack-traces does not get us there: libapp.so will still be missing symbols. Inspection of Flutter CLI source reveals that it simply always instructs AOT compiler to produce stripped output. Users familiar with advanced Flutter tooling options might know about --split-debug-info=... flag which instructs AOT compiler to produce a separate ELF file containing just the DWARF sections but none of the snapshot payload itself. This file would be enough for symbolization, but unfortunately does not play well with tools like llvm-objdump. GNU objdump supports separate debug info discovered through .gnu_debuglink, but some features like intermixing source and disassembly do not seem to work in this mode.

For prototyping purposes I opted to patch flutter CLI to allow a user to suppress stripping by specifying --no-strip in --extra-gen-snapshot-options.

Now when we build our benchmark we can preserve DWARF debug information in libapp.so and subsequently use it for symbolization, making our profile more informative:

Hot methods when running allocateGrowableArray:
17% Precompiled_Stub__iso_stub_AllocateArrayStub (libapp.so)
29% Precompiled______measuredLoop_allocateGrowableArray_21146765_1230 (libapp.so)
01% Precompiled__GrowableList_0150898__GrowableList_0150898__withData_0150898_136 (libapp.so)
63% Precompiled__GrowableList_0150898__GrowableList_0150898__164 (libapp.so)
96% Precompiled__GrowableList_0150898__allocateData_0150898_161 (libapp.so)
66% unknown ([kernel.kallsyms])

I then took it one step further and used information available in DWARF to pretty print symbol names (which simpleperf does not seem to do). Fortunately I did not actually have to parse DWARF by hand, package:native_stack_traces already has all necessary tooling for handling ELF/DWARF parsing for us:

String _userFriendlyName(Dwarf dwarf, elf_lib.Elf elf, String symbol) {
  final elfSymbol = elf.staticSymbolFor(symbol);
  if (elfSymbol != null) {
    final callInfo = dwarf.callInfoFor(elfSymbol.value);
    if (callInfo != null && callInfo.isNotEmpty) {
      final lastInfo = callInfo.last;
      if (lastInfo is DartCallInfo) {
        return lastInfo.function
            .replaceFirst(r'_$measuredLoop$', 'measured loop for ');
      }
      return lastInfo.toString();
    }
  }
  return symbol.replaceFirst('Precompiled_Stub__iso_stub_', 'Stub::');
}

This yields even more readable output:

$ flutter run benchmark_harness report
Results for lib/main.benchmark.dart
allocateFixedArray: 0.000006573274731636047 ms/iteration (fastest)
allocateGrowableArray: 0.00020492076873779297 ms/iteration (31.2 times as slow)

Hot methods when running allocateFixedArray:
 99.64% measured loop for allocateFixedArray (libapp.so)

Hot methods when running allocateGrowableArray:
 54.17% Stub::AllocateArrayStub (libapp.so)
 14.29% measured loop for allocateGrowableArray (libapp.so)
  8.01% new _GrowableList._withData (libapp.so)
  7.63% new _GrowableList (libapp.so)
  4.96% _GrowableList._allocateData (libapp.so)
  3.66% unknown ([kernel.kallsyms])

We can now see that benchmarks seem to behave slightly differently: allocateFixedArray spends all of its time in its measured loop, while allocateGrowableArray spends time in various methods related to array allocation.

To understand this better we need to look at the native code generated for both of these measured loops.

There are different possible ways to implement this, e.g. I could choose to delegate this task to perf annotate or something similar. However perf is rather picky with regards to folder structure when searching for symbols, and none of the NDK scripts seemed to fit the bill, so I opted to just use llvm-objdump and annotate the result with profiling information.

With some post processing to use Dart VM specific names for reserved registers like thr (reserved for current Thread pointer) and to demangle symbol names (via _userFriendlyName helper) we get output like this:


Hot methods when running allocateFixedArray:
 99.64% measured loop for allocateFixedArray (libapp.so)
                      0: stp	fp, lr, [sp, #-16]!
                      4: mov	fp, sp
                      8: ldr	x2, [fp, #16]
           7.87%      c: ldr	x16, [thr, #64]
          18.40%     10: cmp	sp, x16
           9.30%     14: b.ls	->60
           8.88%     18: cmp	x2, null
                     1c: b.eq	->68
           9.65%     20: asr	x3, x2, #1
                     24: tbz	w2, #0, ->2c
                     28: ldur	x3, [x2, #7]
           9.45%     2c: sub	x2, x3, #1
                     30: cmp	x3, #0
           9.85%     34: b.le	->50
           9.72%     38: adds	x0, x2, x2
                     3c: b.vc	->48
                     40: bl	Stub::AllocateMintSharedWithoutFPURegsStub
                     44: stur	x2, [x0, #7]
           8.90%     48: mov	x2, x0
           7.60%     4c: b.al	->c
                     50: mov	x0, null
                     54: mov	sp, fp
                     58: ldp	fp, lr, [sp], #16
                     5c: ret
                     60: bl	Stub::StackOverflowSharedWithoutFPURegsStub
                     64: b.al	->18
                     68: bl	Stub::NullErrorSharedWithoutFPURegsStub

A developer familiar with ARM assembly language might be able to spot the issue with our benchmark just from this output (yep, there is obviously an issue).

However I decided to spend a bit more time on it and overlay the information about Dart VM compiler’s intermediate language on top of the assembly, making it much easier to grok.

Internally our compiler has support for annotating machine code it generates with human readable comments, which among other things describe IL from which a particular piece of machine code was produced.

I have experimented with two different paths I considered taking to get this information out of the AOT compiler:

add a flag --write-code-comments-to=output.json which tells compiler to dump offset-to-comment mapping into a JSON file and then ingest this JSON in our benchmarking CLI.
add a flag --write-code-comments-as-synthetic-source-to=comments.txt which tells compiler to synthesize one gigantic file out of all code comments and write offset-to-comment mapping as a DWARF line program into .debug_line section.

In the end I decided that the second approach is better because it makes it possible for any tool that understands .debug_line to display code comments alongside disassembly (e.g. gdb will start displaying it as well).

Here is my patch against Dart SDK implementing necessary plumbing. With this in place the only change I had to do was to add -S to llvm-objdump invocation and it handled the rest:


Hot methods when running allocateFixedArray:
 99.64% measured loop for allocateFixedArray (libapp.so)
                         ;; Enter frame
                      0: stp	fp, lr, [sp, #-16]!
                      4: mov	fp, sp
                         ;; ParallelMove r2 <- S+2
                      8: ldr	x2, [fp, #16]
                         ;; CheckStackOverflow:30(stack=0, loop=1)
           7.87%      c: ldr	x16, [thr, #64]
          18.40%     10: cmp	sp, x16
           9.30%     14: b.ls	->60
                         ;; CheckNull:10(v3, NoSuchMethodError) T{int}
           8.88%     18: cmp	x2, null
                     1c: b.eq	->68
                         ;; v26 <- UnboxInt64(v3 T{int}) T{int}
           9.65%     20: asr	x3, x2, #1
                     24: tbz	w2, #0, ->2c
                     28: ldur	x3, [x2, #7]
                         ;; v6 <- BinaryInt64Op(- [tr], v26 T{int}, v32) T{int}
           9.45%     2c: sub	x2, x3, #1
                         ;; Branch if RelationalOp(>, v26 T{int}, v34) T{bool} goto (3, 5)
                     30: cmp	x3, #0
           9.85%     34: b.le	->50
                         ;; v30 <- BoxInt64(v6) T{int}
           9.72%     38: adds	x0, x2, x2
                     3c: b.vc	->48
                     40: bl	Stub::AllocateMintSharedWithoutFPURegsStub
                     44: stur	x2, [x0, #7]
                         ;; ParallelMove r2 <- r0 goto:28 B4
           8.90%     48: mov	x2, x0
           7.60%     4c: b.al	->c
                         ;; ParallelMove r0 <- C
                     50: mov	x0, null
                         ;; Return:36(v0)
                     54: mov	sp, fp
                     58: ldp	fp, lr, [sp], #16
                     5c: ret
                         ;; CheckStackOverflowSlowPath
                     60: bl	Stub::StackOverflowSharedWithoutFPURegsStub
                     64: b.al	->18
                         ;; slow path check null (nsm) operation
                     68: bl	Stub::NullErrorSharedWithoutFPURegsStub

Hot methods when running allocateGrowableArray:
 54.17% Stub::AllocateArrayStub (libapp.so)
 14.29% measured loop for allocateGrowableArray (libapp.so)
                         ;; Enter frame
                      0: stp	fp, lr, [sp, #-16]!
                      4: mov	fp, sp
                      8: sub	sp, sp, #16
                         ;; CheckStackOverflow:8(stack=0, loop=0)
                      c: ldr	x16, [thr, #64]
                     10: cmp	sp, x16
                     14: b.ls	->94
                         ;; v36 <- UnboxedConstant:32(#32) [32, 32] T{_Smi}
                     18: mov	x0, #32
                         ;; ParallelMove r1 <- S+2, S-2 <- r0
                     1c: ldr	x1, [fp, #16]
                     20: stur	x0, [fp, #-16]
                         ;; CheckStackOverflow:30(stack=0, loop=1)
                     24: ldr	x16, [thr, #64]
           0.73%     28: cmp	sp, x16
                     2c: b.ls	->9c
                         ;; CheckNull:10(v3, NoSuchMethodError) T{int}
           0.33%     30: cmp	x1, null
                     34: b.eq	->a4
                         ;; v26 <- UnboxInt64(v3 T{int}) T{int}
           0.30%     38: asr	x2, x1, #1
                     3c: tbz	w1, #0, ->44
                     40: ldur	x2, [x1, #7]
                         ;; v6 <- BinaryInt64Op(- [tr], v26 T{int}, v32) T{int}
           0.35%     44: sub	x1, x2, #1
                         ;; ParallelMove S-1 <- r1
                     48: stur	x1, [fp, #-8]
                         ;; Branch if RelationalOp(>, v26 T{int}, v34) T{bool} goto (3, 5)
          10.41%     4c: cmp	x2, #0
                     50: b.le	->84
                         ;; PushArgument(v18)
           0.29%     54: ldr	x16, [pp, #5160]
           0.35%     58: stp	x0, x16, [sp, #-16]!
                         ;; StaticCall:10( _GrowableList@0150898.<0> v18, v36, result_type = T{_GrowableList})
                     5c: bl	new _GrowableList
           0.33%     60: add	sp, sp, #16
                         ;; ParallelMove r2 <- S-1
                     64: ldur	x2, [fp, #-8]
                         ;; v30 <- BoxInt64(v6) T{int}
           0.53%     68: adds	x0, x2, x2
                     6c: b.vc	->78
                     70: bl	Stub::AllocateMintSharedWithoutFPURegsStub
                     74: stur	x2, [x0, #7]
                         ;; ParallelMove r1 <- r0, r0 <- S-2 goto:28 B4
           0.29%     78: mov	x1, x0
                     7c: ldur	x0, [fp, #-16]
           0.38%     80: b.al	->24
                         ;; ParallelMove r0 <- C
                     84: mov	x0, null
                         ;; Return:36(v0)
                     88: mov	sp, fp
                     8c: ldp	fp, lr, [sp], #16
                     90: ret
                         ;; CheckStackOverflowSlowPath
                     94: bl	Stub::StackOverflowSharedWithoutFPURegsStub
                     98: b.al	->18
                         ;; CheckStackOverflowSlowPath
                     9c: bl	Stub::StackOverflowSharedWithoutFPURegsStub
                     a0: b.al	->30
                         ;; slow path check null (nsm) operation
                     a4: bl	Stub::NullErrorSharedWithoutFPURegsStub
  8.01% new _GrowableList._withData (libapp.so)
  7.63% new _GrowableList (libapp.so)
  4.96% _GrowableList._allocateData (libapp.so)
  3.66% unknown ([kernel.kallsyms])
  ..(run with -v to disassemble all hot methods in libapp.so)..

Now it should be visible in the output that allocateFixedArray does not actually contain any array allocation, it got eliminated (sunk) by the compiler, so our benchmark seems to be comparing performance of an empty while (N-- > 0); loop to the performance a loop that actually does some array allocations.

The empty loop however is far from trivial. It seems to contain two sources of overhead by itself:

Variable numIterations seems to be unboxed and reboxed on each loop iteration, as evident from the following sequence of instructions that implements numIterations--:

;; CheckNull:10(v3, NoSuchMethodError) T{int}
18: cmp x2, null
1c: b.eq ->68
  ;; v26 <- UnboxInt64(v3 T{int}) T{int}
20: asr x3, x2, #1
24: tbz w2, #0, ->2c
28: ldur x3, [x2, #7]
  ;; v6 <- BinaryInt64Op(- [tr], v26 T{int}, v32) T{int}
2c: sub x2, x3, #1
  <...>
  ;; v30 <- BoxInt64(v6) T{int}
38: adds x0, x2, x2
3c: b.vc ->48
40: bl Stub::AllocateMintSharedWithoutFPURegsStub
44: stur x2, [x0, #7]

Loop header contains a CheckStackOverflow instruction which performs a memory load and a comparison on each loop iteration.
```
;; CheckStackOverflow:30(stack=0, loop=1)
24: ldr x16, [thr, #64]
28: cmp sp, x16
2c: b.ls ->9c
```

For accurate microbenchmarking we should certainly make an effort to reduce overhead of measured loop to bare minimum.

Preventing boxing of `numIterations`

void _$measuredLoop$allocateGrowableArray(int numIterations) {
  while (numIterations-- > 0) {
    lib.allocateGrowableArray();
  }
}

Why is numIterations kept boxed by the compiler? There are multiple factors at play here. Most importantly compiler is simply unable to prove that numIterations is not null on entry to the function, which happens because we use a tear-off of a measured loop method to invoke it. TFA (type flow analysis, our global type propagation algorithm) does not attempt to infer precise type information for parameters of closures. If our code was structured using classes, for example: If we were running in sound non-nullable by default (NNBD) mode then compiler would also be able to rely on the fact that numIterations can never be null. This illustrates one of the benefits of NNBD: giving compiler more information to work with.

abstract class MeasuredLoop {
  void run(int numIterations);
}

class Loop$allocateGrowableArray extends MeasuredLoop {
  void run(int numIterations) =>
      _$measuredLoop$allocateGrowableArray(numIterations);
}

void main() {
  await benchmark_runner.runBenchmarks(const {
    // Use class instead of tearing off _$measuredLoop$allocateGrowableArray
    'allocateGrowableArray': Loop$allocateGrowableArray(),
  });
}

Then compiler would be able to infer that numIterations is never nullable, and not only unbox numIterations locally inside run and _$measuredLoop$... but even alter calling conventions of these two functions to pass parameter around without boxing.

The reason why this works for methods and does not (yet) work for closures is because a statically typed method call gives compiler more information about potential callees than a statically typed closure call.

class Loop$allocateGrowableArray extends MeasuredLoop {
  void run(int numIterations) {
    // Can only be reached from a call-site where
    // static type of a receiver is either [dynamic] or
    // related to [Loop$allocateGrowableArray].
  }
}

MeasuredLoop loop;
loop.run(...);  // Can invoke only overrides of [MeasuredLoop.run].

// ---------------------------------------------------------

var g = (int numIterations) {
  // Can be reached from any closure call-site with a compatible
  // function type and invocation signature.
};

void Function(int) f;
f(...);  // Can reach any compatible closure in the program.

Right now TFA does a very conservative approximation assuming that any closure could be invoked from anywhere else. This certainly could be improved: for example, TFA could use static type information to narrow potential call-sites for each closure. This can be taken even further - TFA could attempt to track the flow of closures through the program and try to establish which call-sites it could reach, but that’s a more sophisticated analysis.

Given that TFA can’t infer numIterations as never containing null we might want to help it by adding an explicit check at the start of the measured loop wrapper.

void _$measuredLoop$allocateGrowableArray(int numIterations) {
  if (numIterations == null) throw 'Unexpected null!';
  while (numIterations-- > 0) {
    lib.allocateGrowableArray();
  }
}

Unfortunately this does not help: even though compiler infers that numIterations is never null inside the loop it still does not unbox it, because its unboxing algorithm is overly conservative around int typed variables. Current set of heuristics only unboxes integer variables if all reaching values are either boxing operations or constants. Which is probably a bug that needs to be fixed, but in the meantime we could simply satisfy this requirement: I am allowing myself a slight imprecision when talking about unboxing here. Compiler does not actually unbox variables, because it operates on SSA-form which does not contain any operations with variables anymore. It would be more precise to talk about unboxing phis, but I did not want to confuse readers without compiler background too much.

void _$measuredLoop$allocateGrowableArray(int numIterations) {
  // n will be unboxed because it is either
  //     Box(Unbox(numIterations) + 0) or Box(Unbox(n) - 1)
  var n = numIterations + 0;
  while (n-- > 0) {
    lib.allocateGrowableArray();
  }
}

Eliminating `CheckStackOverflow`

CheckStackOverflow instructions inserted by the compiler have dual purpose: they check for stack overflow (as their name implies), but also serve as interruption points, allowing VM to cleanly interrupt a thread executing Dart code. This mechanism can, for example, be used by the GC to park mutator thread at a safepoint. CheckStackOverflow’s are rather fast: they consist of a memory load which usually hits the CPU cache and a compare-and-branch which is almost never taken. However they can still have a visible cost in very tight loops.

As an experiment I have changed the VM to completely eliminate CheckStackOverflow instructions from functions marked with @pragma('vm:no-interrupts').

With a CheckStackOverflow instruction empty loop takes 3.56 ns/iteration per iteration and without it empty loop takes 1.78 ns/iteration which is basically twice as fast.

The final shape of an empty measuring loop is:

    ;; v9 <- BinaryInt64Op(- [tr], v6 T{int}, v36) T{int}
24: sub    x0, x1, #1
    ;; Branch if RelationalOp(>, v6 T{int}, v34) T{bool} goto (3, 5)
28: cmp    x1, #0
2c: b.le   ->38
    ;; ParallelMove r1 <- r0 goto:32 B4
30: mov    x1, x0
34: b.al   ->24

Preventing compiler from optimizing computations out

The final step is to fix the benchmark to prevent compiler from optimizing out our computation. The core idea is to change each benchmark to return a value it produces and to make measured loops consume returned values:

// ubench/lib/main.dart

@benchmark
Object allocateFixedArray() {
  return List.filled(N, null, growable: false);
}

// ubench/lib/main.benchmark.dart
@pragma('vm:never-inline')
@pragma('vm:no-interrupts')
void _$measuredLoop$allocateFixedArray(int numIterations) {
  // ...
  while (n-- > 0) {
    final result = lib.allocateFixedArray();
    // Call a special intrinsic from dart:_internal which keeps
    // the value alive and prevents it from being optimized out.
    // However does not result in any actual calls being produced.
    benchmark_runner.reachabilityFence(result);
  }
}

Final benchmark results

With benchmark harness and benchmark itself fixed to prevent compiler from optimizing away list allocation we are getting the following results

Results for lib/main.benchmark.dart
allocateFixedArray: 131.607 (±0.845) ns/iteration (fastest)
allocateGrowableArray: 177.443 (±1.603) ns/iteration (1.3 times as slow)

Hot methods when running allocateFixedArray:
 77.98% Stub::AllocateArrayStub (libapp.so)
  6.19% measured loop for allocateFixedArray (libapp.so)
  5.21% unknown ([kernel.kallsyms])
  1.16% pthread_mutex_lock (libc.so)
  1.16% pthread_mutex_unlock (libc.so)

Hot methods when running allocateGrowableArray:
 53.13% Stub::AllocateArrayStub (libapp.so)
 11.53% new _GrowableList._withData (libapp.so)
  9.35% new _GrowableList (libapp.so)
  8.15% measured loop for allocateGrowableArray (libapp.so)
  6.23% _GrowableList._allocateData (libapp.so)
  3.92% unknown ([kernel.kallsyms])

So allocating growable list with 32 element takes ~30% more time compared to allocating a fixed length list with 32 elements which is much more reasonable result than what we were getting with our initial benchmarks.

This also makes sense: growable array is just an object which contains inside a fixed length array so allocating a growable array is a bit slower than allocating a fixed length array because you need to allocate and initialize more memory and you need to jump through more hoops to get there: in allocateFixedArray most of the time is spent in a stub (a piece of handwritten machine code) which allocates the array, in allocateGrowableArray there are multiple functions involved in doing the job.

Final remarks on benchmarking

In general I have chosen to neglect in this post some of the hardest problems associated with microbenchmarking : for example, I decided to ignore JIT altogether and focus specifically on AOT. Benchmarking JITs is really hard because they behave like living organisms and show wild fluctuations in performance even after long warmup (see for example Virtual Machine Warmup Blows Hot and Cold paper). I also decided to focus on a simplistic average cost of operation metric, which might hide some of the important but not constantly present overheads. Array allocation is actually a great example: it might be rather cheap in general, but every so often it might trigger a GC - the cost of which will be proportional to the amount of live data, which will probably close to 0 in microbenchmark. Averaging over thousands and thousands of operations would completely erase the cost of the GC, however in the real application GCs like this might matter because they might cause a missed frame.

Next Part

In the Part 2 of the series I am going to use benchmarking harness CLI, I have just implemented to answer more performance questions I have gotten on Twitter, for example:

how changing field initializer from [] to List.filled(0, null) can improve performance of your code;
performance characteristics of async syntax sugar;
what happens when you call a closure through a dynamically typed variable;

Till next time.

I am always available to help you with Dart related performance questions and concerns. Do not hesitate to reach out to me by mail or through other channels like Twitter.

Want to discuss contents of this post? Drop me a mail [email protected] or find me on Mastodon, X or Bluesky.