In the past few months I have started receiving more and more questions about performance of some specific Dart operations. Here is an example of such a question asked by Romain Rastel in the context of his work on improving performance of ChangeNotifier in Flutter.
Looks like creating a fixed-length list with a small number of items, can be, sometimes a lot less performant than creating a growable list. pic.twitter.com/B5opjZkmrX
— Romain Rastel đź’™ (@lets4r) November 30, 2020
Given my experience I knew exactly what was going wrong in this particular benchmark after the very first glance… but for the sake of storytelling let me pretend that I did not. How would I approach this then?
I would normally start by trying to repeat reported numbers. In this particular case I would start by creating an empty Flutter application
$ flutter create ubench
$ cd ubench
then in lib/benchmark.dart
I’d put the following code This code was taken from Romain’s gist with a minor fix: in the original version benchmark names were swapped by accident, so fixed-length
benchmark was allocating growable lists and vice versa.
// ubench/lib/benchmark.dart
import 'package:benchmark_harness/benchmark_harness.dart';
abstract class Benchmark extends BenchmarkBase {
const Benchmark(String name) : super(name);
@override
void exercise() {
for (int i = 0; i < 100000; i++) {
run();
}
}
}
class GrowableListBenchmark extends Benchmark {
const GrowableListBenchmark(this.length) : super('growable[$length]');
final int length;
@override
void run() {
List<int>()..length = length;
}
}
class FixedLengthListBenchmark extends Benchmark {
const FixedLengthListBenchmark(this.length) : super('fixed-length[$length]');
final int length;
@override
void run() {
List(length);
}
}
void main() {
const GrowableListBenchmark(32).report();
const FixedLengthListBenchmark(32).report();
}
and finally I’d run it in the release mode
$ flutter run --release -t lib/benchmark.dart
...
I/flutter (18126): growable[32](RunTime): 31464.890625 us.
I/flutter (18126): fixed-length[32](RunTime): 713.8279800142756 us
The result seems to show fixed length lists being 43 times faster to allocate than growable lists. Should we leave it at that and head over to refactor our code to use as many fixed-length lists as possible?
Absolutely not… or at least not with an expectation that our code becomes 43 times faster. It does actually make sense to prefer fixed-length lists over growable lists where fixed-length lists are a natural fit. They have slightly smaller memory footprint, are faster to allocate and involve less indirections to access an element. But you should do this choice deliberately based on clear understanding of how things work and not based on raw uninterpreted results of microbenchmarks.
Drawing conclusions from raw microbenchmark numbers without any
sort of critical analysis is a common pitfall associated with
microbenchmarking and we should do our best to avoid falling into it.
Unfortunately package:benchmark_harness
is not making it easier to avoid such
pitfalls: it provides developers with a way to write microbenchmarks but does
not give them tools or guidance on how to validate your benchmarks and
interpret their results. To make things worse package:benchmark_harness
does
not even attempt to make it very straightforward to write an accurate
microbenchmark.
Consider for example that I could have written this list benchmark in the
following way, without overriding exercise
to repeat run
100000
times:
// ubench/lib/benchmark-without-exercise.dart
import 'package:benchmark_harness/benchmark_harness.dart';
// Just using BenchmarkBase directly. Rest is the same.
class GrowableListBenchmark extends BenchmarkBase {
// ...
}
// Just using BenchmarkBase directly. Rest is the same.
class FixedLengthListBenchmark extends BenchmarkBase {
// ...
}
Running this variant would show that growable lists are only 6 times slower than fixed length lists
$ flutter run --release -t lib/benchmark-without-exercise.dart
I/flutter (14407): growable[32](RunTime): 1.8629797056305768 us.
I/flutter (14407): fixed-length[32](RunTime): 0.3052065645659146 us.
Which benchmark result should I trust? Neither of them really! I should look under the hood and try to understand what exactly is happening.
Flutter and Dart already provide developers with enough tooling to figure out why benchmark numbers look this way. Unfortunately some of this tooling is somewhat obscure and hard to discover.
For example, it is well known that you can use flutter run --profile
to
profile your application with Observatory, however it is not well known that
you can also profile release builds using native profilers (like simpleperf
on Android or Instruments on iOS). Similarly it is not known (most likely not
known at all outside of a group of engineers working on the VM) that you can
dump annotated disassembly of a specific method from an AOT build by doing
$ flutter build apk --extra-gen-snapshot-options=--print-flow-graph,\
--disassemble,\
--print-flow-graph-filter=FixedLengthListBenchmark.run
I could spend the rest of this post explaining how one could use these tools to understand what exactly is going on in these list benchmarks, but instead I would like to try and imagine how an integrated tooling for benchmarking could be built out of the primitives provided by Dart and Flutter. This tooling should not only run benchmarks, but also automatically provide enough insight for a developer to spot mistakes they made during benchmarking and help them interpret the results.
Preliminary setup
I have forked benchmark_harness
package into mraleph/benchmark_harness
on GitHub. All of my prototype code is going to live in a new experimental-cli
branch in the fork.
From here on I will document an evolution of this experimental benchmarking CLI. I would like to stress a highly experimental nature of this tooling: as you will notice that some of its features will end up depending on a patches to Dart and Flutter SDK internals. It might be weeks or months before these patches land and it will become possible to just merge my changes into upstream version of the harness.
I started by adding a trivial bin/benchmark_harness.dart
script which would
serve as an entry point to our new benchmarking tooling.
$ git clone [email protected]:mraleph/benchmark_harness.git
$ cd benchmark_harness
$ cat > bin/benchmark_harness.dart
void main() { print('Running benchmarks...'); }
^D
Finally I changed pubspec.yaml
in ubench
project (remember it is an empty
Flutter project we created to host our benchmarks) to have a path dependency
on my version of benchmark_harness
# ubench/pubspec.yaml
dependencies:
# ...
benchmark_harness:
path: ../benchmark_harness
# ...
This allows me to run benchmark_harness
script while in ubench
project
directory
$ flutter pub get
$ flutter pub run benchmark_harness
Running benchmarks...
Generating benchmarks
Have you ever looked at how benchmark_harness
runs your benchmarks?
It turns out that this package is doing something rather simple (and to an
extent naive): it starts a Stopwatch
, then repeatedly calls exercise
until 2 seconds elapses according to the stopwatch. Time elapsed divided by
number of times exercise
was called is the reported benchmark score. Take
a look yourself:
// benchmark_harness/lib/src/benchmark_base.dart
abstract class BenchmarkBase {
// Measures the score for the benchmark and returns it.
double measure() {
// ...
// Run the benchmark for at least 2000ms.
var result = measureFor(exercise, 2000);
// ...
}
// Exercises the benchmark. By default invokes [run] 10 times.
void exercise() {
for (var i = 0; i < 10; i++) {
run();
}
}
// Measures the score for this benchmark by executing it repeatedly until
// time minimum has been reached.
static double measureFor(Function f, int minimumMillis) {
var minimumMicros = minimumMillis * 1000;
var iter = 0;
var watch = Stopwatch();
watch.start();
var elapsed = 0;
while (elapsed < minimumMicros) {
f();
elapsed = watch.elapsedMicroseconds;
iter++;
}
return elapsed / iter;
}
}
This code unfortunately has an issue which makes it unsuitable for
microbenchmarking: measured loop has a bunch of overhead unrelated to the
exercise
itself. Most noticeably it gets current time from the OS on
each and every iteration. There is also an overhead associated with
multiple levels of virtual dispatch between measured loop and the body of
run
method containing an actual operation we want to measure. There was a PR against benchmark_harness
, which tried to address the issue of calling Stopwatch.elapsedMilliseconds
too often, but it somehow got stuck in limbo despite being approved.
The best way to avoid these overheads it to have a separate measured loop for each benchmark.
Here is how this could look like. User declares microbenchmarks by writing a top-level function marked with @benchmark
annotation.
// ubench/lib/main.dart
import 'package:benchmark_harness/benchmark_harness.dart';
const N = 32;
@benchmark
void allocateFixedArray() {
List.filled(N, null, growable: false);
}
@benchmark
void allocateGrowableArray() {
List.filled(N, null, growable: true);
}
Benchmarking tooling would then generate an auxiliary source file which contains a measured loop for each benchmark, plus some code to select which benchmarks should run at compile time:
// ubench/lib/main.benchmark.dart
import 'package:benchmark_harness/benchmark_harness.dart' as benchmark_harness;
import 'package:ubench/main.dart' as lib;
// ...
void _$measuredLoop$allocateFixedArray(int numIterations) {
while (numIterations-- > 0) {
lib.allocateFixedArray();
}
}
// ...
const _targetBenchmark =
String.fromEnvironment('targetBenchmark', defaultValue: 'all');
const _shouldMeasureAll = _targetBenchmark == 'all';
const _shouldMeasure$allocateFixedArray =
_shouldMeasureAll || _targetBenchmark == 'allocateFixedArray';
// ...
void main() {
benchmark_runner.runBenchmarks(const {
// ...
if (_shouldMeasure$allocateFixedArray)
'allocateFixedArray': _$measuredLoop$allocateFixedArray,
// ...
});
}
The actual measurement would happen in a simple measure
helper function:
// benchmark_harness/lib/benchmark_runner.dart
/// Runs the given measured [loop] function with an exponentially increasing
/// parameter values until it finds one that causes [loop] to run for at
/// least [thresholdMilliseconds] and returns [BenchmarkResult] describing
/// that run.
BenchmarkResult measure(void Function(int) loop,
{required String name, int thresholdMilliseconds = 5000}) {
var n = 2;
final sw = Stopwatch();
do {
n *= 2;
sw.reset();
sw.start();
loop(n);
sw.stop();
} while (sw.elapsedMilliseconds < thresholdMilliseconds);
return BenchmarkResult(
name: name,
elapsedMilliseconds: sw.elapsedMilliseconds,
numIterations: n,
);
}
We are starting with a very simple implementation, which should nevertheless
satisfy our initial microbenchmarking needs. However for more complex cases
we might want to do something a bit more rigorous: for example once large enough
numIterations
is found we can repeat loop(numIterations)
multiple times
and asses statistical properties of observed running times.
Using source_gen
To generate main.benchmark.dart
we need to parse main.dart
and find all
functions annotated with @benchmark
annotation. Fortunately Dart has a
number of canonical tools for code generation which make this really easy.
All I had to do is to depend on package:source_gen
and to define a subclass of GeneratorForAnnotation
:
// benchmark_harness/lib/src/benchmark_generator.dart
class BenchmarkGenerator extends GeneratorForAnnotation<Benchmark> {
// ...
@override
String generateForAnnotatedElement(
Element element, ConstantReader annotation, BuildStep buildStep) {
final name = element.name;
return '''
void ${_\$measuredLoop\$$name}(int numIterations) {
while (numIterations-- > 0) {
lib.${name}();
}
}
''';
}
}
which I then wrapped in a Builder
// benchmark_harness/lib/builder.dart
Builder benchmarkLibraryBuilder(BuilderOptions options) =>
LibraryBuilder(BenchmarkGenerator(), generatedExtension: '.benchmark.dart');
and finally exposed this Builder
factory through build.yaml
# benchmark_harness/build.yaml
builders:
benchmark:
import: "package:benchmark_harness/builder.dart"
builder_factories: ["benchmarkLibraryBuilder"]
build_extensions: {".dart": [".benchmark.dart"]}
auto_apply: dependents
build_to: source
That was basically it. Now whenever I run build_runner build
in ubench
I will
get lib/main.benchmark.dart
generated for benchmarks defined in lib/main.dart
:
ubench $ flutter pub run build_runner build
ubench $ ls lib/*.benchmark.dart
lib/main.benchmark.dart
You can see the full source for code generator in this commit.
Running benchmarks
We can execute individual benchmarks by passing appropriate --dart-define
to flutter run
, for example:
$ flutter run --release --dart-define targetBenchmark=allocateFixedArray -t lib/main.benchmark.dart
Launching lib/main.benchmark.dart on Pixel 3a in release mode...
Running Gradle task 'assembleRelease'...
Running Gradle task 'assembleRelease'... Done 4.9s
âś“ Built build/app/outputs/flutter-apk/app-release.apk (4.9MB).
Installing build/app/outputs/flutter-apk/app.apk... 1,268ms
Flutter run key commands.
h Repeat this help message.
c Clear the screen
q Quit (terminate the application on the device).
I/flutter (12463): benchmark_harness[{"event":"benchmark.running"}]
I/flutter (12463): benchmark_harness[{"event":"benchmark.result","params":{...}}]
I/flutter (12463): benchmark_harness[{"event":"benchmark.done"}]
Application finished.
But doing this manually is not exactly what I was aiming for. Instead I am
going to change bin/benchmark_harness.dart
script to both build benchmarks
and then to run all generated files to collect benchmark results (for full code see this commit).
// benchmark_harness/bin/benchmark_harness.dart
void main() async {
// ...
// Generate benchmark wrapper scripts.
print(red('Generating benchmark wrappers'));
'flutter pub run build_runner build'.start(progress: Progress.devNull());
// Run all generated benchmarks.
final resultsByFile = <String, Map<String, BenchmarkResult>>{};
for (var file in find('*.benchmark.dart').toList().map(p.relative)) {
resultsByFile[file] = await runBenchmarksIn(file);
}
// Report results.
// ...
}
/// Runs all benchmarks in `.benchmark.dart` [file] one by one and collects
/// their results.
Future<Map<String, BenchmarkResult>> runBenchmarksIn(String file) async {
// ...
}
Such helper script makes running benchmarks really straightforward:
$ flutter pub run benchmark_harness
Generating benchmark wrappers
Found 2 benchmarks in lib/main.benchmark.dart
measuring allocateFixedArray
benchmark is running
done
measuring allocateGrowableArray
benchmark is running
done
--------------------------------------------------------------------------------
Results for lib/main.benchmark.dart
allocateFixedArray: 0.0000030226074159145355 ms/iteration (fastest)
allocateGrowableArray: 0.00018900632858276367 ms/iteration (62.5 times as slow)
Understanding the results
Now that we have a tool for running microbenchmarks, lets extend it with support for profiling benchmarks as they are running. This would help us to understand where benchmark is spending time and confirm that it is measuring exactly what we want it to measure.
Flutter’s release builds exclude Dart’s builtin profiler so we will have
to use a native profiler instead, for example simpleperf
on Android.
Android has comprehensive documentation for using simpleperf
, which I am not going to duplicate here. simpleperf
also comes
with C++ (and Java) code called app_api
which can be linked into an application to allow programmatic access to the profiler.
In reality app_api
does not do anything overly fancy: it just runs
simpleperf
binary with the right command line options. That’s why I decided
to just port relevant parts of app_api
to pure Dart. We could also bind to C++ version of app_api
using Dart FFI, but that requires packaging this C++ into a Flutter plugin, which complicates things, because benchmark_harness
is a pure Dart package and it can’t depend on a Flutter plugin package.
// benchmark_harness/lib/src/simpleperf/profiling_session.dart
class ProfilingSession {
Future<void> start(
{RecordingOptions options = const RecordingOptions()}) async {
// ...
await _startSimpleperfProcess(options);
}
Future<void> _startSimpleperfProcess(RecordingOptions options) async {
final simpleperfBinary = await _findSimplePerf();
_simpleperf = await Process.start(
simpleperfBinary,
[
'record',
'--log-to-android-buffer',
'--log',
'debug',
'--stdio-controls-profiling',
'--in-app',
'--tracepoint-events',
'/data/local/tmp/tracepoint_events',
'-o',
options.outputFilename ?? _makeOutputFilename(),
'-e',
options.event,
'-f',
options.frequency.toString(),
'-p',
_getpid().toString(),
..._callgraphFlagsFrom(options),
],
workingDirectory: simpleperfDataDir,
);
// ...
}
}
Then I adjusted benchmark_runner.dart
to run benchmark it just measured
under the profiler and save profile into a perf-$benchmarkName.data
file. This file will be created in application’s data directory:
Future<void> runBenchmarks(Map<String, void Function(int)> benchmarks) async {
_event('benchmark.running');
final profiler = Platform.isAndroid ? ProfilingSession() : null;
for (var entry in benchmarks.entries) {
final result = measure(entry.value, name: entry.key);
_event('benchmark.result', result);
if (profiler != null) {
// Run benchmark for the same amount of iterations and profile it.
await profiler.start(
options: RecordingOptions(outputFilename: 'perf-${entry.key}.data'));
entry.value(result.numIterations);
await profiler.stop();
}
}
_event('benchmark.done');
}
NDK also comes with a helper script api_profiler.py
, which implements two
commands:
api_profiler.py prepare
configures your device for profiling - we are going to call it before running benchmarks;api_profiler.py collect
pulls collected profiles from the device - we are going to call it after all benchmarks finish running to pull all generatedperf-*.data
from the device.
Displaying collected profiling data
NDK’s simpleperf
binary supports both record
and report
commands, just
like Linux perf
. Looking around in the NDK I have also discovered a bunch of helper scripts written in Python (e.g. report_html.py
which can generate a HTML report). Peaking
into those scripts I have discovered that they make use of libsimpleperf_report.so
library
which handles parsing and symbolization of collected profiles. The API for
this library is defined at the top of simpleperf/report_lib_interface.cpp
file in
simpleperf
sources.
Using ffigen
I generated dart:ffi
based bindings for this library, allowing me to use it from benchmark_harness
script to process collected profiling samples:
final reportLib = report_bindings.NativeLibrary(
ffi.DynamicLibrary.open(ndk.simpleperfReportLib));
Future<void> _printProfile(String profileData) async {
final session = reportLib.CreateReportLib();
reportLib.SetRecordFile(session, Utf8.toUtf8(profileData).cast());
// Iterate over all collected samples.
for (;;) {
final sample = reportLib.GetNextSample(session);
if (sample == ffi.nullptr) {
break;
}
final period = sample.ref.period;
final symbol = reportLib.GetSymbolOfCurrentSample(session);
final dsoName = Utf8.fromUtf8(symbol.ref.dso_name.cast());
final symbolName = Utf8.fromUtf8(symbol.ref.symbol_name.cast());
// Process sample for the symbol [symbolName] in dso [dsoName] and collect
// aggregate statistics (samples per symbol, total sampling period, etc).
// ...
}
// Report top N hottest symbols
}
When I run this for the first time I’ve discovered that simpleperf
can’t
really attribute most of the samples to a meaningful symbol neither for
libapp.so
(which contains AOT compiled Dart code) nor for libflutter.so
(which contains Flutter engine code). Here is the very first report I got:
Hot methods when running allocateGrowableArray:
88.24% _kDartIsolateSnapshotInstructions (libapp.so)
4.04% unknown (libflutter.so)
3.15% unknown ([kernel.kallsyms])
1.44% pthread_mutex_lock (libc.so)
1.30% pthread_mutex_unlock (libc.so)
...
This is not surprising: both of these libraries are stripped and
don’t contain any useful symbol information for simpleperf
to use.
Fortunately, libflutter.so
symbols can be fetched from Cloud Storage where
build infrastructure is archiving them, e.g. symbols for an ARM64 Android
release build of Flutter engine at commit e115066d...
reside in
gs://flutter_infra/flutter/e115066d.../android-arm64-release/symbols.zip
. Just few months ago I have written some Dart code for downloading and caching Flutter Engine symbols based on commit hash for @flutter-symbolizer-bot
, so I could just reuse the very same code here.
Getting symbols for libapp.so
is a more interesting problem. Dart VM AOT
compiler is capable of producing DWARF debug sections in the ELF binaries.
However just passing --extra-gen-snapshot-options=--dwarf-stack-traces
does
not get us there: libapp.so
will still be missing symbols. Inspection of
Flutter CLI source reveals that it simply always instructs AOT compiler to produce
stripped output. Users familiar with advanced Flutter tooling options might know about --split-debug-info=...
flag which instructs AOT compiler to produce a separate ELF file
containing just the DWARF sections but none of the snapshot payload itself. This file would be enough for symbolization, but unfortunately does not play well with tools like llvm-objdump
. GNU objdump
supports separate debug info discovered through .gnu_debuglink
, but some features like intermixing source
and disassembly do not seem to work in this mode.
For prototyping purposes I opted to patch flutter
CLI to allow a user to
suppress stripping by specifying --no-strip
in --extra-gen-snapshot-options
.
Now when we build our benchmark we can preserve DWARF debug information in
libapp.so
and subsequently use it for symbolization, making our profile
more informative:
Hot methods when running allocateGrowableArray:
54.17% Precompiled_Stub__iso_stub_AllocateArrayStub (libapp.so)
14.29% Precompiled______measuredLoop_allocateGrowableArray_21146765_1230 (libapp.so)
8.01% Precompiled__GrowableList_0150898__GrowableList_0150898__withData_0150898_136 (libapp.so)
7.63% Precompiled__GrowableList_0150898__GrowableList_0150898__164 (libapp.so)
4.96% Precompiled__GrowableList_0150898__allocateData_0150898_161 (libapp.so)
3.66% unknown ([kernel.kallsyms])
I then took it one step further and used information available in DWARF to pretty
print symbol names (which simpleperf
does not seem to do). Fortunately I
did not actually have to parse DWARF by hand, package:native_stack_traces
already has all necessary tooling for handling ELF/DWARF parsing for us:
String _userFriendlyName(Dwarf dwarf, elf_lib.Elf elf, String symbol) {
final elfSymbol = elf.staticSymbolFor(symbol);
if (elfSymbol != null) {
final callInfo = dwarf.callInfoFor(elfSymbol.value);
if (callInfo != null && callInfo.isNotEmpty) {
final lastInfo = callInfo.last;
if (lastInfo is DartCallInfo) {
return lastInfo.function
.replaceFirst(r'_$measuredLoop$', 'measured loop for ');
}
return lastInfo.toString();
}
}
return symbol.replaceFirst('Precompiled_Stub__iso_stub_', 'Stub::');
}
This yields even more readable output:
$ flutter run benchmark_harness report
Results for lib/main.benchmark.dart
allocateFixedArray: 0.000006573274731636047 ms/iteration (fastest)
allocateGrowableArray: 0.00020492076873779297 ms/iteration (31.2 times as slow)
Hot methods when running allocateFixedArray:
99.64% measured loop for allocateFixedArray (libapp.so)
Hot methods when running allocateGrowableArray:
54.17% Stub::AllocateArrayStub (libapp.so)
14.29% measured loop for allocateGrowableArray (libapp.so)
8.01% new _GrowableList._withData (libapp.so)
7.63% new _GrowableList (libapp.so)
4.96% _GrowableList._allocateData (libapp.so)
3.66% unknown ([kernel.kallsyms])
We can now see that benchmarks seem to behave slightly differently: allocateFixedArray
spends all of its time in its measured loop, while
allocateGrowableArray
spends time in various methods related to array
allocation.
To understand this better we need to look at the native code generated for both of these measured loops.
There are different possible ways to implement this, e.g. I could choose to
delegate this task to perf annotate
or something similar. However perf
is rather picky with regards to folder structure when searching for symbols,
and none of the NDK scripts seemed to fit the bill, so I opted to just use
llvm-objdump
and annotate the result with profiling information.
With some post processing to use Dart VM specific names for reserved registers
like thr
(reserved for current Thread
pointer) and to demangle symbol names
(via _userFriendlyName
helper) we get output like this:
Hot methods when running allocateFixedArray:
99.64% measured loop for allocateFixedArray (libapp.so)
0: stp fp, lr, [sp, #-16]!
4: mov fp, sp
8: ldr x2, [fp, #16]
7.87% c: ldr x16, [thr, #64]
18.40% 10: cmp sp, x16
9.30% 14: b.ls ->60
8.88% 18: cmp x2, null
1c: b.eq ->68
9.65% 20: asr x3, x2, #1
24: tbz w2, #0, ->2c
28: ldur x3, [x2, #7]
9.45% 2c: sub x2, x3, #1
30: cmp x3, #0
9.85% 34: b.le ->50
9.72% 38: adds x0, x2, x2
3c: b.vc ->48
40: bl Stub::AllocateMintSharedWithoutFPURegsStub
44: stur x2, [x0, #7]
8.90% 48: mov x2, x0
7.60% 4c: b.al ->c
50: mov x0, null
54: mov sp, fp
58: ldp fp, lr, [sp], #16
5c: ret
60: bl Stub::StackOverflowSharedWithoutFPURegsStub
64: b.al ->18
68: bl Stub::NullErrorSharedWithoutFPURegsStub
A developer familiar with ARM assembly language might be able to spot the issue with our benchmark just from this output (yep, there is obviously an issue).
However I decided to spend a bit more time on it and overlay the information about Dart VM compiler’s intermediate language on top of the assembly, making it much easier to grok.
Internally our compiler has support for annotating machine code it generates with human readable comments, which among other things describe IL from which a particular piece of machine code was produced.
I have experimented with two different paths I considered taking to get this information out of the AOT compiler:
- add a flag
--write-code-comments-to=output.json
which tells compiler to dump offset-to-comment mapping into a JSON file and then ingest this JSON in our benchmarking CLI. - add a flag
--write-code-comments-as-synthetic-source-to=comments.txt
which tells compiler to synthesize one gigantic file out of all code comments and write offset-to-comment mapping as a DWARF line program into.debug_line
section.
In the end I decided that the second approach is better because it makes
it possible for any tool that understands .debug_line
to display code
comments alongside disassembly (e.g. gdb
will start displaying it as well).
Here is my patch against Dart SDK implementing necessary plumbing. With
this in place the only change I had to do was to add -S
to llvm-objdump
invocation and it handled the rest:
Hot methods when running allocateFixedArray:
99.64% measured loop for allocateFixedArray (libapp.so)
;; Enter frame
0: stp fp, lr, [sp, #-16]!
4: mov fp, sp
;; ParallelMove r2 <- S+2
8: ldr x2, [fp, #16]
;; CheckStackOverflow:30(stack=0, loop=1)
7.87% c: ldr x16, [thr, #64]
18.40% 10: cmp sp, x16
9.30% 14: b.ls ->60
;; CheckNull:10(v3, NoSuchMethodError) T{int}
8.88% 18: cmp x2, null
1c: b.eq ->68
;; v26 <- UnboxInt64(v3 T{int}) T{int}
9.65% 20: asr x3, x2, #1
24: tbz w2, #0, ->2c
28: ldur x3, [x2, #7]
;; v6 <- BinaryInt64Op(- [tr], v26 T{int}, v32) T{int}
9.45% 2c: sub x2, x3, #1
;; Branch if RelationalOp(>, v26 T{int}, v34) T{bool} goto (3, 5)
30: cmp x3, #0
9.85% 34: b.le ->50
;; v30 <- BoxInt64(v6) T{int}
9.72% 38: adds x0, x2, x2
3c: b.vc ->48
40: bl Stub::AllocateMintSharedWithoutFPURegsStub
44: stur x2, [x0, #7]
;; ParallelMove r2 <- r0 goto:28 B4
8.90% 48: mov x2, x0
7.60% 4c: b.al ->c
;; ParallelMove r0 <- C
50: mov x0, null
;; Return:36(v0)
54: mov sp, fp
58: ldp fp, lr, [sp], #16
5c: ret
;; CheckStackOverflowSlowPath
60: bl Stub::StackOverflowSharedWithoutFPURegsStub
64: b.al ->18
;; slow path check null (nsm) operation
68: bl Stub::NullErrorSharedWithoutFPURegsStub
Hot methods when running allocateGrowableArray:
54.17% Stub::AllocateArrayStub (libapp.so)
14.29% measured loop for allocateGrowableArray (libapp.so)
;; Enter frame
0: stp fp, lr, [sp, #-16]!
4: mov fp, sp
8: sub sp, sp, #16
;; CheckStackOverflow:8(stack=0, loop=0)
c: ldr x16, [thr, #64]
10: cmp sp, x16
14: b.ls ->94
;; v36 <- UnboxedConstant:32(#32) [32, 32] T{_Smi}
18: mov x0, #32
;; ParallelMove r1 <- S+2, S-2 <- r0
1c: ldr x1, [fp, #16]
20: stur x0, [fp, #-16]
;; CheckStackOverflow:30(stack=0, loop=1)
24: ldr x16, [thr, #64]
0.73% 28: cmp sp, x16
2c: b.ls ->9c
;; CheckNull:10(v3, NoSuchMethodError) T{int}
0.33% 30: cmp x1, null
34: b.eq ->a4
;; v26 <- UnboxInt64(v3 T{int}) T{int}
0.30% 38: asr x2, x1, #1
3c: tbz w1, #0, ->44
40: ldur x2, [x1, #7]
;; v6 <- BinaryInt64Op(- [tr], v26 T{int}, v32) T{int}
0.35% 44: sub x1, x2, #1
;; ParallelMove S-1 <- r1
48: stur x1, [fp, #-8]
;; Branch if RelationalOp(>, v26 T{int}, v34) T{bool} goto (3, 5)
10.41% 4c: cmp x2, #0
50: b.le ->84
;; PushArgument(v18)
0.29% 54: ldr x16, [pp, #5160]
0.35% 58: stp x0, x16, [sp, #-16]!
;; StaticCall:10( _GrowableList@0150898.<0> v18, v36, result_type = T{_GrowableList})
5c: bl new _GrowableList
0.33% 60: add sp, sp, #16
;; ParallelMove r2 <- S-1
64: ldur x2, [fp, #-8]
;; v30 <- BoxInt64(v6) T{int}
0.53% 68: adds x0, x2, x2
6c: b.vc ->78
70: bl Stub::AllocateMintSharedWithoutFPURegsStub
74: stur x2, [x0, #7]
;; ParallelMove r1 <- r0, r0 <- S-2 goto:28 B4
0.29% 78: mov x1, x0
7c: ldur x0, [fp, #-16]
0.38% 80: b.al ->24
;; ParallelMove r0 <- C
84: mov x0, null
;; Return:36(v0)
88: mov sp, fp
8c: ldp fp, lr, [sp], #16
90: ret
;; CheckStackOverflowSlowPath
94: bl Stub::StackOverflowSharedWithoutFPURegsStub
98: b.al ->18
;; CheckStackOverflowSlowPath
9c: bl Stub::StackOverflowSharedWithoutFPURegsStub
a0: b.al ->30
;; slow path check null (nsm) operation
a4: bl Stub::NullErrorSharedWithoutFPURegsStub
8.01% new _GrowableList._withData (libapp.so)
7.63% new _GrowableList (libapp.so)
4.96% _GrowableList._allocateData (libapp.so)
3.66% unknown ([kernel.kallsyms])
..(run with -v to disassemble all hot methods in libapp.so)..
Now it should be visible in the output that allocateFixedArray
does not
actually contain any array allocation, it got eliminated (sunk) by the
compiler, so our benchmark seems to be comparing performance of an empty
while (N-- > 0);
loop to the performance a loop that actually does some
array allocations.
The empty loop however is far from trivial. It seems to contain two sources of overhead by itself:
- Variable
numIterations
seems to be unboxed and reboxed on each loop iteration, as evident from the following sequence of instructions that implementsnumIterations--
:;; CheckNull:10(v3, NoSuchMethodError) T{int} 18: cmp x2, null 1c: b.eq ->68 ;; v26 <- UnboxInt64(v3 T{int}) T{int} 20: asr x3, x2, #1 24: tbz w2, #0, ->2c 28: ldur x3, [x2, #7] ;; v6 <- BinaryInt64Op(- [tr], v26 T{int}, v32) T{int} 2c: sub x2, x3, #1 <...> ;; v30 <- BoxInt64(v6) T{int} 38: adds x0, x2, x2 3c: b.vc ->48 40: bl Stub::AllocateMintSharedWithoutFPURegsStub 44: stur x2, [x0, #7]
- Loop header contains a
CheckStackOverflow
instruction which performs a memory load and a comparison on each loop iteration.;; CheckStackOverflow:30(stack=0, loop=1) 24: ldr x16, [thr, #64] 28: cmp sp, x16 2c: b.ls ->9c
For accurate microbenchmarking we should certainly make an effort to reduce overhead of measured loop to bare minimum.
Preventing boxing of numIterations
void _$measuredLoop$allocateGrowableArray(int numIterations) {
while (numIterations-- > 0) {
lib.allocateGrowableArray();
}
}
Why is numIterations
kept boxed by the compiler? There are multiple factors at play here. Most importantly compiler is simply unable to prove that numIterations
is not null
on entry to the function, which happens because we use a tear-off of a measured loop method to invoke it. TFA (type flow analysis, our global type propagation algorithm) does not attempt to infer precise type information for parameters of closures. If our code was structured using classes, for example: If we were running in sound non-nullable by default (NNBD) mode then compiler would also be able to rely on the fact that numIterations
can never be null
. This illustrates one of the benefits of NNBD: giving compiler more information to work with.
abstract class MeasuredLoop {
void run(int numIterations);
}
class Loop$allocateGrowableArray extends MeasuredLoop {
void run(int numIterations) =>
_$measuredLoop$allocateGrowableArray(numIterations);
}
void main() {
await benchmark_runner.runBenchmarks(const {
// Use class instead of tearing off _$measuredLoop$allocateGrowableArray
'allocateGrowableArray': Loop$allocateGrowableArray(),
});
}
Then compiler would be able to infer that numIterations
is never nullable, and not only unbox numIterations
locally inside run
and _$measuredLoop$...
but even alter calling conventions of these two functions to pass parameter
around without boxing.
The reason why this works for methods and does not (yet) work for closures is because a statically typed method call gives compiler more information about potential callees than a statically typed closure call.
class Loop$allocateGrowableArray extends MeasuredLoop {
void run(int numIterations) {
// Can only be reached from a call-site where
// static type of a receiver is either [dynamic] or
// related to [Loop$allocateGrowableArray].
}
}
MeasuredLoop loop;
loop.run(...); // Can invoke only overrides of [MeasuredLoop.run].
// ---------------------------------------------------------
var g = (int numIterations) {
// Can be reached from any closure call-site with a compatible
// function type and invocation signature.
};
void Function(int) f;
f(...); // Can reach any compatible closure in the program.
Right now TFA does a very conservative approximation assuming that any closure could be invoked from anywhere else. This certainly could be improved: for example, TFA could use static type information to narrow potential call-sites for each closure. This can be taken even further - TFA could attempt to track the flow of closures through the program and try to establish which call-sites it could reach, but that’s a more sophisticated analysis.
Given that TFA can’t infer numIterations
as never containing null
we
might want to help it by adding an explicit check at the start of the
measured loop wrapper.
void _$measuredLoop$allocateGrowableArray(int numIterations) {
if (numIterations == null) throw 'Unexpected null!';
while (numIterations-- > 0) {
lib.allocateGrowableArray();
}
}
Unfortunately this does not help: even though compiler infers that numIterations
is never null
inside the loop it still does not unbox it, because its unboxing algorithm is overly conservative around int
typed variables. Current set of heuristics only unboxes integer variables if all reaching values are either boxing operations or constants. Which is probably a bug that needs to be fixed, but in the meantime we could simply satisfy this requirement: I am allowing myself a slight imprecision when talking about unboxing here. Compiler does not actually unbox variables, because it operates on SSA-form which does not contain any operations with variables anymore. It would be more precise to talk about unboxing phis, but I did not want to confuse readers without compiler background too much.
void _$measuredLoop$allocateGrowableArray(int numIterations) {
// n will be unboxed because it is either
// Box(Unbox(numIterations) + 0) or Box(Unbox(n) - 1)
var n = numIterations + 0;
while (n-- > 0) {
lib.allocateGrowableArray();
}
}
Eliminating CheckStackOverflow
CheckStackOverflow
instructions inserted by the compiler have dual purpose: they check for stack overflow (as their name implies), but also serve as interruption points, allowing VM to cleanly interrupt a thread executing Dart code. This mechanism can, for example, be used by the GC to park mutator thread at a safepoint. CheckStackOverflow
’s are rather fast: they consist of a memory load which usually hits the CPU cache and a compare-and-branch which is almost never taken. However they can still have a visible cost in very tight loops.
As an experiment I have changed the VM to completely eliminate CheckStackOverflow
instructions from functions marked with @pragma('vm:no-interrupts')
.
With a CheckStackOverflow
instruction empty loop takes 3.56 ns/iteration
per iteration and without it empty loop takes 1.78 ns/iteration
which is basically twice as fast.
The final shape of an empty measuring loop is:
;; v9 <- BinaryInt64Op(- [tr], v6 T{int}, v36) T{int}
24: sub x0, x1, #1
;; Branch if RelationalOp(>, v6 T{int}, v34) T{bool} goto (3, 5)
28: cmp x1, #0
2c: b.le ->38
;; ParallelMove r1 <- r0 goto:32 B4
30: mov x1, x0
34: b.al ->24
Preventing compiler from optimizing computations out
The final step is to fix the benchmark to prevent compiler from optimizing out our computation. The core idea is to change each benchmark to return a value it produces and to make measured loops consume returned values:
// ubench/lib/main.dart
@benchmark
Object allocateFixedArray() {
return List.filled(N, null, growable: false);
}
// ubench/lib/main.benchmark.dart
@pragma('vm:never-inline')
@pragma('vm:no-interrupts')
void _$measuredLoop$allocateFixedArray(int numIterations) {
// ...
while (n-- > 0) {
final result = lib.allocateFixedArray();
// Call a special intrinsic from dart:_internal which keeps
// the value alive and prevents it from being optimized out.
// However does not result in any actual calls being produced.
benchmark_runner.reachabilityFence(result);
}
}
Final benchmark results
With benchmark harness and benchmark itself fixed to prevent compiler from optimizing away list allocation we are getting the following results
Results for lib/main.benchmark.dart
allocateFixedArray: 131.607 (±0.845) ns/iteration (fastest)
allocateGrowableArray: 177.443 (±1.603) ns/iteration (1.3 times as slow)
Hot methods when running allocateFixedArray:
77.98% Stub::AllocateArrayStub (libapp.so)
6.19% measured loop for allocateFixedArray (libapp.so)
5.21% unknown ([kernel.kallsyms])
1.16% pthread_mutex_lock (libc.so)
1.16% pthread_mutex_unlock (libc.so)
Hot methods when running allocateGrowableArray:
53.13% Stub::AllocateArrayStub (libapp.so)
11.53% new _GrowableList._withData (libapp.so)
9.35% new _GrowableList (libapp.so)
8.15% measured loop for allocateGrowableArray (libapp.so)
6.23% _GrowableList._allocateData (libapp.so)
3.92% unknown ([kernel.kallsyms])
So allocating growable list with 32 element takes ~30% more time compared to allocating a fixed length list with 32 elements which is much more reasonable result than what we were getting with our initial benchmarks.
This also makes sense: growable array is just an object which contains inside
a fixed length array so allocating a growable array is a bit slower than
allocating a fixed length array because you need to allocate and initialize more
memory and you need to jump through more hoops to get there: in allocateFixedArray
most of the time is spent in a stub (a piece of handwritten machine code)
which allocates the array, in allocateGrowableArray
there are multiple
functions involved in doing the job.
Final remarks on benchmarking
In general I have chosen to neglect in this post some of the hardest problems associated with microbenchmarking : for example, I decided to ignore JIT altogether and focus specifically on AOT. Benchmarking JITs is really hard because they behave like living organisms and show wild fluctuations in performance even after long warmup (see for example Virtual Machine Warmup Blows Hot and Cold paper). I also decided to focus on a simplistic average cost of operation metric, which might hide some of the important but not constantly present overheads. Array allocation is actually a great example: it might be rather cheap in general, but every so often it might trigger a GC - the cost of which will be proportional to the amount of live data, which will probably close to 0 in microbenchmark. Averaging over thousands and thousands of operations would completely erase the cost of the GC, however in the real application GCs like this might matter because they might cause a missed frame.
Next Part
In the Part 2 of the series I am going to use benchmarking harness CLI, I have just implemented to answer more performance questions I have gotten on Twitter, for example:
- how changing field initializer from
[]
toList.filled(0, null)
can improve performance of your code; - performance characteristics of
async
syntax sugar; - what happens when you call a closure through a dynamically typed variable;
Till next time.
I am always available to help you with Dart related performance questions and concerns. Do not hesitate to reach out to me by mail or through other channels like Twitter.
My blog no longer has
comments section. If you want to discuss
contents of this post you can send me an email at [email protected]
or find me on Mastodon or
X.