D compilation is too slow and I am forking the compiler

While working on my current project, the constant creep of increasing compilation times was becoming more and more noticeable. Even after throwing my usual tools at the problem, the total time was still over 7 seconds.

Seven. Seconds! — *Seven*. **Seconds**. *Unacceptable‼*

Compile time profiling showed that the blame lay with my liberal use of metaprogramming and std.regex, which I wasn’t willing to give up on. The usual approach to reducing D build times is to split the program into packages, compile one package at a time (to a static library or object file), use D “header” files (.di) to avoid parsing implementations more often than necessary, then link everything together. However, this was too much work, didn’t fit neatly into my existing toolchain, and I wanted to try something else.

C++ compilers handle a very similar problem with precompiled headers, in which the results of parsing header files is serialized and saved to disk in an implementation-defined format. D implementations do not have an analogous feature at the moment, but that still left me wondering:

The precompiled header format is highly implementation-specific, and compilers are free to break compatibility even across minor versions.
The goal is to preserve information and avoid doing repeated work across multiple invocations of the compiler.
The main difficulty with implementing it is serializing the compiler’s internal state to disk, and deserializing it back into memory.

But… what if we didn’t care about saving it to disk? Can we checkpoint the compiler as it’s parsing the files one-by-one, and rewind as necessary?

Hmm… If only there was some way to efficiently snapshot a process’s internal state, and rewind / resume it back to that point… Oh wait, there is – and it’s called fork(). Granted, not a typical use case, but still a promising idea to build on. Let’s get to work.

Part 1 – Graph Theory

In order to correctly partially rebuild a D program, we need to know which parts need to be rebuilt. In our case, this comes down to knowing which files need to be recompiled when a certain file changes.

Luckily, DMD has us covered, as it can output dependency information in a plethora of formats. Great, let’s use that! Slurping the results of -deps gets us a directed graph like this:

GraphViz output — *Digger’s module dependencies (click to enlarge)*

As you can clearly see…

Ahem. Let’s try that again:

As you may have noticed, this graph has cycles. D does allow cyclic dependencies between modules (as long as at most one module in the cycle has static constructors), which is a problem for us as we can’t meaningfully snapshot the compiler in the middle of a cycle – we need to treat them as an all-or-nothing indivisible entity.

Fortunately, graph theory comes to the rescue, and we can use Kosaraju’s clever algorithm to separate the graph into strongly connected components. This simplifies the graph into a DAG:

Our job is not done yet. We need to flatten this DAG into a list, as we can’t meaningfully combine checkpoints in which one is not a parent of another. We can use topological sorting to get there:

Great! Now, we have a list of components, wherein each component is a group of files that we can compile together, and create a snapshot after each such component.

File modification times work nicely as a secondary sorting parameter: we can assume that more recently modified files are more likely to be modified again soon, so placing them in front of older files can save us from rebuilding older files (ranked equally dependency-wise).

(Implementation note: here I got lazy and used an О(n²) insertion sort for the topological/chronological sorting, though I’m sure a more efficient algorithm exists. As an additional curiosity, std.algorithm.sort won’t work here because it apparently expects a weak order for its predicate, while we only have a strict partial order – full credit to Feep on IRC for figuring this out.)

Part 2 – Hacking the Compiler

All the graph theory in the world will not help us in our endeavor if we can’t get the compiler to actually compile files one-by-one. Currently, DMD compiles source code in roughly the following order:

Read all source files (modules) given to it on its command line into memory
Parse all loaded modules
Load all imported modules, and parse them too
Perform the first semantic pass on all modules
Perform the second semantic pass on all modules
Perform the third semantic pass on all modules
Perform inlining (yes, DMD inlines in the frontend by manipulating the language AST, not backend) on all modules
Generate code and write object files for all modules
Link.

This is clearly unsuitable for us: we want to be able to perform as many steps as possible one file at a time.

Fortunately, we can coax the compiler in doing just that with a few relatively simple changes: instead of processing all files at once, read a group of them at a time, go through all the motions to compile it, and repeat.

I wish I could say I got all compilation steps to work serially. Unfortunately, I haven’t managed to serialize code generation - my attempts ended with linking errors involving template instantiations (an area of the compiler which everyone agrees has become unordinarily messy). My theory is that the compiler is attempting to place template instantiations in object files whose modules have already been compiled. In theory, the -allinst switch is supposed to alleviate that (by emitting template instantiations in every module which instantiates the template, rather than just once); however, it currently appears to be a bit broken.

Still, the third semantic pass does work serially (in my tests so far), and takes a significant part of compilation time, so this is still a win for us!

Part 3 – It’s Forking Time

We’re getting to the meat of the subject: actually making use of fork() for snapshotting.

Recall the flattened dependency graph from part 1:

Here’s how the corresponding snapshots should look like:

Note how the compilation order is the reverse of the topological order (we’ll get back to this in a bit). All snapshots except the first hold the result of compiling one module, plus all the modules before it; fork()’s CoW memory allows all snapshots to share memory pages for previous modules, thus not consuming extra RAM (though real results vary a bit due to fragmentation).

We are going to split our design into three parts:

A separate fork driver program, which handles module dependencies, and decides when it’s time to recompile, which files need to be recompiled, and in which order.
A “fork-server” compiler process, which communicates with the fork driver and is the owner of snapshot fork processes. Which process is the fork server can change when rewinding, but there should be exactly one fork server (i.e. process reading commands from the driver) at a time.
All the snapshot fork processes.

Implementation-wise, our goals are:

Keep a communication channel open with the driver program, which sends commands like “compile these files” or “rewind”.
Ensure the main instance doesn’t crash, even in the face of segmentation faults or ICEs in compiler code.
Reasonably manage resources, and clean up old snapshots as necessary.

The commands the driver program can send would then be:

Compile the next group of files, and reply with success/failure
Create a snapshot
Rewind to a previously-created snapshot
Finish compilation with all the files compiled so far (i.e. emit machine code and link).

Some optimizations we can do upon closer examination:

We can combine the first two commands into one, and create a snapshot automatically before compiling a new file group.
Since we need to fork (in order to snapshot) before compiling a file group anyway, we can use that fork as our “backup” in case the compilation errors out or crashes. In this case, the fork is promoted to the current fork-server instead of becoming a snapshot.

We also need a channel for snapshots to communicate with the fork-server, which we can do with a simple POSIX pipe. This brings the lifetime of the compiler process and its forks to the following:

Hmm, that looked less messy in my head. You might find the implementation (which I tried to comment well) easier to follow instead.

A note on quitting: One thing you may notice from the above chart is that there is no explicit “quit” command; the quit signal is implicit when the other side of the pipe is closed. This is straight-forward in the case of the fork-server reading from the driver, but gets more interesting in case of snapshots.

Each snapshot implicitly keeps an open file descriptor to all snapshots before it. This is necessary so that when rewinding a snapshot causes it to be promoted to the current fork-server, said snapshot is able to rewind again to one of its ancestors. Now, when a fork server exits, only its immediate parent snapshot remains with zero processes on the other end of its control channel, which causes said snapshot to exit. However, as it exits, it also closes the only file on the read end of its immediate parent’s control channel as well. This creates a domino-like chain reaction in which all snapshots cause each other to exit in turn, up to whatever snapshot was currently promoted to the fork-server by rewinding, or all snapshots is case the driver closed its control connection. Fun!

Part 4 – Putting it all together

There is one final piece missing: a program to drive the fork-server, and tell it when and what to compile.

The driver needs to:

Collect dependencies of the program being compiled
Flatten it into a sorted list of components
Start the fork server
Control the fork-server to recompile the parts of the program that change (and their dependencies).

Note that dependency discovery needs to happen as a separate step from the compilation. This is because the compilation order of the two is reverse: when discovering dependencies, we must visit the compiled files “top-to-bottom” (i.e. program entry point down to the lowest dependency), which we can’t do in order for snapshots to work (as no part of dependent files may be read before the dependees are saved to a snapshot).

All together, the main loop of the driver is pretty simple. To put it in the context of the fork-server flowchart above:

Was it worth it?

Let’s find out:

(Compilation takes a bit longer in this video as it’s using a debug DMD build.) The time savings are not spectacular, but still significant! (They would be even more significant if not for those pesky template instantiations…)

To try it yourself, check out the dmdforker branch in my dmd GitHub fork.

Caveats:

I’ve only tested that It Works On My Machine™, and only on a pair (read: literally two) non-trivial programs.
Aside from the problems with serializing code generation mentioned above, inlining + CTFE is another troublesome combination. Unlike in ~~sane~~ most compilers, DMD’s inlining is implemented in the frontend (i.e. over the semantic AST). Normally, any inlining happens before all CTFE has executed, but this is no longer true now that we’re processing all modules in turn. This is a problem, as the inliner does not worry about making sure that the AST remains well-formed after it’s done its transformations; all it cares about is that it’s good enough for the backend to understand it sufficiently to turn it into machine code. Workarounds: don’t use -inline; use less CTFE; or, annotate troublesome functions with pragma(inline, false);.
Changes in the topology of modules’ dependencies are not handled:
- Dependencies to new modules should “work” in the sense that -i will cause them to be compiled at the same time as the importing module.
- New backward dependencies (to modules that have been incidentally already compiled) should cause no problems.
- Changes causing forward dependencies in the graph, however, will break compilation (you may see this as module foo from file foo.d is specified twice on the command line – restart the server to rebuild the graph in this case).
- The driver (dmdforker) will also not know about changes in modules outside of the initial dependency graph (it will think the program is up to date).

Some anticipated questions:

Q: How does this affect the integrity of the compilation process?

The question is multi-faceted:

“False cache hits” (i.e. the compiler not recompiling code when it should have, and using stale code instead) should be impossible as long as the driver’s dependency graph is up-to-date, as the compiler doesn’t even know the file names of files that it hasn’t compiled yet, and it can’t even guess them because we are traversing the dependency graph bottom-up.
Changes in the edges within the dependency graph can cause some wasted work or errors (see caveats above), but never wrong results - you’ll get an error if a change in the dependency graph would have caused a dependency to be incorrectly included in a snapshot.
Additions of new nodes (files) in the dependency graph will cause the current driver implementation to not see changes in them in rebuilds following their addition. This will manifest as the driver claiming the program is up-to-date when it might not be.
That said, I did run into some weird errors on the way (plus the inlining issues mentioned above). The compiler code base wasn’t really designed to compile files one at a time (though the new -i switch does impose many similar constraints). YMMV.

Q: Isn’t all this a bit too much trouble to save 4 seconds on a 9-second compile?

Yes. This was an idea I was sitting on for a while, and decided to have a go implementing it given the opportunity. The technique might be useful for some truly larger projects, though, and might be applicable to compilers other than D, or processes other than compilation.

Q: Has this been done before?

Precompiled headers have been around for a long while, and the idea of using fork() for snapshots is not new, but I haven’t seen it used for the goal of speeding up compilation.

There exist debuggers capable of snapshotting / reverse execution, like Mozilla’s rr. However, fork():

does not require special permissions
has no dependencies
is more portable
is faster.

Q: What next?

Well, assuming there is interest to continue this hack of a weekend project:

The compiler patches could potentially be upstreamed into DMD, thus making -fork-server a built-in option. The protocol is simple (you can even use it interactively), and other build tools could use it as well. Dub is a great candidate: as it already knows the dependency graph between Dub packages, it could treat each package as a component.
It sure would be nice to fix code generation to work per-file! This would also allow greatly speeding up optimizations, which take a lot of time for release builds.
It should be possible to detect changes in the dependency graph and automatically rebuild it as necessary, thus avoiding the need to manually restart the fork driver when it happens.
There is no direct analogue to fork() on Windows, which would seem to make a Windows port impossible… however, Windows does support it in its Linux subsystem. There is also the mysterious RtlCloneUserProcess function, which seems to do much of what fork() would. So, perhaps not impossible?

Thanks for reading!

Acknowledgements:

Feep and UplinkCoder on IRC for brainstorming parts of this article
Jonathan Marler for implementing and pushing through dmd -i.

Discuss this article on Reddit, Hacker News, the D forum, or the comment section below.