Splicing git repositories

Some projects eventually get split up into multiple source repositories for whatever reason. Sometimes, it is useful however to present the project as a single repository – it’s more difficult to examine a project’s development history when it is scattered among several repositories. Specifically, git bisect will not be very helpful if there is strong inter-dependency between the repositories, such as with the reference D programming language implementation. Or you may just not like the layout, which pushes you into using a complicated build process you may not want to use, which was the case for me with the new DerelictOrg repositories.

I have created d-dot-git a while ago to solve the first problem. This program creates a new repository which contains the D component repositores as submodules, referencing mainline commits in each repository through a history which chronologically follows all D repositories. The result is a repository with a linear history which you can then easily use git bisect with. You can see the resulting repository here. Digger takes the chore of setting up git bisect run and building D away, and allows you to specify a D source code test case directly. I’ve also covered this during part 2 of my DConf 2014 presentation (“Reducing D Bugs”).

I wanted to do something different for Derelict – submodules wouldn’t cut it. The current structure of the Derelict project (it has a history of moves and refactorings) consists of a DerelictOrg GitHub organization, with a repository for each Derelict component. The repositories have a partially-overlapping directory structure (each repository has a source/derelict directory, which then contains the sub-package with the respective component source code). I’d have liked one repository with the root being the contents with the derelict package being the repository root, thus containing the component sub-packages at the root level directly. This would allow using the repository as a git submodule: one would only need to clone a project with --recursive, and no further setup or dependencies are needed (aside from rdmd, which is included with D).

There are existing approaches and tools that could’ve worked to achieve this:

  • git filter-branch followed by a subtree merge would be OK for a one-off conversion. However, I wanted a live mirror, which would keep itself in sync with the latest DerelictOrg changes.`
  • David Fraser’s splice-repos and Philippe Bruhat’s git-stitch-repo solve the same problem.
    However, they use git-fast-import/export to move all the data around, which makes them less efficient than they could be, as they need to shuffle all the data from the target directories directly.

The program I wrote, git-splice-subtree, works by reading and writing raw Git objects from the Git database. Since the files and the subtrees are not affected, the program only concerns itself with creating git commit and tree objects that point to the existing ones in the right pattern. It does this by fetching all repositories’ objects into a single one, then spawning batch git database object readers (git cat-file) and writers (git hash-object) and piping data to them directly. By not touching any of the repositories’ actual files, and spawning as little processes overall as possible, it is quite speedy (almost instant on the DerelictOrg repositories, which is great for a cronjob). I’ve also managed to avoid having to choose between redundant disk writes for temporary files, or spawning a process for every commit, by tricking git hash-object into reading from the same named pipe over and over.

DerelictMerge generates the spec file for git-splice-subtree. The resulting repository is on BitBucket (to avoid unintentionally pinging any of the authors on GitHub).