Optimizations when importing Git commit history

Posted Dec 4th, 2024 by author

in the category "Deep Dive"

One of our biggest concerns with release.new was speed. We didn't want the user waiting too long while we generated release notes. For the most part, the release notes are generated from the Git commit history. So speed was all about importing the Git commit history for a given repository as quickly as possible.

Being members of the Laravel community, we used the laravel/framework repository as our benchmark. Laravel is one of the most popular web frameworks. It has nearly 40,000 commits and thousands of contributors. So it has a large history. It also has frequent releases.

We figured if we could generate release notes for recent releases of Laravel quickly, then other projects should be okay. We set a goal to generate the release notes for recent releases of Laravel in under 10 seconds. This was our threshold.

Immediately, we knew using the GitHub REST API was not an option. There would be too many requests to too many endpoints. For large repositories, you'd probably need to make multiple requests to paginate results. Even if you could send requests in parallel, some are dependent on other responses. Leaving all that aside, we'd likely get rate-limited by GitHub with just a few simultaneous users.

This left us with actually cloning the repository. Now, if you went to your command line and ran git clone https://github.com/laravel/framework.git it'd probably take 20 seconds. Which puts us well past our threshold.

We were going to need to make a few optimizations.

Optimistic prefetching

One of the easiest optimizations on the web is to prefetch. To speed up cloning the repository, we implemented something similar.

When you visit release.new, you are presented with a form to enter information about your release. The very first field is the clone URL. Once the user has entered the clone URL, we may assume (optimistically) the user will not change its value and continue to fill the other fields.

Technically this means once the clone URL field is blurred, we can clone the repository. So we run a job in the background to prefetch the Git commit history.

As this runs in the background, it gives release.new a headstart. This headstart is as long as it takes the user to fill out the other fields and press "Generate release notes".

That may not seem like long. But however long it is gets deducted from the time it takes to clone the repository. More importantly, this means the user spends less time in the "loading state". Spending 7 seconds in the "loading state" is a better user experience than spending 10 seconds in the "loading state".

In some cases, for smaller repositories or if the user takes longer to complete the form, it's possible to not see a "loading state". The release notes may appear to be generated instantly after submitting the form if the prefetch completed.

Blobless cloning

Prefetching is a nice touch. But it doesn't actually speed up cloning the repository. Our real delay was git clone. Since we are not going to make changes to the repository, we only need a "read-only" copy. We don't even need the files. Just the commit history and some branch or tag data.

In Git terms, we need the references, but not the blobs. We can achieve this using a combination of options for git clone. Notably --bare and --filter. With --bare, we essentially just copy the contents of the .git folder. Not the project files.

Using --bare speeds up git clone quite a bit as it reduces the overall size. For example, it reduced the cloned laravel/framework repository to 105 MB from 132 MB. Good, but still bigger than it needs to be. This is because the changesets are still included. Every commit includes a changeset. Since the laravel/framework repository has 40,000 commits, it has a lot of changesets.

This led us to the --filter option. Specifically --filter=blob:none. This creates a "blobless" clone. Git doesn't fetch the blobs (changesets) until it needs to. In our case, it never needs to since we only need the basic data from git log. With this option, we were able to reduce the cloned size to 40 MB and a clone time of around 8 seconds.

In closing

These two strategies combined allowed us to hit our threshold and feel good our users would not be waiting any longer than necessary. Of course, prefetching is always a good strategy to keep in your developer toolbox. Especially when working with web technology. But knowing the specifics of the technology you are using was the real win. In this case, digging deep within git clone to find the optimal way to clone a repository.

If you manage your own build environments, you may be interested in optimizing your own git clone commands. I highly recommend reading this GitHub blog post. It takes a deep dive into the various options to speed up git clone and when to use them.