How we cleaned up our big Git repo

Jonathan Koppenhaver
Jonathan Koppenhaver
Contents

In a previous article, I discussed how we converted a large SVN repository to Git. This article explores our next project—cleaning up our new Git repository.

We got to a usable state after the transition to Git, but we can do better. Now, we will clean up the repository by rewriting all of the commit messages, and removing a lot of extra junk.

Given how long many of these operations take—especially the just-completed SubGit conversion—and the number of iterations I went through to make each piece perfect, I strongly advise making a copy of the repository after each major step. Feed the new copy into the next step. That lets you rerun the later steps without worrying about corrupting previous work.

To help with that and to keep your sanity intact, you’ll almost certainly want to combine the various commands presented here into a script personalized for your situation. I ran through various parts of the process a hundred times and was extremely grateful for a consistent script.

Let’s review our requirements from last time.

Our project requirements

  1. Convert an SVN repo to a Git repo, maintaining history for git blame.
  2. Minimize repo size as much as possible. Target is 2 GB.
  3. Developers need to be able to keep working in SVN until we’re ready to switch.
  4. Reformat commit messages to remove obnoxious template and to include the original SVN revision number.

Our tools

  • BFG Repo-Cleaner — Cleans out bad files and commits from your history much more efficiently than git’s built-in filter-branch command.
  • Incremental-filter-branch — Another tool for efficiently filtering Git commits.
  • Git LFS — Git Large File Storage, a Git add-on, for storing large files outside of the repository.
  • JFrog Artifactory — Storage for binaries. We’ll use it as our LFS store, although there are other options available.

Step 1: Rewrite Commit Messages

Our SVN repository required that commit messages follow a bulky XML-esque form, representing noble intentions gone wrong. I knew that I wanted to reformat the messages in Git to be cleaner and easier to read. The incremental-filter-branch project provides a relatively quick way to do this:

incremental-git-filterbranch --no-lock -- $REPO_PATH "--msg-filter \"$PATH_TO_SOME_SCRIPT\"" $NEW_REPO_PATH

$PATH_TO_SOME_SCRIPT will get executed for each revision, receiving the current commit message on STDIN and expecting the modified message on STDOUT. I wrote a Ruby script to parse our template and rework the message into something more palatable. Feel free to use your language of choice. As the “incremental” portion of the name implies, if you rerun this script against the same repository, it will just modify new commits.

Step 2: Remove Large Objects

(Note: In retrospect, I might have skipped this step in favor of just including everything in Git LFS. Depending on your situation, you might still prefer to get rid of old, large files, particularly if some are egregiously large.)

Our repository contained a lot of very large files. I was able to strip out many of them with SubGit’s configuration, but that’s an all-or-nothing process. I want to delete all large files—let’s say over 1 MB—that are no longer in use on the head of main or a few other important branches. This gives a compromise between repo size and commit integrity that was acceptable for our situation—and if we really need a deleted file, we could boot the SVN server. So far, this hasn’t happened.

There’s a great tool called BFG Repo-Cleaner that provides many options for selectively deleting files from the repository. If main is the only branch you care about preserving, you can call BFG directly and it should just work. However, because there are multiple branches we need to keep healthy, our situation requires a little more work. We need to generate a list of all of the objects that are currently in use on main or any other important branch.

git ls-tree -r main | cut -f 1 | cut -d ' ' -f 3 >> /tmp/active-objects.list

Repeat the above line, replacing main with the name of each of your important branches. git ls-tree lists information on every object currently used by that branch. The cut calls trim those lines down to just the blob ID and then append those to a file.

We ought to sort and remove duplicates from our list:

sort /tmp/active-objects.list | uniq > /tmp/active-objects-sorted.list

Now that we have our list of active objects, we can use some more command line magic to generate a list of all objects that are larger than 1 MB and are not currently in use on our important branches. Thanks to Stack Overflow contributors for figuring this one out:

comm -23 \

   <(git rev-list --objects --all | git cat-file  --batch-check="%(objecttype) %(objectname) %(objectsize) %(rest)" | grep ^blob | awk '$3 > 1024 * 1024 { print $2 }' | sort) \
   /tmp/active-objects-sorted.list \

   > /tmp/large-blobs.list

Great — /tmp/large-blobs.list is the list of offending files. Let’s nuke them with the BFG. From the root folder of our latest repo copy:

java -jar ~/bfg.jar --private -bi /tmp/large-blobs.list

Alright! Our repo should be looking much leaner now. If you check out some old commits, they might be missing some of their large files if the file is different than the one that’s on the branch head. But that was kind of the point here!

Step 3: Convert to LFS

Next, we’re going to hook up Git LFS. We’ve removed many of the large objects from our repository, but there are still plenty left. High resolution PNGs, the occasional committed binary, and some large test files can all add up to significant disk space. Git LFS works by moving the actual file to an external storage solution, and replacing it in the repository with a pointer to its new location. This keeps the size of the repository down so your initial git clone of the repository is fast, and you only need to download the full file when it’s needed for the commit you want to check out.

You’ll want to create a new repository in Artifactory (or your binary storage solution of choice), give out appropriate permissions, and then tell Git to use it. There are two ways to do this. You can modify your local Git settings via:

git config lfs.url ssh://git@artifactory.company.com/artifactory/myrepo-lfs

Or, you can create a file named .lfsconfig in the root of your repository and put a similar entry in it:

[lfs]
   url = ssh://git@artifactory.company.com/artifactory/myrepo-lfs

There are pros and cons to both options. If you use the .lfsconfig file, everyone gets the correct settings just by checking out the branch. However, unless you rewrite history to include that file in your old commits, then anyone who checks out an older commit will have broken LFS URLs and need to remember to run the git config command.

Since it’s rare for us to go back in time, I went with the .lfsconfig approach, and committed it to all active branches. In retrospect, I should have rewritten the repository’s initial commit to include .lfsconfig. This would have been a more thorough solution that avoids the occasional confused developer with missing LFS files.

Next, we need to populate the .gitattributes file with the list of file types we want to store in LFS. The contents will look something like the following, with one entry for each path you want to ignore:

*.exe filter=lfs diff=lfs merge=lfs -text

*.dll filter=lfs diff=lfs merge=lfs -text

In this case, we’re moving all exe and dll files over to LFS. The filter, diff and merge keywords tell Git that it needs to use the special lfs versions of those operations to appropriately handle the files. -text tells Git to treat it as a binary file instead of text.

Like .lfsconfig, you’ll need to commit .gitattributes to your repository and you’ll probably want to include it in the initial commit unless you aren’t worried about people checking out pre-conversion commits.

As an aside: while you’re poking around Git LFS things, you might bump into smudge and clean, which is how Git refers to operations it does immediately after checking out a file and before committing a file, respectively. LFS uses smudge to replace the “pointer” file with the actual contents from the LFS store, while clean takes the new version of an LFS-aware file, uploads it to the LFS store, and replaces the file in the repository with the new pointer.

Well, we’ve finished our configuration so we can run the conversion.

git lfs migrate import --everything --include=”*.exe,*.dll”

That’ll run for a while and copy your designated large objects to your LFS store. The --everything flag tells it to convert all reachable commits, but you can be more discriminatory if you’d like. Note that .gitattributes tells Git how to handle these types of files going forward, but isn’t used by the conversion process, so we still have to specify the file types with --include.

When it’s done, you might consider running:

git reflog expire --expire-unreachable=now --all

git gc --prune=now

This tells Git to run garbage collection and get rid of the old unnecessary copies of your files, keeping only the LFS version. If you check the size of your .git folder now, it should be about the same size as it was before the LFS conversion, but your large files will now be in the lfs subfolder, which is not part of the repository proper.

Step 4: Share

That’s it! We’ve cleaned up our commit messages, removed a ton of unneeded data, and relocated the remaining big stuff to a better home. All that’s left is to share our work with the team:

git remote set-url origin ssh://git@git.mycompany.com/myrepo.git

git push --all

git push --tags

We took our massive 135 GB Subversion repository and mashed it down to a much more manageable 2 GB Git repository, thanks to careful SubGit conversion, BFG Repo-Cleaner, and Git LFS. Good luck with your own repository-management endeavors!

Share this article:

You May Also Like