How we converted a huge SVN repo to Git

Jonathan Koppenhaver
Jonathan Koppenhaver
Contents

Four years ago, we decided to change our source control from Subversion (SVN), which served us well for over a decade, to Git. While we knew the transition would be challenging, we desperately wanted a first class Pull Request experience to help improve code quality.

When we started, our primary SVN repository consisted of around 500,000 revisions. (Including some regrettable mistakes, like new employees accidentally adding 2 gigs of data files.) As a result, the total size of the repository on the SVN server was a whopping 135 GB. The Internet suggests Git repos are happiest when smaller than 300 MB, or 1 GB, or 2 GB, depending on what you read. Regardless, this put us well outside of a comfortable size.

We had two options:

  1. Just copy-and-paste the head of the SVN trunk into an empty Git main branch, and call it good.
  2. Carefully curate the history that gets converted, minimizing overall size while maintaining maximum value.

In either case, we could keep the SVN server around to refer to the original source if necessary. However, no one is really going to want to dig around in SVN, and having a functioning git blame is worth some suffering. So, we investigated Option 2 to see how far we could take it.

While searching the Internet for solutions, I found plenty of articles talking about converting repositories with 10,000 revisions. Cute, I thought, but will it hold up to our beast?

The answer was “no.” In the end, the only way I could complete the conversion was the combination of SubGit running on a Linux VM. Thanks to several tools and a lot of patience, our new Git repo weighed in at 2 gigs — large, but manageable. With several years of hindsight now available, I can say that we don’t miss the deleted data.

The steps below go through the process we used to convert a repository from SVN to Git. In a subsequent article, we will explore additional steps to reduce the size of the Git repo and clean up its commit messages.

Our project requirements

  • Convert SVN repo to Git repo, maintaining history for git blame.
  • Minimize repo size as much as possible. Target is 2 GB.
  • Developers need to be able to keep working in SVN until we’re ready to switch.
  • Reformat commit messages to remove the obnoxious template and to include the original SVN revision number.

Our tools

  • Atlassian’s svn-migration-scripts.jar — Used to generate a mapping of SVN users to Git users. See also their SVN-to-Git Migration document.
  • SubGit — Our heavy lifter; it does the actual SVN to Git conversion. It’s free for 1-time conversions. There’s a paid version available if you need to sync changes back from Git.
  • A Linux machine with Java installed — If you have an SVN repo of any real size, you’ll 100% want to use Linux to do the conversion. An unimpressive Linux VM converted commits about 30 times faster than my fairly high-end Windows developer machine. I suspect that WSL2 might be sufficient, but I haven’t tried it.

Step 1: Map your authors

First up, we need to generate a mapping from SVN usernames to Git’s name-plus-email format. We can use Atlassian’s tools to do that with:

java -jar svn-migration-scripts.jar authors $SVN_REPO_URL > authors.txt

This will spit out authors.txt containing a list of all the people who have committed to your SVN repo. Something like:

aperson = aperson
bsmith = bsmith
cjones = cjones

Now we do some grunt work to convert this into useful names. I used a combination of Regex Find-and-Replaces in Sublime Text plus some good ol’ fashioned elbow grease to get this into the final format:

aperson = Andy Person <andy_person@mycompany.com>
bsmith = Bob Smith <bob_smith@mycompany.com>
cjones = Clara Jones <clara_jones@mycompany.com>

I had fun trying to remember former employees from just their first initials and last names. That said, my authors list was around the maximum size I would be willing to go through by hand. If your list is very long, you might want to make your peace with just having SVN-style usernames in your converted commits. Additionally, if anyone new starts committing between now and when you transition to Git, you’ll have to add them manually to the file.

Step 2: Configure SubGit

SubGit has a bunch of configuration options, described in detail in the documentation. To generate a basic configuration file, run:

subgit configure $SVN_REPO_URL $OUTPUT_FOLDER

This will create a config file in $OUTPUT_FOLDER/subgit/. You should take this time to copy your author.txt into the same folder as well. Your config will be very specific to your situation, but here are a few options I found useful:

[core]
 authorsFile = subgit/authors.txt

In the core section, be sure to point authorsFile at your carefully-curated authors.txt.

[svn]
 gitCommitMessage = %message\n\nSVN: %revision
 trunk = trunk:refs/heads/master
 branches = svn_path_to_branch:refs/heads/path_to_git_branch
 excludePath = /goofyPath
 excludePath = /a/path/with/wildcards*/**
 excludePath = *.abc

The svn section is where I spent the most time.

  • gitCommitMessage lets you reformat the SVN commit message. There are several variables discussed in the documentation, but I was most interested in appending the SVN revision to the end using %revision. This lets us maintain a connection to revisions referenced in our issue tracking software.
  • trunk and branches let you map SVN branches to Git. We have a relatively complicated branching model in this repository, so I polled our Project, Product, and Development teams to determine which branches were most likely to need future development, and converted only those to Git. If another branch unexpectedly needed work done on it, it would just have to happen in SVN.
  • excludePath is my favorite property, as it let me fix some of the mistakes we had made over the years. Someone had amazingly managed to commit all of trunk into a subfolder of trunk at one point, so I excluded that. In other cases, certain types of large binary files tended to be unnecessarily committed. I filtered them out entirely. Finally, a few projects had already been extracted from the repository into their own git repositories, and we had no need for their history clogging up our new repository, so I excluded them entirely.
[auth "default"]
 passwords = username/password

Finally, be sure to add valid SVN credentials to the auth section so that SubGit is able to talk to your SVN server.

Step 3: Run SubGit

Alright, we’re all configured — let’s go!

subgit import $LOCAL_PATH_TO_NEW_REPO

SubGit will start iterating through the revisions in your SVN repo, modifying or skipping each as needed to match the rules you’ve given it. It’s been a while since I ran it, but I think it took about 18 hours to process the 200,000 commits we decided to keep. It’s slow. And remember this is on my Linux machine — it was massively slower on Windows! Fortunately, it keeps track of its progress so if you have to restart the process — or if you need to run it again because there’s new work in SVN — you can efficiently pick up where you left off.

So great, we made it through SubGit’s conversion. The new repository was about 5 GB — a huge improvement from the SVN source and usable if we’re desperate, but we can do better. Depending on the state of your repository, you might be happy stopping here. If you want to polish your new repo further, check back soon for how I cleaned up the newly-created Git repo.

Share this article:

You May Also Like

Experimenting With LangChain AI

| By Ganesan Senthilvel

My recent experiences working with LangChain technology.