The more you commit to a Git repository, the bigger the repository gets.
A project’s repository lives in a .git
folder at its root. Our Rails project at TripCase has a 60M Git folder:
1 2 |
|
This folder could’ve been a lot bigger, but Git has clever compression mechanisms to only store the diffs between files instead of the full file with every commit.
However, binaries can’t be diffed by Git. If you’re committing a lot of these files, even if the changes aren’t great, your repository can grow quite large.
An Exercise
Create a new repository:
1 2 3 |
|
Let’s make our first commit—we’ll throw in an image from elsewhere on the disk. Let’s find a good one:
1
|
|
Beach.jpg
looks good. Let’s copy it into our project:
1
|
|
Commit it:
1 2 |
|
Now check out the objects folder:
1 2 3 4 5 6 |
|
Notice the “10.0M” object.
Let’s overwrite that file with a new image (the largest one in the backgrounds directory Zebras.jpg
) and commit it:
1 2 |
|
The objects again:
1 2 3 4 5 6 7 8 9 |
|
Use Preview.app to trivially change the content of the image:
1
|
|
Add an arrow or some text somewhere on it. Just leave it mostly the same image.
Commit the change:
1
|
|
The objects again:
1 2 3 4 5 6 7 8 9 10 11 |
|
There’s a new 28M object right there after the 25M object. Why? They’re essentially the same image, except for I added an arrow to it. Shouldn’t Git get be fancy and know how to only store the diff?
Let’s use the git-gc
command to cleanup the objects
directory:
1 2 3 4 5 6 7 8 9 10 |
|
Oh, so that’s what cleanup does. The pack folder there contains the following files:
1 2 3 |
|
Notice that that pack file is 63M which is exactly the sum of the objects we had before (28M + 10M + 25M). Probably would’ve been more useful if there were unreachable objects or duplicate objects.
According to documentation, git-gc
has an --aggressive
option:
Usually git gc runs very quickly while providing good disk space utilization and performance. This option will cause git gc to more aggressively optimize the repository at the expense of taking much more time. The effects of this optimization are persistent, so this option only needs to be used occasionally; every few hundred changesets or so.
Let’s see if that helps:
1 2 3 4 5 6 7 8 9 10 |
|
Nope.
We Do This To Ourselves
GitHub has super useful, world-class image diffing support. In order to diff images, you have to commit multiple versions of the same image. And as we demonstrated above, Git isn’t great at optimizing for this behavior.
This isn’t the only feature GitHub supports despite it being a pretty bad idea. From their “What is my disk quota?” page:
GitHub supports rendering design files like PSDs and 3D models.
Because these graphic file types can be very large, GitHub’s designers use a service like Dropbox to stay in sync. Only the final image assets are committed into our repositories.
GitHub isn’t alone:
- Image diffing is popular with Git
- Facebook’s visual regression testing tool Huxley encourages you to commit images to your repository
- Kaleidoscope heavily markets their image diffing features
What We’ve Learned
Git has features that optimize disk space for text data. It’s not great at binary files, though.
Maybe it’s not technically possible to only store the diff from an image change. Maybe Git just doesn’t have this feature. Maybe committing images inefficiently isn’t that big of a deal.
Committing large binary files is fine, I guess. The value seems to outweigh the costs, which appears to be performance, disk size, and clone time. It should be fine as long as you don’t anticipate that you’ll ever blow past your Git host’s disk quota. However, this seems like a hard thing to estimate for.
Going back in time to remove binary files does not look trivial.
Hit me up on Twitter if you have any feedback.