This post is from the CollabNet VersionOne blog and has not been updated since the original publish date.
Stash your Trash – Keep GIT clean with Gerrit 2.10
Garbage collection is really important, not just in the real world but also within your Git repositories. If you have lots of development activities going on, chances are very high that your Git clone and push operations will get slower and slower.
Fortunately, the cure is simple: Run git garbage collection (git gc).
We have seen examples where running gc cut clone and push times by more than 99 percent. Same goes for code browsing activities as well – Run git gc and list your branches online within milliseconds instead of seconds again.
If Git garbage collection has so many benefits, you may be wondering why are not all servers configured to run it periodically? Git gc can consume a lot of resources (memory and CPU) if you do not constrain it properly. This may lead to situations where your server is not responsive anymore or even kills processes because of memory shortage. Furthermore, setting up a cron job to run git gc periodically can be cumbersome and will require root permissions. Last but not least, garbage collection will prune unreferenced commits, which may result in a permanent data loss if you had an accidental branch deletion or a force push gone wrong.
Fortunately, TeamForge’s Git Integration with its unique history protection feature can run Git garbage collection periodically while making sure you do not run out of resources or lose any accidentally deleted/rewritten branches. We just upgraded the Git backend for all our TeamForge versions to Gerrit 2.10. If you are interested in the other great features that come with the upgrade to Gerrit 2.10, check out this presentation. Our help documentation describes how you can update your existing installation to Gerrit 2.10. Customers using disconnected install can find the new packages here.
This blog post is about Git garbage collection though, so let’s see how to configure this in Gerrit.
First you have to decide how much memory and CPU cores should be dedicated to garbage collection. Gerrit is using JGit for Git garbage collection. What this means is that all gc related code will run natively in Gerrit’s JVM and no native Git processes will be spawned. Hence, there will be no additional memory overhead for garbage collection but push, clone and review operations happening in parallel might be a bit slower.
From our experience, one should dedicate about ¼ of the configured Java heap for Gerrit to garbage collection. If you forgot how much heap limit you set, check out container.heaplimit in /opt/collabnet/gerrit/etc/gerrit.config
If you do not have this parameter set, assume Gerrit’s total heap space will be ¼ of the memory dedicated to your machine.
As far as CPUs are concerned, dedicate at least ¼ th of the server’s CPU cores. If you have less than 4 cores, dedicate one core to gc. Those resources are only dedicated to garbage collection while it is actually running, otherwise they are used to support ongoing clone, push and review operations.
Let’s assume we have a large Gerrit server with 16 cores and 24 GB of Java heap configured. In that case, we would use 4 cores for garbage collection and 6 GB memory. Those values have to be put into a config file called /opt/collabnet/gerrit/.gitconfig as follows:
[pack] threads=4 windowMemory=6g [gc] pruneexpire=2.weeks.ago
There is another particular interesting property called gc.pruneexpire. It determines how long it will take to remove unreferenced objects from your repositories. By default, it is set to two weeks, which is a good value. We only mention this property here as setting it to values smaller than a minute may cause problems if you have frequent push activity to your repositories while garbage collection is running.
The file should /opt/collabnet/gerrit/.gitconfig be owned by the gerrit Unix user. It is read whenever a garbage collection is executed, so there is no need to restart Gerrit if you change your mind on resources.
How to trigger Git garbage collection now? You have three choices:
Trigger manually using Gerrit’s gc command
Press the “Run GC” button in Gerrit’s Project commands (see screenshot below)
Configure gc to run periodically across all repositories.
We recommend that you first try option one or two to figure out whether you are still satisfied with the performance of ordinary fetch, push and review operations during the time garbage collection is running. Both options will require your user to have Gerrit’s Garbage Collection capability assigned to it. Once you found the optimal settings, proceed with option three.
Garbage collection should be run at least once a week, for sites with considerable load (more than 100k requests per day), we would recommend every other day. We have seen highly loaded sites (>500k requests) where the interval was reduced to one day as well.
The parameters which control how often gc is executed are startTime and interval in Gerrit’s gc config section. Let’s say you like to run Git garbage collection every two days at 1 am, you would have to add the following lines into /opt/collabnet/gerrit/etc/gerrit.config:
[gc] gc.startTime = 1:00 gc.interval = 2 day [core] packedGitOpenFiles=512
Please also notice our setting for core.packedGitOpenFiles. If you have set this value already, consider doubling it (and make sure that your system ulimit allows for that many open files), otherwise set it to at least 512. The reason for this is that Git gc will now run inside the same JVM as Gerrit and more Git pack files will be opened in parallel.
To check, whether your settings have been picked up correctly after a gerrit restart, you can have a look into /opt/collabnet/gerrit/logs/gerrit.system.logs
You should spot lines similar to
INFO |WorkQueue-1|GarbageCollectionRunner| Triggering gc on all repositories
whenever a periodic garbage collection kicks in.
You may also check whether you have configured any cron jobs or similar mechanisms that still run native git gc. Those should be disabled, as Gerrit’s gc algorithm is typically more efficient and will not cause any race conditions with running Gerrit instances.
Git garbage collection has a lot more tuning parameters as the one introduced in this post. If you think, that we missed an essential one or would like to share your configuration, please drop a comment. We are currently preparing a Gerrit Performance Cheat Sheet (not just about garbage collection) and are particularly interested in the gerrit.config/.gitconfig of large Gerrit installations.