10 Things You Should Know About Git Replication in the Enterprise
1. Git is distributed, isn’t replication a concept for centralized version control systems?
You’re right, Git is a distributed version control system. However, most enterprises require their developers to synch their local work with central, “blessed” repositories. Those central synch points make sure no work is lost if a laptop gets stolen or damaged and developers only have to only know one place where they have to go to learn about their colleagues’ work results. The same applies to build/CI servers like Jenkins that will monitor and pull the latest source code from the blessed repository as well. Consequently, most remote Git operations will go to a central Git server, creating considerable load and creating a need for replication solutions.
2. Most Git operations are offline, why can’t a single Git server easily handle the few that are not?
Remote Git update operations (aka git clone, pull, fetch) are very CPU intense. A single fetch operation already consumes 0.5 CPUs due to strong encryption and compression used by Git’s protocols. Vertical scaling soon reaches physical and economic limits. A small amount of users and build systems that fetch in parallel can already exhaust the CPU capacity of a single server or completely saturate its available network bandwidth. If users and build systems are located in multiple geographies, problems get worse. Network latencies and outages negatively impact fetch performance and reliability, frustrating users and breaking builds.
3. How can Git Replication help me out here?
Git replication helps by setting up local and remote mirrors (replica servers) that provide up to date copies of the source code stored in the central repositories. Local mirrors (same data center) help to deal with the CPU and network load caused by build systems and many parallel update operations – just have those build systems fetch from the mirror instead of the central repository.
Remote mirrors (different data centers) decrease network latency and hence improve fetch performance and reliability for remote users. Those developers remote would get the latest source code from a mirror that is located close by and hence will not suffer from poor network connectivity to the central server.
Whenever a new commit gets pushed to the central repository, all local and remote mirrors will be automatically updated by the underlying replication mechanism.
4. Do you have anything that visualizes how Git replication works?
Absolutely, just check out this animated presentation.
5. Is there anything to watch out for before deciding on a concrete solution?
There are many vendors out there that provide Git solutions that can handle huge amounts of load by providing multiple servers. However, most of them require those servers to be hosted in the same data center or at least be connected to each other with a super fast, reliable network. Also check whether any subcomponents like database servers need to be in the same datacenter (applies to both read and write access). Those kind of setups will only help if your users are within the same region as this data center and have great network connectivity. If you have developers in remote locations, make sure that the solution you go with deals well with servers in regions with bad, unreliable network connections.
Another pitfall are repository permissions and user management. Some solutions only replicate repository content and leave permission and user management to the admins. This approach does not scale well if you have dozens of replica servers, creating a maintenance and compliance nightmare. Obviously, permissions to repositories should be replicated as well. If a user leaves the company or his access to repository changes, all replica servers have to reflect those changes immediately.
Last but not least, replication servers should still serve their users if they temporarily lose network connectivity to the main server, ensuring business continuity in tough environments.
6. Does CollabNet provide an Enterprise Git Replication solution?
Starting with TeamForge 8.1, CollabNet provides Enterprise Git replication with as many replica servers as you like. It allows enterprises to set up local and remote Git mirrors, reducing server load and improving fetch performance for developers and build systems across the globe. Repository permissions and user accounts are replicated along with repository data, protecting assets on every mirror server. We optimized our solution for environments that have slow, unreliable network connections between data centers. As TeamForge uses the same role and permission model for all its assets, code hosted in Subversion (replica) servers is protected as well.
7. How does replication look like for an end user?
Once a Git replication server has been set up, it automatically registers itself at its master and shows up in the list of available servers. The screenshot below shows a local mirror to handle the additional load of build/CI servers and one in a remote location.
Parameters like replica server title and description can be changed directly in TeamForge’s web interface. The following screenshot shows a filterable and sortable list of all replication operations and their status that happened in a given timeframe.
If you like to replicate a repository to one of those mirrors, all you have to do is to check a box in the edit repository dialog:
In the next screenshot you can see the replication process in progress. Status indicators tell all developers whether a repository is currently in synch.
Also note the protocol selector which enables you to clone the repository using different protocols. Developers would just select the mirror of their choice and use the clone command displayed for it. If developers are using Eclipse or Visual Studio, they can select their favorite mirror server directly from their IDE (see next section for an example).
If you are interested in the details of an ongoing synch operation, click on the status icon and you will be directed to this screen:
And that’s basically all screens related to replication. In addition to that, we provide command line tools for administrators to monitor and control ongoing replication requests.The rest is all handled automatically behind the scenes.
8. How was your solution tested?
We wanted to absolutely make sure that whatever Git replication solution we come up with also works in tough environments with flaky network conditions and hundreds of repositories. For this reason we set up a performance lab with replication servers on multiple continents and the entire Android Open Source Project (700+ repositories, 40 GB+ source code). Furthermore, we configured our firewalls to arbitrarily drop IP packets and temporarily block connections. We are very proud that we managed to design a solution that succeeded to replicate all repositories to all mirrors without any errors. Obviously, a performance lab can never fully simulate reality, so we are looking forward to your feedback. Eating our own dogfood, we already set up replication servers on multiple continents in our internal production environment and gained a performance advantage (while doing git clone) of up to 20x. The next screenshot is showing how we clone from our European Git replica server using GitEye:
We also learned from our experiences doing large scale Subversion Replication for years. We have customers with 50+ Subversion replication servers and analyzed their central server load patterns. The findings influenced our decision to go with a mostly push based approach for Git replication. As a result, the web services of TeamForge stayed pretty responsive even in a heavily used production environment:
9. How many Git replication servers do I need and how to fine tune them?
You should have at least one Git replication server per geographic region that has a suboptimal network connection to your main server or whenever you see the network bandwidth already saturated. Furthermore, you may standup colocated mirrors if server load is the main challenge and vertical scaling is no longer economically feasible. We have prepared a Git performance tuning cheat sheet for you that talks about all relevant factors for Git server performance, HW sizing, software tuning parameters and strategies how to deal with heavy CI use cases. It applies to Git master servers as well as its mirrors.
If you like to learn more about the rationale behind this cheat sheet or have any questions, please drop a comment to this blog post.
10. How can I install your solution?
If any questions arise, add a comment to this blog post or ask our support.