Contribution patterns in open source
A month or so ago, Michael Ogawa published some fascinating and beautiful "home movies" on the commit patterns in various important Open Source projects. He also open-sourced the program that creates these records, and so one of the Subversion committers took the time to generate a similar home movie for the Subversion project. After spending way more time than seems reasonable watching them all, I think I begin to see, in the movies, some traces of the respective cultures and their processes. Thought you might like to skip forward a few viewings with these hints.
As you work through these, you might want to open the movies in a separate window, so you can keep the commentary side by side with the actual video.
You should start with the Python movie. There are several reasons for that. First, the pattern you see here is the classic, ur-open-source pattern, one hacker scratching itches somehow breaks through to popularity. Second, that long "lone wolf" preamble is easier to watch, and you’ll need the practice before we’re done! The movie’s 4:42 long; here are some notes, keyed to their times:
- 0:24: "guido" is Guido Van Rossum, the "lone wolf" in question. You’ll see a lot of him.
- 0:44: The reason he seems to be dancing around is that he focuses on particular groups of files–particular functions–shifting from time to time. You can roughly think to yourself, each time he wanders to a new spot, "there’s one more important feature complete."
- 0:51: Some helpers appear, sjoerd and jack (jansen). They’ll both take a while to become major contributors, but remain important throughout the movie. What this really signals is Guido opening his trust to outsiders for the first time, a key moment.
- 1:27: More names float in. Again, though they take a while to gain prominence, these are names that will stay with us throughout the project. At this time as well, jack and guido are consistently working the same group of files–their names are superimposed, the circle of files unites them. jack has become nearly as significant to the project as guido himself.
- 1:40: "jack" and "guido" divide the work and specialize: guido now trusts jack to do work guido doesn’t even review and improve.
- 2:00: Many more names appear, drifting in and out: guido has found ways to trust many contributors.
- 2:30: About half-way through the movie, several clear specialists have emerged, freeing guido to concentrate on the core.
- 2:41: A contributor named ‘gstein‘ appears (remember that name)
- 3:00: New contributors come think and fast, contributing to the core (clustering around guido), and also to the satellite domains of specialization: the specialists have begun to recruit their own helpers and off-loaders. By this time, you can hardly spot any particular "key" contributor, and the python product itself is, at v2.0.0, a firmly established product.
- 4:14: rehosting / renaming of contributors
What have we seen?
- Initially, guido maintained control. It’s not that he was the only person interested, or even contributing, but all contributions came through him.
- As others gained his trust, they go through an apprenticeship phase, where he reviews, and often reworks, their contributions. soon, though, they specialize in some part of the project, and guido is freed to return to his primary focus: "promotion" (commit rights) comes through merit, even commit rights don’t instantly grant uncontrolled rights, but eventually the productivity of the team grows.
- Specialization is efficient and productive, but only after trust is earned.
- Success in implementation leads to market success; market success leads both to more demands and more hands: synergy.
The next one to view is probably PostgreSQL. To understand this video, you should know that this advanced database arises from a long period of academic research and development. You may notice that the first product version you see is actually 6.0, giving you some sense of this. The history you actually watch here is from a for-production open-source transformation of this academic code base into product. The initial contributors are mostly students (and former students) of the original designer, Michael Stonebraker. The list of contributors grows over time, but is far more stable than we saw in the Python case. The most obvious specialization we see is Thomas Lockhart’s work on the documentation; nearly all the other contributors stay centered on the screen, jointly committing to all the same files … basically, all the files of the project. This reflects several things, including the tightly intertwined nature of a relational database (virtually every component is involved in virtually every transaction).
What have we seen?
- You might note that this video has none of the names from the first one.
- All in all, this history looks very much like the canonical view of in-house, enterprise, closed-source software. This really is an open-source project, don’t misunderstand, but a variety of factors gives it a very un-open-source pattern. (Key factors would be the long academic build-up, and the inherent complexity of RDBMS work.)
Let’s look at the Eclipse movie next.
- 0:30: Wow! Thirty seconds in, and we already seem to have more names and circles and commits than the entire history of the first two projects, combined! The new files fall thickly and continuously, like rainfall, or mist, or star-stuff into a black hole.
What have we seen?
- Of course, the key technological factor here is the Eclipse Rich Client Platform. What IBM initially bequeathed to the open-source world was not primarily a product — though it took the form of an IDE — but rather a game-changing framework for creating extensions. All the circles of specialization you see are individual extensions, and the RCP allows them to be extraordinarily independent of each other, in just exactly the way that the many components of the PostgreSQL RDBMS aren’t.
Let’s now look at the Apache HTTPD movie. This is actually the history of Apache 2.0. The 1.x httpd was already popular, but the Apache community had gradually built up quite a list of major changes they wanted. Nearly everything about a web server is in some sense public: the interpretation of the standard protocol HTTP, the plug-in APIs, and so forth. Even many internal facilities, such as the Apache Portable Runtime Library, had become popular separable components, and had to be redesigned with respect for current users. As a result, this project begins with an intensive documentation phase, where the specifications for all these public interfaces were hammered out.
- 0:20: November 1996: activity begins
- 1:09: June 1998: Apache group meets to discussion 2.0. So, what had they been doing before then? They’d been doing 2.0! In Open Source, it’s not necessary to meet face-to-face to get work done–though this meeting really kicks off the activity.
- 1:30: During this period, individual contributors are providing basic specifications for features they care about. Others comment, within the spec documents. (It’s been said that, during this period, the community was "arguing within the specs.")
- 1:45: August 1998: code begins to appear, and contributors come thick and fast
- 2:00: areas of specialization begin to appear. "document" contributions are becoming end-user/administrator documents, rather than the original design specifications.
Watch for these names:
- 2:15: Wilfredo Sanchez
- 2:30: Ben Collins-Sussman
- 2:33: Karl Fogel
- 2:35: b.w.fitzpatrick
- 2:40: justin erenkrantz
- 2:42: sander striker
- 3:24: joe orton
- 4:14: greg stein (remember him from Python?)
What have we seen?
Again, as with Eclipse, this video begins with a foundation for extensibility. In this case, the focus was on documentation rather than code, but the effect is the same: when the coding begins, the flood-gates open wide. We see relatively little modularity of practice (though it does exist in the code, and the practice does begin to develop, near the end). And there were those names ….
Turning to the Subversion movie:
Notice how each new release brings in more committers: synergy, once again.
- 2:08: Tight cluster (cmpilato, sussman, kfogel) in core, hints of specialization elsewhere (brane for Windows, yoshiki Ruby)
- 6:30: as 1.5.0 takes longer than planned, branches form for features for 1.6 and beyond
Watch for those Apache names. Karl Fogel would be hard to miss, here at the outset!
- 1:05 the "sussman" who arrives here is Ben Collins-Sussman, mentioned above
- 1:15 gstein (greg stein)
- 1:23: joe (orton)
- 1:25: fitz (b. w. fitzpatrick)
- 2:20: striker (sander striker)
- 2:22: jerenkrantz (justin erenkrantz)
- 3:35: jrepenning (me!)
- 4:40: markphip (mark phippard, popular Submerged blogger)
- 5:38: wsanchez
- 5:38: mhagger, now leader of split-off project cvs2svn
What have we seen?
- A substantial number of committers, especially the early committers, came from the Apache project (including the "serial open sourcerer," Greg Stein).
- A highly trusting, highly interactive work pattern appeared at the very outset, no doubt in part because so many contributors had worked together before, at Apache.
- As was evident in Python and Apache, popularity in the marketplace breeds more committers and more progress, producing explosive growth, and justifying the open source mantra "release early, release often."