Code Provenance in Github

Mozilla is making more use of Github. One of the things legal-minded Mozillians care about when people are putting together Mozilla code is legal provenance – where it came from, and whether the people contributing it have sufficient rights to do so. Github is designed to allow code to be pulled in, mixed up and exchanged between many different people. Leaving aside for the moment how well provenance tracking works in the current system, let’s posit a few scenarios:

  • A Mozilla member wants to prove, for CV purposes, that they wrote the original version of an interesting algorithm now used in Firefox
  • A suggestion is made that some code we are shipping is part of a proprietary codebase, and the legal team want to investigate
  • We are using some code under licence X, but the known upstream is (now, at least) licence Y, which is incompatible with our iicences, and the project contacts us and tells us to stop

How would you deal with these sorts of issues in a Github-based project? How do you track provenance?

6 thoughts on “Code Provenance in Github

  1. Mostly the answer is, “the same as for any open-source project that accepts public contributions,” except that Git and GitHub provide several tools to make the job easier than in some other systems like CVS and Subversion.

    For example, Git can record both the name of the person who wrote a piece of code and the person who committed it to a repository (if they are different people). GitHub does this by default for pull requests. It also records when a commit was merged from one repository to another, and by whom. GitHub also shows who is maintaining known forks of the code, and when those forks diverged or merged. (Contrast this to CVS or SVN, where the only name attached to changeset metadata is that of the committer to the main repository, and there are no links or metadata preservation between repositories.)

  2. What also somewhat concerns me – is github itself open source? Do we know what that site is doing to our code? Can we replicate or fork the experience of the site if we want to?

  3. kairo: No, it’s not open source, although there are open source clones and competitors (e.g. Gitorious). However, the social side is hard to clone. If all we did was store source code in it, we can always get our data out in full. I don’t know what the export is like for other data such as discussions, reviews and tickets.

  4. kairo: The site can’t really do anything to the code; git relies on using hashes of the actual content plus the parent commits, so it is difficult to maliciously alter the repo and not get caught.

    In terms of the experience of the site, remember that the alternative is hg.mozilla.org which is entirely open source but which doesn’t even syntax highlight code and has been equally stagnated for basically all of its existence. (The pushlog stuff is custom.)

  5. As long as people stick to just doing pulls or straight rebases/cherry-picks of commits from other repos, the original authorship information will remain intact, so there’s nothing to worry about. Each commit’s author will be recorded in any repository where their code is pulled. Of course, it would always be possible to copy-paste code instead of pulling the commits and then not mention the source in your commit message, but this is no more true than for any other VCS.

    Put another way: DVCSes offer you lots more ways to mix together different people’s commits, but they all leave clear trails of who did what. If anything, it will reduce the incentive to copy-paste code and make things easier to track. E.g., if you commit a patch on someone’s behalf, you can specify them as the author, then commit any changes by you separately as a follow-up under your name.

    Robert: You can’t fork the site experience so easily, but you know it’s not doing anything malicious to your code. Every git commit (I think Hg is similar) is identified by a SHA1 hash of the commit contents concatenated with metadata that includes the hash of any parent commits. If the site alters anything in the history, it will change all the hashes after that point, and when you do a pull, your local git copy will complain that the histories don’t match. Of course, they could append extra commits to the end if they liked, or not let you make changes, but that’s pretty easy to spot.

  6. Surprisingly you are not the only one in this game. For example, Gnome is running all code on git.gnome.org and they are working on making bugzilla+self-hosted git better.

    Also, on another note, I found gitorious surprisingly similar and competitive with github (except there are just much less people there). Of course, you could have your own instance of gitorious hosted on git.mozilla.org if you choose so.