I’ve just released version 1.0 of some new software called slic, which I’ve been using to do license analysis on Firefox OS. From the README:
This is slic, the Speedy LIcense Checker. It scans a codebase and identifies the license of each file. It can also optionally extract the license text and a list of copyright holders – this is very useful for meeting the obligations of licenses which require reproduction of license text. It outputs data in a JSON structure (by default) which can then be further processed by other tools.
It runs at about 120 files per second on a single core of a 3GHz Intel I7 (it’s CPU-bound, at least on a machine with an SSD). So you can do 100,000 files in less than 15 minutes. Parallel invocation is left as an exercise for the reader, but could easily be bolted on the side by dividing up the inputs.
The code is Python, and it uses a multi-stage regexp-based classifier, so that with families of licenses it starts with a more generic classification and then refines it via checking various sub-possibilities. Future potential enhancements include a hash-based cache to avoid doing the same work more than once, and integration with a popular online spreadsheet tool to help manage exceptions and manual license determinations.