The Dictionary Dilemma

The Mozilla Foundation has a policy of only including software in our CVS repository, and in builds we distribute, which is available under our MPL/LGPL/GPL tri-licence or a compatible licence (e.g. the BSD licence or an MPL/LGPL dual licence). The reason for this policy is to present a simple story to people who want to use or distribute our code. We want to be able to say: “Happy with the MPL? OK – use what you like.”

Many localisation teams come to us asking if they can include a dictionary with their localisation of Thunderbird. However, a lot of available free software dictionaries, including many of those used by the OpenOffice project, are available under the GPL or LGPL alone.

[Sidenote: why on earth do people use the LGPL for data? It doesn’t make any sense. For a start, a straight reading prohibits modification of the data. “The modified work must itself be a software library”, section 2a). They should have said “Library”, not “software library”.]

What can be done? Compiling a dictionary is a lengthy process, and it’s work that no-one wants to have to repeat.

So here’s an idea. You create a bit of code which checks text against the dictionary. This code is covered by the same licence as the dictionary. You then feed it a large number of documents in the relevant language. Words which fail the spell check are discarded; words which pass are added to an output file. Eventually, the output file is a new dictionary which you created, and you can licence how you choose. After all, one can’t claim that the licence of spell-checking software infects documents it spell-checks.

The questions are: Is it technically possible? If so, does it produce a useful dictionary? If so, is it legal? If so, is it moral?

18 thoughts on “The Dictionary Dilemma

  1. “A large number of documents” might usefully be called “Wikipedia” for this purpose. Why not also use several dictionaries at once in parallel for this purpose? This makes it easier to argue that the dictionaries are being used as reference sources, not as a basis for a derived work.

  2. Surely a better starting point would be one of the out-of-copyright dictionaries in Gutenberg?

  3. Well I think you have yourself argued against it in another context (if memory serves me right)

    The other scenario.
    * We have a file with key/value pairs
    * We have a program that assigns new values to the keys found in said file.
    * The new key value pair is saved in a new file.

    The difference here is roughly that my scenario only reuses less than half the original file while yours reuses naerly 100%

    You (again if memory serves me right) concluded that given the aproach in my scenario:

    * the new file is a derived file, and must be under the same licences as the original
    * authors of the old file must be added as contributers to the new file, since they “invented” the keys

    Either there is something fundementally different about the two scenarios that I am missing, or it is rather funny how you changed standpoint now that you (or mozilla) has something to gain.

  4. FYI, itself can’t distribute these dictionaries in its localized releases for the very same reason. They had to create an automated tool (in the form of a macro) that downloads the dictionaries and installs them after the fact. Other software are installed this whay, think of Microsoft Core Fonts which are required to be distributed in their .exe form. Tools have been made to download the executables, extracts the fonts and install them in a Linux system.

    Couldn’t Thunderbird do the same thing and ‘suggest’ downloading and installing automatically the required dictionary for a given language, in the form of a separate (maybe GPL’ed) extension, using its own update system?

  5. Compiling the dictionary in the manner that you suggest probably isn’t possible; that still sounds like a derived work.

    Using LGPL or GPL for a dictionary sounds quite mad: for example, in the UK, electronic dictionaries would have an additional database right that would subsist. Those licences do not address that right, and thus the author of the dictionary hasn’t licenced those rights. Database right is much like copyright, but extends differently (for example, there is no fair dealing/fair use exception for commercial usage). And, it only lasts 15 years. (Incidentally, your compilation technique probably infringes the database right).

    So, you might want to examine whether or not the trilicence scenario is sufficient in and of itself. Database right is somewhat common in the EU, and somewhat less common (but becoming more so) within WTO members.

  6. Henrik: that’s an interesting point.

    In the case you mention, someone is translating the UI of a Mozilla application. Translations are held to be derivative works of the original in copyright law in every jurisdiction I know of. They are also Modifications (in the MPL sense) because you started with an original MPLed file. So, whichever of our licenses you use, translations are definitely covered.

    So if it turns out that my suggestion is the same as doing a translation, then it will mean that my suggestion won’t work legally, rather than the translations are not subject to the MPL.

    So are they the same? I don’t think so. A dictionary is designed to be used to look up words. So we are using the dictionary for its primary purpose in spell-checking all these documents and removing misspelled words. We then have left some documents which are quite similar to the original. Are you arguing that these documents are subject to the licence of the spell-checker?

    If, at some later point, we then go and run sort and uniq over those documents, that’s nothing to do with the spell checker or the dictionary.

  7. FFS!

    Stop looking for a technical solution. Have you tried ASKING these people? I’m sure they would be more than happy for their dictionaries to be used by Mozilla.

  8. kijella: in some cases, possibly yes – although normally l10n teams try this before coming and asking me to do something about it. Often, a dictionary is based on a previous dictionary, and so on, back a long way – and all the contributors aren’t even known, and/or can’t be contacted.

  9. Your idea is not technically feasible for many languages, those with complex affix rules or productive compounding. For these languages, the dictionary is the easiest part of the spellchecker resource. The hard part is the affix table and the rule-set for compounding. You will not get those with automatic methods.

    For example, in Hungarian any single word stem can be the basis for thousands of different word forms. Below are the most common of the hundreds of word forms with the stem “anya” (“mother”), from a large Hungarian webcorpus:

    anya anyja any�m any�k anyj�t any�m any�t anya any�nak any�d anyj�nak anya anyja anyj�val any�mat any�mnak any�nk any�kat any�dat any�knak anyj�t�l any�val anyja anyj�hoz any�m any�t�l any�mmal any�k any�ra any�d anyj�ra anyjukat anyak�nt any�mt�l any�hoz any�nk any�v� any�k any�dnak any�r�l anyj�nak anyjukkal any�mhoz anyjukt�l any�kkal anyj�r�l any�n�l any�d any�mnak any�tok any�nak any�mra any�t any�kra anyj�ba any�nkat any�m�k any�kn�l any�ddal any�nk any�ban anyj�n�l any�ink anyj�t any�mat anyjuknak any�n any�� any�b�l anya any�mn�l any�val anyj�b�l any�dba any�kt�l any�k�t�l any�m�k anyj�t any�n�l any�mmal any�nknak anyj�v� any�mr�l any�tokat any�nak any�kr�l anyj��

  10. I was involved in the construction of a translation dictionnary some time ago, this is clearly a domain where copyright issues are both really important and hard to apprehend, don’t word belong to everybody ?

    In it’s simplest form, the copyright of a dictionnary is the set of word that it includes and the decision about infringment will be based on the level of similarity between the two. It’s not the method you used that counts, it’s the end result.

    At the end of the day, if you want to be really clean, you have to bite the bullet, and construct a new dictionnary from scratch, documenting precisely all the sources. If you only need a list of word, the method you suggest seems usable. Use as input edited content, on-line newpaper/magazine/public domain literature, and select all the unique word in it. You can check by hand if the words that appear only one or two times are bogus, or just remove them automatically.

    Content whose origin has not been clearly checked initially ends up being a world of pain, see for a good exemple the watanabe font problem in debian :
    And they were lucky that the owner of the copyright didn’t decide to go to courts.

  11. Gerv, I have asked Ricardo Ueda Karpischek, br.ispell coordinator and main contributor, what we can do to solve the license conflict.

    Currently the “source” files are GPL (c)Contributors. Some scripts parse them and “compile” the files Thunderbird needs.

    Ricardo wants to ask contributors to transfer the copyright to him. So, he could keep the “source” with GPL and release the “compiled” files as public domain.

    What do you think?

  12. I don’t think that this would be legal. But it would give “dictionary attack” a whole new meaning. :) With todays computers it isn’t unfeasible to check billions of words against a dictionary to build a new one. It would be just as easy to check all serially generated bit strings against a Harry Potter book and “rebuild” the whole book word for word. It wouldn’t be any more legal than just copying. Rule of thumb: If you’re trying to get around the intent of the law it almost never will stand up in court.

    But what about an extra server which only delivers non-MPL’d software? Mozilla, Firefox and Thunderbird could then download those dictionaries from there.

  13. The United Nations publishes all of its official documents in each of its six official languages: Chinese, English, French, Russian and Spanish.

    A quick review seems to suggest that these have a reasonable level of copyediting. So, that’s an enormous corpus for generating six substantial wordlists, for a start.

  14. Wordlists are easy, by comparison. As Daniel points out, dictionaries tend to have suffix and affix rules, and possibly hyphenation rules as well. While it might be possible to automatically convert a wordlist to a dictionary, it’s probably not easy (for example: unless you can be sure you have all possible words, you can’t tell which suffix rules might be appropriate).

    I also suspect it would be considered a derived work (because of why you’re doing it, not how).


  15. Ricardo wants to ask contributors to transfer the copyright to him. So, he could keep the “source” with GPL and release the “compiled” files as public domain.

    What do you think?

    Those files wouldn’t be in the preferred form for modification. This might get around the letter of the licence, but would be really incompatible with the spirit. (But then, the same accusation might be levelled at my plan.)

    However, reading what people have written, it looks like this probably wouldn’t work technically. Ah, well. No short cuts, then…

  16. Oh, one other thing I thought was worth mentioning: when considering whether work B is a derivative work of work A, courts don’t generally care how you got from A to B, just the end result.

    [This is partly why SCO’s “we need access to IBM’s AIX SCCS” argument is bogus. I can take a copy of any copyrighted work (GPL, proprietry, or whatever) and progressively change it by small increments until it no longer resembles the original. Despite the fact that every step along the way is a clear derivative of step N-1, eventually the code will reach a point at which it is not legally a derived work of the original. Provided that I don’t distribute the intermediate works, I’m in the clear.]

  17. Malcolm: I think you need to consider “clean-room” implementations, coming at the problem from the other direction to your step-wise removing-derivativeness.

    IANAL, but my gut instinct is that legally, this could work. Morally, it’s grey, at the very least. And technically, unless you’re happy with simple wordlists, it’s somewhere between “definitely non-trivial” and “a big step back”.

    Gerv: I think, as a first step (if something better can’t be arranged in time for TB 1.1) that MoFo need to host 3rd party dictionaries somewhere, and ensure that latest versions of the dictionaries are available from there (MLP could help to publicise the right place for the dictionaries among the l10n community, but we’d need to know where the right place is ;-) NB, exists, which may be a good starting point (given that mozdev != MoFo).