Link Fingerprints

I’ve noticed that some organisations, particularly Microsoft, have been releasing security updates where the download URL contains some form of checksum for the file. Here’s an example, which is the download for this security vulnerability in Windows 2000.

The page doesn’t explain which algorithm they use to generate the hash. It’s a 128-bit value, but it’s not straight MD5 (wrong values), or SHA1 (wrong length). It also doesn’t say why it’s there, or give you any way to use that information to verify the download, so I assume it’s designed for automated tools like Windows Update or similar. If anyone has more info, please comment. [Update: apparently, it’s just a GUID – basically a random number. So all the stuff below is my invention rather than being copied from anyone else.]

This is a great idea in principle, because it means that no-one can break into the server and trojan the file – if they do, then the checksum doesn’t match any more. And if they change the name of the directory it’s in to match the new checksum, all the published links break and users still don’t get the trojaned file. But it’s no use if the checksum isn’t checked.

So I started thinking. It must be fairly common that users download files where they are given the URL by a trusted source – or, at least, get it from another location or server which would also have to be compromised.

So what’s to stop any user agent verifying a downloaded file against its path-based checksum, or ‘Link Fingerprint’? It would work like this: if the user agent downloaded a file whose URL had a path component matching a particular pattern (perhaps /md5[0-9a-zA-Z]{16}/), the user agent would checksum the file, compare it to the URL, and put up a nasty scary “don’t trust this file” warning if they were different. I’m assuming it’s fairly unlikely that many files will have paths like that where the relevant path component isn’t an md5sum.

Yes, files may be digitally signed with the distributing company’s key, as most Linux distributions do with their RPMs, but verifying that often involves a lot of PKI hassle which is currently out of the reach of many users. The idea of this would be a simple way for a software distributor to reduce the likelihood of their download being successfully trojaned, with no need for the user to be made to perform verification steps.

There are implementation details, of course – you’d need to watch out for redirects, for a start – but is the basic idea useful, or are there too many potential problems with it? Or perhaps it’s solving an issue that isn’t a problem in practice?

24 thoughts on “Link Fingerprints

  1. I find it disturbing when prominent developers encourage the use of MD5 or SHA1. Both are broken, MD5 considerably more than SHA1.

    Better to use Whirlpool or Tiger.

  2. To me, this sounds like a good idea. I see the following advantages:

    A) Not only the users don’t need any PKI stuff, but also the server admins won’t need any special tools (e.g. signtool, special packaging and the like), and can protect any file format (even .txt) that otherwise wouldn’t support any signing/certificates, because the file is never modified.

    B) As no additional files are required on the server (the classic .md5 ones, for example), no additional HTTP/FTP request is neccessary. Thus, the browser needs no “guessing”, as it can easily detect the feature’s presence just from the URL (it might even highlight such links as “md5-secured” before they are clicked).

    Some further thoughts:

    1) For Firefox, I’d prefer this to be a built-in feature, not an extension. This way, even users who don’t care about security will benefit from it. I think the whole functionality won’t need much code, so Firefox would not be bloated.

    2) A formal specification should be put up somewhere, to encourage other vendors/developers to implement it (reduces their feeling of “we’re copying firefox’s behaviour here, no idea if that’s going to change”)

  3. [Anyway, the checksum in the MS download link just looks like a GUID. Perhaps, in this case, it isn’t a checksum at all? Doesn’t change anything about the idea, though]

  4. Good idea. It certainly would need a lot of testing though before it could be included in Firefox. Maybe require a special flag in .mozconfig at first until it’s bullet proof and then make it default?

  5. I do this to prevent evil people from using a phishing attack.

    Say my site:
    http://robertaccetturarules.com/deleteaccount/?id=2&delete =true

    you could technically craft some sort of URL (and if the user was recently logged in), put them pretty close to deleting them.

    What I do is take their id, and md5 with a secret code stored on the server to generate a hash.

    So on the confirmation page it asks, do you wish to delete?

    http://robertaccetturarules.com/deleteaccount/?id=2&delete =true&hash=62bb62f5049118f7f4d4c299e3f4b260

    now you can’t give a user a direct url to delete an account/file/bit/whatever. They need to go through my confirmation dialog. You would need to know my secret code(s) to generate a good md5 to successfully do so.

    Also prohibits writing a script to do something on the site.

  6. If it were decided to use bittorrent as a means of firefox distribution how might this thinking integrate?

  7. Gerv: the example you give looks like a UUID, which is either a random number or a number based on the MAC address of the machine generating it, and the time. That won’t help with verifying the contents of the file, but other than that the idea is an interesting one :)

  8. I believe that an HTTP header may be better for HTTP requests. “X-MD5-Sum:”, perhaps? I can see the other one working well if the server had some support for it such that any file dir/file had an implicit presense in dir/md5$filemd5/file iff filemd5 == the file’s md5 sum. That’s not even much additional work on the server end — It can be done in 2 easy stages. (Assuming dir/md5$filemd5/file has been requested)
    1) (optional in case that server can recognized these md5-included links) search for dir/md5$filemd5/file and send it if it is found
    2) search for dir/file and send it iff filemd5 matches the md5 sum of the file. (These would likely be computed in the background beforehand (or on first request) and entered into a lookup table for quick reference.
    This setup allows people to keep their site organized however they like and won’t require them to create a new (even if 1 entry) directory heierarchy each time the add or modify a file, or remove a useless level of the heierarchy every time the modify or remove a file.

    Either of these solutions would be less unwieldly, I think, than creating this directory structure or each file (from a user/publisher’s perspective). Only problem is that each requires modifications to any server used.

  9. Call me boring, but this seems to be a fair bit of fuss to solve not much a problem. This wouldn’t stop people from downloading questionable files – the vast majority of these come from web sites that probably don’t have much intention of providing MD5-directory links. It’s a nice idea, and I wonder how Microsoft plan to use it, but to include it in FF seems a pretty low priority.

    Anyway, Gerv, keep up the good work – I always await your contributions on planet.mozilla.org – great stuff, and enjoy your Easter!

    Tom

  10. Darin, ERIIX: you’ve both missed the point :-). The idea is not to provide a method so the browser can check that the file requested was successfully downloaded. As Darin says, Content-MD5 does that job fine.

    The idea is to make is so the browser can check that the file downloaded is the one the supplier meant when they gave them the original link – because the URL itself and the file contents are bound together. An HTTP header such as Content-MD5 doesn’t do that, because there’s no tying to the original URL.

    Tom Page: you’ve also missed the point. The idea is not to “stop people downloading questionable files” – as you say, that’s somewhat tricky. The idea is that sites can provide some protection against their files being trojaned if one of their servers is hacked.

    junfal: the actual hash algorithm used is fairly irrelevant, but your assertion that MD5 and SHA1 have been “broken” is somewhat simplistic. If they’ve been broken, please find me another meaningful message which hashes to the same MD5 as this comment.

  11. Look at the second part of my post, where I thought of a way to automate the link-based system you already discussed. Only the first sentence is about what I now know to be Content-MD5 (I hadn’t heard of that header before). The rest of the post was about a fairly simple patch for servers HTTP, FTP, etc. that would automate the confirmable link process. It wouldn’t help on the browser end, but I wanted to point out how easy adding that on the server side could be. That way, the server cna client sides can work together without additional effort on the part of the one posting.

  12. I’ve been considering adding support for something along these lines to my extension. However, I’d planned to implement this feature as a handler for a new protocol (current working name is “wonton”) of the form:

    wonton://server.domain/filename.ext?alg=md5&digest=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

    where the algorithm parameter could be any commonly used cryptographic hash function. The user agent would strip off the parameters, temporarily store the digest/checksum, retrieve the file as a standard HTTP request, compute the digest upon download completion, and compare the two digest/checksums. In the case of Firefox, it might also store the reference checksum as an additional data point in the downloads.rdf file. Behavior in the event of an unsupported digest algorithm is user agent defined.

    The advantages as I see them:

    1. The scheme is explicit – no need for pattern matching or attempting to guess whether a URL requires automated checksum comparison.

    2. Doesn’t require the user to administer the server or alter the directory structure. This would enable the scheme to be used with public servers (e.g. ‘contrib’ directories, mirrors, etc.).

    Disadvantages:

    1. The scheme is explicit – thus not transparent to the user agent. I.e. a browser that doesn’t support the protocol would cough on a “wonton” URL. It would also therefore usually require a file originator to supply two links to the file, a standard URL and a “wonton” URL.

  13. ERIIX: Sure, you could change the server code to make it easier for such links to work. After all, a URL doesn’t have to match an actual path on the server’s disk. This would allow people to create checkable URLs for any file on that server.

    Having the md5sum as a URL parameter might achieve the same thing – but perhaps having “magic” URL parameter names is a bit dangerous.

    Charles: as you say, the big problem with your scheme is that it’s in no way backwardly-compatible. Every URL would have to be given twice for the forseeable future – every page would need two links, every email would need two lines. It would be hard to explain to people how to know which one to click, and people who clicked the wonton one in unsupported user agents would get a nasty error.

    In the light of the “boil the sea” nature of the proposal, I think that the first thing you list under “advantages” is pretty minor. The second one isn’t actually an advantage of your proposal over mine – see ERIIX’s point about changing the server software.

    Jon: Thanks for the info. I don’t see why you think this “spoils it for me” – it just means I’ve invented something totally new rather than copied someone else’s idea.

    Incidentally, my idea has all the advantages of the GUID idea too.

  14. Thinking about it more, I’ve come to the conclusion that the path name might not be the correct place to put the md5 information.

    Look at a sample URL:
    http://www.example.org:8080/path/to/file?param=value#anchor

    Of these, different parts are instructions to different bits of software.

    • http://www.example.org:8080 – instructions to the browser to say which machine to contact
    • path/to/file – instructions to the server to say which file to serve
    • param=value – instructions to the server to say which file to serve
    • anchor – instruction to the browser to say where in the file to go.

    Now, our md5 information is actually an instruction to the browser to do a verification – so, it really belongs into one of the URL sections which are instructions to the browser. Now you can’t include it in the protocol, host and port, otherwise the browser would contact the wrong place – but what about the anchor?

    When you are downloading a file to disk, the anchor is never used, because the file is not displayed to the user. That means it’s available to be borrowed for this purpose. So, I think we want to design a system with syntax like this:

    http://www.example.org/securityupdate#md5sum!5F3C

    Legacy user agents will just ignore the anchor, whereas new user agents would use it as a security check. And it works with any file, with no server-side changes of any sort required.

  15. Much better, I think. Cleaner, more intuitive syntax, avoids the complication of redirects, etc. Still some ambiguity in that any digest information could be a valid internal anchor (consider links into an html page about md5 checksums with internal anchors like #md5sum!introduction), however unlikely, but most cases could fail silently due to invalid checksum format. Still, might want to add some sort of identifying prefix, like

    #ADV!md5sum!5F3C…

    where ADV stands for Automated Download Verification, to further reduce the potential for conflicts.

    >Legacy user agents will just ignore the anchor

    When ‘right-click’ downloading, yes, but try left-clicking on the following link in Firefox:

    http://downloads.mozdev.org/mdhashtool/mdhashtool-0.3.xpi#md5sum!b3187251c16675ac7d20bb762ad53967

    Note that the checksum anchor is carried into the installation warning notice. A minor quibble, but there could be others.

    I do like the fact that this scheme doesn’t require any server-side changes. In the long run, when the sea is aboil, this might not be important, but, as one of your countrymen famously observed, in the long run we are all dead. In the short run, nothing helps generate enthusiasm for a new idea like ease of use and working examples to demonstrate its benefits.

    So, if you aren’t planning to do anything with this immediately, I’d like to incorporate support for it in my tool as a technology demonstration. Will you formalize the syntax and define a set of digest algorithm identifiers?

  16. Charles: I think the chances of anyone using internal anchors of the form md5sum followed by an exclamation mark are so tiny that adding ADV in front doesn’t reduce the possibility any further. And it makes the URL longer, which is bad.

    Interesting point about the XPI install dialog; but it is, as I’m sure you’d agree, an edge case.

    In terms of formalising the syntax, the questions are: which algorithm(s), which identifier, and which separator char. For sep, I chose ! because it’s fairly uncommon but doesn’t need to be escaped in URLs. For identifier, we can choose either e.g. “md5″ or “md5sum”. I’m not sure the “sum” adds much, apart from a bit more uniqueness.

    In terms of algorithms, I think it’s important to keep the number of algorithms small, so as not to put a burden on implementors. So, we want one with a short digest to keep URL size down in non-NSA-proof applications, and one which has no known attacks or weaknesses. MD5 is the obvious choice for the first. Perhaps that’s all we need right now. Alternatively, perhaps something like SHA-256 is the right choice for the second?

    Gerv

  17. You’re overloading the fragement identifier, so that’s going to make the semantics people unhappy – you’re not specifying a part of the file to be downloaded.

    I would say it’s better to use the search portion of the URL, as it’s more semantically correct � “I want example.xpi with a checksum of abcdef…” � and web servers will ignore this for static file requests. You could use the rel attribute to turn on the client-side checksum processing, e.g.

    <a href=”http://example.org/example.xpi?abcedf…” rel=”download”>Download now!</a>

    This has the possible advantage that the server logs will identify if checksum has been altered in the requesting link. However, it does have the disadvantage that it won’t work for URLs that are serving files dynamically and already use the search portion of the URL.

    I suppose you could always introduct Yet Another Namespace to XHTML:

    <a href=”http://example.org/example.xpi” lf:checksum=”md5:abcdef…”>Download now!</a>

  18. Andrew:

    Some interesting ideas here. I share your concerns about abusing the fragment identifier semantics. Perhaps we’re back to the “magic” URL parameter that Gerv mentioned earlier if we’re determined to have the checksum data embedded in the URL itself.

    However, the other alternative you mentioned, the new XHTML namespace, might be an even better solution. Or, rather than create a new XHTML namespace, why not place the entire checksum/digest declaration in the ‘rel’ attribute (as a new link type).

    Either way, you lose the benefits of URL-embedded checksums, i.e. you couldn’t email or type into the address bar a simple URL string, but on the other hand, no extraneous parameter or anchor ever gets sent to the server. The fingerprint-enabled link solution is entirely client-based, which in my opinion is a plus. In addition, there is no ambiguity in interpreting the checksum/digest component, and no strangely-suffixed URL appears in the status bar to confuse or alarm the user – it’s essentially transparent and backwardly-compatible. Newer user agents that recognized the new link type or name space could optionally display such links as “secured”, like jens suggested above, and make it easy to manually examine the checksum (via a ‘properties’-type context menu) to provide visual cues for the savvy user.

    Gerv, thoughts?

  19. An addition inspired by Gerv’s latest post: You could add a new “checksum” attribute right into HTML via WHAT-WG, avoiding YAN.

    I think Charles’ point about emailing links is a good argument for keeping the checksum in the URL. Perhaps we just have to RFC that “#” is a fragment identifier, but “#!” is a checksum idenifier:

    http://example.org/example.xpi#!abcedf

    It’s backwardly compatable (as noted above) and “!” isn’t a legal character for an HTML or XML id, so it’s fairly safe too.

  20. Andrew: I didn’t realise that ! wasn’t legal for HTML or XML IDs. That’s really useful :-) So we could extend things in the way you specify without any possibility of a clash.

    http://example.org/example.rpm#!md5!A3F4C12E

    I don’t think adding a “checksum” attribute into HTML really solves the same problem – the big advantage of this proposal is that the links can be emailed, copied, pasted and put in plain text without losing the checksum information.

    Gerv

  21. Just thought I’d better double check. Yes, “!” is not a valid character for HTML or XML (1.0, 1.1) ids. Furthmore, it is allowed in the fragment identifier in the URL spec. That should keep Hixie happy. ;)

    Is it really necessary to have the “md5!” part, though? Can’t we just pick one good hash algorithm and use that? And if we do have to specify the hash algorithm, can we have a more asthetically pleasing character like “#!md5@abcdef…” or “#!md5*abcdef…”? Shame we can’t use “:” or “;”.

  22. I don’t think it’s wise to just pick one hash algorithm. The recent breakthroughs in hash attacks make it clear that we need to be able to change horses further down the line.

    I’d be all for a more aesthetically-pleasing character, but IMO it should be one which doesn’t have to be escaped. So that’s a fairly small set. * would do. But I rather like being able to use ! as a separator in all cases.