URL Encoding Confusion

According to RFC 1738, Section 2.2, a URL is a series of octets. Octets which have no visible ASCII representation, plus other problematic chars, must be encoded as %HH. It doesn’t say anything about character sets.

In all modern browsers (including Mozilla, now that bug 44272 is fixed), JavaScript’s escape() function (which was originally designed for escaping strings into URLs) uses %HH for Unicode codepoints below 0x0100, but %uHHHH for codepoints above there. Apparently, this is what the ECMAScript spec says. Where did this strange %u encoding come from? The newer (but not supported by IE 5.0) encodeURIComponent() seems to do the right thing, always encoding each individual byte of the character’s representation in UTF-8 as %HH.

Form submission seems to encode in the character set of the page, as %HH. So encodeURIComponent() matches form submission if the document is in the UTF-8 character set. Otherwise, it doesn’t.

But if you give IE a mailto: URL, it’ll decode any %HH bytes as if they were ISO-8859-1 (I think). Perhaps it’s using the default charset of the platform. So you have to use escape() – you can’t use encodeURIComponent, otherwise it gets things like U-UMLAUT wrong.

Mozilla, on the other hand, assumes UTF-8. But this means you can’t escape() something into a mailto: if it’s got characters in it between 0x007F and 0x0100. escape() uses %HH (direct encoding of value), whereas Mozilla expects %HH%HH (2-byte UTF-8). You need to use encodeURIComponent. >sigh<

So why does escape() use %uXXXX? What is IE up to? Is there some definitive representation of Unicode characters in URLs, or is it just a contract between client and server? Does anyone know of a document which explains all of this, in words of less than one syllable?

4 thoughts on “URL Encoding Confusion

  1. As far as I know, the definitive representation of URLs now that IDN is around is URL-escaped UTF-8.

    Unfortunately, a lot of servers assume other charsets (and have for years), so browsers have to do some wild guessing to figure our exactly what to send to the server, etc.

  2. Well, http://www.faqs.org/rfcs/rfc2396.html says:

    ‘This document updates and merges “Uniform Resource Locators” [RFC1738] and “Relative Uniform Resource Locators” [RFC1808]’

    But it makes no mention of the %uHHHH encoding.

    Then there is http://www.ietf.org/rfc/rfc2718.txt
    “Guidelines for new URL Schemes”

    which includes the part:

    “2.2.5 Character encoding

    “When describing URL schemes in which (some of) the elements of the URL are actually representations of sequences of characters, care should be taken not to introduce unnecessary variety in the ways in which characters are encoded into octets and then into URL characters. Unless there is some compelling reason for a particular scheme to do otherwise, translating character sequences into UTF-8 (RFC 2279) [3] and then subsequently using the %HH encoding for unsafe octets is recommended.”

    Then there is http://www.w3.org/International/iri-edit/ which is where they are working on what they hope will become RFC for IRIs, if it is adopted.

    Their latest draft also appears to make no mention of the %uHHHH format, although I must admit that I have not read it in detail. One of the two draft authors works at Microsoft.

    So, text -> Sequence of octets in the UTF-8 encoding -> % escaped string seems to be the way that all the standards documents advocate, but that does not really answer your question!

    Tim.