Entity Parsing

I’ve discovered an interesting HTML parsing foible. HTML entities, such as &, are used to encode special and extended characters in HTML. But, as web page authors are sloppy, both IE and Firefox attempt to “do the right thing” when faced with unterminated (no semicolon) HTML entities like:

“I used Sun’s &#106ava compiler”

However, if HTML sanitisers don’t do exactly the same thing when decoding, you may be able to slip some script past them. So, for example:

<img src="&#106ava&#115cript&#58aler&#116&#40&#39Oops&#39&#41&#59">

could be the name of a valid image on the webserver. However, IE and Firefox will decode it to:

<img src="javascript:alert('Oops');">

IE executes the JavaScript immediately; you need to do a View Image for Firefox to (not sure why). Perhaps this isn’t a big deal, but I have no idea whether any commonly-used HTML sanitisers would fall for this one…

5 thoughts on “Entity Parsing

  1. Oddly enough, Firefox executes it immediately if you use it in a data: url, but only one time per tab. Also, if I try to put an if statement in to restrict it only to internet explorer, I get an uncaught exception in the JS console…

    I think this will turn up a lot of very obscure bugs, if anyone bothers to check.

  2. Nitpicking: Those are not entities but (numeric) character references.

    SGML/HTML allows the omission of the semicolon in certain situations. I don’t have the Handbook here, but this might be one of the situations where it is permissible to omit the semicolon.

  3. Henri,

    There’s something about that in the HTML 4.01 specification:

    Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.

  4. I frankly don’t see the point; whether or not the refc delimiter (reference close, usually ‘;’) is used, the attribute value passed to the script engine by the markup parser should be the same. If you wanted an ampersand in a path component you would have to escape it just as you would have to in a query component.
    As to parsing rules, SGML delimiters are context sensitive, which seems to cause a lot of confusion. In your example, the context is unambiguous[0] and the character references (I wouldn’t file the destinction from entity references under nitpicking, so &deity help me) can be expanded.

    [0] The quoted W3C prose is by no means unambiguous, I’m afraid, especially in gaining the general character reference potpourri noise – e.g. for a character reference the “middle of a word” would have to be a function name while this particular part of that note is quite likely twaddling about entity references (who would want to use a function character “in the middle of a word” ;-). Just in the idle hope to make someone’s head explode, the mentioned scenario “at a line break” is a special case in itself and does not mean that the refc delimiter is even omitted. Look out:

    In the more common case

    Hell&#111 world

    the character reference ends at the space character and should be parsed as

    Hello world

    while this case

    Hell&#111
    world

    should be parsed as

    Helloworld

    because in this context the line break is a refc delimiter too (RE, ‘record end’) and as such absorbed.