I’ve discovered an interesting HTML parsing foible. HTML entities, such as &, are used to encode special and extended characters in HTML. But, as web page authors are sloppy, both IE and Firefox attempt to “do the right thing” when faced with unterminated (no semicolon) HTML entities like:
“I used Sun’s java compiler”
However, if HTML sanitisers don’t do exactly the same thing when decoding, you may be able to slip some script past them. So, for example:
could be the name of a valid image on the webserver. However, IE and Firefox will decode it to:
One of the first bugs I found had a very unique way of interpreting something similar:
Oddly enough, Firefox executes it immediately if you use it in a data: url, but only one time per tab. Also, if I try to put an if statement in to restrict it only to internet explorer, I get an uncaught exception in the JS console…
I think this will turn up a lot of very obscure bugs, if anyone bothers to check.
Nitpicking: Those are not entities but (numeric) character references.
SGML/HTML allows the omission of the semicolon in certain situations. I don’t have the Handbook here, but this might be one of the situations where it is permissible to omit the semicolon.
There’s something about that in the HTML 4.01 specification:
I frankly don’t see the point; whether or not the refc delimiter (reference close, usually ‘;’) is used, the attribute value passed to the script engine by the markup parser should be the same. If you wanted an ampersand in a path component you would have to escape it just as you would have to in a query component.
As to parsing rules, SGML delimiters are context sensitive, which seems to cause a lot of confusion. In your example, the context is unambiguous and the character references (I wouldn’t file the destinction from entity references under nitpicking, so &deity help me) can be expanded.
 The quoted W3C prose is by no means unambiguous, I’m afraid, especially in gaining the general character reference potpourri noise – e.g. for a character reference the “middle of a word” would have to be a function name while this particular part of that note is quite likely twaddling about entity references (who would want to use a function character “in the middle of a word” ;-). Just in the idle hope to make someone’s head explode, the mentioned scenario “at a line break” is a special case in itself and does not mean that the refc delimiter is even omitted. Look out:
In the more common case
the character reference ends at the space character and should be parsed as
while this case
should be parsed as
because in this context the line break is a refc delimiter too (RE, ‘record end’) and as such absorbed.