HTML5 Email Address Regexp

Most Recent Update: 2013-03-26: If you found this page by searching, the HTML5 email address regexp you are looking for is:

/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,253}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,253}[a-zA-Z0-9])?)*$/

(This should be all one line.) This regexp works unchanged in either Perl or JavaScript.

Version history: this version improves upon the previous, shorter version which was here from Feb 2012 to March 2013 in that it restricts domain labels to 255 characters and prevents them beginning or ending with a “-“. That prior version differed from the version originally calculated in May 2011 in the blog post below, in that it stopped using \w and started using non-capturing parentheses.

Original post follows:


Summary: can anyone help me out by confirming I’ve correctly calculated the regexp to match the HTML5 email address format spec?

The regexp to match official RFC2822 email addresses is:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|”(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

HTML5 realises this is ridiculous, and proposed a more limited format for email addresses. Having read the spec, and followed the logic below, I think the regexp form of the spec is this:

/^[\w.!#$%&'*+\-/=?\^`{|}~]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*$/i

Am I right?

Logic:

HTML5 says:

“1*( atext / “.” ) “@” ldh-str *( “.” ldh-str ) where atext is defined in RFC 5322 section 3.2.3, and ldh-str is defined in RFC 1034 section 3.5.”

Mining those RFCs:

atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
"!" / "#" /        ;  characters not including
"$" / "%" /        ;  specials.  Used for atoms.
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"

<ldh-str> ::= <let-dig-hyp> | <let-dig-hyp> <ldh-str>
<let-dig-hyp> ::= <let-dig> | "-"
<let-dig> ::= <letter> | <digit>
<letter> ::= any one of the 52 alphabetic characters A through Z in upper case and a through z in lower case
<digit> ::= any one of the ten digits 0 through 9

That leads to the following, if I’m not mistaken:

/^[a-zA-Z0-9.!#$%&'*+\-/=?\^_`{|}~]+@[a-z0-9-]+(\.[a-zA-Z0-9-]+)*$/i

which can be reduced, by using ‘\w’, to:

/^[\w.!#$%&'*+\-/=?\^`{|}~]+@[a-zA-Z0-9-]+(\.[a-zA-Z0-9-]+)*$/i

21 thoughts on “HTML5 Email Address Regexp

  1. Is there any reason to not use “\w” on both sides? ie:

    /^[\w.!#$%&'*+\-/=?\^`{|}~]+@[\w-]+(\.[\w-]+)*$/i

  2. Looks about right, and suitably vague.

    I take issue with the approach the ‘offical’ one uses – Jan accepts that his regex rejects a small number of valid email addresses, so it can catch a small number of invalid email addresses (the example he gives is rejecting .museum, a valid address, so it will also reject .office, an invalid address).

    No regex will ever perfectly validate that the string entered is a working email address – a syntactally valid domain might not exist or the user might not have been created, so I always err on being overly simple and allowing some invalid addresses through, which this regex does.

  3. Shouldn’t you use non-capturing group for the last group? And a small innocent question, isn’t at least one of these needed? This lead to:

    /^[\w.!#$%&'*+\-/=?\^`{|}~]+@[a-z0-9-]+(?:\.[a-z0-9-]+)+$/i

  4. @A & Gervase

    In theory, browsers should accept IDNs and transform them internally to the punycode version.

    However, none of the existing implementations actually allows IDNs.
    I personally can’t use input@email because I know some of our users have IDN e-mail addresses. I don’t want to exclude potential customers either.

    BMO bug 618876 if you’re interested (don’t spam please).

  5. Just wondering though about IDN email addresses. What about the part before the @ sign. Can those characters be something other than ascii when using an IDN email address?

  6. Might it make sense to have the HTML5 definition of a valid email address be a specific regex of EMCAScript 4 instead of the cryptic definitions that are currently given?

  7. A: My understanding is that email-to-IP syntax is intentionally excluded from the HTML5 regexp.

    Peter: Possibly, but (I think) the beauty of using \w is that for Unicode-aware regexps, it’ll automatically be covered.

    Havvy: that’s a question for Hixie :-) I suspect he’s considered it. Perhaps a non-normative note would be appropriate.

  8. nAmo: You are right about the non-capturing group. But it’s not true that we need at least one bit with a dot. “gerv@kitten” is a perfectly valid email address.

    Gerv

  9. On the one hand, .museum is essentially irrelevant: in actual usage, museums all use addresses in .org or occasionally .com or a two-letter geographic domain. The same is true of all the new longer-than-three-letters TLDs, except (arguably) for .info, which occasionally gets used by advertisers.

    On the other hand, I don’t think a regex is the right way to determine whether the domain portion of the address is valid. The right way, IMO, is to look it up in the DNS. If it has an MX record, it’s a valid domain for email purposes. If it has an A record and port 25 is open, it’s *arguably* a valid domain for email purposes. If both of these checks fail, it’s no good. I believe this is the approach Email::Valid follows.

    > none of the existing implementations actually allows IDNs.

    If you’re the first implementation to support them (in the domain portion), make it a pref and have it be turned off by default for the first while. IDNs have security implications, and so it’s best not to spring them on everyone suddenly, IMO.

    The username portion (before the at sign) is another matter. Various mail servers in practice already have any number of weird variations on what the username portion of the email address can contain and what it means. It’s usually not case-sensitive, but it can be. It usually doesn’t have punctuation (except period, hyphen, and underscore), but it can, and there are existing mail servers that e.g. use the apostrophe/single-quote character to separate semantically distinct components of the username. It’s all a matter of the recipient mail server’s policy. The only thing I’m pretty sure is universally disallowed is whitespace, for rather obvious reasons.

    Since one organization (whoever runs the mail server) is responsible for creating all of the addresses, this doesn’t cause the same kinds of security issues that come up with IDNs. (Mail domains that allow third parties to create arbitrary mail accounts, like Yahoo and Hotmail and Gmail, have restrictive policies on what characters can be in the username, as well they should, but again the details are up to the people who run the mail server.)

  10. Come to think of it, I probably should explicitly mention that there are existing mail servers that have things in the username portion that technically don’t conform with RFC 822. This includes some fairly important domains, e.g., Docomo.

  11. Sorry I never got back to you about this, and glad to see you got better answers this way.

  12. This should do it:

    /^([\w+\!\#\$\%\&\'\*\+\-\/\=\?\^\_\`\{\|\}\~])+((\.?)([\w+\!\#\$\%\&\'\*\+\-\/\=\?\^\_\`\{\|\}\~]))+\@[a-zA-Z0-9-]+((\.?)([a-zA-Z0-9-]+))+$/;

    It’s based off these rules:

    The local-part of the email address may use any of these ASCII characters:

    Uppercase and lowercase English letters (a–z, A–Z)
    Digits 0 to 9
    Characters ! # $ % & ‘ * + – / = ? ^ _ ` { | } ~
    Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively (e.g. John..Doe@example.com).

    The domain name part of an email address has to conform to strict guidelines: it must match the requirements for a hostname, consisting of letters, digits, hyphens and dots.

    Rules were found here:
    http://en.wikipedia.org/wiki/Email_address

  13. Mark: Thing is, I’m interested in what the HTML5 regexp is going to be, because that will become a common limiter on what email addresses work in a lot of places. I don’t really care about other sets of rules that there may be :-)

  14. But those rules I quoted are from RFC 5321 and RFC 5322, so they’re the “official” rules for email addresses. Why would the HTML5 regexp use anything different? Oh, wait… ;)

  15. Mark: Because the official official rules are a lot more complicated than that, and include a load of stuff hardly anyone uses, and simplicity is a great thing :-)

  16. As far as I know, the RE I posted allows for everything except quoted addresses like:

    “bob smith”@example.com

    and IP addresses for the domain part like:

    bob.smith@[10.254.79.161]

    which are both valid according to the RFC’s, but hardly ever used. I would be interested to find out what other rules there are that my RE doesn’t allow for. I’ve tested it extensively and I haven’t found any cases where it doesn’t work other than the quoted local-part and the IP domain part.

  17. As I mentioned earlier, even the rules Mark lists don’t cover everything that’s currently in use out there in the wild. For example, most mobile phone users in Japan have email addresses that do not conform to those rules.

    I still think using a regular expression is categorically the wrong way to handle this and that the approach used by Email::Valid is inherently superior.

    I’m also not sure how it’s any of HTML5’s business what characters people do or do not have in the username portion of their email addresses, and I really don’t think implementing such limits in web browsers is a good idea. If mail administrators even for quite major sites can’t be bothered to follow RFC2822 in this regard, what on earth would make anyone think they’ll change it just to accommodate HTML5? If people can’t type their friends’ actual email addresses into a given email address form field, the combination of that website and that browser will be broken for those users. Any site that gets hit with this will either tell people that browser isn’t supported, or else it’ll go back to using regular old type=”text” inputs instead of type=”email”. I’d vote for the latter approach, because if type=”email” means some email addresses won’t work, then type=”email” is obviously borked.

    If you must have a regex, I propose the following:
    /^((?!@)\S)+[@]((?!@)\S)+/
    That’s one or more non-whitespace characters other than @, followed by @, followed by one or more non-whitespace characters other than @.

    For added bonus points, any time the field loses focus or stops changing for more than a couple of seconds, ask the resolver library about the part after the @ is listed in DNS (and, like, show red squiggles under it or something if it doesn’t resolve).

    Also, intranets frequently use domain names that do not contain a period or start at the DNS root, and everyone expects them to Just Work as long as the local DNS resolver knows what they mean. Just saying.