Phishing – Browser-based Defences

The recent Shmoo Group punycode phishing demo has caused a lot of debate on the topic of phishing. In fact, domain registrars were warned about this problem (guideline 3, subsection b) when IDN was invented, and many registrars (for example, the Japanese) have implemented sensible controls to counter the problem of homographic domains.

However, it’s caused me to crystallise a number of thoughts I’ve been having on the subject into a paper – Phishing – Browser-based Defences – which covers what we can be doing in the browser space to mitigate the phishing problem. It lists the qualities of a good solution, analyses some of the suggestions that have been made, and then makes three of its own.

Please let me know what you think.

35 thoughts on “Phishing – Browser-based Defences

  1. The glyph-based hashing scheme sounds like a good idea, but I doubt it would be very helpful. Since the glyphs don’t have any explicit meaning to the user, they will be ignored. Just like status bar icons for unused features, users will totally ignore things that don’t have any obvious use.

    Even I, as a more technically-aware user, can’t see myself remembering the glyphs for my banks’ domains.

    Further, the tendency of banks to do silly things like change the subdomain of their secure servers inexplicably (secure.mybank.com –> secure2.mybank.com) would lead people to say “well I guess my bank just changed something again!” if even the glyphs alone change.

    The hashed SSH domains seems like a great idea, but will ring too many people’s privacy buttons.

    The best idea seems to be your “Phish Finder” combination solution. Implementing all the points except the hashed SSL domains seems like a great start.

    Extending your first point, show a warning if any illogical sequence of characters appear, not just simple homographs. I can’t think of a normal circumstance where a Cyrillic character would appear within a domain of Latin characters (or vice versa). Maybe I’m uninformed; this might be too discriminatory.

  2. In general it looks good.

    The glyph thing sounds a bit like a method Lotus Notes uses for passwords. One of the strengths of that is the distinctivness of the icons. I’m not sure if (even a crefully selected) set of Unicode glyphs would be as memorable.

    A thought I’d had a while ago was to add SSL information to the password store so that when you visit a site and enter a username and password you have somewhere else the browser could say: You normally use that pair for paypal.com and this page aconnot be erified to be par of paypal.com…… etc

    In terms of the moving pages/ banks using new subdomains etc., I can’t remember exactly but don’t SSL certificates have an organisation identifier that could be used?

  3. Probably the browser should implement what the registration companies should have done beforehand with rejecting registrations of mixes-charactersets domain names. The browser wouldn’t reject the domain but display (probably like your “new site” warning) a warning message. Pure cyrillic, us-ascii, kanshi or whatever domains would be fine. Only mixed domains would trigger the message.

  4. Nice work Gerv!

    One way to solve the unbounded history problem would be to make the cache expiry be based on both age and frequency of visit. This way, the more often you visit a site, the longer it stays in your history. I think an effective history is almost a prerequisit for automatic anti-phishing. (Isn’t one of the Firefox 2 plans to have a complete history stored?)

    If the history is long lived, you could marry the domain hashed icon scheme with the user selected icon scheme – the browser picks a default icon, which the user can then override. Hopefully the browser default being visible may prompt the user to select a “better” icon. This could also be improved by opening an explanatory window on first use, like Firefox’s popup blocker.

    Obviously this fails the user intervention criteria, but it may help address Greg’s concerns that the browser picked icon would be meaningless or forgettable for users.

    If you do allow the user to override the icon, it would probably be a good idea to make it destinctive. I can forsee users (like me!) setting all their trusted sites to the same icon, e.g. a heart or star to symbolise “favourite”. This would give sites that are automatically assigned this icon an unfair level of trust. Perhaps the colour could be changed; I’m sure two colours could be picked in such a way that they are distinctive for the colour “blind”.

    Greg’s concern over changing domains, e.g. server1 vs. server2, could be addressed by only using the name up to the second- or third-level domain names in .com/.net/.org or country-code domains respectively. After all, this is the part of the domain that uniquely identifies the organisation you’re contacting.

    Finally, I think all browser history including anti-phishing history should be removed by a “sanitise” function – as this is its advertised purpose.

  5. A thorough discussion of the issues, and I like the ‘new site’ approach.

    Generally, each user should be aware of which sites he/she is visiting, and which ones to trust.
    So, a ‘trust’ list compiled and managed by himself for sites allready visited and trusted by himself
    (after checking the SSL certificates, etc) is indeed a good way to go.

    It should be more than just a technical feature of FF, but a real good practice and core to doing trusted, safe and secure business on the internet.
    So a browser should only allow to do business (literally!) with a site pretending to be secure (because it is using SSL), when also the user is aware of it, and marked that specific site as trustworthy (not just some vague certification agent).

    Next to the ‘new secure site, do you want to trust it?’ message, it should help the user in the ways to determine the security aspects of that new site (is the domain name really what is looks like, is the certificate not just valid, but which agent?, and are the fields valid/same as the actual domainname, etc). After having marked a domain name to be trusted, the browser should warn for domain names which do look a like (which may be valid or not).

    But also for non-SSL sites, a warning for suspicious domain names is usefull.

  6. Nice suggestions. There really is no silver bullet to solve this problem, so any kind of “solution” will probably be a combination of different measures.

    Another suggestion:
    Allow the user to specify a limited number (or a range) of Unicode characters that he is likely to meet.

    If a user knows only, say, three different languages, characters not used in either of these languages should be treated as “suspecious”. Personally I don’t speak any East European language, so there is no harm in not displaying domain names with Cyrillic letters to me.

    Most American and Western European users will probably only want to allow characters from the ISO-8859-1 charset. Domains containing other characters can be displayed in “raw” form or with a warning icon or whatever.

    This should of course be user configurable. Perhaps it can be integrated with settings for character encoding and Accept-Language.

  7. On the topic of “close lexical proximity” I think you should consider using the Levenshtein distance on the two strings.

    This could be further improved by flattening homographs (and almost-homographs?) down to the same character prior to comparison, e.g. replacing Cyrillic “a” with an ASCII “a”, “1” with “l”, and so on. This would lead to almost all class 4 and 5 attacks resulting in a Levenshtein distance of zero.

    This method would probably not fall foul of mistaking “ibm.com” with “bmi.com”.

    Finding all the Unicode characters in all homographic classes would certainly be time consuming, but I think real value could be achieved quickly by starting with the ASCII/Latin characters and working outward.

    Aside: Is it possible to use Unicode punctuation in the class 5 attacks, e.g. secure.paypal.com vs. secureִpaypal.com?

  8. Since the glyphs don’t have any explicit meaning to the user, they will be ignored. Just like status bar icons for unused features, users will totally ignore things that don’t have any obvious use.

    But I think a user might notice when they change, even if they don’t remember them to the extent that if you asked them in the street “what are your bank’s glyphs?”, they wouldn’t remember.

    “Saft”, a browser-addon for Apples Safari, now has a feature that detects possible IDN spoofs

    How does it detect them? Or does it put up that pop-up for all IDN domains? If so, that’s highly discriminatory.

    Probably the browser should implement what the registration companies should have done beforehand with rejecting registrations of mixes-charactersets domain names.

    How do you define “mixed character set”? Each language in the world does not use a unique set of characters – there’s a lot of overlap and mixing.

    If the history is long lived, you could marry the domain hashed icon scheme with the user selected icon scheme – the browser picks a default icon, which the user can then override.

    But if the user overrides it, then the icon is not the same as the one in the bank’s publicity, and won’t be the same as the one in any other browser he uses, and anyone else using his browser would also not see what they expect.

    It would also mean malicious people could e.g. change the icon for internet cafe or public access browsers to disguise phishing attempts. Best not to have it changeable.

    Greg’s concern over changing domains, e.g. server1 vs. server2, could be addressed by only using the name up to the second- or third-level domain names in .com/.net/.org or country-code domains respectively.

    That’s a non-trivial problem; registrars all have different rules for how they allocate domains (straight under the top, have a set of second levels, or both), and they change them sometimes.

    So, a ‘trust’ list compiled and managed by himself for sites allready visited and trusted by himself
    (after checking the SSL certificates, etc) is indeed a good way to go.

    Er, that’s not at all what I’m recommending. It would be far too much work for the user.

    If a user knows only, say, three different languages, characters not used in either of these languages should be treated as “suspecious”.

    That’s very discriminatory. Just because a user doesn’t know Arabic, why should they not buy things from an Arabic domain which has English pages and offers international shipping?

  9. ‘How do you define “mixed character set”?’

    The same way the non-enforced rules for the registries would have been. Languages don’t mix charactersets inside one single word. You don’t need cyrillic and ascii characters in the same hostname ([cyrillic].[us-ascii] would be allowed). Same goes for kanshi and greek letters. The only problem is the range of different possible character sets. But still a solvable problem.

  10. Gerv,

    I think you overestimate the ability to remember the shapes. Humans aren’t that good. And besides, the phising site can confuse them by saying ‘make sure those shapes are shown’ (their shapes, not that of the original site)

    I think the ‘new site’ idea is better.

    Another thing to consider is showing the ssl info more prominently, and maybe also the whois information. If that changed from Paypal, Inc to http://www.xn-pypal-4ve.com they might notice. certs should not use punycode. (see bug 228524)

  11. Because you’ve titled the article as ‘Phishing – Browser-based Defences’ and you’ve listed ‘Phishing URL Classes’, it seems odd to omit the http://www.paypal.com/&*&*(&):%$�@badsite.com type vulnerabilities from the list. It would be better to mention them and explain how Firefox deals with them.

  12. I had much the same idea. I was using it for antiphishing code in a website. On the first page you got asked for your username, and on the second page it asked you for your password.

    However on the second page the page layout, colour, icons etc were all taken from a hash of the username and a secret that only the website knew. Thus if a phisher tried to impersonate the site they would not be able to correctly impersonate the password page.

    Colour seems to work well for people, they instantly recognise if the page has different colours than it did previously even if the colours are subtly different. People who are colour blind can usually distinguish enough of a difference to make this difficult. It does, however, cause problems with people who are blind. I never spoke to a blind user of our site, so I never discussed what we could do, one suggestion was to use the w3c aural markup css magic stuff to customise the pitch/speed etc of the reading of the page.

    With a browser it can be tied to the domain name as you mention with great effect. I’d suggest looking at using colour as an additional feature to match the symbols you suggest. Colour allows for a lot more data to be encoded from the hash (24 bits vs about 10 bits at most for symbols) and can be combined with the symbols for a total of maybe 34 bits worth of hash.

  13. Re mixed character sets – I’m sure there are legitimate Japanese (and probably other) companies which combine Japanese and Western text within their name. I don’t think prohibiting this is a good idea.

    I agree that most of the suggested letter-colour-coding ideas are stupid, as is turning off IDN. I don’t think the ‘hashing to fancy symbol’ idea is really a good one, though. Sorry, but it just seems unnecessarily complicated. On the other hand, the ‘have you been here before’ idea for SSL sites (along with method of hashing permanent history) is definitely excellent. I’d like to see this method implemented.

    Re type 5 attacks – somebody already suggested a list of homographs. I don’t think it would be an impossible task to find all homographs for ASCII lower-case letters, digits, and hyphen. (Could possibly even be done automatically via bitmap comparisons at a low pixel size, allowing for minor variation…)

    These are the only characters commonly used in domains. The browser defense could then work as follows:

    1) Map any Unicode characters that are similar to or indistinguishable from from those ASCII characters (e.g. Cyrillic, mathematical letters) into ASCII

    2) Look up the resulting (ASCII-version) domain (DNS)

    3) If the ASCII-version domain exists, show a warning (possibly a popup box that offers to redirect to the ‘real’ site – while popup boxes are annoying, it’s possible to miss the yellow bar and if there are essentially zero false positives this would be acceptable. Might need to include a ‘Don’t warn again for this domain’ option in case there ever is a false +ve.)

    Restrictions/mappings for other languages/characters might need to be implemented differently but these are currently a much less urgent problem (as few people use them).

    –sam

  14. Gerv,

    That’s very discriminatory. Just because a user doesn’t know Arabic, why should they not buy things from an Arabic domain which has English pages and offers international shipping?

    I don’t mean that they should be blocked from domains containing Arabic letters. The domain name should just be displayed in the raw xn--xxxxx form instead of in Arabic letters.

    If the user doesn’t know Arabic, neither the raw form nor the original form using Arabic letters will make much sense to the user, so the browser might as well show the one that is least likely to cause confusion.

  15. Christian – As IDN catches on you may find more sites use it legitimately, e.g. http://www.ToysЯUs.com, so spotting mixed character sets will inevitably lead to false positives.

    Gerv – While I originally advocated the domain hashed icon idea, on reflection I don’t think it would be useful. In fact, the icon itself is open to spoofing by the phishers, as pointed out by mvl above, which would convey a false sense of security for the phishing site. So it could even be considered harmful.

    The “new site” idea is good, but it still suffers from a lack of visibility. However, you don’t want the browser to interrupt you with a dialogue or info-bar for for each new SSL site, otherwise users will just get used to clicking through it and its value will be greatly reduced.

    Ideally it would give a warning only if the site was similar to a previously visited site. This is where a full domain history, homograph folding and Levenschtein distance (mentioned above) would come in.

    To implement this it would be necessary to store the plain text of the domain, though. Perhaps the history could be encrypted on disk, as is the passwords file, instead of hashed. (If an intruder has access to your filesystem, you probably have a whole lot more to be worried about anyway.)

    I still believe the the bottom level domain (server1/server2/etc) problem is solvable and worthwhile. You could get large initial coverage, and therefore immediate user benefit, from supporting the .com/.net/.org (“gTLD”?) case. The rest of the problem is bounded (about 260 county code TLDs), well defined, and slow changing.

    A similar approach should be applied when creating the homograph folding algorithm. The Unicode code charts – http://www.unicode.org/charts/ – already provide some useful information on related characters. Incomplete does not equate to worthless in either of these cases.

    Perry – I really like your idea, but given that an attacker can effectively inspect a given user’s hash value it is open to a brute force attack. That said, it’s still an interesting idea!

    Sam – The only problem with your approach is that it assumes that the ASCII-flattened domain name is the official name, but the reverse could be true, as per my very first point.

  16. I have a small question about the Shmoo Group punycode demo.

    When one tries to view the source of the page, one obtains the “real” domain name. I haven’t taken the time to read through your paper due to lack of time, but I wondered whether this had been taken into account.

  17. Putting together Unicode characters from the Miscellaneous Symbols section can result in many potentially offensive or humorous combinations and ideological statements.

    Male + Crown
    Female + Warning :-)
    Snowman + Sun
    Beverage + Recyclable
    Cross + Peace
    Hammer&Sickle + Frown
    Star&Crescent + Biohazard
    Star&Crescent + Airplane

  18. Christian – As IDN catches on you may find more sites use it legitimately, e.g. http://www.ToysЯUs.com, so spotting mixed character sets will inevitably lead to false positives.

    I don’t agree that this is a legitimate use. It is almost l33t sp34k, i.e. an attempt to take an “artistic” advantage of the visual resemblance of some characters whose semantics are quite different.

    Some people may think that this is kewl, but personally I prefer to turn off this kind of coolness, rather than allowing all kinds of similar-looking domain names.

    I don’t claim that my suggestion is a silver bullet, but I disagree that it is discriminatory (except against creative marketing people :-).

  19. Is it possible to use Unicode punctuation in the class 5 attacks, e.g. secure.paypal.com vs. secureִpaypal.com?

    Unless the registrars get their act together, then sure, why not?

    The same way the non-enforced rules for the registries would have been. Languages don’t mix charactersets inside one single word.

    As I understand it, the rules specified that homographic sets would be treated as a block. They did not specify that one particular combination (e.g. the non-mixed one) would be the only one allowed. So under the guidelines, you could legitimately have the mixed character set form http://www.ToysЯUs.com as your main domain, but http://www.ToysRUs.com would go along with it as part of the block. (Leave aside the suitability of that particular example for a moment; it’s just an example.)

    The “new site” idea is good, but it still suffers from a lack of visibility.

    It’s in the most visible place for secuity UI! If users are not checking the bottom right corner before making a transaction, then we’ve already lost – because they don’t even know if they are using SSL, or are on the correct site.

    There has to be a small amount of user training – and it is “look in the bottom right corner”.

    superyooser: I did say “carefully chosen”. The political and religious symbols would not be used, and probably not the male/female ones either. There are plenty of neutral ones.

    an attempt to take an “artistic” advantage of the visual resemblance of some characters whose semantics are quite different.

    I’m sure there are plenty of legitimate uses for mixing characters from different areas of the Unicode standard.

    Andrew mentioned “Levenschtein distance”. My worry about these is that while it’s a simple algorithm for determining a distance between two strings, I haven’t seen evidence which says that it correctly models the human perception of a distance between two strings.

  20. I’m sure there are plenty of legitimate uses for mixing characters from different areas of the Unicode standard.

    Could you mention some? Note that it is not a problem, if the characters are among those the user has allowed (e.g. I would probably allow most characters from ISO-8859-1). There is only a problem if the domain contains characters that the user has specified that he is not familiar with.

  21. There is only a problem if the domain contains characters that the user has specified that he is not familiar with.

    I think it’s wildly optimistic to require users to specify what character sets they are familiar with. Most won’t have a clue what to do, no matter how you present the UI.

    And what would the setting be on a public access browser, like in an Internet cafe?

  22. But I think a user might notice when they change, even if they don’t remember them to the extent that if you asked them in the street “what are your bank’s glyphs?”, they wouldn’t remember.

    Even passively, I still don’t think average users (e.g. Blake’s mother) will notice a change in their banks’ glyphs unless they are very prominent: huge and ugly. Even I don’t think I would. It’s just “that garbage that’s always there when I check my bank balance.” Since we don’t want to introduce garish glyphs into the Firefox interface and non-garish glyphs will be ineffective, I really don’t think the glyph method has much usefulness.

  23. Gerv � I’m going to run with my atrocious example above (sorry) to try and clarify how I think the Levenschtein distance can be used effectively:

    Assume we have “www.ToysЯUs.com” that is legitimate and previously visited and “www.ToysRUs.com” that is phishy and new.

    Before the domains are compared, we flatten the homographs down. The legitimate domain becomes “www.toysrus.com” and the phishy domain becomes “www.toysrus.com”. I’m assuming here that we consider upper and lower case letters as “homographs”, and similarly the characters “R” and “Я”.

    This is the part of the process that takes into account human perception of the characters. Only after the homograph flattening would the Levenschtein distance be calculated.

    As with the example above, all class 5 attacks would have a homograph-flattened Levenschtein distance (“HFLD”?!) of zero. If you extended your definition of “homograph” to include similar looking characters, such as [0oO] or [1iIl|�], you would catch all class 4 attacks as well.

    Futhermore, you could consider flattening homographic digraphs. For example, “rn” looks a lot like “m”.

  24. Andrew: you missed the point of my question. I understand how Levenschtein distances work; I just don’t think it’s been yet proved that Levenschtein distances are a good model for human perception of how close two domains are.

    Ignore the homograph factor for a moment. For some humans, http://www.indipendant.com might be a lot closer to http://www.independent.com (given that you can’t compare the two side-by-side) than http://www.indpendent.com, even though it has a longer Levenschtein distance.

  25. Okay, I see your point. I was (am) attacking the problem from the direction of flattening/hashing down the strings until you can “safely” calculate the Levensctein distance.

    A quick Google for “word recognition” turns up this really interesting looking paper:

    http://www.microsoft.com/typography/ctfonts/WordRecognition.aspx

    I haven’t had time to read it yet.

    I still think the flattening/hashing method is worth pursuing; it may be necessary to group characters into broader “shape” classes, e.g. [aceo] and [gpq].

    P.S. We’ve been mispelling “Levenshtein” as “Levenschtein” – my fault!

  26. Is there any use in doing a google search for related links to the domain name? For example, searching for related:www.paypal.com brings up many sites, whereas searching for related:www.paypa1.com brings up none.

  27. James: interesting idea. The problem is that every SSL domain you visit would then be reported to Google. People tend not to like that sort of thing.

    Gerv

  28. A fourth (longer term) approach is to harness the distributed discretionary power of social networks to distinguish between legitimate and malicious sites.

    Currently, orkut tells me I am connected to over four million people through about 50 friends, while Friendster tells me there are almost 9000 people three or less degrees of separation away from me. If these members of my social network identified legitimate sites (or if such information could be derived from their browsing practices), and these identifications were made available to my browser, then my browser would have highly accurate and trusted information on the vast majority of sites I visit.

  29. myk: That, like many solutions, would require sending each domain accessed to some sort of central server. A lot of people don’t like that sort of thing…

  30. Not necessarily. The database of domain names your friends know about, and what they think about those domain names, is small, even for a large set of domains, and could be downloaded wholesale to users’ computers. So users could check on domains locally without having to submit each domain to a central server.