Charset confusion

We’ve been discussing how to fix Bugzilla’s charset problems.

At the moment, Bugzilla doesn’t send a charset header on its pages. Although this means they don’t validate, it can be a good thing, because if a Bugzilla is used solely by people with one default charset, everything just about works. In fact, if people with different charsets comment on different bugs, things still just about work – because of browser auto charset detection. b.m.o. is an example of the latter case. However, multiple different charsets on the same bug usually leads to a nasty mess, and unreadable comments.

What you really want is for everyone to be able to discuss the same bug without a problem. And that basically means using UTF-8. This is fine for new installations, with recent versions of Perl and MySQL. However, how do you write code which upgrades a Bugzilla and converts all the comments to UTF-8, if you don’t know what charset they were in, in the first place? :-|

Magic solutions for this problem would be welcome… (Read the bug first, though.)

Gerv

7 thoughts on “Charset confusion

  1. Yuck.

    I didn’t read (all of) the comments, but it’s sounding like the best solution is some automated absolutely-certain conversion (if there’s such a thing, I’m pretty sure some encodings are unique enough to do this) and then per-bug user triaging. For little databases this wouldn’t really matter, but for a b.m.o-sized database it’s nearly impossible.

    Hope you get this figured out — somehow. I suppose now’s an opportune time to lament *not* creating Bugzilla with a specified encoding, eh?

  2. Could you check possible encodings against a dictionary program, to see if they’re likely to be human readable?

    The best “magic solution” in the comments seemed to be automatic encoding detection, plus allowing users to flag badly encoded comments. If flagged comments could be re-encoded instantly by the server and sent back out, that would provide a minimum of user pain.

  3. 1)check if no NN4 send :
     <meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />
    in head of page, and all form send in utf-8
    if NN4 use latin-1
    2) use ‘BINARY’ columns in MySQL

  4. Yegor: that’s not the problem :-) We know how to send UTF-8; what we are trying to work out is how to convert all data of unknown charsets into UTF-8…

    Gerv

  5. Complicated problems may have simple solutions :) :

    Old messages should be stored as is, with the user option to choose the coding of the old messages.
    Old messages should be automatically converted after the user has chosen the old coding for the current Login.

    All the new messages should be stored in UTF-8.

    ?

  6. Yegor: unfortunately, Bugzilla displays lots of comments on a single page, and so they all have to be in the same charset, otherwise they won’t display properly. So we can’t just leave the old ones as-is.

    Converting based on login might be an interesting idea, though… “Reconvert all my comments assuming they are EUC-KR”, sort of thing. Hmm…

    Gerv