Statistical Analysis on 2

I just spent an hour or two trying to do some statistical analysis on, as I suggested back in December. My approach was a very naive “\b\w+\b” Perl regexp to extract the words. However, the results weren’t particularly good :-( They get skewed by a few repetitive spams, and things like that. Make the chains too short, and you don’t find out much. Make them too long, and you fall foul of the fact that different people express themselves in different ways. In a message base of the first 6600-odd, I found the following:

Keep up the good work: 72
I would like to see: 28
it would be nice : 57
I would like to: 89
I don’t like: 47
in the address bar: 35
I really like the: 23

Not exactly earth-shattering, although very nice. Does anyone have any better ideas for uses for the Hendrix data in the newsgroup? Even if we never look at it, it serves the purpose of diverting non-bug-reports and making people feel like they’ve made a contribution. But it would be nice if we could get more out of it.

5 thoughts on “Statistical Analysis on 2

  1. I know that mconnor – masochist that he is — actually reads all of Hendrix. Skims, really, but he does let every post go over his retina. Brave man. At the end of the day, what it gives him is a good aggregate indication of what people are thinking.

    This is, IMO, a bit of a research problem involving natural language processing and keyword searching. I’m sure that there are people looking into these issues at schools and research institutes, maybe we can get them to plug into things?

    The other way to do this is to mine the message group for things we know we’re interested in, much like you did here. Keywords like “search”, “problem”, “can’t”, “better”, “suggestion” could end up grouping and sifting the messages making it easier to digest.

    Great idea trying to get more out of this, though. Similarly, I’d like to make it easier for people to submit to Hendrix from within the product.

  2. How about having something that picks a random message from the last month (or so) and displays the date of the submission and the first few sentences, at the top of each bugzilla page, with a link to the message (and perhaps keeps track on a per-bugzilla-user basis of which messages they’ve been shown, so that if someone thinks “hey, that hendrix snippet I got shown three days ago actually seems worth following up on”, they can find it).

    The people who could make best use of that feedback are, after all, hitting Bugzilla an awful lot…


  3. I have tried to respond to a few of them and had problems getting the addresses right. The messages bounced.

    It might be helpful to have a simple web app that lists the messages, and then click one, type in a field and the app does the mailing for you.

    Would this be useful?

  4. You could try kea ( which extracts key phrases from documents and is likely to do a better job than your simple analysis.

    Another idea would be to apply the Cosine measure to cluster similar documents into groups and see what kind of groups you get (and how large each group is).

  5. Stuart: That sounds like a great way to distract people from their workflow ;-P

    Ray: such an app already exists. It’s called a newsreader :-)

    Perry: Thanks for the tip. It looks like KEA requires you to train it on some documents from which you have already extracted key phrases. This doesn’t appeal to my lazy side :-)