Statistical Analysis on

The newsgroup, which takes the output of Hendrix, now has 23,000 posts in it, and acquires more at the rate of hundreds per day. While we don’t have the resources to follow up each one individually, I think we might find some interesting things by doing analysis on the aggregate data. This would give us some idea of trends.

Perl (and, I’m sure, other languages) has appropriate modules for getting the data – Net:NNTP to download them, and News::Archive to persist them to disk. One could then read them in, filter them by product (there’s a header for that) and do one word, two word and three word frequency analysis to try and extract some common themes.

Anyone feel like this would be a fun project for an hour or two? :-)

5 thoughts on “Statistical Analysis on

  1. Daniel: I might be interested if it were free software and ran on Linux :-) However, sadly it seems it is neither.

    Also, I think I need something a bit more specialised that understands the structure of newsgroup messages, so I can filter them to work with a subset.

  2. Yea, that does sound like a good lark´┐Ż

    lemme think on it overnight, and see what I can come up with in the morning.

  3. Gerv: I spent an hour or so on this and got it started. It downloads the messages to the disk, and creates a database. I didn’t have time to get around to the real meat of the task, so the only information in the database so far is the arrival dates of the messages.

    I ended up using the perl modules you mentioned; there were other possibilities, but these do look like the best choices. (I started to do this in emacs lisp, but I figured you’d want to run it from a cron job.) The script is kinda slow, about 50-100 messages a minute by my estimate, but I suppose that’s adequate. We’ll see what happens when it actually does some real work on the messages :)

    This could easily go into the Mozilla cvs, but for now it’s at Lets use email for further communication, the comments field on a weblog is suboptimal.