This is an excerpt of an email from Ken, a friend who does term extraction and text analysis professionally, in regards to doing text analysis on SickCity tweets to weed out the signal from the noise.
In response to my inquiry for URLs of places to read up on this, Ken writes:
It's a bit of an art, there is no single recipe for it. Ok, here are some details...
One way to go is to use a package that automatically does all the massaging work, for example MALLET. It's a nice package with a few good algorithms, and it should get you started quickly. Only trouble is that it doesn't have the one of the most powerful algorithms, SVMs.
Most packages require that you do the massaging yourself. Places to read more:
* The LingPipe website has a useful tutorial on text classification
* A few books are helpful, e.g., "Web Data Mining" by Liu, Fundamentals of NLP by Manning and Schutze
* I think there are a couple of free tutorials on SVMs for text classification on the web. One of the libraries I used (LIBSVM) had a decent tutorial.
Roughly, text preprocessing involves:
- Begin with a set of positive and negative text examples
- For each individual text, filter out punctuation/numbers/most symbols, and tokenize into single words
- Filter out stopwords (frequent words like 'the', 'and', etc)
- Count the frequency of every word in the corpus, and filter out highly infrequent words
- Convert each text into a sparse vector of numbers. It's generally a list of int:float pairs, where the first number is the index for a particular term, and the second is the weighted frequency of that term in the document. For term weighting, I usually use something like TF-IDF (you can read more about that on the web).
- Every machine learning package has a slightly different input format.