CS 660 Assignment 4

Due at the beginning of class on Thursday, February 18

The point of this exercise is to play with a Naive Bayes classifier. You will write a naive Bayes classifier to distinguish spam from non-spam. You can use Dr. Cottrell's assignment from his Stanford AI course for guidance. You may use your own mail files, or his. I encourage you to spend the time putting together your own, based on whatever your spam filter created, and a set of emails you've saved.

You may work in groups of 2 or 3 if you wish. If you are looking for someone to work with, you may use the class mailing list.

My lecture was a little weak on formulas. Look up Naive Bayes formulations (wikipedia works here).

Basically, the classification is that value of c of C that maximizes
Pr(C=c) times the product of Pr(F_i=f|C=c)

You will need to run a script that pulls out words and creates a table of word occurrences (each word you use is a feature) per message. You will then compute, for each feature, the probability of that feature does/doesn't occur in a spam message, and in a nonspam message. (That's 4 probabilities; f in the above equation is either "occurs" or "doesn't occur", c is either "spam" or "nonspam".)

Divide your example into two sets, using about 2/3 of your emails for training and 1/3 for validation. Use the training corpus to compute the probabilities, and then run your classifier on the remaining emails.

To hand in on paper:

an overview, which includes which corpus you've used
statistics about your success: what percentage of spam messages were correctly identified, and what percentage of nonspam?
what you learned.

To email me:

your code;
the features used, and their probabilities of appearing in spam and nonspam.

Never leave a Goldsmith homework until the night before it's due.