The point of this exercise is to play with a Naive Bayes classifier. You will write a naive Bayes classifier to distinguish spam from non-spam. You can use Dr. Cottrell's assignment from his Stanford AI course for guidance. You may use your own mail files, or his. I encourage you to spend the time putting together your own, based on whatever your spam filter created, and a set of emails you've saved.
You may work in groups of 2 or 3 if you wish. If you are looking for someone to work with, you may use the class mailing list.
My lecture was a little weak on formulas. Look up Naive Bayes formulations (wikipedia works here).
Basically, the classification is that value of c of
C that maximizes
Pr(C=c) times the product of Pr(F_i=f|C=c)
You will need to run a script that pulls out words and creates a table of word occurrences (each word you use is a feature) per message. You will then compute, for each feature, the probability of that feature does/doesn't occur in a spam message, and in a nonspam message. (That's 4 probabilities; f in the above equation is either "occurs" or "doesn't occur", c is either "spam" or "nonspam".)
Divide your example into two sets, using about 2/3 of your emails for training and 1/3 for validation. Use the training corpus to compute the probabilities, and then run your classifier on the remaining emails.