About this problem , see the lecture in machine learning course of USTC.
This exercise is taken from HW4.
Programming Exercise: Naive Bayes
We provide you with a data set that contains spam and non-spam emails (“hw4 nb.zip”). Please use the Naive Bayes Classifier to detect the spam emails. Finish the following exercises by programming. You can use your favorite programming language.
- Remove all the tokens that contain non-alphabetic characters.
- Train the Naive Bayes Classifier on the training set according to Algorithm 1.
- Test the Naive Bayes Classifier on the test set according to Algorithm 2.
- Compute the confusion matrix, precision, recall, and F1 score. Please report your result.
Remove non-alphabelta tokens
In this session, I simply split all the emails by delimiter “ “ and “\n” , Thus, we get all the tokens, and we filter all the None characters ( since it is not meaningful for our classification). Finally , just skip all the tokens with non-alphabelta characters.
Training and Testing
All the codes are as follows
1 | import numpy as np |
Results
As we can see, the results on the test sets are rather good. Only one spam email is classified as normal email and only one normal email is incorrectly classified as spam.
Thus the total result is as follows:
1 | output |