Logistic Regression for Intelligent Email Spam Detection: A Practical Approach
DOI:
https://doi.org/10.9734/bpi/mcsru/v2/3819Keywords:
Spam, ROC curve, logistic, UCI dataAbstract
This paper presents an experiment on spam filters using Logistic Regression, where the filter's effectiveness is influenced by the characteristics of the token frequency distribution. The focus of the discussion is on the importance of data cleaning before model development. It emphasizes the necessity of excluding inconsistent features prior to their inclusion in the model. The experiment utilizes the UCI dataset, which shows the percentage of token counts in each email. The model’s discriminative performance is evaluated through the use of an ROC curve. The use of the UCI dataset provided valuable insights into how token counts influence spam classification. The ROC curve analysis reinforced the importance of evaluating model performance comprehensively, offering a clear view of its discriminative power.