一、实验目的

（1）掌握在 WINDOWS 下安装和使用垃圾邮件过滤系统

（2）掌握垃圾邮件过滤系统主要功能模块

（3）文本内容过滤的原理

二、实验内容

（1）分析并调试垃圾邮件过滤系统程序主要功能模块

（2）选取实验数据集

（3）运行 WINDOWS 下的垃圾邮件过滤系统

（4）用垃圾邮件过滤系统对实验数据集进行过滤实验

三、系统整体描述和分功能描述

系统整体描述

分功能描述

1）预测结果

首先要获得词典。如果列表中的词已在词典中，则加1，否则添加进去。通过计算每个文件中p(s|w)来得到对分类影响最大的15个词。随后计算贝叶斯概率，再计算预测结果正确率。

用到自定义函数

SpamEmailBayes()

2）计算贝叶斯概率

计算在已知词向量$w=(w_1,w_2,…,w_n)$的条件下求包含该词向量邮件是否为垃圾邮件的概率

用到的函数

def calBayes(self, wordList, spamdict, normdict):
  ps_w = 1
  ps_n = 1

  for word, prob in wordList.items():
    print(word + "/" + str(prob))
    ps_w *= (prob)
    ps_n *= (1 - prob)
  p = ps_w / (ps_w + ps_n)
  \#     print(str(ps_w)+"////"+str(ps_n))
  return p

3）通过计算每个文件中p(s|w)来得到对分类影响最大的15个词

用到的函数

def getTestWords(self, testDict, spamDict, normDict, normFilelen, spamFilelen):
  wordProbList = {}
  for word, num in testDict.items():
    if word in spamDict.keys() and word in normDict.keys():
      \# 该文件中包含词个数
      pw_s = spamDict[word] / spamFilelen
      pw_n = normDict[word] / normFilelen
      ps_w = pw_s / (pw_s + pw_n)
      wordProbList.setdefault(word, ps_w)
    if word in spamDict.keys() and word not in normDict.keys():
      pw_s = spamDict[word] / spamFilelen
      pw_n = 0.01
      ps_w = pw_s / (pw_s + pw_n)
      wordProbList.setdefault(word, ps_w)
    if word not in spamDict.keys() and word in normDict.keys():
      pw_s = 0.01
      pw_n = normDict[word] / normFilelen
      ps_w = pw_s / (pw_s + pw_n)
      wordProbList.setdefault(word, ps_w)
    if word not in spamDict.keys() and word not in normDict.keys():
      \# 若该词不在脏词词典中，概率设为0.4
      wordProbList.setdefault(word, 0.4)
  sorted(wordProbList.items(), key=lambda d: d[1], reverse=True)[0:15]
  return (wordProbList)