Neaya~

笔记、记录、总结

贝叶斯分类

摘要:机器学习贝叶斯分类、词袋模型应用、词频提取模型

贝叶斯分类用途

  • 和文本有关的分析分类用贝叶斯分类效果比较好

  • eg:

    邮件:总体100,正常70,垃圾30。

    “办证”在正常邮件中出现10次,在垃圾邮件中出现25次

    假设X为“办证”,H为垃圾邮件

    𝑃 (𝑋|𝐻) =25/30=5/6

    𝑃(𝐻)=30/100=3/10

    𝑃 (𝑋) =35/100=7/20

包含“办证”这个词的邮件属于垃圾邮件的概率为5/7 
  • 有:

    • 多项式模型
    • 伯努利模型
    • 高斯模型
      • 高斯模型用于连续型数据效果好

用sklearn实现贝叶斯

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
"""
# @Time : 2020/8/13
# @Author : Jimou Chen
"""
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB # 导入朴素贝叶斯的三种模型

iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target)

'''建立三种贝叶斯模型看看效果'''

# 建立多项式模型
mul = MultinomialNB()
mul.fit(x_train, y_train)
print(classification_report(mul.predict(x_test), y_test))
print(confusion_matrix(mul.predict(x_test), y_test))

# 建立伯努利模型
bernoulli = BernoulliNB()
bernoulli.fit(x_train, y_train)
print(classification_report(bernoulli.predict(x_test), y_test))
print(confusion_matrix(bernoulli.predict(x_test), y_test))

# 建立高斯模型
gaussian = GaussianNB()
gaussian.fit(x_train, y_train)
print(classification_report(gaussian.predict(x_test), y_test))
print(confusion_matrix(gaussian.predict(x_test), y_test))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
D:\Anaconda\Anaconda3\python.exe D:/Appication/PyCharm/Git/MachineLearning/machine_learning/贝叶斯/iris_贝叶斯.py
precision recall f1-score support

0 1.00 1.00 1.00 9
1 1.00 0.39 0.56 28
2 0.06 1.00 0.11 1

accuracy 0.55 38
macro avg 0.69 0.80 0.56 38
weighted avg 0.98 0.55 0.66 38

[[ 9 0 0]
[ 0 11 17]
[ 0 0 1]]
precision recall f1-score support

0 1.00 0.24 0.38 38
1 0.00 0.00 0.00 0
2 0.00 0.00 0.00 0

accuracy 0.24 38
macro avg 0.33 0.08 0.13 38
weighted avg 1.00 0.24 0.38 38

[[ 9 11 18]
[ 0 0 0]
[ 0 0 0]]
precision recall f1-score support

0 1.00 1.00 1.00 9
1 1.00 0.85 0.92 13
2 0.89 1.00 0.94 16

accuracy 0.95 38
macro avg 0.96 0.95 0.95 38
weighted avg 0.95 0.95 0.95 38

[[ 9 0 0]
[ 0 11 2]
[ 0 0 16]]
D:\Anaconda\Anaconda3\lib\site-packages\sklearn\metrics\_classification.py:1221: UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))

Process finished with exit code 0

  • 多次运行后发现高斯模型的贝叶斯效果最好

词袋模型

1
2
3
4
5
6
7
8
9
10
11
'''词袋模型'''
from sklearn.feature_extraction.text import CountVectorizer

texts = ["dog cat fish", "dog cat cat", "fish bird", 'bird']
cv = CountVectorizer()
cv_fit = cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())

print(cv_fit.toarray().sum(axis=0))
1
2
3
4
5
6
['bird', 'cat', 'dog', 'fish']
[[0 1 1 1]
[0 2 1 0]
[1 0 0 1]
[1 0 0 0]]
[2 3 2 2]
  • 计数只对英文文本起作用

词频提取模型(TF)

Welcome to reward