千家信息网

Naive Bayes怎么使用

发表于:2025-02-05 作者:千家信息网编辑
千家信息网最后更新 2025年02月05日,这篇文章主要讲解了"Naive Bayes怎么使用",文中的讲解内容简单清晰,易于学习与理解,下面请大家跟着小编的思路慢慢深入,一起来研究和学习"Naive Bayes怎么使用"吧!一、概述优点:在数
千家信息网最后更新 2025年02月05日Naive Bayes怎么使用

这篇文章主要讲解了"Naive Bayes怎么使用",文中的讲解内容简单清晰,易于学习与理解,下面请大家跟着小编的思路慢慢深入,一起来研究和学习"Naive Bayes怎么使用"吧!

一、概述

优点:在数据少的情况下仍然有效,可以处理多类别问题

缺点:对于输入数据的准备方式较为敏感

适用数据类型:标称型数据

二、原理

三、文档分类

A,B,C,D..为文档中单词。假设总词汇只有A,B,C,D四种。训练样本为5个


ABCD类别
文档100110
文档201110
文档310011
文档411001
文档511101
测试文档1010?

类别:C0,C1

测试文档:W

求:max{P(C0|W),P(C1|W)} ===> max{log[P(C0|W)],log[P(C1|W)]}

P(C0|W) = P(W|C0) * P(C0) / P(W)

P(C0) = 2 / 5 ==> 2个0类型的文档,3个1类型的文档

P(W|C0) = P(A*B*C*D|C0) ==> Navie Bayes ==> P(A|C0) * P(B|C0) * P(C|C0) * P(D|C0)

P(A|C0)=(0 + 0)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=0 ==> A在类别0文档中出现的次数/ 类别0文档中的总词汇量

P(B|C0)=(0 + 1)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=1/5 ==> B在类别0文档中出现的次数/ 类别0文档中的总词汇量

P(C|C0)=(1 + 1)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=2/5 ==> C在类别0文档中出现的次数/ 类别0文档中的总词汇量

P(D|C0)=(1 + 1)/(0 + 0 + 1 + 1 + 0 + 1 + 1 + 1)=2/5 ==> D在类别0文档中出现的次数/ 类别0文档中的总词汇量

因为相乘为存在0* ==>0 取log

log[P(W|C0) * P(C0)] = log[P(A|C0) * P(B|C0) * P(C|C0) * P(D|C0) * P(C0)]

=log[P(A|C0)] + log[P(B|C0)] + log[P(C|C0)] + log[P(D|C0) ] + log[P(C0)]

同理计算log[P(W|C1) * P(C1)]

测试样本:

log[P(C0|W)] = 0 * log(1/5) + 1 * log(2/5) + 0 * log(2/5) + log(2/5) =

log[P(C1|W)] = 1 * log(3/7) + 0 * log(2/7) + 1 * log(1/7) + 0 * log(1/7) + log(1 - 2/5) =

# -*- coding:UTF-8from numpy import *'''1.伯努利模型==>不考虑词在文档中出现的次数,只考虑出不出现。假定词是等权重中的2.多项式模型'''def loadDataSet():    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],                 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],                 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],                 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],                 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],                 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]    classVec = [0,1,0,1,0,1]    return postingList,classVecdef createVocabList(dataSet):    vocaSet = set([])    for document in dataSet:        vocaSet = vocaSet | set(document)    return list(vocaSet)'''vocabList = ['','',.....]inputSet = ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please']'''def setOfWords2Vec(vocabList,inputSet):    returnVec = [0] * len(vocabList)        for word in inputSet:        if word in vocabList:            returnVec[vocabList.index(word)] = 1        else:            print 'the word: %s is not in my vocabulary!' % word    return returnVec'''P(c|w) = P(w|c) * P(c) / P(w)1.P(c)2.P(w|c)trainMatrixtrainCategory===>[0,0,1,1,0]  标签集合的向量pAbusive = (0  + 0 +  1 + 1 + 0) / 5A    B    C    D    category0    0    1    1    00    1    1    1    01    0    0    1    11    1    0    0    11    1    1    0    11    0    1    0    ?numTrainDocs = 5 => 5个文档numWords = 4 => 4个特征pAbusive = (0 + 0 + 0 + 1 + 1) / 5 = 2/5 ==> 先验概率p0Num = [0,0,0,0]p1Num = [0,0,0,0]p0Denom = 0.0p1Denom = 0.0    0    0    1    1    0 ===> p0Num=[0,0,1,1]  p0Denom=1    0    1    1    1    0 ===> p0Num=[0,1,2,2]  p0Denom=2    1    0    0    1    1 ===> p1Num=[1,0,0,1]  p1Denom=1    1    1    0    0    1 ===> p1Num=[2,1,0,1]  p1Denom=2    1    1    1    0    1 ===> p1Num=[3,2,1,1]  p1Denom=3    P(C0|W) = P(W|C0) * P(C0) / P(W) = P(A*B*C*D|C0) * P(C0) / P(W) = P(A|C0) * P(B|C0) * P(C|C0) * P(D|C0) * P(C0) / P(W)P(C1|W) = P(W|C1) * P(C1) / P(W) = P(A*B*C*D|C1) * P(C1) / P(W) = P(A|C1) * P(B|C1) * P(C|C1) * P(D|C1) * P(C1) / P(W)P(W) ==> 无需再计算了max{P(C0|W),P(C1|W)} ===> max{Log[P(C0|W)],Log[P(C1|W)]}Log[P(C0|W)] = Log[P(A|C0)] + Log[P(B|C0)] + Log[P(C|C0)] + Log[P(D|C0)] + Log[P(C0)]P(A|C0) = 0/(0+1+2+2) = 0/5P(B|C0) = 1/(0+1+2+2) = 1/5P(C|C0) = 2/(0+1+2+2) = 2/5P(D|C0) = 2/(0+1+2+2) = 2/5Log[P(C1|W)] = Log[P(A|C1)] + Log[P(B|C1)] + Log[P(B|C1)] + Log[P(B|C1)] + Log[P(C1)]P(A|C1) = 3/(3+2+1+1) = 3/7P(B|C1) = 2/(3+2+1+1) = 2/7P(C|C1) = 1/(3+2+1+1) = 1/7P(D|C1) = 1/(3+2+1+1) = 1/7测试样本1    0    1    0    ?Log[P(C0|W)] = 1 * Log[0/5]  + 0 * Log[1/5] + 1 * Log[2/5] + 0 * Log[2/5] + Log[2/5]Log[P(C1|W)] = 1 * Log[3/7]  + 0 * Log[2/7] + 1 * Log[1/7]+ 0 * Log[1/7] + Log[1 - 2/5]注意存在Log[0] ==> 所有初始化,我们设置p0Num = [1,1,1,1]p1Num = [1,1,1,1]p0Denom = 2.0p1Denom = 2.0'''def trainNB0(trainMatrix,trainCategory):    numTrainDocs = len(trainMatrix)    numWords = len(trainMatrix[0])    pAbusive = sum(trainCategory) / float(numTrainDocs)    p0Num = zeros(numWords)    p1Num = zeros(numWords)    p0Denom = 0.0    p1Denom = 0.0        for i in range(numTrainDocs):        if trainCategory[i] == 1:            p1Num += trainMatrix[i]            p1Denom += sum(trainMatrix[i])        else:            p0Num += trainMatrix[i]            p0Denom += sum(trainMatrix[i])    p1Vec = log(p1Num/p1Denom)    p0Vec = log(p0Num/p0Denom)        return p0Vec,p1Vec,pAbusivedef trainNB1(trainMatrix,trainCategory):    numTrainDocs = len(trainMatrix)    numWords = len(trainMatrix[0])    pAbusive = sum(trainCategory) / float(numTrainDocs)    p0Num = ones(numWords)    p1Num = ones(numWords)    p0Denom = 2.0    p1Denom = 2.0        for i in range(numTrainDocs):        if trainCategory[i] == 1:            p1Num += trainMatrix[i]            p1Denom += sum(trainMatrix[i])        else:            p0Num += trainMatrix[i]            p0Denom += sum(trainMatrix[i])    p1Vec = log(p1Num/p1Denom)    p0Vec = log(p0Num/p0Denom)        return p0Vec,p1Vec,pAbusivedef classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)        if p1 > p0:        return 1    else:        return 0def testingNB():    listOPosts,listClasses = loadDataSet()    myVocabList = createVocabList(listOPosts)    trainMat = []        for postingDoc in listOPosts:        trainMat.append(setOfWords2Vec(myVocabList, postingDoc))            p0V,p1V,pAb = trainNB0(trainMat, listClasses)        testEntry = ['love','my','dalmation']        thisDoc = array(setOfWords2Vec(myVocabList, testEntry))        print(testEntry,' classified as: ',classifyNB(thisDoc,p0V,p1V,pAb))

四、过滤垃圾邮件

def textParse(bigString):    import re    listOfTokens = re.split(r'\W*', bigString)   #简单空格分词    return [tok.lower() for tok in listOfTokens if len(tok) > 2]  #简单过滤词长<=2的词def spamTest():    docList = []    classList = []    #fullText = []        for i in range(1,26):        #读取所有的单词        wordList = textParse(open('emial/spam/%d.txt' % i).read())        docList.append(wordList)        #fullText.extend(wordList)        classList.append(1)                wordList = textParse(open('emial/ham/%d.txt' % i).read())        docList.append(wordList)        #fullText.extend(wordList)        classList.append(0)            vocabList = createVocabList(docList)    trainSet = range(50)    testSet = []        for i in range(10):        randIndex = int(random.uniform(0,len(trainSet)))        testSet.append(trainSet[randIndex])        del(trainSet[randIndex])            trainMat = []    trainClasses = []        for docIndex in trainSet:        trainMat.append(setOfWords2Vec(vocabList,docList[docIndex]))        trainClasses.append(classList[docIndex])            p0V,p1V,pSpam = trainNB0(trainMat, trainClasses)    errorCount = 0        for docIndex in testSet:        wordVector = setOfWords2Vec(vocabList, docList[docIndex])                if classifyNB(wordVector, p0V, p1V, pSpam) != classList[docIndex]:            errorCount += 1            print 'classification error',docList[docIndex]            print 'the error rate is: ',float(errorCount) / len(testSet)

感谢各位的阅读,以上就是"Naive Bayes怎么使用"的内容了,经过本文的学习后,相信大家对Naive Bayes怎么使用这一问题有了更深刻的体会,具体使用情况还需要大家实践验证。这里是,小编将为大家推送更多相关知识点的文章,欢迎关注!

0