Applying a probabilistic algorithm to spam
filtering
Olga V. Okhlupina, Dmitry S. Murashko
Abstract— Among the common methods of combating spam, a special place is occupied by a probabilistic machine learning algorithm, which is based on the well-known Bayes theorem. The so-called "naive" Bayesian classifier establishes the class of the document by determining the a posteriori maximum. With the development of machine learning methods, the Bayesian algorithm has not lost its relevance and continues to be very popular for solving a large number of tasks, including spam detection. The main advantages of this classifier are simplicity, fast learning, fairly high accuracy, reliability. The paper considers the solution of the problem of determining spam messages using a probabilistic machine learning algorithm. The mathematical justification and implementation of the Bayesian algorithm on a concrete example using program code in the Python programming language is given.
Keywords— Spam, filtering, probabilistic algorithm, Bayes formula, a posteriori probability, conditional probability, machine learning, class, classifier, training.
I. INTRODUCTION
The use of approaches and methods implemented in artificial intelligence systems makes it possible to significantly improve the efficiency of solutions to most practical problems compared to traditional approaches. The human experience of spam detection plays an important role in the fight against such mailings due to the unconventionality and high efficiency. However, optimization of the filtration process and improvement of its mechanisms is becoming increasingly relevant.
There are a number of spam detection methods that have their advantages and disadvantages in terms of meeting the necessary criteria of simplicity, trainability and reliability, minimizing false conclusions.
In the first part of the paper, the mathematical basis of the probabilistic Bayesian algorithm is given. The second part is devoted to the implementation of an algorithm for detecting and filtering spam messages with a demonstration of program code in the Python programming language.
II. Bayesian algorithm and filtering
The probable Bayesian algorithm is based on the use of the well-known Bayes theorem [1], which is closely related
Manuscript received February 19, 2022.
Olga V. Okhlupina, Ph.D., Associate Professor, Bryansk State University of Engineering and Technology, Bryansk, Russia (e-mail: helga131081@yandex.ru )
Dmitry S. Murashko, student, Bryansk State University of Engineering and Technology, Bryansk, Russia (e-mail: murashko100500@gmail.com)
to conditional probabilities. This algorithm refers to machine learning algorithms.
We introduce the following notation. Let X - be a set of objects, Y - is a set of classes (finite). The probability space XxY has a density of p(x,y) = P(y)p(x|y), P(y) - a priori probabilities of the appearance of objects of each class, py (x) = p (x|y) - class distribution densities (likelihood functions). a: X ^ Y - is the filtering algorithm. pyk - losses when assigning a class object to a
class k (pyy = 0, pyk > 0, y * k).
When detecting spam: y = 1 - spam, y = 0 - not spam, P01 > p10 (that is, the loss when passing spam is less of a loss than false detection).
If we assume that the losses are determined only by the true classification of the object, and not by the class to which it was mistakenly assigned, then pyk = py, Vy, k e Y.
With a priori probabilities P (y) and likelihood functions
py (x) , Pyk =Py, ^ k e Y , Pyy = 0 , the average risk is minimized thanks to the algorithm a(x) = argmax pypypy (x).
According to the definition of conditional probability, we have: p(x,y) = py (x)Py = P(y|x)p(x). The conditional
probability P (y|x) is the a posteriori probability of a class y for an object x . To calculate it, we apply the Bayes
formula: P (y|x ) = ^ = .
11 ' p (x) X pk (x) Pk
keY
With the help of a posteriori probability , the algorithm will take the form: a(x) = argmaxpyP(y|x).
Under the condition of equivalence of classes py = 1 we are talking about the maximum P (y|x). If the classes are equally probable, then the object x belongs to the class with the highest distribution density, a(x) = argmaxpy (x).
yeY
For example, a mail system that has been trained on a certain number of incoming emails (belonging to two classes: spam and non-spam) needs to attribute the following message to one of the classes considered during training.
It is believed that the words in the letter do not depend on each other. The use of the so-called "naive" Bayesian algorithm is associated with the assumption of independence
and equal possibility of all the parameters under consideration. It should be noted that these assumptions, which are not entirely correct in practice, justify themselves in practical application. Hence the "naivety" of the algorithm.
P (y\x) calculated by the Bayes formula for each class
(by creating frequency tables for all objects (relative to the desired result), from which likelihood tables are created). The class with the highest P (y\x) and is the desired one.
III. Algorithm implementation
Here is an example of the implementation of the Bayesian algorithm.
Let the system be offered the following messages as a training sample (see Table 1):
Table 1
1247178944
= 8,01809560^-10
16
- = 4,47491494£ - 9.
2 1 1 1 2 1 1 2 1 4 ' 22 ' 22 ' 22 ' 22 ' 22 ' 22 ' 22 For "Not Spam":
211 2 2 _L _L 2 =_
4 19 19 19 19 19'19'19 " 3575486956
We are implementing the task in the Python programming language.
As a result of the program, we get weights equal to (see Table 3, Fig. 1):
Table 3
Spam Not spam
8.01E-10 4.47E-09
Spam Not spam
Laptops at a bargain price There will be a conference tomorrow
Sale! Order a bike and get headphones as a gift Order skates and a bike
As a spam message to be checked, we will select the following message: "Skates are presented on the website. Order one pair and a bike."
We will perform mathematical calculations and carry out verification using the program code.
To calculate the probabilities , we use the formula
p , p - smoothing parameter (let's put it equal to 1),
n - the number of hits of a word in a class document, N -the number of words of the class document, V - the size of the training sample.
Let's enter the data in Table 2:
Table 2
Words Getting Getting The The
into the into the probability of probability of
"Spam "Not getting into getting into
" class spam" "Spam" "Not spam"
class
laptops 1 0
bargain 1 0
price 1 0
& sale 1 0
a m order 1 1 (1+1)/(13+9) (1+1)/(13+6)
M C bike 1 1 (1+1)/(13+9) (1+1)/(13+6)
get 1 0
headphones 1 0
.g gift 1 0
0 m tomorrow 0 1
-a 0 will be 0 1
i conference 0 1
£ skates 0 1 (1+0)/(13+9) (1+1)/(13+6)
pair 0 0 (1+0)/(13+9) (1+0)/(13+6)
website 0 0 (1+0)/(13+9) (1+0)/(13+6)
are 0 0 (1+0)/(13+9) (1+0)/(13+6)
presented
one 0 0 (1+0)/(13+9) (1+0)/(13+6)
We get the following result for the "Spam" class:
C : \Python39\python. exe C : /ProJects/SpamLearn/maiii. py
Beca: cnaM - 8.018095597354795e-10, He cnaw - 4.47i91i9407888S6e-B9
He cnan
Fig. 1. The result of the program
Since the weights of the "Spam" class are less than the weights of the "Not spam" class, we can conclude that the message is not spam, which is confirmed by the program.
Program listing:
# the library from which the list of punctuation marks is taken
from string import punctuation
# the library from which the list of "stop words" is taken
from stop words import get stop words spam_line = ['Laptops at a bargain price
1
,
'Sale! Order a bike and get headphones as a gift']
# a list with "non-spam messages" not_spam_line = ['There will be a conference tomorrow',
'Order skates and a
bike']
# spam verification message search_spam_line = 'The site presents skates. Order one pair and a bike'
# a function for formatting a message (accepts a string, returns a tuple) def clear line(line clearing: str) -> tuple:
# the whole string is converted to lowercase
line_clearing = line clearing.lower()
# we go through the list of punctuation marks
for i in punctuation:
# replacing the punctuation mark with "emptiness"
line_clearing = line clearing.replace(i, '')
# splitting the string into a list of words
list words = line clearing.split()
# let's go through the list of "stop words"
for i in get stop words('ru'):
# if a word is found in the list of ready-made words
if i in list words:
# removing a word from the
list
list words.remove(i)
# returning the tuple of the finished list
return tuple(list words)
# spam check function (accepts string and dictionary, returns boolean value) def check spam(line check: str, table info: dict) -> bool:
# weights "spam" and "not spam" result = [1, 1]
# the number of words from the training sample,
# the number of words included in "spam" and "not spam"
count = [0, 0, 0]
### filling in count ###
# passage through all the "spam" elements
for i in table info['spam']:
# adding the number of "meetings" to the total counter
count[0] += table info['spam'][i]
# adding the number of "interruptions" to the local "spam" counter
count[1] += table info ['spam'][i]
# passing through the "not spam" elements"
for i in table info['not spam']:
# if the item is not in "spam" if i not in table info['spam']:
# adding the number of "meetings" to the total counter
count[0] += table info['not spam'][i]
# adding the number of "meetings" to the local "not spam" counter"
count[2] += table info['not spam'][i]
# passing through all the final words of the test string after formatting
for i in clear line(line check):
# if the word is not in "spam" if i not in table info['spam']:
# adding a word with a
meaning 0
table info ['spam'][i] = 0
# if the word is not in "not
spam"
if i not in table info['not spam']:
# adding a word with a
meaning 0
table info['not spam'][i] =
0
# smoothing parameter a = 1
# we change the weights according to the formula
result[0] *= (a + table_info['spam'][i]) / (a * count[0] + count[1])
result[1] *= (a + table info['not spam'][i]) / (a * count[0] + count[2])
result[0] *= table info['count in spam'] / (table info['count in spam'] + table info['count in not spam'])
result[1] *= table info['count in not spam'] / (table info['count in spam'] + table info['count in not spam'])
# debugging information about the balance status
print('Weight: spam - %s, not spam -%s' % (result[0], result[1]))
# if the weights "spam" are greater than the weights "not spam"
if result[0] > result[1]:
# return true (spam) return True
# otherwise, if "not spam" is more "spam"
else:
# return false (not spam) return False
# a function for "learning" (accepts a list of "spam" and a list of "not spam", returns a dictionary)
def learn spam(spam: list, not spam: list) -> dict:
# creating a "blank sheet" of the dictionary
dict words = {'spam': {}, 'not spam': {}, 'count in spam': 0, 'count in not spam': 0}
# buffer lists for words
spam words, not spam words = [], []
# going through the "spam" list" for i in spam:
# combining the formatting result into a single list
spam words.extend(clear line(i))
# passing through the "spam" buffer
list
for i in spam words:
# adding to the effective dictionary in the dictionary "spam"
# the word and as a value - the number of repetitions
dict words['spam'][i] = spam words.count(i)
# going through the "not spam" list for i in not spam:
# combining the formatting result into a single list
not spam words.extend(clear line(i))
# passing through the buffer list "not spam"
for i in not spam words:
# adding to the effective dictionary in the dictionary "not spam"
# the word and as a value - the number of repetitions
dict words['not spam'][i] =
not spam words.count(i)
# adding training list lengths to the result list
dict words['count in spam'], dict words['count in not spam'] = len(spam), len(not spam)
# returning the "trained" dictionary return dict words
# we call the spam check function, passing the checked string to it and
# "trained" by the lists of "spam" and "not spam" dictionary
if check spam(search spam line, learn spam(spam line, not spam line)):
# if the function returned "true" print('Spam')
else:
# if the function returned "false" print('Not spam')
IV. Advantages and disadvantages of the algorithm
The classifier in question has greater performance compared to other simple algorithms on a smaller amount of training data.
The naive Bayesian algorithm characterizes the simplicity and speed of determining the class of the proposed data set. It is effective in working with categorical features. When we encounter a variable category in the test set that is not represented in the training set, we will encounter zero frequency, which will require a smoothing technique to solve.
In reality, it is extremely rare to talk about the independence of signs. However, the point is not to consider independent parameters, but that we should not assume any dependence. This allows you to speed up the learning process and predict using any data sets.
References
[1] V. E. Gmurman, Teoriya veroyatnostej i matematicheskaya statistika: uchebnoe posobie dlya vuzov. 11 izd. M.: Vysshaya shkola, 2005. 479 p. (In Russian)
[2] Vysokourovnevyj yazyk programmirovaniya Python [Online]. Available: https://www.python.org/
[3] D. Barber, Bayesian reasoning and machine learning. Cambridge University Press, 2012. 642 p.
[4] O.V. Ohlupina, A.A. Prokopenko, A.O. Zgonnikova, O yomkosti modeli klassifikacii // Uchyonye zapiski Bryanskogo gosudarstvennogo universiteta. Bryansk: BGU, 2021 (4). pp. 22-27. (In Russian)
Olga V. Okhlupina, Candidate Sc. (Phys. and Math.), associate Professor, Bryansk state engineering-technological University, Prospekt Stanke Dimitrova, 3, Bryansk 241037, Russia.
Dmitry S. Murashko, student, Bryansk state engineering-technological University, Prospekt Stanke Dimitrova, 3, Bryansk 241037, Russia.