Combined Classifier for Website Messages
Filtration
Veniamin Tarasov< [email protected]>, Ekaterina Mezenceva <[email protected]>, Danila Karbaev <[email protected]> Volga Region State University of Telecommunications and Informatics, 77 Moskovskoe sh., Samara, 443090, Russian Federation
Abstract. The paper describes a new approach to website messages filtration using combined classifier. Information security standards for the internet resources require user data protection however the increasing volume of spam messages in interactive sections of websites poses a special problem. Spam messages vary significantly in content, however the common feature of these messages is that they are usually of little interest to the majority of the recipients. Many filtering approaches are based on the Naive Bayesian classifier - an effective method to construct automatically anti-spam filters with high performance. Unlike many email filtering solutions the proposed approach is based on the effective combination of Bayes and Fisher methods, which allows us to build accurate and stable spam filter. In this paper we consider the organization of combined classifier according to determined optimization criteria based on statistical methods, probability calculations and decision rules. We consider the optimization criteria for grading messages basing on statistical methods. The classifiers normally admit the compromise between the acceptable level of false-positive and false-negative errors, and use the threshold values for decision-making, which may vary. In order to receive more valid results of spam detection we need to analyze multitudes of results of various filters and a subset of their overlaps. The approach we suggest is to construct classifier organization, which presumes the combined use of Bayes and Fischer methods for improved the filtration quality based on the analysis of subsets and set overlaps identified by both methods (spam, non-spam, false triggering and spam leaks).
Keywords: combined classifier; spam filter; optimization criterion. DOI: 10.15514/ISPRAS-2015-27(3 )-20
For citation: Tarasov V., Mezenceva E., Karbaev D. Combined Classifier for Website Messages Filtration. Trudy ISP RAN/Proc. ISP RAS, vol. 27, issue 3, 2015, pp. 291-302. DOI: 10.15514/ISPRAS-2015-27(3)-20.
1. Introduction
The constantly growing volumes of data, number of uses as well as groups devoted to various subjects significantly decrease the effectiveness and the authenticity of communicated information. In this regard the task of increasing the efficiency of statistical data filtration and authentication algorithms becomes undoubtedly topical. The history of this subject in computer science accounts for more than 20-30 years and the trend is becoming more urgent. We can say that right now the antispam features of interactive sections of websites rest in the very initial stage of development.
The subject of message filtration in emails is widely developing, manual antispam methods are being used, and the issue of automated antispam protection of corporate websites becomes a priority on the agenda (including comments, forums and other interactive sections). In practice there are no universal software solutions to protect all types of interactive website sections from spam. There are only small number of specialized tools which prevent automatic messages posting. Some of them are designed for a particular content management system, such as WordPress in form of plugins: Akismet, Quiz, Spam Karma etc. These modules have some disadvantages: the distribution model "as is" do not include the statistical base, most of online services do not provide multilingual filtration and are limited only by the support of the English language. The other blog comment hosting services such as IntenseDebate, Disqus, Livefyre do not provide self-hosted option, except Discourse.
Thereby the spam filtering software solution should have the following properties: the use of multiple filtering methods, both formal and linguistic, united by a common intellectual decision making core; high speed and precision of the method; easy installation and use.
This work describes a new approach to spam filtration involving the combined use of Bayes and Fischer methods, allowing to significantly reduce the number of false triggering and increase spam detection.
2. Calculation of combined probabilities of conditions
The main idea of message classification is based on selection of all conditions, calculation of probabilities of select conditions, and further combination of all calculated probabilities into one value for the studied message. Messages with a large number of spam attributes and little non-spam attributes will have a value close to 1, and the messages with a large number of non-spam attributes and little number of spam attributes will gain a value close to 0.
We will build a classifier of messages received by the website to grade the incoming messages into three categories (spam, non-spam, unidentified). In this respect, we need to identify all conditions (words and word combinations) in the message to be analyzed, calculate statistical probabilities for some select conditions and combine all probabilities into one value for the whole message. In most cases the probability
of assigning a message to a certain category is a lot higher than to others, which results in further grading of such message.
Before calculating the combined probabilities of conditions, we need to calculate the probability of assigning a certain condition to a specific category. For this we can divide the identified number of messages with condition i in this category by the total number of messages in the same category, but we would rather use another method described below. Let's assume:
Fai is the number of messages with condition i in the spam group;
Fbi is the number of messages with condition i in non-spam group.
Then the statistical probability of appearance of i in a spam message can be calculated as follows:
„ Fai ,, „
Pa'~~ Fa-Tib <»
and the probability of appearance of i condition in a non-spam message, as follows:
№= F^ «>
Thus, the number of messages with condition i in one category will be divided by the total number of messages featuring this condition i .
The use of (1) and (2) takes into account the fact that with time the number of messages in both categories may be equal, i.e. these formulas do not depend on the number of messages in a specific category.
Note that formulas above give accurate result only to those conditions, which filter is used in both categories. As the result the spam filter becomes too sensitive on early stages of learning applying to rare words. To solve this problem we need to calculate new probability with expected a priori probability (Pex) and applied weight (w), then according to (1) and (2) add calculated probabilities. If the probability Pex = 0.5 and the weight of expected probability equals to one word (w = 1), we estimate weighted probabilities using (1) and (2):
- (w * Pex) + Pai *(Fai + Fbi)
Pai =-,
w + Fai + Fbi
— _(w * Pex) + Pbi *( Fai + Fbi)
pbi , .7-. .
w + Fai + Fbi
This approach allows to avoid division by zero in the following formulas and to take into account rare words.
To obtain combined probabilities of the whole document (message) we will use the dictionary, which is built on the step of filter learning. We introduce the following
events: A - document is spam, B - document is non-spam. We assume that the probabilities are independent, thus the multiplication is allowed:
P(A) = Pal xPa2 x -xPaM (3)
- for the probability of words co-occurrence in spam;
P{B) = Pb1 x Pb2 x-x PbM (4)
- for the probability of words co-occurrence in non-spam [[1]].
3. Decision rules based on bayes theorem
To estimate the probability that word belongs to one of three categories (spam, nonspam, unidentified messages) we consider the two methods of classification. In this case we apply Bayes formulas using a priori knowledge [[1]]. We introduce two hypotheses for any given message: Ha if the message is a spam, H B if the message is a non-spam. Further, we introduce the following notation: Fa is the total quantity of spam messages; F is the total quantity of non-spam messages; Fa
Pa =-is a priori probability that a message is a spam;
Fa + Fb
Fb
Pb =-is a priori probability that a message is not a spam;
Fa + Fb
P
Oa = —a— is a priori expectations that a message will be a spam;
1 ~ Pa
P
Ob =-is a priori expectations that a message will be a non-spam.
1 " Pb
Then basing on Bayes theorem using a priori knowledge we obtain:
P(A) x Oa
P(H a ) =- - a posteriori probability that a message is a
P(A) x Oa + P(B) x Ob
spam;
P(B) x Ob
P(Hb ) =- - a posteriori probability that a message is non-
P(A) x Oa + p(B) x Ob
spam.
The probabilities P(A) and P(B) are estimated according to (3) and (4).
Given algorithm is implemented in spam detection and filtering system for websites. [[2]].
В. Тарасов, Е. Мезенцева, Д. Карбаев. Совмещенный классификатор для фильтрации сообщений на веб сайтах. Труды ИСП РАН, том 27, вып. 3, 2015 г., с. 291-302
4. Decision rules based on fisher's method
According to Fisher method all probabilities are multiplied together in a similar manner to Bayes method, then the natural logarithm is taken of the product and the result is multiplied by -2. To do this we introduce variable hisqv, which is estimated by the following expressions: hisqv = -2* ln(P(A)) or hisqv = -2* ln(P(B)) ,
where probabilities P(A) and P(B) are calculated according to (3) and (4). Fisher proved that if the set of independent and random probabilities (3) and (4) is given, the value -2 * ln(P(A)) follows the distribution of x with 2n degrees of freedom (n - the number of words in the document):
xtn-le-t/2
F (x) = J-dt (5)
0 2n T(n)
where Г(п) is the gamma function.
In view of foregoing using a representation of the gamma function of even argument (5) can be written as:
1 x
F(x) =-Jxn-1e-x/2dx I x = hisqv (6)
2n(n -l)l0 '
The calculation of the factorial and the integrand in (6) could cause the overflow error due to floating point numbers range in PHP programming language. Thus the recurrence formula is used in the calculation algorithm. Calculation the probability of (6) is implemented by Gaussian quadrature formula with 15 nodes:
b b - a n
J f (t)dt * — E Aif (ti),
а 2 i=1
where ti = (b + a)/2 + (b - a)xi /2 , and xt are the nodes of Gaussian quadrature formula;
Aj are the Gaussian coefficients, (i = 1,2,...,15)[[3]]. In our case a = 0, b = hisqv
The value returned by the function F(hisqv) is low if a text contains many spam conditions. We need the opposite result to rate the message correctly. For this purpose we subtract the value from 1. The use of this subtraction for a large number of non-spam conditions allows us to get the probability that message is not spam. However the Fisher method is not symmetrical. We need to combine the probabilities of spam and non-spam into a single value in the range between 0 and 1. For this we use the Fisher index:
/ = 1 + P(HA) - P(HB), where: 2
P(H'a ) = 1 - F(-2 ln(P( A)) is the probability that a document belongs to spam; P(HB) = 1 - F(-2ln(P(B)) is the probability that a document belongs to non-spam [[4]].
5. Optimization criteria for grading messages based on statistical methods
Let's assume that all set of conditions is divided into classes A and B, where A -class of spam messages, and B - class of non-spam messages. The task of assigning a message to any of these classes is not directly connected to the statistical verification of the following hypotheses: simple hypothesis HA: X A against the alternative HB: X B, where X is the message qualifying condition. As we know from the math statistics, if a message appertains to class A and it was qualified as class B, it will result in 1st type error with the conditional probability of - level of importance. It will be an error of the alternative hypothesis selection HB instead of the correct HA. If HB hypothesis is fair but, nevertheless, HA was selected, the 2nd type error will occur with the conditional probability of.
The 1st type error or false-negative error occurs if the spam filter erroneously leaks an undesired message through identifying it as non-spam (spam leakage or insufficient method completeness). Whilst the spam filter is capable of identifying a large share of undesired messages, the task of minimizing the number of faulty filtering of desired (non-spam) messages may become a higher priority, i.e. the task of 2nd type of error minimization.
The 2nd type error or false-negative error occurs if the spam filter erroneously classifies a legitimate message as spam (faulty triggering or method accuracy). The spam filter will be efficient with a lower number of such errors, i.e. with minimal 2nd type error level. However currently all antispam systems demonstrate correlation between 1st and 2nd type errors.
The classifiers normally admit the compromise between the acceptable level of 1st and 2nd type errors, and use the threshold values for decision-making, which may vary. This results in the "strictness" or "softness" of the classifier. The level of significance set during the statistical hypothesis verification is taken as the threshold value. Whereas, the increase of the filter sensitivity leads to the increased occurrence of 1st type errors (spam leaks), and decrease of sensitivity - to increased occurrence of 2st type of error (false triggering).
6. Bayes optimization criterion
We need to consider the losses related to 1st and 2nd type errors for evaluating the classification quality. For this we need to split the space of condition X into two semispaces Xa and Xb with point xo. Let's define C1 as the conditional price of 1st
type error and С2 - conditional price of 2nd type error, P(A) - a priori probability of A class, P(B) - a priori probability of class B, P(A) + P(B) = 1. The values ci and С2 depend on the price matrix coefficients C2x2={c j} and on the 1st and 2nd type errors:
ci = С12 a+ С11 (1 - a) (7)
С2 = С21 в+ С22 (1 - в) (8)
These values are also called conditional risks with proven fairness of hypotheses Ha and Hb, respectively.
According to the decision making theory, we introduce the decision rule of classification, which minimizes the function of losses (risk) [[3]]: R = ClP(A) + c2 P(B) (9)
where С1 and С2 are determined by (7) and (8).
Function (9) represents the average risk, which depends on the threshold value xo, because the values С1 and С2 depend on the xo value through type I and type II errors, therefore these errors are correlated.
Minimum value Rmin of risk function (9) at the point xo is called Bayes risk. fl (X)= c21 - c22 P(B)
f 2 (X) C12 - cii P(A) ( )
where f 1 (X) and f 2 (X) are the probability density distributions of X condition on A and B classes respectively. The right part in (10)
c 21 — c 22 P(B)
---is called likelihood ratio, which is constant for the selection of
ci2 — cii P( A)
fi (X) c2i — c22 P(B)
сц. Thus, if the inequality —т-т >---is true, the observable vector
j 4 J f2(X) ci2 — cii P(A)
X is related to A class; if the inequality
f (X) c2i — c22 P(B)
—< ii-22---is true, then observable vector X is related to B class. If
f 2 (X) ci2 — cii P( A)
f (X) c2i — c22 P(B)
the equality ——г ^ —2i-22---is true, the observed vector X is related to
f 2 (X ) ci2 — cii P( A)
one of the classes A or B. The latter expression is the equation for the boundaries of A and B classes. This decision rule is related to Bayes rules [[5]]. The technique can be applied to many practical problems formulated in terms of statistical decision making theory with assumption that probability densities fx (X)
and f 2 (X) are known. In most practical cases functions fl (X) and f 2 (X) are not
known, and we need to determine estimations ~ (X), f (X) on training sets using
approximation method [[5]], which can cause the classifier to slow down. Considering this fact we use the following approach: on the stage of filter learning the estimations f (X),f2(X) are determined on small training sets of 100-200
elements, and the optimality criterion to get such estimations can be excluded excluded from the program flow.
Results of numerous tests on training selections allowed identifying optimal threshold values for decision-making:
xh = 0,95 for higher threshold and xL = 0,4 for lower threshold.
Thereby we set strict limits for spam and regular for non-spam messages. Such threshold values provide minimum leakage of desired messaged into spam, i.e. minimum false triggering. However, it's notable that any system administrator will be able to easily set more convenient threshold values to suit his needs.
7. Combined filter
In order to receive more valid results of spam detection we need to analyze multitudes of results of various filters and a subset of their overlaps. We suggest exactly this kind of approach to classifier organization, which presumes the combined use of Bayes and Fischer methods for improved the filtration quality based on the analysis of subsets and set overlaps identified by both methods (spam, non-spam, false triggering and spam leaks).
Let's assume S={s,} (i=RM) - multitude of documents (messages), including both desired and spam messages; Si c S and Sfc S - multitude of documents, identified by Bayes and Fischer classifiers, respectively. Then the subset resulting from the overlap Sb H Sf against all indicated categories may be used for evaluating the quality of the combined filter operation (see Fig. 1).
Fig. 1. Illustration of overlap degree of two subsets SB and SF.
The completeness of such overlap Sb H Sf will also grade the subsets Sb\Sf and Sf\Sb. As a measure of overlap degree of two sets Sb and Sf we suggest to use the absolute measure N(Sb H Sf) - number of shared documents in these subsets. Thus, the maximum value of measure of l category (spam, non-spam, false triggering and spam leaks) will be used as the optimality criterion for spam filter self-teaching evaluation:
Ni (SlB H S F ) ^ max. 298
Once the best values of sets Sb and Sf overlap are reached across all categories, the administrator will be able to choose a filter for further application (see Fig. 2).
Fig. 2. The algorithm of combined filter accuracy evaluation.
As a benefit of the combined filter implementation the evaluation of all components of the overall picture became possible: - spam messages caught by both filters;
- spam filters caught only by Bayes or only Fischer filters;
- simultaneous false triggering of both filters;
- false triggering of each individual filter;
- simultaneous spam leaks by both filters;
- spam leaks of each individual filter.
Before testing filter was trained on 1100 messages (400 spam and 500 non-spam). The tests were run on the flow of 1223 messages. The Bayes method showed 2.9 percent of the false triggering, 9.8 percent of spam omission. The Fisher method showed 1.5 and 4.5 percent accordingly. The combined filter showed the best result with 1.0 and 4.5 percent.
The experimental results confirmed the feasibility of using the selected filtering algorithms. Only having a whole picture, we will be able to make a reasonable comparison of the combined filter self-teaching quality.
References
[1]. E. Mezenceva, V. Tarasov. "Securing computer networks. The method of multi-module spam filtering on websites," Information Technologies, 2012. vol. 6, P. 18-22 (in Russian).
[2]. E. Mezenceva. "The software system of recognition and spam filtering on the sites," Certificate of state registration of the computer program №2011619160, [Registered in the Computer Program Registry, Moscow, on November 25th, 2011] (in Russian).
[3]. S. Nikolskiy. Quadrature Formulas. "Nauka", Moscow, 1974. 224 p. (in Russian).
[4]. E. Mezenceva, V. Tarasov. "Computer networks security. Web programming of the multi-module spam filter," Software Engineering, 2012. vol. 4, P. 27-32 (in Russian).
[5]. E. Mezenceva, V. Tarasov. "An optimal filter construction based on combining statistical classifiers," Information and communications technologies, book 1, 2013. vol. 4, P. 53-57 (in Russian).
Совмещенный классификатор для фильтрации сообщений на веб сайтах
Вениамин Тарасов< [email protected]>, Екатерина Мезенцева <[email protected]>, Данила Карбаев <[email protected]> ФГОБУ ВПО Поволжский государственный университет телекоммуникаций и информатики, 443090, Россия, Самара, Московское шоссе д. 77.
Аннотация. В работе рассмотрен новый подход к фильтрации сообщений на сайтах с использованием совмещенного классификатора. Уровень защиты пользовательских данных определен стандартами информационной безопасности для Интернет-ресурсов, кроме того постоянно растет число спам-сообщений в интерактивных разделах сайтов.
Предлагаемый подход, в отличие от распространенных решений для электронной почты, основан на совместном использовании методов Байеса и Фишера, что позволило разработать эффективное программное решение фильтрации спама. Основная идея классификации сообщений заключается в выделении всех признаков, вычисления вероятностей для отдельных признаков, и затем объединения всех вычисленных вероятностей в значение для всего сообщения. Рассмотрены критерии оптимальности при классификации сообщений на основе статистических моделей. В качестве примера были установлены пороговые значения, обеспечивающие минимум пропуска в спам нужных сообщений, т.е. минимум ложных срабатываний. Для получения более достоверных результатов выявления спама необходимо проводить анализ множеств результатов работы отдельных фильтров и подмножества их пересечений. В работе рассмотрен подход к построению совмещенного классификатора, удовлетворяющего критериям оптимальности и обеспечивающего принятие решений при классификации сообщений на основе статистических методов. Нами предлагается именно такой подход к организации классификатора, который заключается в совместном использовании методов Байеса и Фишера для повышения качества фильтрации на основе анализа подмножеств пересечения множеств, распознанных обоими методами (спам\не спам, ложные срабатывания и пропуск спама). Благодаря реализации совмещенного фильтра можно обоснованно сравнивать качество обученности совмещенного фильтра.
Ключевые слова: совмещенный классификатор, спам фильтр, критерий оптимизации. БО1: 10.15514Л8РКЛ8-2015-27(3 )-20
Для цитирования: Тарасов В., Мезенцева Е., Карбаев Д. Совмещенный классификатор для фильтрации сообщений на веб сайтах. Труды ИСП РАН, том 27, вып. 3, 2015 г., стр. 291-302 (на английском языке). DOI: 10.15514/ISPRAS-2015-27(3)-20.
Список литературы
[1]. Е.М. Мезенцева, В.Н. Тарасов. "Организация защиты компьютерных сетей. Метод многомодульной фильтрации спама на ^'еЬ-сайтах," Информационные технологии, 2012 г., № 6, с.18-22.
[2]. Е.М. Мезенцева. "Программная система распознавания и фильтрации спама на сайтах," Свидетельство о государственной регистрации программы для ЭВМ № 2011619160, [Роспатент, Москва, 25.11.2011].
[3]. С. М. Никольский. Квадратурные формулы. "Наука", Москва, 1974. 224 с.
[4]. Е.М. Мезенцева, В.Н. Тарасов. "Защита компьютерных сетей. Веб программирование многомодульного спам фильтра," Программная инженерия, 2012 г., № 4, с. 27-32.
[5]. Е.М. Мезенцева, В.Н. Тарасов. "Построение оптимального спам фильтра на основе совмещения статистических классификаторов," Инфокоммуникационные технологии, том 1, 2013г., № 4, с.53-57.