Научная статья на тему 'SENTIMENT ANALYSIS OF ARABIC TWEETS USING SVM CLASSIFIER WITH POS TAGGING FEATURES'

SENTIMENT ANALYSIS OF ARABIC TWEETS USING SVM CLASSIFIER WITH POS TAGGING FEATURES Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
52
16
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
SENTIMENT ANALYSIS / SVM

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Jafar Kamel, Panov Alexander

Social media platforms are open spaces that allow their users to express their opinions freely, which made it one of the most popular and widely used Internet sites, including Twitter, which is among the most visited social networking sites, as the number of its users' increases day by day. Due to the amount of information, opinions, and points of view that these sites contain, the importance of analyzing and extracting these opinions and benefiting from them in various fields, to allow the beneficiaries of this information to take appropriate decisions according to the result of analyzing the texts written in them and classifying them according to certain classifications. The field of opinion mining and sentiment analysis has received great attention from researchers, but most studies have focused on English texts. Therefore, in this research, Arabic texts were studied in this field, especially after the increased demand for sentiment analysis tools for Arabic texts written in standard and colloquial. The research relied on machine learning technology and used the Support Machine Vector algorithm to classify tweets into tweets with positive, negative, or neutral fingerprints because it is one of the good algorithms for classifying texts in general.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «SENTIMENT ANALYSIS OF ARABIC TWEETS USING SVM CLASSIFIER WITH POS TAGGING FEATURES»

Sentiment Analysis of Arabic Tweets Using SVM Classifier with POS Tagging Features

Kamel Jafar, Alexander Panov

Abstract— Social media platforms are open spaces that allow their users to express their opinions freely, which made it one of the most popular and widely used Internet sites, including Twitter, which is among the most visited social networking sites, as the number of its users' increases day by day. Due to the amount of information, opinions, and points of view that these sites contain, the importance of analyzing and extracting these opinions and benefiting from them in various fields, to allow the beneficiaries of this information to take appropriate decisions according to the result of analyzing the texts written in them and classifying them according to certain classifications. The field of opinion mining and sentiment analysis has received great attention from researchers, but most studies have focused on English texts. Therefore, in this research, Arabic texts were studied in this field, especially after the increased demand for sentiment analysis tools for Arabic texts written in standard and colloquial. The research relied on machine learning technology and used the Support Machine Vector algorithm to classify tweets into tweets with positive, negative, or neutral fingerprints because it is one of the good algorithms for classifying texts in general.

Keywords—sentiment analysis, SVM.

I. INTRODUCTION

With the great development of web technology, the use of the Internet has increased at a rapid pace in various fields. At the beginning of its emergence, most of the information content was the result of companies, governments, and universities, but now individuals create 71% of the content of the Internet. Today, we cannot imagine our life without, which we use for various purposes such as browsing, or sending messages on social networking sites that have allowed millions of people to publish their opinions, ideas, experiences, and everything that attracts their attention on different platforms such as Twitter, Facebook, Instagram, and other forums. Therefore, it has become an important part of our virtual life that has changed our way of communicating [1], which is under the control of emotions that play an important role in thinking and expressing opinions, and it is worth noting that Internet users do not use it to take information only, but also give useful information, including opinions which have become the focus of researchers' attention to know their trends for decisionmaking in a field [2].

II. Related Work

Although Arabic is one of the most widely used languages on the Internet, it has not received appropriate attention, especially Standard Arabic, compared to other languages

such as English, and the reason for this is that it has a complex linguistic structure, its linguistic nature and the available linguistic resources on the Arabic language are limited such as dictionaries and grammar, which is one of the challenges facing the researcher in the field of Arabic. Among the studies that investigated the Arabic language and data classification using the SVM support product machine: The researcher in [3] presented an Arabic sentiment corpus called the GLASC, which was built using online news and shared data via the big data source GDELT. The GLASC consists of 152,621 news items, which are arranged into categories (positive, negative, and neutral), and each news item has a value of sentiment average between 1 and -1, the results were the best performance of HHM SVM classifier 92.37%.

In [4] proposed a hybrid system to classify Arabic sentiment, it is called NB-MLP which consists of the Naïve Bayes algorithm and Multilayer Perceptron network MLP, and six data sets were used to test the network. Data sets were about hotels, cinemas, products, restaurants, and tweets. In addition, the data sets were categorized into positive and negative and used 10-cross fold validation to test the suggested models.

Authors in [5] proposed a method using Naïve Bayes, KNN, and the clustering means-K for applying the method on reviews about the mobile phone. The accuracy of the classification was 91%.

The research in [6] dealt with the opinions mining and sentiment analysis of the Arabic language on social networking sites, blogs, and Twitter. They focused on the technical language blogs to determine the feelings expressed, then linked the discussion point in the blog messages with the related tweets on Twitter, this was done with the help of content similarity and emotional score measurement, and text mining techniques were used to extract the required data.

The researchers in [7] studied the effect of preprocessing in analyzing Arabic sentiments, especially the Saudi dialect, and using Twitter as a source of information due to its nature which is characterized by the shortness of tweets and the richness of the text in the vernacular. They used three algorithms of supervisor machine learning: SVM, Naïve Bayes, and KNN, and compared the classification results of these algorithms for several tweets amounted to 2434 tweets. The article [8] studied how to apply a sentiment analysis algorithm, and how to influence the performance of this algorithm through different types of preprocesses applied to raw data, which are movie reviews. The results showed high validity in the emotional analysis even with the small sample used, the accuracy reached more than 70% when appropriate NLP algorithms were applied.

The authors of [9] used the fusion method for the selected features, called PSO, to increase the accuracy of the classifications for the classifier SVM, classified 200 reviews on the smartphone product into positive and negative, and the evaluation was done using fold-10 cross-validation, while the accuracy of the algorithm was measured using Confusion Matrix and ROC curve, the accuracy reached %95.

Researchers of [10] focused on sentiment analysis on tweets for Saudi dialect, suggested a hybrid method that combines semantic meaning and machine learning method, to determine the trend of Arabic tweets, and used the Lexical-based classifier to correctly classify tweets. The accuracy of the hybrid method is 84%.

The article [11] presented a new sentiment analysis application on Twitter about a specific product, and tested the app using four supervised classifying methods: The most common ones, namely SVM, Naive Bayes, Max Entropy, and J48. The results were that the J48 Classifier is the most efficient classification technology relative to the other used technologies. The classification accuracy using classifier J48 was 92%.

Authors of [12] focused on evaluating the content of Arabic by mining and analysis of opinions tool, they collected various forms of the Arabic language (Classical, Modern Standard, and Colloquial). The comments and reviews were inputs for this tool. The outputs were indicative of trends in those comments, and the tool is also working to determine whether the inputs are (objective or subject), (positive or negative), (strong or weak). They used Naïve Bayes in data classification and the results came with a classification accuracy of 94%.

III. Methodology

The architecture of the proposed sentiment analysis system within several stages is illustrated by the diagram in figure (1).

Figure 1 architecture of the proposed sentiment analysis system

A. Collection Data

The issue of searching for data is very difficult, because the available data and information on the Internet are enormous, and the process of extracting opinions from it is difficult [2]. As a result of the process of searching for opinions on the Internet, it was found that their sources are numerous, namely:

• Websites for Reviews

• News Articles

• Blogs

• Social Media Posts

• Web Discourse

Social media has become a platform for conveying people's voices to the public, and the rapid progress of the Internet has made it an interactive medium, as users interact with each other to generate content on the Internet such as news reports, and what is written in forums and blogs such as Twitter and other microblogs, which are primary sources of textual opinions that extract Natural Language Processing to obtain it [13]. Twitter is a social networking platform that has become one of the most popular blogs among Internet users that have received attention in recent years. Sentiment extraction from Twitter data has been used in several fields and applications such as behavioral economics in the applications of the stock market, public health, and natural disasters, and an example of this was used in the olympic Games in London in 2012 [14], to share ideas and produce a huge amount of daily messages that can be collected and used to extract feelings and emotions about various topics

[15], most of which are colloquial tweets that are easy for the user to understand but difficult for the system to interpret

[16]. Short messages written on the Twitter platform are called tweets, which are characterized by a message length of 140 characters maximum. There are two types of tweets, either tweets that express an opinion or merely an expression of facts and the language of the tweets used is a mixture of classical Arabic and Arabic.

The tweet contains emoticons such as " :)" which are emotional expressions, and are represented by strings or symbols, which are good indicators for detecting emotions from Twitter (equivalent to emoji) which are expressions of faces expressing gestures such as being happy gestures such as □ or gestural sadness such as © and others), and abbreviations (an abbreviation of a word or phrase such as "q8" meant by "great" that are used because of the length of the tweet, and this abbreviation may be used to denote the word "Kuwait"), and special symbols are used in the tweet such as (@ which represents a symbol to direct my tweet For a Twitter user, the hashtag # is used to search and rank [17] (example #love refers to generally positive feelings, #sad indicate general negative feelings, and RT is a Retweet of another person [18].

B. Preprocessing of tweets

At this stage, tweets that do not contain text are removed, to increase the efficiency of the classification, as well as the removal of duplicate tweets, including tweets that have been retweeted because they represent unwanted data, i.e., Spam, such as RT: Bruno Guido, UNHCR representative, said In Iraq: For Iraq to advance, people must be able to return to their homes. 3.3 million displaced people have returned to their homes.

The symbol (RT) Retweet (which appeared at its beginning, refers to the previously mentioned retweet, as here every tweet added to the tweet file was compared, whether it is present or not, and thus the repetition is canceled in the stored texts, which increased the efficiency of the results obtained.

C. Normalization of Text

The processing of the Arabic text extracted from the web pages of the social networking platform Twitter, which includes Arabic words (eloquent and colloquial) and words written in foreign languages such as English, as well as Arabic and Hindi numbers, punctuation marks and phonetic signs of the Arabic language, especially the Holy Qur'an, and the diacritics of Arabic words and letters repeated in the word. One and expressive images such as a smiley face, as the size of the Arabic text, is reduced by removing information that does not affect the emotional classification of the text, so the primary processing is the process of converting the original textual data into a ready-made composition for the classification process after determining the necessary properties required for the classification process. Twitter can range from official reports such as messages, circulating news, or opinions about what is happening in events, and through our observations of most tweets, they contain letters, words, and symbols that affect the accuracy and validity of the classification results. These cases were addressed after dividing the tweet into a group of words, and the reason for this processing is to obtain words that match the corpora of Arabic words, to be the part of speech to which they belong. Among the cases that have been dealt with are:

• Replacement of Some Letters

• Processing of Emoj is and Emoticons

• Removal of Duplicate Letters

• Removal of Stop Words

• Processing of Merging Words

D. Part Of Speech (POS) Tagging

We used two methods for associating the word with the part of speech, one elicited by using word weights and the presence of prefixes and suffixes or that precedes or follows certain words, or through a manually designed lexicon. 1. Ontology rules, which are divided into:

• ontology rules depend on the diacritical of the word,

as in the word (^e), it is not possible to distinguish whether the verb is (^e) which has a negative sentiment polarity, or if the word denotes the description (^ie) which has a positive sentiment polarity, so the word which contains a word stress that represents a verb, and which does not have it to see if there is a noun, consisting of the same number of letters and the same sequence.

• ontology rules that do not depend on the diacritical

of the word, as in the word (JU*jJ), and depend on the number of letters of the word and the appearance of certain letters in certain locations of the word.

• ontology rules depend on the presence of certain

prefixes and suffixes, through which we can distinguish the word whether it is a noun or a verb

such as (^«21), which is a verb because it ends with (¿0 and it is not from the origin of the word.

• ontology rules depend on the presence of words that

precede the word, to indicate whether the word is a noun or a verb, such as (^I je), the existence of (^1 + je), and there are several points that differentiate the noun from the verb, but some of them do not apply to tweets because they are governed by the context of speech or The dialect used, for example, the appearance of the preposition at the beginning of words does not mean that the word is a noun, in tweets, because it is used with the verb as well, such as ( <—

in the dialect of the Levant, and it is used as a substitute for the present tense (I). Therefore, it is necessary to study the tweets first, and then formulate the rules that suit the classical and colloquial Arabic texts. 2. A dictionary of words, a dictionary of Arabic words was created manually for words through processing that needs to know whether the word is a noun with a description or a noun with no description, and if the word is a verb or a letter to determine how to process the negation, and to process the method of determining the sentiment polarity of the word or phrase based on the context, and contains Arabic words between classical and colloquial dialects of multiple dialects.

E. Polarity of Words: The polarity is given to each of the words, whether classical or colloquial, and there are controls that control some of the words before giving the final sentiment polarity on three levels:

1. The Polarity at the level of the word

• The sentiment polarity of each word of the tweet

initially; Because the sentiment polarity of the word may be changed, if it is affected by the following levels or its sentiment polarity may be neglected.

• Emoji and Emoticons It is very important to address

these cases because they can determine the sentiment polarity of the tweet as a whole.

• The sentiment polarity of the hashtags, by studying

the tweets, it was observed that the positive hashtags affect the emotional imprint and the tweet level as a whole.

2. The sentiment polarity at the phrase level

Counting the number of words that bear positive sentiment polarity and the number of words with negative sentiment polarity is not sufficient, because the location of the word within the phrase or sentence changes the sentiment polarity of its sentiment polarity, so controls have been put in place to give the phrase the full sentiment polarity, as follows: Giving a sentiment polarity of the phrase depending on the tagging POS to which the word belongs and also depending on the sentiment polarity of each word in the phrase, according to the following:

A- If the word was a noun, and it was not appended to any suffixes, then it was followed by a word beginning with (J /J), then the expression is treated as the sentiment polarity of the phrase, as in Table (1):

Table 1 Sentiment Polarity at Phrase Level

B. If the phrase contains three words and the first word is an indefinite word (not defined by J), then it is added to a word containing (J) and then followed by a third word containing (J), then the sentiment polarity of the phrase will be as in Table (2).

Table 2 Sentiment Polarity at Phrase Level (3 words)

3. Sentiment Polarity at the level of the entire tweet:

The number of words bearing positive sentiment polarity is calculated, and the number of words bearing negative

sentiment polarity is calculated, taking into account the previous levels, as follows:

a. If the tweet contains a hashtag or a group of positive hashtags, the tweet is considered positive, and the sentiment polarity of the words are ignored. If it contains negative hashtags, the tweet is considered negative and the sentiment polarity of the words is also ignored. The sentiment polarity of tweeting, as in the table (3).

Table 3 The Sentiment Polarity of the tweet based on the hashtag

Positive Sentiment #

Polarity

Positive

Negative

B. If the Tweet contains Emoji and Emoticons, then these special symbols enrich the tweet with the sentiment polarity, if all the symbols are positive, then the sentiment polarity of the tweet is positive, but if all the symbols are negative, then the sentiment polarity of the tweet is negative, as in Table (4).

Table 4 Tweet contains Emoji and Emoticons

Positive Sentiment #

Polarity

Positive □

Negative ©

F. SVM Algorithm:

The support vector machine (SVM) is one of the vector machine learning methods used in classification, which is based on statistical learning theory and on the principle of SRM Structural Risk Minimization, which reduces the upper limit of expected risks [19], which reduces the error resulting from classification Misclassification Error [20]. The decision-making process in the SVM algorithm is fast, and that is why it is used in real-world applications [1]. The algorithm has three essential elements that worked on its success, first: the Margin Maximum principle, second: Theory Dual, and third: the Kernel Trick, and it has become one of the effective tools in solving machine learning problems and difficulties such as: Dimensionality of Curse, as it has the ability to learn regardless of the dimensions of the feature space and measures the complexity of the hypothesis based on the cut-off level and not on the number of features [21]. The algorithm quickly gained popularity due to its following features:

First: Mathematical Representations, elaborate mathematical representations, Second: Geometrical Explanations, Third: Ability Generalization, Fourth: Promising Empirical Performance [22].

The engineering interpretation of the classification algorithm is the algorithm that searches for an optimal cut-off level is dimensional to separate the data into two classes [23] (in the case of the Classification Binary that classifies the training

First Word Second Example Phrase

Sentiment Word Sentiment

Polarity Sentiment Polarity

Polarity

+1 +1 '¿JJI^. +1

+1 - 1 +1

+1 - 1 LUI aH '¿J'1 ^ 0

- 1 - 1 ^ Lnän +1

- 1 +1 JLAC-VI jj^' - 1

- 1 0 jJjJl ji^' - 1

+1 0 +1

( Jl

0 +1 Ö^IJJ +1

CJI^I^MJI

0 - 1 ^Lilj'Jl '¿^..J - 1

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

0 0 JliLl ^>ljj 0

Previous Phrase Sentiment Polarity Third Word Sentiment Polarity Example Phrase Sentiment Polarity

+1 +1 u^LJl '¿JJU. jji^Üll +1

+1 0 JliLVl Aie^ Ji^Jl +1

-1 - 1 ( "jj^ ji '¿J' -1

- 1 +1 Jjä-Jl jlajj' '¿JJjJl -1

0 +1 '¿jJujJl +1

- 1 0 «■LjVl ^j^ JlA^ll - 1

0 0 jL^Vl '¿j^j AikJl 0

0 -1 jlSà'Vl j^j Ajj^jl -1

data into two classes only or applies SVM Multiclass if the training data is classified into more than two classes), and here lies the main idea of SVM in building an optimal hyperplane in a space that solves classification issues and discriminating models, and the greater the size of the optimal hyperplane, the greater the efficiency of the algorithm and the accuracy of the classification [24]. To determine the largest margin, two parallel hyperplanes are built located on both sides of the margin, which represents the distance between the two parallel hyperplanes as in Figure (2) [25]. The SVM machine learning algorithm is based on the concept of decision levels that define decision boundaries, and the decision level is the boundary that separates a group of entities that have different class or classification affiliations, and the algorithm SVM finds the optimal hyperplane with the largest margin of separation for classes, using Lagrange's optimization formula. The benefit of the margin is to avoid falling into local minima bump and get the best classification [26]. We can think of the algorithm as a linear algorithm in a higher dimensional space [27] (as in Figure 2).

Figure 2 Linear Classification and Regression

The margin in the SVM algorithm has two concepts, the first: the geometric margin, which represents the distance to the level, and is related to the statistical rate of the weight vector, so the larger the margin, the smaller the standard rate of the weight vector, the second: the numerical margin represents the magnitude of Yi f(Xi) that appears in the loss function used in the standard SVM. If Yi f(Xi) is large, this will ensure that the loss function is small, meaning that the number of classification errors will be reduced experimental error [28]. SVMs classifiers are Maximum Margin Hyperplane (MMH) not a probability classifier [29]. Because the algorithm reduces the Empirical Classification Error and geometric margin expansion simultaneously, that is why it is called largest margin classifier [23]. The SVM classification algorithm works to obtain a model based on the training data, to predict the classes test data, and the idea of the data training process is to find the margin that will classify the test data, and this is achieved through three classification techniques that depend on the separability of the training data, the first: Linearly Separable

data, the second: Linearly Inseparable data, and the third Non-Linearly Separable [30].

The SVM algorithm was mainly proposed to deal with binary classification issues (i.e., classification into only two categories such as 1 or 0), but nowadays, we often need to classify big data into more than two categories, which is more complex than binary classification. The need for multi-class SVM has emerged (let's say M represents the number of classes, then M > 2 represents Multiclass), which is a major requirement in the field of science and engineering, as the SVM algorithm deals with all classes simultaneously [14], and it is a powerful and accurate technology in Model Classification and Knowledge Mining [31]. There are two types of multiclass SVM, the first type, is the process of dividing the classification problem into groups of binary classification problems and the methods are OVsR and OVsO, and the second type, is based on solving the multiclass classification problem in a model Single example, such as: the regression method, Grammar multiclass SVM and Weston's multiclass SVM [9].

IV. Results and Discussion

A. Features of Training:

Four groups were used to train the data on different types of features, which are verb features, noun features, (verb and noun) Features, and Unigram features, as well as negation tools that are added to the previous features. The POS Tagging method used to distinguish the features related to the Arabic text, as the first part of the tagging is based on ontology rules for Arabic words that mostly do not have the distinctive formation of the word, and the other part of the tagging is based on the hand-designed lexicon of words. These features were applied in the multi-class SVM algorithm and the RBF kernel. After converting it into a digital format that deals with the algorithm. The training and testing department is divided into two groups:

a. Training set 1: consider all tweets as opinions of 1,500 tweets.

First test set: 300 tweets.

b. The second training group: Training on tweets that were classified as opinions only, and ignoring the rest, which was classified as news, consisted of 500 tweets.

The second test group: consists of 100 tweets. The training features used are 50 features per tweet i.e. (50 * 1500) features for the first training group and (50 * 500) attributes for the second training group.

B. Measurements of Performance of SVM Algorithm in

Classification:

1) Confusion Matrix: Confusion Matrix usually causes a lot

of confusion even in those who are using them regularly. Terms used in defining a confusion matrix are TP, TN, FP, and FN.

2) Accuracy: We can define accuracy as the ratio of the

number of correct predictions and the total number of predictions.

Accuracy = (TP + TN) / (TP + FP +TN + FN)

3) Precision: Out of all that were marked as positive, how

many are actually truly positive.

Precision = TP / (TP + FP) 4) Recall or Sensitivity: Out of all the actual real positive cases, how many were identified as positive. Recall = TP/ (TN + FN) Fl-Score: F1 score is a weighted average of Precision and Recall, which means there is equal importance given to FP and FN. This is a very useful metric compared to "Accuracy". The problem with using accuracy is that if we have a highly imbalanced dataset for training (for example, a training dataset with 95% positive class and 5% negative class), the model will end up learning how to predict the positive class properly and will not learn how to identify the negative class. But the model will still have very high accuracy in the test dataset too as it will know how to identify the positives really well.

F1 score = 2* (Precision * Recall) / (Precision + Recall)

C. Classification results for the SVM algorithm and according to the approved features:

1. Verbs features The features (verbs + negation tools) were used to be the features of the training model and the SVM algorithm. The figures (3) and (4) more accurately illustrate the results of the first and second datasets, as it was noted: The results of the evaluation metrics for the first group are all high, except for one metric, which is an average of correct negative, because the classifier performs better in classifying positive tweets than in classifying tweets negative, while the results of the measures of the second group were uneven because of the ability to classify positive tweets less in the second group. By comparing the results of the first group with the results of the second group, it was found that the measures of Precision, Accuracy, the F-score and the correct positivity rate decreased in the second group, while the results of the recall measures and the correct negative rate increased. The reason is that the first training group is greater than the second training group. Despite the fact that the number of positive tweets in the two datasets exceeded the number of negative tweets. According to the results of the previous comparison, it was found that the second training group is better than the first training group. Although most of the results of the measures were better in the first group, the reason is that the ability of the second training group to distinguish negative tweets was high, in addition to its ability to classify positive tweets as acceptable.

Accuracy Precision Recall F-measure TP-Rate TN-Rate Figure 3 Results of Classifying the First Training Set using Noun Features

Accuracy Precision Recall F-measure TP-Rate TN-Rate

Figure 4 Results of Classifying the Second Training Set using Noun Features

2. Features of verbs and nouns: Features (verbs + nouns + negation tools) were used to be features of the training model and SVM. Figures (5) and (6) more accurately illustrate the results of the first and second groups, as it was noted that the results of all classification measures for the second group increased except for one measure, which is the recall measure. By comparing the results of the first group with the results of the second group, it was found that all measures had increased in the second group except for the recall measure, and for this reason it was found that the second training group was better than the first training group.

Accuracy Precision Recall F-measure TP-Rate TN-Rate Figure 5 Results of Classifying the First Training Set using verbs and nouns Features

Accuracy Precision Recall F-measure TP-Rate TN-Rate Figure 7 Results of Classifying the First Training Set using unigrams Features

D. Comparison of the final results for all classification features:

By comparing the results of the classification of the four features (verbs, nouns, verbs and nouns, and unity Unigram). Figures (9) and (10) showing that the best features of the classification are the attributes of verbs, because the classifier's performance in categorizing the tweets into positive, negative and neutral emotions was the best.

Accuracy

Precision

Accuracy Precision Recall F> measure TP ■Rate TN-Rate Figure 6 Results of Classifying the Second Training Set using verbs and nouns Features

3. Unigram's Features: The features (Unigram + negation tools) were used, and the Unigram is all the words of the tweet except for the excluded Arabic words, to be the features of a model of the training algorithm and SVM.

Figures (7) and (8) show the results of the first and second datasets. It was also noted that the results of the evaluation metrics for the second group were uneven. By comparing the results of the first group with the results of the second group, and since the results of the two groups are close, and for this reason, it was found that the second training group is better than the first training group; because the correct negativity rate measure in the second group is higher than the first group.

I.I Ü h l.il

5 virtu « (VCHjbi unignns

Recall

novirtt Vflrtw* n<

F-measure

a« \-^H-^B-^H— a» ^fl ^R

■ I 1 I ; I ill

nouiW vçrfci t nouns umjfjm

True Positive Rate

OH----

1111 i. I

nouns VCfl>S & nour» UOlirJitl

True Negative Rate

III

Figure 9 Comparing the results of the metrics used in classification to the first training dataset

1.2

0.8

0.6

0.4

0.2

I I I I I

Accuracy Precision Recall F-measure TP-Rate TN-Rate Figure 8 Results of Classifying the Second Training Set using unigrams Features

Accuracy Precision

o.! T- i.;

III!

verte rraure vert« 4 nouns uftpum wte wuw «te & nouns urrçMm

Recall F-measure

True Positive Rate True Positive Rate

1.1 ; 1 ■■

neitri Hrt teitifi ncmrn mijrim Vftte ruui'i rtltn&Jlftjni iTFjfJirt

Figure 10 Comparing the results of the metrics used in classification to the second training dataset

Conclusions and Recommendations

A. Conclusions:

By applying the proposed system, the following conclusions can be drawn:

1. Classifying the tweets using a small training group, which gave better results than the training group big.

2. The use of ontology rules in stamping words with the part of speech to which the word belongs has helped distinguish classical and colloquial words that were not present in the dictionary that was created manually, and sometimes in other Arabic dictionaries because the texts written on social networking sites are often devoid of formation.

3. Classifying tweets using verb attributes are better than other attributes.

4. Classifying Emoticons, and Emojis helped identify the emotional polarity of tweets.

5. The processing of hashtags helped to determine the sentiment polarity of the tweet.

6. Determine the sentiment polarity of tweets based on the presence of certain words in the tweet.

7. Increasing ontology rules to determine the sentiment polarity of the tweet helps in creating a new classification machine.

8. Tweets may contain a few words or contain misspellings such as the word and correct writing it is what makes categorizing tweets a difficult task, in addition to the difficulty of classification itself to a positive, negative, or neutral opinion. Studies of opinions on tweets are also sparse compared to other studies.

B. Recommendations:

1. Using another social media website such as Facebook to explore opinions and analyze feelings.

2. Using other algorithms to classify tweets.

3. Increasing the number of categories when classifying tweets using mining and analysis algorithms feelings.

4. Using other features to classify tweets (trigram and bigram).

5. Classify the images accompanying the tweets as part of the opinions.

6. Addressing the links accompanying the tweets that may refer to other tweets related to the opinions of the tweeters.

7. Devising new ontology rules to help define the grammatical features of the support vector machine algorithm.

References

[1] B. Liu, Sentiment Analysis and Opinion Mining, Chicago: Morgan and Claypool Publishers, 2012.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

[2] Asmita Dhokrat and Sunil Khillare and C. Namrata Mahender, "Review on Techniques and Tools used for opinion Mining," International Journal of Computer Applications Technology and Research, vol. 4, no. 6, pp. 419 - 424, 2015.

[3] A. Nasser, Large-Scale Arabic Sentiment Corpus And Lexicon Building For Concept-Based Sentiment Analysis Systems, Ankara: School of Science and Engineering of Hacettepe University, 2018.

[4] Mohammad Subhi Al-Batah and Shakir Mrayyen and Malek Alzaqebah, "Investigation of Naive Bayes Combined with Multilayer Perceptron for Arabic Sentiment Analysis and opinion Mining," J. Comput. Sci., vol. 14, pp. 1104-1114, 2018.

[5] Ruchika Aggarwal and Latika Gupta, "A Hybrid Approach for Sentiment Analysis using Classification Algorithm," International Journal of Computer Science And Mobile Computing, Ijcsmc, vol. 6, no. 6, p. 149 - 157, 2017.

[6] S. Alhazmi, LINKING ARABIC SOCIAL MEDIA BASED ON SIMILARITY AND SENTIMENT, Manchester: The University of Manchester, 2016.

[7] Waad A Al-Harbi and Ahmed Emam, "Effect of Saudi Dialect Preprocessing On Arabic Sentiment Analysis," International Journal Of Advanced Computer Technology (Ijact) , pp. 91-99, 2016.

[8] Rababah Osama and Al Hwaitat Ahmad and Qudah Dana, "Sentiment Analysis As A Way of Web Optimization," Scientific Research and Essays, vol. 11, pp. 90--96, 2016.

[9] Wahyudi Mochamad and Kristiyanti Dinar Ajeng, "Sentiment Analysis Of Smartphone Product Review Using Support Vector Machine Algorithm-Based Particle Swarm Optimization," Journal Of Theoretical And Applied Information Technology, vol. 91, pp. 189201, 2016.

[10] Aldayel Haifa K and Azmi Aqil M, "Arabic tweets sentiment analysis - a hybrid scheme," Journal of Information Science, vol. 42, pp. 782-797, 2016.

[11] Suresh Hima and Raj.S G, "Analysis of Machine Learning Techniques for Opinion Mining," International Journal of Advanced Research, vol. 3, no. 12, pp. 375-381, 2015.

[12] Al-Kabi Mohammed N and Gigieh Amal H and Alsmadi Izzat M and Wahsheh Heider A and Haidar Mohamad M, "Opinion mining and analysis for Arabic language," (IJACSA) International Journal of Advanced Computer Science and Applications, vol. 5, no. 5, pp. 181195, 2014.

[13] Bhonde Reshma and Bhagwat Binita and Ingulkar Sayali and Pande Apeksha, "Sentiment Analysis Based on Dictionary Approach," International Journal of Emerging Engineering Research and Technology, vol. 3, no. 1, pp. 51-55, 2015.

[14] Lauer Fabien and Guermeur Yann, "MSVMpack: A Multi-Class Support Vector Machine Package," The Journal of Machine Learning Research, vol. 12, pp. 2293-2296, 2011.

[15] Tumsare Pranali and Sambare Ashish S and Jain Sachin R and Olah Andrada, "Opinion mining in natural language processing using sentiwordnet and fuzzy," International Journal of Emerging Trends & Technology in Computer Science (IJETTCS), vol. 3, no. 3, pp. 153158 , 2014.

[16] Sharma Richa and Nigam Shweta and Jain Rekha, "Opinion mining of movie reviews at document level," International Journal on Information , pp. 13-21, 2014.

[17] O. D. E, "Blog mining-review and extensions: From each according to his opinion," Decision support systems, vol. 51, no. 4, pp. 821830, 2011.

[18] S. Olha, Opinion Mining And Sentiment Analysis Using Bayesian And Neural Networks Approaches, Master thesis, University of Tartu, Institute of Computer Science, 2017.

[19] Surya Prakash Sharma and Rajdev Tiwari and Rajesh Prasad, "Opinion Mining and Sentiment Analysis on Customer Review Documents- A Survey," International Journal of Advanced Research in Computer and Communication Engineering, pp. 156-159, 2017.

[20] D. Oraon, Study On Proximal Support Vector Machine As A Classifier, Department Of Electronics And Communication Engineering National Institute Of Technology, Rourkela, Orissa, 2012.

[21] Patil Gaurangi and Galande Varsha and Kekan Vedant and Dange Kalpana, "Sentiment Analysis Using Support Vector Machine," International Journal of Innovative Research in Computer and Communication Engineering, vol. 2, no. 1, pp. 2607-2612, 2014.

[22] Tian Yingjie and Shi Yong and Liu Xiaohui, "Recent Advances On Support Vector Machines Research," Technological and economic development of Economy, vol. 18, no. 1, pp. 5-33, 2012.

[23] Bhavsar, H and Ganatra, A, "Increasing Efficiency of Support Vector Machine using the Novel Kernel Function: Combination of Polynomial and Radial Basis Function," International Journal on Advanced Computer Theory and Engineering (IJACTE), vol. 3, no. 5, pp. 17-54, 2014.

[24] Kulkarni A. A. and Hundekar V. A. and Sannakki S. S. and Rajpurohit V. S., "Survey on Opinion Mining Algorithms and Applications," International Journal of Computer Techniques, vol. 4, no. 3, p. 9, 2017.

[25] Khairnar, Jayashri and Kinikar, Mayura, "Machine Learning Algorithms for Opinion Mining and Sentiment Classification," International Journal of Scientific and Research Publications, vol. 3, no. 6, pp. 1-6, 2013.

[26] Yash Ahuja and Sumit Kumar Yadav, "Multiclass Classification and Support Vector Machine," Global Journal of Computer Science and Technology Interdisciplinary, vol. 12, no. 11, pp. 15-20, 2012.

[27] (Karatzoglou, Alexandros and Meyer, David and Hornik, Kurt, "Support Vector Machines in R," Journal of Statistical Software, vol. 15, no. 9, pp. 1-28, 2006.

[28] Javier, M and Alberto, M, "Support Vector Machines with Applications," Statistical Science, vol. 21, no. 3, pp. 322-336, 2006.

[29] "Machine Learning Algorithms for Opinion Mining and Sentiment Classification," International Journal of Scientific and Research Publications, vol. 3, no. 6, pp. 1-6, 2013.

[30] Bhuvaneswari P. and Kumar J. S., "Support Vector Machine Technique for EEG Signals," International Journal of Computer Applications, vol. 63, no. 13, pp. 1-5, 2013.

[31] Asogbon, Mojisola G and Samuel, Oluwarotimi W and Omisore, Mumini O and Ojokoh, Bolanle A, "A multi-class Support Vector Machine Approach for Students Academic Performance Prediction," International Journal of Multidisciplinary and Current Research, vol. 4, pp. 210-215, 2016.

Kamel S. Jafar, Ph.D. student at "MIREA - Russian Technological University" RTU MIREA (78, Vernadskogo pr., Moscow, 119454 Russia). Scientific Specialty: Mathematical and software computing systems complexes and computer networks. E-mail: [email protected]

Panov A. Vladimirovich, Associate Professor of the Department Academic at MIREA - Russian Technological University" RTU MIREA (78, Vernadskogo pr., Moscow, 119454 Russia). E-mail: [email protected]

i Надоели баннеры? Вы всегда можете отключить рекламу.