Научная статья на тему 'AUTOMATIC RECOGNITION OF MESSAGES FROM VIRTUAL COMMUNITIES OF DRUG ADDICTS'

AUTOMATIC RECOGNITION OF MESSAGES FROM VIRTUAL COMMUNITIES OF DRUG ADDICTS Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
52
34
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
TEXT CLASSIFICATION / WORD EMBEDDINGS / BAG-OF-WORDS / CONVOLUTIONAL NEURAL NETWORKS / SUPERVISED LEARNING / TEXT CATEGORISATION / NEURAL NETWORKS / ONE-HOT ENCODING / CLASSIFICATION ALGORITHM

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Firsanova Victoria I.

The paper describes building a binary classifier with Convolutional Neural Network (CNN) using two different types of word vector representations, Bag-of-Words and Word Embeddings. The purpose of the classifier is to recognise messages published in virtual communities of drug-addicted people. This system may find application in healthcare as a tool for automatic identification of addicts’ communities. It may also provide insights on the features of addicts’ online discourse. The classifier is based on the dataset from Russian-speaking online VK (VKontakte) communities. The dataset comprises texts of publications and comments posted in two types of open communities. The first type includes communities which actively discuss problems of addiction to psychotropic and psychoactive substance. The second type of communities focuses on the discussion of private issues - the users share their life stories and ask for help or advice. In the latter case publications are not related to drug addiction issues. The experiments centered around the development, evaluation and comparative analyses of two models - based on Bag-of-Words and Word Embeddings, respectively. The neural network training was implemented with the Tesla T4 graphics processing unit on the Google Colab platform. The model with the best performance showed 0.99 F1-Score and 0.95 Accuracy; however, after the programme testing, a few weaknesses were found. The programme still requires retraining on a supplemented dataset which includes publications collected from both addicts’ and non-addicts’ communities describing various mental conditions including depression, anxiety and nervous disorders. This opens up an opportunity to create software that can automatically distinguish publications made by people struggling with depression caused by the use of psychoactive substances from publications made by people suffering from depressive disorders of a different kind.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «AUTOMATIC RECOGNITION OF MESSAGES FROM VIRTUAL COMMUNITIES OF DRUG ADDICTS»

Applied Linguistics

PI Check for updates

UDC 81'33, 81'32 https://www.doi.org/10.33910/2687-0215-2020-2-1-16-27

Automatic recognition of messages from virtual communities of drug addicts

V. I. Firsanova®1

1 Saint Petersburg State University, 7/9 Universitetskaya Emb., Saint Petersburg 199034, Russia

Abstract. The paper describes building a binary classifier with Convolutional Neural Network (CNN) using two different types of word vector representations, Bag-of-Words and Word Embeddings. The purpose of the classifier is to recognise messages published in virtual communities of drug-addicted people. This system may find application in healthcare as a tool for automatic identification of addicts' communities. It may also provide insights on the features of addicts' online discourse. The classifier is based on the dataset from Russian-speaking online VK (VKontakte) communities. The dataset comprises texts of publications and comments posted in two types of open communities. The first type includes communities which actively discuss problems of addiction to psychotropic and psychoactive substance. The second type of communities focuses on the discussion of private issues — the users share their life stories and ask for help or advice. In the latter case publications are not related to drug addiction issues. The experiments centered around the development, evaluation and comparative analyses of two models — based on Bag-of-Words and Word Embeddings, respectively. The neural network training was implemented with the Tesla T4 graphics processing unit on the Google Colab platform. The model with the best performance showed 0.99 F1-Score and 0.95 Accuracy; however, after the programme testing, a few weaknesses were found. The programme still requires retraining on a supplemented dataset which includes publications collected from both addicts' and non-addicts' communities describing various mental conditions including depression, anxiety and nervous disorders. This opens up an opportunity to create software that can automatically distinguish publications made by people struggling with depression caused by the use of psychoactive substances from publications made by people suffering from depressive disorders of a different kind.

Keywords: text classification, Word Embeddings, Bag-of-Words, Convolutional Neural Networks, supervised learning, text categorisation, neural networks, one-hot encoding, classification algorithm.

Introduction

The paper describes the process of building a classifier. It may find application as a tool for the content analysis of virtual communities on social media based on the processing of published texts. The study resulted in the development of two models of the classifier. They were based on different types of word vector representations, Bag-of-Words and Word Embeddings. The classifier was built using Convolutional Neural Network (CNN). Both models were built with the Keras library (Keras: the Python deep learning API 2020) on Python 3 (Python 3.6.7 documentation 2020). All the experiments were performed on Google Colab (Google Colab 2020) using Tesla T4 GPU. The classifier assigns one of the two defined labels. If the argument of the maxima of the neural network prediction is 0, the programme assigns the label "non-addicts" and outputs the following message, "This text was most likely published in a non-addicts' virtual community." In turn, if the argument of the maxima of the neural network prediction is 1, the programme assigns the label "addicts'" and outputs the following message, "This text was most likely published in an 'addicts' virtual community."

To train the model based on Word Embeddings, a frequency dictionary was built for the dataset, and word range indices were used for building vector representations. In addition, an Embedding-layer was added to the CNN model. To train the model based on Bag-of-Words representations, one-hot encoding was applied to the integer representations of word frequency indices. During the data pre-processing, the whole training set was cut into short text pieces of equal length. After the parameter optimisation, the sample length, the sampling step size and the number of words from the frequency dictionary analysed during the neural network training were adjusted. Two models (one based on Bag-of-Words and the other based on Word Embeddings) were evaluated, and the best one was chosen for the final testing.

Both Bag-of-Words and Word Embeddings representations can be used to analyse lexical features without directly investigating syntactical ones. However, valencies filling could be analysed, as lemmatisation was not implemented during the data pre-processing. The linguistic analysis of the publications from the test set was consequently implemented to find out the importance of syntax analysis for the research. During the linguistic analysis, the lexical diversity and syntactic features of the texts were taken into account. Then, the linguistic analysis results were compared with the model outputs. During the analysis and the programme testing, several ambiguous texts were found in the dataset. The model would make mistakes in their analysis. The anonymous survey was conducted to find out if these examples were truly ambiguous or the programme errors were caused by the insufficient model training. Respondents were asked to decide whether a given passage was published in a drug addicts' virtual community or some other online community.

The study presents an application of machine learning techniques for solving psycholinguis-tic tasks. The study result is not a diagnostic tool. Since the study raises an acute social problem, a discussion about the possibilities of its practical application requires prior consultation with experts.

Related work

In 1983, Dmitry Spivak introduced a new term "linguistics of altered states of consciousness" to designate the research field that combines psycholinguistics and neuroscience. Studies in linguistics of altered states of consciousness describe linguistic features and processes that take place in an altered state of consciousness. "Altered" can refer to a state induced by psychoactive substances or by staying in adverse conditions (for example, in the highlands) or to an unusual emotional state (for example, anxiety or fear) (Spivak 1983).

This study also refers to linguistics of altered states of consciousness. According to the content of texts from the training dataset, authors of some publications might systematically (repeatedly and regularly) consume psychotropic and psychoactive substances. Thus, the dataset might contain messages created by people in an altered state of consciousness. Since all the processing data is anonymous, the statistical evidence for that cannot be provided, although several confirming examples from the dataset can be shown. Examples (1)-(3) present transliterated Russian texts with the authors' original spelling and punctuation and their translation into English given in brackets.

(1) Sejchas kuryu chasten'ko, no ne bolee togo. (Now [I']m smoking quite often but no more than that.)

(2) Ya sejchas kuryu 50-60 kosyachkov v den'. (Now I'm smoking 50-60pots a day.)

(3) Mne 22 goda iz nih ya shest' let upotreblyayu narkotiki. (I'm 22 and I've been consuming drugs for six years.)

Psychoactive substance consumption might cause brain disorders. For example, it might influence the functioning of brain areas responsible for speech production. In some cases, brain resources are not sufficient to compensate for the work of the damaged areas (Luria 1976; Jakobson 1973). This fact partly explains why speech disorders are common among people suffering from alcohol, nicotine and drug addiction. Perhaps such problems can be reflected in an individual's online discourse since digital communication has a lot in common with live interaction.

The study results can find a practical implementation in psychotherapy. The analysis of speech or texts produced by a drug addict or a person in an altered state can shed light on the patient's perception of the world. For example, stigmatisation manifested in the form of speech formulas is widespread among drug addicts (for instance, "there are no former drug addicts"). The analysis of speech semantic features during psychotherapy facilitates de-stigmatisation and may debunk myths about addiction. This is an important step in treatment which allows a patient to establish a trusting relationship with a therapist (Shajdukova 2013).

Dmitry Spivak also wrote about the connection between linguistics and psychotherapy. He stated that the effectiveness of psychotherapy depends entirely on the influence exerted on deeper layers of the patient's consciousness, and there is only one tool to achieve this — the therapist's word (Spivak 1983).

Dataset

VK (VKontakte social networking service) features various public pages related to drug abuse. In total, the official VK search interface identified 31 drug addicts' virtual communities. All the found communities can be divided into three thematic clusters. The first cluster includes 11 trip-report libraries. Trip-reports usually contain around 500 words and describe the process of psychoactive substance consumption and its physiological and psychological influence. The second cluster includes 14 public pages where users share information, personal experience and stories, ask for help, etc. The third cluster includes 6 entertaining pages about the peculiarities of drug addicts' life on which users publish images, video or audio recordings with short comments.

Publications in drug addicts' online communities contain a lot of obscene vocabulary, slang and vernacular language. The training dataset should be linguistically balanced, which means that texts in the set of publications from non-addicts' communities should possess the same features. Otherwise, the CNN model might err in classifying publications containing obscene language and ignore other linguistic characteristics. The publications from non-addicts' communities were collected from 43 VK public pages. Of them, 23 pages represent thematic communities of urban residential areas where people share information and ask for advice, and 20 pages that discuss private issues.

The dataset contains 23 983 publications of two classes. The volume of the dataset is 1 636 998 words. The collection of texts from non-addicts' communities contains 857 195 words (52% of the data), and the collection of texts from drug addicts' communities contains 779 803 words (48% of the data). Consider two random examples from the training dataset. Example (4) represents an excerpt from a publication in a drug addicts' community, while example (5) represents a text of a publication in a non-addicts' community.

(4) Anonimno. Lyubil babu odnu. Nu kak lyubil? CHuvstva byli tochno pomnyu... S baboj vmeste zhili goda tri uzhe. Lyubila menya govorila. Nu vobshchem v odin iz dnej poekhal v garazh... (Anonymously. Loved one woman. Well did [I]? There were some feelings for

sure [I] remember... With the woman [we'd] lived for about three years already. Loved me, [she] used to say. Well, anyway one day [I] went to the garage...)

(5) Vot chego ne hvataet devushkam??. Vrode dazhe nad tvoimi shutkami smeetsya,no net v itogeprosti ne sud'ba ne lyublyu i bla bla bla... mozhno ne anonimno. (What do girls want??. Seems that [she] even laughs at your jokes, but no in the end [she says] sorry it's not meant to be [I] don't love [you] and blah blah blah... not necessarily anonymously.)

Both texts are about romantic relationships. It is possible to assume that both authors are male. In example (4), the author's sex is explicitly indicated by the use of masculine verb forms. In example (5), such a conclusion can be drawn from the message context. Both authors left non-anonymity or anonymity tags in their posts ("anonymno (anonymously)", "mozhno ne anonimno (not necessarily anonymously)"). Both authors use vernaculars ("lyubil babu (loved a woman, vernacular)", "bla bla bla (blah blah blah)"). They do not strictly follow the rules of spelling and punctuation. In terms of lexical diversity, the texts can be called similar. The differences can be found in the syntax structure choice. Unlike the author of example (5), the author of example (4) prefers the reverse word order and simple sentences; he unconsciously splits the text into short, easier to process statements. However, both authors avoid using an explicit subject.

Thus, randomly selected texts have a significant number of common linguistic features, and they can be called ambiguous for the automatic classification. The major feature of example (4) is the absence of specific slang and lexicon denoting psychoactive substances. As stated earlier, the classifier described in the study is not syntactically sensitive; however, since lemmatisation was not implemented, the programme might analyse such features as valencies filling and grammar forms quite effectively. Nevertheless, the classifier assigns class 0 ("non-addicts") to both examples (4) and (5), although the full publication from which example (4) was taken was analysed correctly as class 1 ("addicts") when loaded into the classifier.

Data pre-processing

With the VK application programming interface (VK API), a bot that returns a list of all the community publications was built on Python 3 (VK API 2020). API makes it possible to establish an interaction between two systems (Lauret 2019), in this case, between the developer's programming environment and the VK database. Publications from the drug addicts' virtual communities received index "1". Publication texts from the non-addicts' communities received index "0". All the dataset samples were randomly mixed and divided into a training set containing 80% of the data and a validation set containing 20% of the data. In addition, a test set containing new publications was collected to assess the programme performance.

The dataset was split into samples of equal length for the model training. The original length of texts in the dataset varies. For example, a collection of texts from the drug addicts' virtual communities contains a large number of trip-reports. The length of some trip-reports exceeds 500 words, whereas the average length of publication texts from the non-addicts' communities, in which users discuss their private life, is about 50-60 words, and such texts make up about half of the non-addicts' communities collection. Splitting the texts into samples of equal length made the dataset more uniform and convenient for automatic processing and, in particular, for machine learning. Moreover, this had a positive impact since it reduced the likelihood of class 1 ("addicts") erroneously assigned to any lengthy text, regardless of its content. Besides, the processing of such short data chunks requires less computing power than analysing the entire array. This makes the dataset more flexible for automatic processing.

Convolutional neural networks were chosen for the model training (Kim 2014). This type of artificial neural networks has been successfully used to solve natural language processing problems. It works well with phrase and collocation analysis, automatic feature extraction, and it is considered an effective tool for part-of-speech tagging, semantic analysis and named entity recognition (Collobert, Weston 2008). Today, this class of neural networks is actively used for text classification (Tao, Chang 2019). In CNN, data is processed as follows: convolution layers perform operations on the matrices extracting information about the most significant features of the analysed entities (Goodfellow et al. 2016).

The neural network training implies computing coefficients of connections between neurons (for example, between classes and corresponding text features), which allows the model to make non-linear decisions (Ma et al. 2017). This allows solving complex tasks on natural signals processing, for example, speech, texts or images. Such signals represent complex structures. For the natural signals data classification, the simultaneous analysis of multiple variables is needed (Kim 2017). For example, with deep neural networks, it is possible to analyse the context.

Two types of word vector representations were used for data processing. The first type, Bag-of-Words, does not take into account grammar, text structure or word order, although it can analyse the word usage frequency (Zhang et al. 2010). The second type, Word Embeddings, maps word vector representations. Word Embeddings realises the following concept: semantically close words should appear in similar contexts. A word vector represents the position of a token in the meaning space, and the comparison of such vectors makes it possible to identify semantically close tokens and establish thematic clusters (Turney, Pantel 2010). Word Embeddings representations are considered efficient for natural language processing tasks (Yin, Shen 2018). Unlike Bag-of-Words, Word Embeddings makes it possible to analyse not only the word usage frequency but also the word compatibility. Thus, a hypothesis that the Word Embeddings model would show better results in the VK publication classification task than the Bag-of-Words model was put forward.

Data pre-processing, model training and classifier testing were implemented via the Google Colab environment (Google Colab 2020). The code is written with Python 3 (Python 3.6.7 documentation 2020). To train the model, the Keras library was used (Keras 2020). One of the natural language processing problems is that a neural network can only process numerical data, like vectors representing the word usage frequency and contextual dependencies (Webster, Kit 1992). For the Bag-of-Words model, one-hot encoding vectors were used, for example, [0. 1. 1.... 0. 0. 0.]. For the Word Embeddings model, sequences of frequency indices were used, which is as follows: [2 441, 4 228, 747, 23, 335].

Google Colab sessions could get interrupted during the data pre-processing and training due to the system limitations in computing power. Thus, finding a balance between data processing efficiency and energy consumption was another important task. During tokenisation, the text data was converted to lower case, and all the characters except the Cyrillic and Latin letters were filtered to make the personalised publication texts uniform and more convenient to process. A word form separated by a whitespace character was considered as one token. After the tokenisation, the volume of the training dataset was 8 998 819 characters, 1 221 630 words. Then, a frequency dictionary, in which each word was encoded with a unique frequency range index, was built. A fragment of the frequency dictionary is given in Table 1; all the word forms are presented in transliterated Russian and their translation is given in brackets.

Table 1. The frequency dictionary sample

Word form Index Word form Index

pasport (passport) 905 bol' (pain) 909

rukami (hands, genitive) 906 dolzhno (should) 910

diko (widely) 907 narkoticheskih (narcotic, genitive) 911

odnim (one, instrumental) 908 metallurgov (metallurgists, genitive) 912

The volume of the dictionary is 137 803 tokens. Index 1 encoded the most frequent word in the dictionary ("T (Russian conjunction "and")), and index 137 803 encoded the least frequent one, unique and sorted alphabetically ("odnoklasnitsy" (Russian mistyped feminine plural form of the noun "classmate")). The analysis of low-frequency words and entities with typos was not considered efficient because it may cause overfitting — a situation when the model computed the weights too accurately and the computations would be unsuitable for new examples (Mou et al. 2016). As a way to prevent this, the maximum word count parameter or mWC was defined to identify the maximum number of words from the frequency dictionary that the model will take into account during the training and classification. If a given text contains words with frequency indices higher than mWC, then these words will be ignored, and the computation will be implemented only with tokens whose indices are lower or equal to mWC. For example, if mWC is 20 000, the word encoded with the index 20 001 will be ignored by the program.

During the vectorisation, the data array was converted with the Keras module into frequency indices sequences according to the dictionary ranking. For example, a piece of text "TV-3 priglashaet sem'i iz Chelyabinska prinyat' uchastie (TV-3 invites families from Chelyabinsk to take part in...)" was transformed into the following form: [2 441, 4 228, 747, 23, 335, 1 324, 1 860]. Each number in this sequence corresponds to the word index from the dictionary, although the numeral "3", which is a part of word "TV-3", was omitted after the tokenisation since all the characters except letters were filtered. Word vector representations of this type were used to implement the Word Embeddings model. Then, the sequences were transformed into a one-hot encoding matrix in which one vector corresponds to one numeric index; for example, the number 3 is encoded as follows: [0. 0. 1. 0. ...]. The matrices were used to implement the Bag-of-Words model.

Model training

To train the Bag-of-Words model, a sequential Convolutional Neural Network was used, in which the Dense layer with 200 units with the rectified linear activation function is connected to the output Dense layer with 2 units with the softmax activation function. To prevent overfitting, the Dropout method equal to 0.5 was used. To improve the model performance, the BatchNormalization method was used. To train the Word Embeddings model, the CNN with the Embedding layer was used and the vector space of 30 dimensions was built. The Spatial Dropout 1D version was used to prevent overfitting. The Flatten layer was used to reformat multidimensional data. The model also included the Dense layer with 200 units with the ReLU activation function and the output Dense layer with 2 units with the softmax activation function. The BatchNormalization method was used again. The visualisation of both models is shown in Fig. 1.

Bag-of-Words Word Embeddings

Fig. 1. Visualisation of the Bag-of-Words and Word Embeddings models

For both models, the Adam optimisation algorithm, categorical cross-entropy loss function and the following metrics — Accuracy, Precision, Recall and Fl-Score — were used. The classifier evaluation metrics are based on the frequency identification of false negative and false positive model responses (Manning et al. 2008). Type I errors or "false alarms" and type II errors or "target omissions" are the key concepts in statistical hypothesis testing (Easton, McColl 1997).

Both models were trained for 20 epochs with a batch size of 200. The optimised modifications of the Bag-of-Words and Word Embeddings models were compared using the selected metrics. Only one model was chosen for the classifier testing. During the parameter optimisation, the sample length was increased from 10 to 50 words (xLen parameter). Then, the Bag-of-Words model recognition percentage of class 0 increased from 82% to 95%, and for class 1 it increased from 72% to 90%. The average recognition percentage increased from 72% (precision = 0.93, recall = 0.95, fl = 0.94, loss = 2.56) to 93% (precision = 0.93, recall = 0.95, fl = 0.94, loss = 2.63). The Word Embeddings model recognition rate of class 0 grew from 81% to 95%, and for class 1 it increased from 67% to 90%. The average percentage also increased from 74% (precision = 0.96, recall = 0.97, f1 = 0.96, loss = 1.28) to 93% (precision = 0.94, recall = 0.96, f1 = 0.95, loss = 0.25).

The results were significant; however, the parameter optimisation continued with reducing the sampling step size (step parameter) from 100 to 50 units. The hypothesis was that reducing the sampling step size would make the sample dense and allow the model to conduct an utter analysis. This hypothesis was confirmed. The Bag-of-Words model recognition percentage of class 0 increased to 98%, and for class 1 it increased to 94%. The average recognition percentage increased by 2% (precision = 0.99, recall = 0.99, f1 = 0.99, loss = 0.16). The Word Embeddings model recognised 98% of class 0 texts and 89% of class 1. The average percentage of its recognition increased by 1% (precision = 0.96, recall = 0.97, f1 = 0.96, loss = 0.28).

Word Embeddings model managed to achieve higher values of the selected metrics as well as lower losses during the optimisation process; however, after the final transformations, the Word Embeddings model handled the VK classification task worse than the Bag-of-Words model with the same parameters. The hypothesis that the Word Embeddings model would perform more effectively was not confirmed. The Bag-of-Words modification, obtained during the parameter optimisation, was used for the classifier testing. Table 2 presents the model evaluation.

Table 2. Evaluation

Bag-of-Words model results

Optimised parameter xLen =10 xLen = 50 step = 50

Precision 0.93 0.93 0.99

Recall 0.95 0.95 0.99

F1-Score 0.94 0.94 0.99

Accuracy 0.72 0.93 0.95

Loss 2.56 2.63 0.16

Bag-of-Words model results

Optimised parameter xLen =10 xLen = 50 step = 50

Precision 0.96 0.97 0.96

Recall 0.97 0.96 0.97

F1-Score 0.96 0.95 0.96

Accuracy 0.74 0.93 0.94

Loss 1.28 0.95 0.28

Analysis

The test dataset contained texts that were ambiguous for classification. The classifier made mistakes in their recognition. An anonymous online survey was conducted to ensure that the texts are ambiguous and the CNN approach choice was justified.

In his monograph, Dmitry Spivak outlines some linguistic features of speech typical of a person in an altered state of consciousness. They include a large number of narrow denominational signs of a natural language, such as abusive and onomatopoeic vocabulary, substitutive words (e. g., "that stuff", "well", etc.), "stamp" collocations, proper names and words frequent for a specific context (e.g., "doctor" and "injection" in the hospital) (Spivak 1983). Spivak also lists syntactic features of speech produced by people in an altered state of consciousness, for example, an unclear distinction between sentence parts, weakly expressed connections between the components of an utterance, predominance of simple isolated sentences. Consider as an example a transcript of a transliterated Russian utterance (the English translation is presented in brackets) from one of Spivak's studies: "Nu, normal'no, ukol horosho poshel, snachala pohuzhe, a sejchas normal'no, sejchas eto, nu, minut uzhe desyat' (Well, it's okay, the injection went well, at first it was worse, but now it's okay, now err, well, about ten minutes already)" (Spivak 1983). Texts created by addicted people might have most of the listed features, as their authors could be in an altered state at the time of publication or their speech could be defected due to the prolonged and systematic use of psychoactive substances, which, in turn, could affect their use of language online.

The classifier was tested on different types of publications. For example, the test results showed that trip-reports (texts describing the process of psychoactive substance consumption and its effects) are recognised well by the program. Such texts often have a similar syntactic structure. Example (6) is an excerpt from a trip-report from a drug addicts' virtual community presented as a transliterated Russian text with its translation into English given in brackets. The model recognised this text correctly and assigned it to class 1 ("addicts").

(6) Ladno, poka tam vse bryzzhut zhelch'yu, rasskazhu vam koroten'kuyu stori, dlya kogo-to mozhet dazhepouchitel'nuyu. Bylo eto gde-to v 2k14, poekhali s drugom Dimoj v kurortnyj gorodochek... (Okay, while everyone's spewing their bile, [I']ll tell you a short story, for someone it may even be instructive. It happened sometime in 2k14, [we] went with a friend Dima to a resort town...)

After its linguistic analysis, it was concluded that example (6) belongs more likely to class 1 ("addicts"). The text lacks explicit class 1 features, such as names of psychoactive substances or specific slang. However, syntactic inversions with a subject omission ("poekhali s drugom Dimoj (went with a friend Dima)"), emphatic expressions designating aggression ("vse bryzzhut zhelch'yu (everyone's spewing their bile)") and neologisms ("stori (story, vernacular anglicism)") show that the text was more likely published in a drug addicts' virtual community. According to the survey results, only 20% of respondents admitted that the text is an excerpt from a message published in a drug addicts' online community.

The linguistic analysis shows that in some cases a deep semantic analysis is required in addition to the morpho-syntactic analysis to assign a proper class. Humans can grasp the meaning of a text (its hidden message) immediately. However, both the Word Embeddings and the Bag-of-Words models have limitation in this capacity due to their settings. In example (7), the author describes not only what was happening but also his emotional state at that moment. To do so,

he not only uses the verb "zanervnichal (became nervous)", but also lists the girl's actions. The reader can guess how closely the narrator watched her every movement and conclude that he was nervously concentrated. Moreover, the text contains the phrase "vse budet rovno (everything will be smooth)" which indicates that something illegal was about to happen. To analyse such expressions, a special vocabulary might be needed in addition to the CNN model. The classification here should be based on semantic analysis. The linguistic approach allows implementing such a deep analysis, but will a machine ever be able to handle this?

(7) Korotkij telefonnyj razgovor s poluchatelem, i fraza "Vse budet rovno", pribavila mne uverennosti. Moya ochered' uzhe podhodila. Devushka bezrazlichno perekladyvala moi produkty sveryayas's opis'yu, rutinnaya rabota. Vdrug, pakets kofe vyzval u nee vnezapnyj interes, shiroko razvernuv ego, ona prinyalas' vnimatel'no rassmatrivat' soderzhimoe, prinyuhivayas' i vorosha soderzhimoe ruchkoj. Ya zanervnichal. (The short phone conversation with the recipient and the phrase "Everything will be smooth" made me feel more confident. It was my turn already. A girl indifferently shifted my products checking the inventory, a routine work. Suddenly, a coffee pack aroused her unexpected interest, she opened it wide and began to examine its contents carefully, sniffing and stirring the contents with a pen. I got nervous.)

Conclusion

The research centred around the development of CNN based models for text classification. For the Bag-of-Words model training, one-hot encoding matrices were used, and for the Word Embeddings model training, a frequency indices sequence was used. To build the word vector representations, a frequency dictionary was built for the training dataset. After the model parameter optimisation, the Bag-of-Words models showed better results in classifying publications from VK virtual communities. The models were evaluated by the Accuracy, Precision, Recall and F1-Score metrics. The final version of the CNN classifier correctly recognises 98% of texts from the non-addicts' online communities and 94% of texts from the addicts' online communities.

The programme was tested on publications absent in the training sample. In addition, the linguistic analysis of texts ambiguous for classification was implemented. Then, a survey was conducted to find out how random people online would rate such texts. Testing and analysis found that texts from drug addicts' virtual communities are characterised not only by specific slang, vernaculars and obscene language but also by the vocabulary describing the symptoms of mental and physical disorders. Users in drug addicts' virtual communities tend to describe their nervous conditions or depression online.

In many cases, model predictions match the judgments made after the linguistic analysis, although they often contradict opinions of human respondents. Moreover, the model classifies texts even more accurately than humans. Thus, the CNN model can capture the structural features of texts published in addicts' online communities without analysing their meaning or sentiment. Perhaps it will be possible to implement a deeper analysis if the model is supplemented with a vocabulary and the dataset is supplemented with texts describing the author's mental or emotional state. The model can be expanded by increasing the number of classes for recognition; for example, by training the model to recognise the type of publication (e. g., a trip-report, a help request, a comment, etc.). The classifier can accurately recognise messages in which people complain about their health or mental state, therefore, it was decided to continue the research by exploring the capabilities of neural networks and developing an extended programme.

Sources

Google Colab. (2020) [Online]. Available at: https://colab.research.google.com/ (accessed 07.12.2020). (In English)

Keras. (2020) [Online]. Available at: https://keras.io/ (accessed 07.12.2020). (In English) NumPy Documentation. (2020) NumPy. [Online]. Available at: https://numpy.org/doc/ (accessed 07.12.2020). (In English)

Python 3.6.7 documentation. (2020) Python. [Online]. Available at: https://docs.python.org/release/3.6.7/

(accessed 07.12.2020). (In English) VK API. (2020) VK Developers. [Online]. Available at: https://vk.com/dev/manuals (accessed 07.12.2020). (In Russian)

References

Collobert, R., Weston, J. (2008) A unified architecture for natural language processing. In: ICML '08: Proceedings of the 25th International Conference on Machine Learning. New York: Association for Computing Machinery Publ., pp. 160-167. https://doi.org/10.1145/1390156.1390177 (In English) Goodfellow, I., Bengio, Y., Courville, A. (2016) Deep learning. Cambridge: The MIT Press, 800 p. (In English) Easton, V. J., McColl, J. H. (1997) Hypothesis testing. Statistics Glossary. [Online]. Available at:

http://www.stats.gla.ac.uk/steps/glossary/hypothesis testing.html (accessed 07.12.2020). (In English) Manning, C. D., Raghavan, P., Schütze, H. (2008) Introduction to information retrieval. New York: Cambridge

University Press, 496 p. (In English) Jakobson, R. (1973) Main trends in the science of language. London: Routledge Publ., 76 p. (In English) Kim, P. (2017) MATLAB deep learning: With machine learning, neural networks and artificial intelligence. Berkeley:

Apress Publ., 151 p. https://doi.org/10.1007/978-1-4842-2845-6 (In English) Kim, Y. (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on empirical methods in natural language processing (EMNLP). Doha: Association for Computational Linguistics Publ., pp. 1746-1751. https://www.doi.org/10.3115/v1/D14-1181 (In English) Lauret, A. (2019) The design of web APIs. Shelter Island: Manning Publications, 392 p. (In English) Luria, A. R. (1976) Basic problems of neurolinguistics. The Hague: Mouton Publ., 398 p. (In English) Ma, A., Stagliano, A., Wills, G. (2017) Supervised classification algorithms. O'Reilly. [Online]. Available at: https://learning.oreilly.com/videos/supervised-classification-algorithms/9781492023937 (accessed 07.12.2020). (In English) Mou, L., Meng, Z., Yan, R. et al. (2016) How transferable are neural networks in NLP applications? In: Proceedings of the 2016 Conference on empirical methods in natural language processing. Austin: Association for Computational Linguistics Publ., pp. 479-489. https://www.doi.org/10.18653/v1/D16-1046 (In English) Shajdukova, L. K. (2013) Sovremennye podkhody k reabilitatsii narkozavisimykh [Modern approaches to the rehabilitation of the drug addicts]. Kazanskij meditsinskij zhurnal — Kazan Medical Journal, 94 (3): 402-405. (In Russian)

Spivak, D. L. (1983) Lingvisticheskaja tipologija iskusstvenno vyzyvaemykh sostojanij izmenennogo soznanija. Soobshchenie 1 [The linguistic typology of artificially caused altered states of consciousness. I]. Fiziologija cheloveka — Human Physiology, 1: 141-146. (In Russian) Tao, W. I., Chang, D. (2019) News text classification based on an improved convolutional neural network. Tehnicki vjesnik — Technical Gazette, 26 (5): 1400-1409. https://doi.org/10.17559/TV-20190623122323 (In English)

Turney, P. D., Pantel, P. (2010) From frequency to meaning: Vector space models of semantics.

Journal of Artificial Intelligence Research, 37: 141-188. https://doi.org/10.1613/jair.2934 (In English) Webster, J. J., Kit, C. (1992) Tokenization as the initial phase in NLP. In: COLING' 92: Proceedings of the 14th conference on Computational linguistics. Vol. 4. Stroudsburg: Association for Computational Linguistic Publ., pp. 1106-1110. https://doi.org/10.3115/992424.992434 (In English) Yin, Z., Shen, Y. (2018) On the dimensionality of word embedding. In: 32nd Conference on Neural information processing systems (NeurIPS 2018). [Online]. Available at: https://arxiv.org/abs/1812.04224 (accessed 07.12.2020). (In English) Zhang, Y., Jin, R., Zhou, Z.-H. (2010) Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1 (1-4): 43-52. https://doi.org/10.1007/s13042-010-0001-0 (In English)

Author:

Victoria I. Firsanova, SPIN: 8926-9681, e-mail: vifirsanova@gmail.com

For citation: Firsanova, V. I. (2020) Automatic recognition of messages from virtual communities of drug addicts. Journal of Applied Linguistics and Lexicography, 2 (1): 16-27. https://www.doi.org/10.33910/2687-0215-2020-2-1-16-27 Received 11 November 2020; reviewed 2 December 2020; accepted 7 December 2020.

Copyright: © The Author (2020). Published by Herzen State Pedagogical University of Russia. Open access under CC BY-NC License 4.0.

i Надоели баннеры? Вы всегда можете отключить рекламу.