Научная статья на тему 'РЕКОМЕНДАТЕЛЬНАЯ СИСТЕМА НА ОСНОВЕ ДЕЙСТВИЙ ПОЛЬЗОВАТЕЛЕЙ В СОЦИАЛЬНОЙ СЕТИ'

РЕКОМЕНДАТЕЛЬНАЯ СИСТЕМА НА ОСНОВЕ ДЕЙСТВИЙ ПОЛЬЗОВАТЕЛЕЙ В СОЦИАЛЬНОЙ СЕТИ Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
193
52
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
МАШИННОЕ ОБУЧЕНИЕ / РЕКОМЕНДАТЕЛЬНАЯ СИСТЕМА / ОБРАБОТКА ЕСТЕСТВЕННОГО ЯЗЫКА / РАСПОЗНАВАНИЕ ИЗОБРАЖЕНИЙ / MACHINE LEARNING / RECOMMENDATION SYSTEM / NATURAL LANGUAGE PROCESSING / IMAGE RECOGNITION

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Монастырев В.В., Дробинцев П.Д.

В настоящее время большое количество людей пользуются различными социальными сетями, онлайн-сервисами и тому подобное. При этом пользователи оставляют различную информацию в подобных системах. Это могут быть фотографии, комментарии, геотеги и так далее. Эта информация может быть использована для создания системы, которая может идентифицировать различные целевые группы пользователей. На основе этой информации можно запускать рекламные кампании, создавать рекомендательные объявления и много другое. В данной статье рассматривается система, которая позволяет идентифицировать интересы пользователей на основе их действий в социальной сети. Для анализа были выбраны следующие типы данных: опубликованные фотографии и текст, комментарии к записям, информация о любимых публикациях и геотеги. Для выявления целевых групп была поставлена задача проанализировать изображения на фотографиях и проанализировать текст. Анализ изображений включает в себя распознавание объектов, а анализ текста включает в себя выделение основной темы текста и анализ тональности текста. Данные анализа объединяются с помощью уникального идентификатора с остальной информацией и позволяют создать витрину данных, которая может быть использована для поиска целевых групп с помощью простого SQL-запроса.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

RECOMMENDATION SYSTEM BASED ON USER ACTIONS IN THE SOCIAL NETWORK

Currently, a large number of people use various photo hosting services, social networks, online services, and so on. At the same time, users leave a lot of information about themselves on the Internet. These can be photos, comments, geotags, and so on. This information can be used to create a system that can identify different target groups of users. In the future, you can run ad campaigns based on target groups, create recommendation ads, and so on. This article will discuss a system that allows users to identify their interests based on their actions in a social network. The following features were selected for analysis: published photos and text, comments on posts, information about favorite publications, and geotags. To identify target groups, the task was to analyze images in photos and analyze text. Image analysis involves object recognition, and text analysis involves highlighting the main theme of the text and analyzing the tone of the text. The analysis data is combined using a unique identifier with the rest of the information and allows you create a data showcase that can be used to select target groups using a simple SQL-query.

Текст научной работы на тему «РЕКОМЕНДАТЕЛЬНАЯ СИСТЕМА НА ОСНОВЕ ДЕЙСТВИЙ ПОЛЬЗОВАТЕЛЕЙ В СОЦИАЛЬНОЙ СЕТИ»

DOI: 10.15514/ISPRAS-2020-32(3)-9

Recommendation system based on user actions in the social network

V.V. Monastyrev, ORCID: 0000-0001-6770-4481 <vit34-95@mail.ru> P.D. Drobintsev, ORCID: 0000-0003-1116-7765 <drobintsev_pd@spbstu.ru> Peter the Great St.Petersburg Polytechnic University, 29, Polytechnicheskaya, St.Petersburg, 195251, Russia

Abstract. Currently, a large number of people use various photo hosting services, social networks, online services, and so on. At the same time, users leave a lot of information about themselves on the Internet. These can be photos, comments, geotags, and so on. This information can be used to create a system that can identify different target groups of users. In the future, you can run ad campaigns based on target groups, create recommendation ads, and so on. This article will discuss a system that allows users to identify their interests based on their actions in a social network. The following features were selected for analysis: published photos and text, comments on posts, information about favorite publications, and geotags. To identify target groups, the task was to analyze images in photos and analyze text. Image analysis involves object recognition, and text analysis involves highlighting the main theme of the text and analyzing the tone of the text. The analysis data is combined using a unique identifier with the rest of the information and allows you create a data showcase that can be used to select target groups using a simple SQL-query.

Keywords: machine learning; recommendation system; natural language processing; image recognition

For citation: Monastyrev V.V., Drobintsev P.D. Recommendation system based on user actions in the social network. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 3, 2020, pp. 101-108. DOI: 10.15514/ISPRAS-2020-32(3)-9

Рекомендательная система на основе действий пользователей в

социальной сети

В.В. Монастырев, ORCID: 0000-0001-6770-4481 <vit34-95@mail.ru> П.Д. Дробинцев, ORCID: 0000-0003-1116-7765 <drobintsev_pd@spbstu.ru> Санкт-Петербургский политехнический университет Петра Великого, 195251, Россия, Санкт-Петербург, ул. Политехническая, д. 29

Abstract. В настоящее время большое количество людей пользуются различными социальными сетями, онлайн-сервисами и тому подобное. При этом пользователи оставляют различную информацию в подобных системах. Это могут быть фотографии, комментарии, геотеги и так далее. Эта информация может быть использована для создания системы, которая может идентифицировать различные целевые группы пользователей. На основе этой информации можно запускать рекламные кампании, создавать рекомендательные объявления и много другое. В данной статье рассматривается система, которая позволяет идентифицировать интересы пользователей на основе их действий в социальной сети. Для анализа были выбраны следующие типы данных: опубликованные фотографии и текст, комментарии к записям, информация о любимых публикациях и геотеги. Для выявления целевых групп была поставлена задача проанализировать изображения на фотографиях и проанализировать текст. Анализ изображений включает в себя распознавание объектов, а анализ текста включает в себя выделение основной темы текста и анализ тональности текста. Данные анализа объединяются с помощью уникального идентификатора с остальной информацией и позволяют создать витрину данных, которая может быть использована для поиска целевых групп с помощью простого SQL-запроса.

Ключевые слова: машинное обучение; рекомендательная система; обработка естественного языка; распознавание изображений

Для цитирования: Монастырев В.В., Дробинцев П.Д. Рекомендательная система на основе действий пользователей в социальной сети. Труды ИСП РАН, том 32, вып. 3, 2020 г., стр. 101-108 (на английском языке). DOI: 10.15514/ISPRAS-2020-32(3)-9

1. Introduction

Currently, humanity actively uses various Internet services and leaves a lot of different data on the Internet. This can be photos, text information, and so on. Based on this information, you can divide users into groups based on their interests. Many companies have their own recommendation systems that operate on this principle - Yandex [1], Google (YouTube) [2], Netflix [3]. In this article, we will look at a recommedation system that will identify interest groups based on the following data: photos, text, rated publications, and geotags. The final goal is to create a target data table (in SQL format). From the SQL table, you can get a list of users based on the specified interest using an SQL query. To create such a table, you need to recognize objects in images, and recognize the main theme and tone in the text. This will help you understand which topics the user treats positively, which ones negatively, and which ones are neutral.

Thus, the final table will contain information about what the user posts, what they comment on, what and how they evaluate, as well as information about geolocation. Based on this information, which is specific to a particular user, you can easily get different groups of users by interests and geolocation.

2. Existing recommendation systems

As mentioned above, many large companies use different recommendation systems to process their data. It all depends on the specific task and the available data, so companies build the data processing process in a way that is convenient for them and usually such solutions are not open source. These can be systems for recommending movies, music, friends, interesting authors, and so on. Let's look at some of them in more detail.

To generate a smart news feed, the social network Vkontakte marks data with the help of users who have received the status of experts [4]. These users vote for or against publishing on a particular topic. Then the marked-up data is already transmitted to the neural network, which is trained on it and improved. Due to the large amount of marked-up data, the neural network is well trained and can find similar publications that are more likely to attract users' interest. One of the disadvantages is that not every project can attract a large number of users for data markup. In addition, this solution is not an open source solution.

Another example is Yandex music. The recommendation system analyzes the user's actions: likes and dislikes, skipped tracks, repeated playback, and so on. Each action has weights that are later used in the algorithm. In addition, the system analyzes similar profiles. The final list of recommendations is compiled using Matrixnet [5], which processes the list of all possible recommendations and determines which ones should be shown to the user on the Yandex Music home page and in what order to place them. It is worth noting that more than a hundred training models are used when making recommendations for a single user. This consumes a large amount of resources - hundreds of servers collect data about user requests to the search engine, viewed products, etc. this approach can be used by large companies, but it is not suitable for small projects. It is worth noting that the systems described above and other similar systems are sharpened for a specific set of data that a particular service works with. Also, the entire data processing process (data cleaning, preprocessing, model learning) is not open source. This article will discuss the process of working with the most popular data types, as well as building an algorithm for data processing and training models in such a way that this algorithm can be reused on other data types and in other projects. 102

3. Approach to building a recommendation system

The data set analyzed in this article was collected in one of the photo hosting services. This data set contains 127 images, 307 comments, 496 rating entries (likes and dislikes), and 47 geotags. The recommendation system will consist of several data processing modules. The algorithm of the system is shown in fig. 1.

Raw data is sent to the system input. This data is divided into three categories:

• images;

• text;

• geotags and ratings.

To identify user interests, images and text will be processed by machine learning modules. Image processing involves a module that will recognize objects in the image. Text processing includes two submodules: recognition of the main subject of the text and recognition of the tone of the text (positive or negative).

All processed data will be combined by a unique identifier (id). As a result, this will create a target tables that will contain the following information:

• what the user posts;

• what the user writes about and in what key;

• what the user evaluates positively;

• what the user evaluates negatively;

• geotags attached to the user's records.

This data will help you identify user groups based on their interests. You can use interest groups to recommend new publications, recommend various products, and so on.

It is worth noting that MySQL [6] relational database was chosen for storing information. Moreover, images are not stored directly in the database, but are stored in the file system. The database stores only links to images. Machine learning modules are written in Python, as this language offers a wide range of tools for data processing.

4. Machine learning modules

Let's take a closer look at how machine learning modules work for image and text processing.

4.1 Module for recognizing objects in an image

The pre-trained Inception-v3 [7] model was used for recognizing objects in images. This is one of the most popular models for recognizing objects in images [8]. This model achieves an accuracy of more than 78.1% on the Imagenet dataset. The model has been trained in 1000 [9] classes. The use of the pre-trained model is due to the fact that the model has good performance, has open source code, is easily integrated into existing solutions, and works fast enough (about 1-2 seconds for 1 image on Intel core i7).

When analyzing images, this model outputs the top prediction classes with the highest score value. Within the recommendation system, only the value with the highest score was recorded. An example of how the model works is shown in fig. 2:

flags = NameEpace{image_file='server_uploads/ 30/8bl0a2;7-7;69-49c8-96e2-6dl5676c916g.jpeg', inodel_dir='tmp/imagenet1, njm_top_predictions=

5)

unparsed = ['D:\\Users\\WinUser\\Anaconda3\\l ib\\site-packages\\ipykernel_launcher.py1, * -f', 'C:\\Users\\WinUser\\AppData\\Roaming\\jup yter\\runtime\\kernel-fafb8a94-6d99-4718-bbe4-80870598319a.json']

television, television system (score = 0.8473 8)

Fig. 1. Example of how the image recognition model works In total, 127 photos from the original data set were processed using this model. Of the 1000 classes available in the model, 87 images were recognized. The average score value for all data is about 0.49. Information about recognized objects is shown in fig. 3.

Fig. 2. Recognized objects

The most popular «chainlink fence» images shown on the chart are a classifier error, such images have a very small score. The most common objects are the sea, the coast, cars, and architectural objects. In the data set under consideration, the results of the classifier were analyzed. Correctly predicted values had a score greater than 0.5, so these images were considered correctly recognized and taken into account in the future (there are also incorrectly recognized images, but only about 10% of them).

All data was written to a MySQL table with the following fields: • id;

• photo_id;

• photo_desc;

• score.

Here id is a unique identifier, photoid is a foreign key from the photo table, photodesc is the name of the recognized object, and score is the value of score.

4.2 The analysis module of text subject

Working with text is a more complex topic than image recognition, so there are no ready-made models here. This is because each language has its own grammar and it is difficult to adapt one model for all languages at once. In our case, the entire text was in Russian. However, there are various algorithms that can be adapted to your data and trained. To highlight the main topic of the text, a model based on the Latent Dirichlet Allocation (LDA) [10] algorithm was used. The main idea of this algorithm is that each document is considered as a set of topics in a certain proportion. Each topic is a set of the most common word and each document consists of a specific set of words [14].

The origin data set cannot be passed directly to the model. First, you need to additionally process the text:

• eliminate unnecessary characters (punctuation marks);

• remove stop words (conjunctions, particles, etc.);

• form stable phrases;

• make lemmatization.

The simple_preprocess() method of the Gensim library [11] was used to remove punctuation and tokenize the text. To delete stop words, a set of stop words from the nltk [12] package was used. Bigrams and trigrams were formed as stable phrases using the Gensim library. The ru2 model from the spacy package was used for lemmatization.

The main input data for the LDA model is the dictionary and corpus. Gensim creates a unique identifier for each word in the document, and the corpus shows the frequency of occurrence of this word.

One of the hyperparameters is the number of topics in the text. Since we had a fairly small data set, we set 20 topics. Other alpha and eta (was set to 'auto') parameters affect the sparsity of topics, chunksize (was set to 100) - the number of documents in each training chunk, and passes (was set to 10) - the total number of training passes.

Fig. 3. Visualization of the LDA model operation

To visualize the result, an interactive diagram was built using the pyLDAvis [13] package, which is shown in fig. 4.

4.3 The analysis module of text sentiment

A convolutional neural network was used to analyze the tone of the text [15, 16, 17, 18]. The Word2Vec library was used to create the feature space. The training was conducted on a corpus of words based on Russian-language messages from Twitter, which contains 114991 positive and 111923 negative tweets, as well as 17639674 unmarked tweets [19]. Before training, all data was pre-processed (reduced to lowercase, replacing links to the token, etc.). The Word2Vec model was trained using the Gensim library. The Keras library [20] was used to build the neural network. This model, trained on tweets, was applied to text messages from the data set in question. The model metrics are shown in fig. 5.

Fig. 5. Metric models the tone of the text This model was used to process the original data set that contained comments. As a result, the results were obtained as fig. 6 shows.

Fig. 6. The results of the model determine the tone of the text In this case, the abscissus axis shows the percentage predicted by the model, and the ordinate axis shows the number of similar comments. As you can see, most of the comments in the provided data set had a mostly neutral accent (values between 0.3 and 0.7 were taken as neutral, this data was viewed manually).

The trained model was used on the source data. All results were written to a MySQL table.

5. Results

The results of all three models were recorded in MySQL. All data is combined with a single id. This way we can now distinguish user groups based on their interests. As a result, the database contains the following tables.

• Post. This table stores the id, photo and / or text, rating, geolocation (if available), and author of the publication;

• Comment. This table stores the id, publication id, text, rating, and comment author;

• Rating. This tables stores id, user id, photo id and rating (negative, positive or neutral);

• Object in the photo. This table stores the id, information about objects in the image (this information was obtained using the model), and the image id;

• Main theme of the text. This table stores the id, the main subject of the text, the type of post (post or comment), and the id of the post or comment.

• Tone of the text. This table stores the id, the tone of the text, the type of post (post or comment), and the id of the post or comment.

Let's look at an example of making recommendations using these tables. Let's say that we create an individual recommendation system to recommend interesting authors. To do this, we need to select what the user posts and what they rate positively (posts and comments). Then we need to find authors who publish similar images and recommend such authors to the user. For example, if these are sea coasts, we can use the following SQL query: «select distinct (photo.userid) from photodesc, photo where (photodesc.photodesc like '%coast%' or photodesc.photodesc like '%sea%') and pho-to_desc.photo_id=photo.id and score > '0.5'». Three such users were found in the data set under consideration (fig. 7).

Fig. 4. Result of the SQL query

6. Conclusion

As a result, we implemented a recommendation system that allows us to identify target groups of users. The process of data processing by several machine learning models was considered. The Concept-v3 model was used for image processing, an LDA-based model was used to highlight the subject of the text, and a neural network-based model was used to determine the tone of the text. The model results were used for building SQL queries. The results of all models in the test data set were checked manually. For the object recognition model, the extreme score value was set to 0.5. For the text tone recognition model, the values 0-0. 3 were set for negative text, 0.3-0.7 for neutral text, and 0.7-1 for positive text.

This system can be used on small projects, since models are trained on marked - up data from open sources. In addition, the logic of setting up search targets is quite clear; it can be performed by any analyst who knows the SQL language. This architecture is suitable for almost any purpose, whether it is recommending services, searching for interesting publications, etc. In future plans:

• Building the process of fully automating the launch of model training. To do this, you plan to use the Linux scheduler, or Jenkins/TeamCity;

• Implementation of a recommendation system in a real project. At this point, the data was received as a separate set of values and processed on a separate computer. For the full operation of the service, it is planned to transfer the entire data processing process to an industrial server;

• Analysis of model metrics. After implementing this system in the service, it is planned to analyze the accuracy of the models. This can be tracked by user clicks on the proposed content. It will also allow you to conduct A / B tests when some users see suggestions of recommendations from one model, and others from another. These tests will help you identify the best-performing models.

References / Список литературы

[1]. Recommendation Technology 'Disco'. URL: https://yandex.com/company/technologies/disco/.

[2]. Covington P., Adams J., Sargin E. Deep Neural Networks for YouTube Recommendations. In Proc. of the 10th ACM Conference on Recommender Systems - RecSys '16, 2016, pp. 191-198.

[3]. Gomez-üribe C.A., Hunt N. The Netflix Recommender System. ACM Transactions on Management Information Systems, vol. 6, issue 4, 2015, pp. 1-19.

[4]. VK Experts. URL: https://vk.com/press/theme-feeds.

[5]. Matrixnet. URL: https://yandex.ru/company/technologies/matrixnet/.

[6]. Joel Murach. Murach's MySQL. Mike Murach & Associates, 2012, 612 p.

[7]. TensorFlow models, GitHub. URL: https://github.com/tensorflow/models.

[8]. Szegedy C., Vanhoucke V., Ioffe S., Shlens J., Wojna Z. Rethinking the Inception Architecture for Computer Vision. In Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818-2826.

[9]. 1000 synsets for Task 2 (same as in ILSVRC2012). URL: http://image-net.org/challenges/LSVRC/2014/browse-synsets.

[10]. Bíró I., Szabó J. Latent Dirichlet Allocation for Automatic Document Categorization. Lecture Notes in Computer Science, vol. 5782, 2009, pp. 430-441.

[11]. Gensim project page. URL: https://pypi.org/project/gensim/.

[12]. NLTK project page. URL: https://www.nltk.org/.

[13]. pyLDAvis project page. URL: https://www.nltk.org/.

[14]. Thematic modeling using Gensim (Python). URL: https://webdevblog.ru/tematicheskoe-modelirovanie-s-pomoshhju-gensim-python/.

[15]. Jin R., Lu L., Lee J., Usman A. Multi-representational convolutional neural networks for text classification. Computational Intelligence, vol. 35, issue 3, ,2019, 599-609.

[16]. Text tonality analysis using convolutional neural networks. URL: https://habr.com/ru/company/mailru/blog/417767/.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

[17]. Cliche M. BB_twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs. In Proc. of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 2017, pp. 573-580.

[18]. Zhang Y., Wallace B.A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification. arXiv preprint arXiv:1510.03820, 2015.

[19]. Rubtsova Y.V. (2015). Constructing a corpus for sentiment classification training. Programmnye produkty i sistemy, no. 27, 2015, pp. 72-78 (in Russian) / Рубцова Ю.А. Построение корпуса текстов для настройки тонового классификатора. Программные продукты и системы, no. 27, 2015 г., стр. 72-78.

[20]. Keras project page. URL: https://keras.io/.

Information about authors / Информация об авторах

Vitaly Viktorovich MONASTYREV - student. Research interests: neural networks, recommender systems, machine learning.

Виталий Викторович МОНАСТЫРЕВ - студент. Научные интересы: нейронные сети, рекомендательные системы, машинное обучение.

Pavel Dmitrievich DROBINTSEV - Ph.D., Associate Professor. Research interests: test automation, formal models, software verification, artificial intelligence applications.

Павел Дмитриевич ДРОБИНЦЕВ - к.т.н., доцент. Научные интересы: автоматизация тестирования, формальные модели, верификация программного обеспечения, приложения искусственного интеллекта.

i Надоели баннеры? Вы всегда можете отключить рекламу.