Научная статья на тему 'Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси'

Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
193
40
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
МНОГОКРИТЕРИАЛЬНАЯ ОПТИМИЗАЦИЯ / ОБУЧЕНИЕ ПРЕДСТАВЛЕНИЙ / АНАЛИЗ ЛОГОВ / ЛОГИ МОБИЛЬНОГО ПРИЛОЖЕНИЯ / АВТОМАТИЧЕСКОЕ ИЗВЛЕЧЕНИЕ ПРИЗНАКОВ

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Селезнев Н.К., Ирхин И.А., Кантор В.В.

Во многих задачах, решаемых в Яндекс.Такси с помощью машинного обучения, будь это обыкновенная сегментация пользователей, предсказание числа поездок в следующем месяце или другие задачи, необходимо представлять пользователя приложения в виде вектора признаков. Среди основных источников данных для построения такого вектора можно выделить логи мобильного приложения, которые, однако, слабо структурированы. Извлечение признаков из данных такого типа вручную осложнено характером данных: требуются серьезные знания в области человеческого поведения, а кроме этого глубокое понимание технических деталей генерации логов. Мы разработали метод, который автоматически конструирует

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси»

УДК 004.85

N. Seleznev1, I. Irkhin1'2, V. Kantor1'2

xYandex.Taxi

2Moscow Institute of Physics and Technology (State University)

Automated extraction of rider's attributes based on taxi mobile application activity logs

Whether it is an ordinary user segmentation or an estimation of a given user's number of trips in the next month, data scientists in Yandex.Taxi are frequently concerned with representing a user in a form of a feature vector. One of the main sources of useful for the task data is mobile application activity logs, which are quite technical and weakly-structured. Manual extraction of features from this type of data is complicated as it requires solid knowledge in the fields of human behavior and cognitive abilities paired with deep understanding of log-generation technical details. We propose a method that automatically constructs n-dimensional dense vector representation of a user based on her application activity. The constructed representation acts as a feature set for both supervised and unsupervised tasks. The evaluation shows that tested models successfully learn to extract crucial information about a user. Moreover, we tested our method in the real-world supervised learning task. The results show that obtained user representation is useful both on its own and in combination with manually crafted features from user's taxi order history.

Key words: multitask learning, text embedding, log-data analysis, mobile application activity logs, automated feature extraction.

H.K. Селезнев1, И.А. Ирхин1'2, B.B. Кантор1'2 1 Яндекс. Такси

Московский физико-технический институт (государственный университет)

Автоматическое извлечение атрибутов водителя из логов мобильного приложения такси

Во многих задачах, решаемых в Яндекс.Такси с помощью машинного обучения, будь это обыкновенная сегментация пользователей, предсказание числа поездок в следующем месяце или другие задачи, необходимо представлять пользователя приложения в виде вектора признаков. Среди основных источников данных для построения такого вектора можно выделить логи мобильного приложения, которые, однако, слабо структурированы. Извлечение признаков из данных такого типа вручную осложнено характером данных: требуются серьезные знания в области человеческого поведения, а кроме этого - глубокое понимание технических деталей генерации логов. Мы разработали метод, который автоматически конструирует n-мерное векторное представление пользователя, построенное на основе его активности в мобильном приложении. Полученное представление может использоваться как набор признаков в задачах обучения с учителем и без учителя. Как показывают эксперименты, опробованные модели успешно справляются с извлечением важной информации о пользователе. Мы протестировали наш метод в задачах обучения с учителем, решаемых в сервисе, и результаты показывают, что получаемое представление пользователя полезно как само по себе, так и в комбинации с собранными вручную признаками из истории заказов пользователя.

Ключевые слова: многокритериальная оптимизация, обучение представлений, анализ логов, логи мобильного приложения, автоматическое извлечение признаков.

@ Seleznev N., Irkhin I., Kantor V., 2018

(с) Федеральное государственное автономное образовательное учреждение высшего образования «Московский физико-технический институт (государственный университет)», 2018

1. Introduction

Yandex.Taxi is a service that allows its users to order an official taxi at an affordable rate without calling a dispatcher. One can order a taxi on the site or through the Yandex.Taxi application for iOS or Android1. Yandex.Taxi users generate substantial amount of data, mainly coming from their history of orders and application activity logs. This data is used extensively for machine learning objectives throughout the company, such as recommendation of destination points for a given trip or estimation of the taxi demand for a given area.

Both streams of data (history of orders and application activity logs) contain crucial information about the users, and are complementary to each other in various user-oriented machine learning tasks. However, there is some difficulty in analyzing them together. Users' history of orders is well-structured and, in many ways, straightforward to extract features from. At the same time, logs of users' activity in the application are much less accessible without extensive study of the data. Besides, feature extraction from application logs requires some expertise in the areas of human behavior, cognitive abilities and psychology, specifically applied to mobile application user-activity understanding. Overall, it is extremely labor-intensive to extract features from application logs and, as a consequence, the efficiency of data-utilization in the company is less than it might be if only application logs were easier to work with.

In order to help machine learning practitioners throughout Yandex.Taxi to facilitate the process of technical and weakly structured application logs analysis, we propose a method for automatic construction of user's vector representation based on her mobile application activity. The proposed representation is as an n-dimensional dense vector constructed from a given Yandex.Taxi user's mobile application log history. This representation maps users to the same vector space. It acts as a feature set for both supervised and unsupervised machine learning tasks.

Later in this paper we will refer to the aforementioned n-dimensional dense vector constructed from a given user's mobile application activity logs as «user representation», «user-embedding», «representation» or «user-vector».

2. Related Work

The construction of user-representations based on some weakly or unstructured data has been around for a while. The popular setup is to bring users of some service to the same vector space with its products and make product recommendations for users based on some distance metric or more sophisticated techniques [3,11,12]. This paper is not concerned with recommendation systems and aims to solve supervised learning tasks as in fl] and to find similar users as in [3].

Although our approach is closely related to the model presented in fl], one of the main differences is that mobile application activity data is more technical and less interpretable than website activity data. Moreover, we are not only interested in the user representation that is explicitly trained on some number of supervised learning tasks, but also similarly concerned with the ability of this representation to generalize to previously unseen tasks. For that reason, apart from supervised learning tasks, we employ various techniques to improve generalization in an ordinary multitask fashion [2]. Furthermore, we test various models capable of word-level embedding and compare their performance against each other on the set of mimic tasks. We also show that our method may be applied to the real-world production task. Finally, we study the relationship between method's performance on supervised tasks and the configuration of auxiliary tasks it was trained on.

The method employed in this paper is comparable with the one suggested in [3] as one of the goals of our approach is to identify similar users. In some of tested models we use similar strategy to obtain user representation, except for the fact that we are not interested in representing user's log sessions, but instead, in the aggregated history of her sessions. Nevertheless, we test the idea of averaging word-level embeddings that belong to a user's application activity log history which

1The company's description is taken from the official website: https://yandex.com/support/taxi/

is close in spirit to approach of [3]. One of the distinctive features of our setup is that the notion of context in Yandex.Taxi mobile application logs is ill-defined. Therefore, it is not immediately justified to use word2vec [4] and other context-based embedding techniques to obtain word-level embeddings.

Parts of the presented approach may be used for categorical feature embedding as in [5]. During the course of training, some of the tested models learn representations for mobile application activity logs' event names (identifiers of some event happening, e.g. start of the application or tap on the «order button»). After training, one may use the Euclidean space representations of said event names for machine learning tasks.

3. Methodology

Our method is aimed to obtain fixed-length dense vector representation of an arbitrary user of Yandex.Taxi from her activity in the mobile application. Apart from having fixed-length we also make this representation:

1) be able to act as a feature set for business-oriented supervised learning tasks, such as user's Lifetime Value estimation or user's service preferences identification (like child seat requirement etc.). Below, this feature is referred as «predictive power»;

2) help identify similar users in terms of business metrics, such as willingness to accept surge pricing2 or tariff preferences3. Below, this feature is referred as «similarity».

3.1. Data Description and Preprocessing

The main source of users' data is their logged activity in the mobile application. The log is represented by a series of consecutive events, some of which contain detailed descriptions regarding the event. Each event has: event_name, event_value (description), event_timestamp: event_ region, session_ id and event_ coordinates.

Example 1. If event^name is «accept_order_button_is_clicked», then its description might be «tariff: economy, surge^value: 1.5, source^coordinate: (10, 10), target ^coordinate: (20, 20)».

After the manual selection process, there are 169 unique event names, 40 of which contain event values. The selection of event names for user-text (concatenated event names and event values) creation was done manually based on the amount of useful information they bear.

Example 2. Event «application^started» is ignored, because it, seemingly, bears no relevant information about the user except for the fact that she started the app, and that information is logged seconds later on the first screen she sees.

Preprocessing of event values is aimed to extract useful information from raw logs and help text-embedding models observe the diversity of, at first glance, similar events.

Example 3. event^value «surge 1.2» is transformed to «surge^ves surge_value_l_2», while event^value «surge: 1.0» is transformed to «surge^no surge_value_l_0» to enable text-embedding models to tell the difference between the situation in which surge price was accepted and the opposite.

2 Surge pricing is a method to balance taxi demand and supply by charging higher price for the trip.

3Tariff is a class of the car that arrives when a taxi is ordered in Yandex.Taxi application. There are plenty of available tariffs: economy, comfort, business and others. Users may have preferences regarding the tariff.

The dataset for experimentation consists of 5.539 user-texts generated from selected users' Yandex.Taxi mobile application activity until 1 November, the average length of user-text is 781 words. The vocabulary size is 1.249, and the total number of words is 4.401.355.

3.2. Predictive Power and Similarity Evaluation

To evaluate the performance of user-embeddings on predictive power and similarity tasks we collected the set of business metrics4 that are used as target values in these tasks. Predictive power of user representation is measured on its ability to predict collected business metrics associated with the user. For the experiments, we chose 8 business metrics. The symbol «*» in the column «Metric name» indicates the presence of information directly associated with metric value in user's application logs.

Table 1

User's Business Metrics Description

Metric >iame Description Performance Evaluation Metric

accepts surge* 1 if user lias accepted surge pricing at least once, 0 - otherwise. Accuracy

tariff* The distribution of user's total taxi orders among available tariffs. Categorical cross entropy-

card system The distribution of user's total taxi orders among available card systems. Categorical cross entropy

•payment type* The distribution of user's total taxi orders among available payment types. Categorical cross entropy

mean cost Average cost of user's order. RMSE

mean travel time Average travel time of user's trip. RMSE

cancel frequency* The number of cancelled orders divided by the total number of orders user had. RMSE

num. orders Total number of orders user had. RMSE

The user-embedding's ability to help identify similar users is measured as follows: firstly, for each user we find top-n most similar users (in our experiments n = 5) based on their cosine similarity. Secondly, for each business metric from Table 1 we evaluate variance in the group of selected users. For real-valued metrics and for binary one we use regular variance, for other metrics, which are, essentially, distributions, we measured average pairwise Hellinger distance in the group.

3.3. Models

In our experiments, we evaluated the performance of 5 different in nature models' some of which have both unsupervised and supervised versions (* indicates presence of supervised version of a model):

• Word2Vec (W2V)*

• FastText (FT)

• Doc2Vec (D2V);

• Autoencoder (AE)*

• ARTM*

Before diving deeper into the models' architectures, it is crucial to define the concept of «guide». A guide is a task additional to the model's original unsupervised objective. With this additional task, we encourage the model to pay more attention to the textual features that are indicative of e.g. user's tariff preferences or the card system she uses. We employ guides explicitly in an ordinary multitask learning fashion, i.e. we introduce auxiliary losses to the original unsupervised loss, while the embedding stage is shared among all of the tasks. As an

4 Business metric is an attribute of a user which describes her pattern of service usage. For example, it might be average number of orders per month or number of cancelled orders during the last week.

example, one may think of an autoencoder model which takes as input user-text (in a bag-of-words representation) and transforms it to some dense fixed length vector with the objective to minimize reconstruction loss and auxiliary binary cross entropy of user classification as one that accepts surge pricing or not. In the described setup, the autoencoder is trying to learn user representation in such a way that it preserves both the information important for reconstruction (original objective) and the patterns indicative of surge pricing acceptance. In the remaining of the paper we refer to any auxiliary task as a guide. All supervised models were trained with 4 guides: payment type, tariff, mean cost and num orders.

All the models except for unsupervised versions of W2V, FT and D2V are trained to obtain 100-dimensional user representation. Unsupervised W2V, FT and D2V obtain 200-dimensional representation. The choice of dimensionality is guided by each model's performance in the predictive power task.

Word2Vec

As the unsupervised version of the model (W2V simple) we used genism [6] implementation of Skip-Gram word2vec trained on full corpus of user-texts. In order to obtain given user's representation all word vectors from his or her user-text are averaged. The supervised word2vec model (CBOW) has multi-layer perceptrons attached to the embedding layer of word2vec for each guide we introduce to the model. At each epoch of training, firstly, the regular word2vec model is trained on the whole corpus of user-texts, then the embedding layer is taken out and trained simultaneously with multiple classificators (MLP's) on top of it to minimize the loss associated with user's business metrics prediction. The whole process is repeated until the classification converges. In the described architecture original word2vec objective acts as a regularizer that helps the model to generalize better to unseen tasks (e.g. prediction of business metrics that were not used as guides during the training). The supervised model has 2 versions: one with global average pooling layer (W2V POOL) on top of the embedding layer and the other with LSTM [10] layer in that place (W2V LSTM).

Fig. 1. General supervised word2vec architecture:

w - word, textitt - index of current central word in the word2vec window (CBOW), s - half-length of the word2vec window, e^ - embedding of word i, eav - embedding of user-text, I - number of words in user-text, (yi) - prediction of guide j (business metric), m - number of guides used for training

ARTM

We chose ARTM as a topic modeling approach (and BigARTM0 [7] as a tool), because it is, no more than, a generalized topic modeling method. Taken with 2 different sets of parameters it acts as a generalization of 2 of the most popular approaches to the task, namely, LDA and PLSA. The unsupervised version of ARTM (ARTM simple) is a simple LDA model implemented in the BigARTM library trained on the full corpus of user-texts. The supervised ARTM (ARTM guided) is a regular ARTM model with guides represented as added modalities to the original word modality. Intuitively, the model takes, for example, user's mean order cost, or her willingness to accept surge pricing as an additional modality to the topic modeling task.

Autoencoder

The unsupervised autoencoder (AE simple) aims to encode a bag-of-words representation of user-text to fixed-length vector and then reconstruct the original input from it. The encoded representation is used for subsequent tasks. The supervised Autoencoder (AE guided) is a regular autoencoder model with output from the encoder being fed to dense layers for business metrics prediction. The model is trained in a regular multitask fashion with total loss being a weighted sum of reconstruction loss and all guides' losses. Contrary to supervised word2vec, this model is trained in an end-to-end fashion.

Fig. 2. General supervised autoencoder architecture:

[wc0...wcn] - bag-of-words representation of user-text, n - number of words in the vocabulary, wci - number of times word i appeared in user-text, dim dimensionality of user-text embedding, e embedding of user-text, ) - prediction of guide j (business metric), m - number of guides used for training

Doc2Vec

As the unsupervised version of the model we used gensim implementation of DBOW doc2vcc f 13] trained on full corpus of user-texts with each user-text treated as a document. In order to

"BigARTM is a tool to infer topic models, based on a novel technique called Additive Regularization of Topic Models (taken from BigARTM official website: http://bigartm.readthedocs.io/en/stable/intro.html).

obtain user representation, the corresponding document vector is inferred. There is no supervised version of this model.

FastText

The original Facebook Research fastText [8] implementation is used. There is no supervised version of this model.

4. Results

4.1. Mock Evaluation: Predictive Power

Table 2

Predictive Power Evaluation Results

Each row in the table presents a model, each column - a guide (business metric) for which performance was evaluated. Row «constant» shows performance of the best train constant prediction for each business metric. Row «W2V untrained» refers to the supervised word2vec model, which was not trained, just initialized with random weights. All performance evaluation metrics are taken from Table 1. In all columns except for the first

one the less is the better.

accepts surge card system payment type tariff cancel frequency ij mean travel time mean cost num orders

W2V untrained 0.8578 0.719 0.4959 0.2129 0.1629 11.8916 198.809 52.986

W2V simple 0.8758 0.7388 0.4393 0.2152 0.1688 11.8363 192.227 49.9414

W2V POOL 0.8657 0.5608 0.3958 0.1537 0.1635 11.7972 175.895 47.2452

W2V LSTM 0.7769 0.6904 0.4797 0.5538 0.1758 11.9295 184.502 41.8545

FT 0.8736 0.6248 0.4314 0.1541 0.1691 11.7443 191.327 50.5823

D2V 0.6072 0.7749 0.5243 0.2185 0.192 12.6536 220.606 55.9574

ARTM simple 0.9112 0.8008 0.4681 0.1836 0.1378 11.7484 202.655 38.3039

ARTM guided 0.9162 0.7478 0.5148 0.2165 0.1641 11.7736 194.085 38.8015

AE simple 0.8664 0.6767 0.4776 0.1605 0.1722 11.9182 200.173 36.0736

AE guided 0.87 0.5838 0.4319 0.1632 0.1693 11.8151 183.072 36.7271

constant 0.5819 0.7823 0.5259 0.2152 0.1932 12.6442 220.624 60.0897

The results suggest that our supervised W2V model shows best performance in 4 out of 8 prediction tasks. The guides used for supervision are: payment type, tariff, mean cost and num orders. The W2V POOL model outperforms others in 3 out of 4 tasks it was explicitly supervised on. However, in the card system distribution estimation it shows best result despite the fact it was not supervised with respect to this metric. The opposite is true for the num orders metric, on which W2V POOL was supervised, yet it struggles to beat the other models.

In 2 of the tasks (cancel frequency and mean travel time) the best results are shown by unsupervised models (LDA and FastText).

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The largest variance in the predictive power among the models is present for tariff and num orders tasks, the smallest - for mean travel time and mean cost.

Additionally, in none of the tasks constant prediction is the best one, which serves as a proof that nearly all models have learned to extract meaningful information about users' business metrics.

4.2. Mock Evaluation: Similarity

In the similarity task, the supervised autoencoder (AE guided) beats the others in 3 out of 8 tasks, also W2V untrained shows the same result with 3 out 8 tasks being won. For now, we cannot suggest a reasonable explanation of that phenomenon. The supervised autoencoder shows best performance in only one task it was supervised on. Moreover, the model shows best performance on 2 tasks on which it had no guide.

The largest variance in the similarity task performance is present for tariff and mean cost tasks, the smallest - for card system and payment type.

Table 3

Similarity Evaluation Results

Each row in the table presents a model, each column - a guide (business metric) for which performance was

evaluated. The row named «random» presents variances calculated based on n random samples from the dataset (i.e. the step with selection of n most similar users to the given user is replaced with random uniform sampling of n users). Row named «W2V untrained» refers to the word2vec model which was not trained, just initialized with random weights. The details of evaluation are described in section 3.2 of this paper. In each

column the less is the better.

accepts surge payment type card system tariff cancel frequency mean cost num orders mean travel time

W2V uiitraindcd 0.0984 0.2792 0.6893 0.0879 0.0203 26951.9 3103.32 82.0377

W2V simple 0.0976 0.2689 0.6563 0.0951 0.0223 29554.4 2238.73 104.411

W2V POOL 0.0956 0.2258 0.5596 0.0907 0.0209 29819.8 2167.41 92.5748

W2V LSTM 0.1259 0.2645 0.6188 0.0894 0.0239 30774.1 773.741 108.91

FT 0.1024 0.2685 0.6562 0.0935 0.0224 35447 2420.46 103.427

D2V 0.1762 0.319 0.7217 0.1375 0.0289 67849.9 2483.63 127.114

ARTM simple 0.1073 0.233 0.5595 0.0893 0.023 57348.9 1704.41 116.845

ARTM guided 0.1111 0.2805 0.6674 0.0969 0.0243 65342.3 2375.38 104.782

AE simple 0.0967 0.283 0.6833 0.0865 0.0229 39925 1943.59 102.506

AE guided 0.0928 0.223 0.555 0.0785 0.0214 49074.9 1405.92 108.226

random 0.1966 0.3165 0.7271 0.1356 0.0293 67277.2 2107.31 121.262

It is important to note that in 5 out of 8 tasks supervised models beat the others, however in 3 tasks untrained word2vec with randomly initialized weights wins.

Furthermore, in none of the tasks random grouping of users is the best one, which serves as a proof that nearly all the models have learned to place similar, in terms of business metrics, users closer to each other in cosine distance terms.

4.3. Guide Validation

We also studied the relationship between addition of different guides to the autoencoder model and its performance on the predictive power task. In order to estimate the relation, we trained and evaluated the supervised autoencoder model with 255 possible combinations of guides. Then we created a set of 255 examples, each of which is represented by a vector of 8 variables indicating whether model was trained with guide g (guides[g]=l) or without it (guides[g]=0), this is our feature set. The target values are the performance measures on 8 business metrics prediction tasks from the predictive power evaluation stage. We train 8 regression models separately to predict performance on each business metric for every possible set of guides.

Table 4

Guide Validation Results

Each coefficient with coordinates (g, m) in the table shows an effect of g'th guide introduction on the performance of the model on m'th business metric prediction task. The coefficients are normalized on the scale of the dependent variable (all coefficients show relative percentage changes in target variables if guide is present cet. par). Empty cells are coefficients which did not pass 95% significance level measured by regular significance

tests applied to OLS regression coefficients.

tariff payment type accepts surge card system cancel frequency num orders mean cost mean travel time

tariff 0.0831 0.0039

payment type -0.0459 -0.0705 0.0018

accepts surge -0.0173 0.0064

card system -0.0740 0.0033 -0.1069

cancel frequency 0.0021 0.0036

num orders 0.0041

mean cost -0.0110 0.0019 0.0073 -0.005 -0.0451 -0.0040

mean travel time -0.0086

Table 4 offers some insights about the guides and their effects on the separate predictive power tasks. One of them is that, as expected, the introduction of some guide to the model boosts its performance on the corresponding prediction task cet. par. (for example if we add guide for mean cost prediction task, the I!MSI", on this task falls by 4,5% cet. par.). The only artefact is the tariff task, which demonstrates the opposite.

Another feature is that some of the guides appear to contribute not only to their metric predictive power, but to others as well. The example is the card system guide, which helps not only to predict credit card system type better, but also boosts the performance on the payment type task. We speculate that this phenomenon may be explained as follows: if one gives the model the information about the card system of the user (e.g. MasterCard), then it may infer that this user's payment type might be card and not cash. Less intuitive relation is seen between mean cost and payment type, where introduction of the mean cost guide improves model's performance on the estimation of payment type distribution.

Overall, if each row is summed up, one may see that some guides improve the total performance of the model and some do not. This information may be useful to select guides for models' training.

4.4. Application to Production Task

In this part of the section we investigate how obtained user representation may be applied in a real-world setup.

The task is to predict the number of users' trips up to received date based on their activity in the first month. The received date is fixed for all users, while the starting date may vary. We use both available data streams, namely, history of orders and mobile app activity logs. There are 3.999 users in the dataset.

First, we extract features from users' history of orders (94 features in total). We fit boosting model (CatBoost [9]) on 2.999 samples from the dataset and evaluate it using 1.000 samples as the test set. Second, we construct user representations from users' first month of mobile application activity and fit the same model on the constructed vectors. Then, we fit the model on the combined feature set, both with hand-crafted features generated from history of orders and user-embeddings obtained by our method. For user-embeddings' construction we use the unsupervised autoencoder model so as to prevent leaks indicative of users' future trips, moreover, the autoencoder model is trained using only the first month of users' mobile application activity.

Table 5

Production Task Results

The task is to predict the number of user's trips up to given date based on his or her activity in the first month.

Row named «Constant» shows performance of best constant prediction on this dataset.

Feature Set MAE RMSE

Best of median / mean 13.93 32.46

Hand-crafted features (HC) 12.11 27.06

User-embeddings (UE) 11.79 26.2

Combined (HC + UE) 11.2 24.6

It is evident from Table 5 that our method performs better than hand-crafted feature extraction. Moreover, the combined representation yields best results. It is important to note that in the combined version we are using both available data streams while also avoiding the process of manual feature extraction from user mobile application activity logs, which is a very laborintensive procedure. After all, we suppose that our method may improve existing production processes by enriching them with automatic feature extraction from mobile application activity logs.

5. Conclusions and Future Work

We show how various models of different nature may be used to obtain user representation through her Yandex.Taxi mobile application activity. Such representation is capable of acting as a feature set for supervised learning tasks and is helpful to identify similar users in terms of their business metrics. We also studied the relation between the method's performance and configuration of guides it was fed with. The findings suggest that some guides are complementary to each other and some are the opposite. One can tune the configuration of guides in order to achieve best overall performance.

Our method is not yet deployed in the company as the process faces various challenges. The main obstacle is that Yandex.Taxi is growing rapidly and the existing mobile application log generating process is constantly improving (event names are changed or merged e.t.c). So, in order to keep the method's performance on the same level, one needs to constantly retrain it. However, training of best models is quite time-consuming: on the machine with 16 cpu-cores, 2.5 GHz each, the supervised word2vec takes almost 10 hours to converge with training set size of 5.000 users. Our aim is to scale training up to around 10.000.000 users. Our approach is going to be deployed as soon as we optimize it for faster training. Nevertheless, the current state of the method is enough for a single-time improvement of various models used in Yandex.Taxi, however for the continuous usage in the production processes the challenge outlined above needs to be overcome.

Future work may concentrate around context-based embedding models' performance under the conditions of context absence. Also, the study of word-level embeddings change in the course of training looks promising for the discovery of methods to separate training of words that benefit from context-based approach from ones that do not, which might be helpful to learn better representations.

We would like to thank Tatiana Saveleva, Arsenii Ashukha and Anton Pankratov for their contributions in reviewing and drafting the paper; and providing various thoughts on algorithm design and evaluation.

Литература

1. Zolna Konrad User Modeling Using LSTM Networks // Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). 2017. P. 5025-5026.

2. Ruder S. An overview of multi-task learning in deep neural networks // arXiv preprint arXiv:1706.05098. 2017.

3. Arora S., Warrier D. Decoding fashion contexts using word embeddings // KDD Workshop on Machine learning meets fashion. 2016.

4. Mikolov Т., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionalitv // Advances in neural information processing systems. 2013. P. 3111-3119.

5. Guo Ch., Berkhahn F. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.

6. Rehurek R., Sojka P. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010. P. 361-369.

7. Vorontsov K., Frei O., Apishev M., Rom,ov P., Dudarenko M. Bigartm: Open source library for regularized multimodal topic modeling of large collections // International Conference on Analysis of Images, Social Networks and Texts. 2015. P. 370-381.

8. Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.

9. Dorogush A. V., Ershov V., Gulin A. CatBoost: gradient boosting with categorical features support. 2017.

10. Hochreiter S., Schmidhuber J. Long short-term memory // Neural computation. 1997. V. 9, N 8. P. 1735-1780.

11. Liu H., Wu L., Zhang D., Jian M., Zhang X. Multi-perspective User2Vec: Exploiting re-pin activity for user representation learning in content curation social network // Signal Processing. 2018. V. 142. P. 450-456.

12. Ozsoy M.G. From word embeddings to item recommendation. arXiv preprint arXiv:1601.01356. 2016.

13. Le Q., Mikolov T. Distributed representations of sentences and documents // International Conference on Machine Learning. 2014. P. 1188-1196.

References

1. Zolna Konrad User Modeling Using LSTM Networks. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). 2017. P. 5025-5026.

2. Ruder S. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv: 1706.05098. 2017.

3. Arora S., Warrier D. Decoding fashion contexts using word embeddings. KDD Workshop on Machine learning meets fashion. 2016.

4. Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionalitv. Advances in neural information processing systems. 2013. P. 3111-3119.

5. Guo Ch., Berkhahn F. Entity embeddings of categorical variables. arXiv preprint arXiv: 1604.06737, 2016.

6. Rehurek R., Sojka P. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010. P. 361-369.

7. Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. Bigartm: Open source library for regularized multimodal topic modeling of large collections. International Conference on Analysis of Images, Social Networks and Texts. 2015. P. 370-381.

8. Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.

9. Dorogush A.V., Ershov V., Gulin A. CatBoost: gradient boosting with categorical features support. 2017.

10. Hochreiter S., Schmidhuber J. Long short-term memory. Neural computation. 1997. V. 9, N 8. P. 1735-1780.

11. Liu H., Wu L., Zhang D., Jian M., Zhang X. Multi-perspective User2Vec: Exploiting re-pin activity for user representation learning in content curation social network. Signal Processing. 2018. V. 142. P. 450-456.

12. Ozsoy M.G. From word embeddings to item recommendation. arXiv preprint arXiv:1601.01356. 2016.

13. Le Q., Mikolov T. Distributed representations of sentences and documents. International Conference on Machine Learning. 2014. P. 1188-1196.

Поступим в редакцию 13.09.2018

i Надоели баннеры? Вы всегда можете отключить рекламу.