Научная статья на тему 'Controlling impression: Making ruGPT3 generate sentiment-driven movie reviews'

Controlling impression: Making ruGPT3 generate sentiment-driven movie reviews Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
27
15
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
ruGPT3 / fine-tuning / controlling text generation / sentiment / prompt

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Anastasia V. Margolina

In this research paper, I investigate the controlled text generation capabilities of ruGPT3Large through fine-tuning, specifically focusing on generating movie reviews based on a designated sentiment attribute. Controlled text generation is an active area of inquiry within the domain of Natural Language Processing, particularly for the Russian language. This study exemplifies a simple approach to controllable text generation by training ruGPT3 on a textual dataset containing sentiment-marked prompts, enabling the model to recognize patterns and generate analogous texts. The research provides a comprehensive analysis of the limitations, shortcomings and merits of fine-tuning a large language model using prompts embedded in a dataset. The generated texts exhibit coherence, logical structure, abundant coreferential links, and narratives and vocabularies characteristic of film reviews. Nevertheless, ruGPT3-generated reviews exhibit certain linguistic errors. I classify the most prevalent errors, such as named entity confusion, grammatical gender inconsistencies and sentiment fluctuations. Given that the primary objective is to evaluate the efficacy of basic fine-tuning with respect to the specified attribute, both automatic sentiment analysis and human evaluation are employed for output assessment. In comparing the outputs of the fine-tuned model and the baseline ruGPT3Large, I observe that positive sentiment generation is the most successful, while neutral and negative sentiments are produced by the models less accurately.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Controlling impression: Making ruGPT3 generate sentiment-driven movie reviews»

Прикладная лингвистика

Я Check for updates

УДК 81 '32 https://doi.org/10.33910/2687-0215-2022-4-1-15-25

Controlling impression: Making ruGPT3 generate sentiment-driven movie reviews

A. V. Margolina ш

1 Higher School of Economics, 16 Soyuza Pechatnikov Str., Saint Petersburg 190121, Russia

Author

Anastasia V. Margolina, e-mail: avmargolina@edu.hse.ru

For citation: Margolina, A. V. (2022) Controlling impression: Making ruGPT3 generate sentiment-driven movie reviews. Journal of Applied Linguistics and Lexicography, vol. 4, no. 1, pp. 15-25. https://doi.org/10.33910/2687-0215-2022-4-1-15-25

Received 8 April 2022; reviewed 16 May 2022; accepted 20 June 2022.

Funding: The study did not receive any external funding.

Copyright: © A. V. Margolina (2022). Published by Herzen State Pedagogical University of Russia. Open access under CC BY-NC License 4.0.

Abstract. In this research paper, I investigate the controlled text generation capabilities of ruGPT3Large through fine-tuning, specifically focusing on generating movie reviews based on a designated sentiment attribute. Controlled text generation is an active area of inquiry within the domain of Natural Language Processing, particularly for the Russian language. This study exemplifies a simple approach to controllable text generation by training ruGPT3 on a textual dataset containing sentiment-marked prompts, enabling the model to recognize patterns and generate analogous texts. The research provides a comprehensive analysis of the limitations, shortcomings and merits of fine-tuning a large language model using prompts embedded in a dataset. The generated texts exhibit coherence, logical structure, abundant coreferential links, and narratives and vocabularies characteristic of film reviews. Nevertheless, ruGPT3-generated reviews exhibit certain linguistic errors. I classify the most prevalent errors, such as named entity confusion, grammatical gender inconsistencies and sentiment fluctuations. Given that the primary objective is to evaluate the efficacy of basic fine-tuning with respect to the specified attribute, both automatic sentiment analysis and human evaluation are employed for output assessment. In comparing the outputs of the fine-tuned model and the baseline ruGPT3Large, I observe that positive sentiment generation is the most successful, while neutral and negative sentiments are produced by the models less accurately.

Keywords: ruGPT3, fine-tuning, controlling text generation, sentiment, prompt

Introduction

Natural Language Generation (NLG) is a major field of today's Natural Language Processing (NLP) and aims to automatically produce, or mimic, a human-like text. With the emergence of new and more advanced language models (LMs), such as ChatGPT, generated texts become more and more comprehensible. However, the pretrained model cannot always be successfully applied to downstream tasks with a zero-shot or few-shot learning (Brown et al. 2020, 34). Such tasks are better handled through fine-tuning an LM on a big corpus of textual data. An example of a downstream task is controllable text generation (CTG), in which the attribute (sentiment, style) affects the output. This type of text generation can be used in a wide variety of applications, such as generating product descriptions for e-commerce websites, composing news articles, creating chatbot responses, and more. In CTG, the process can be broken down into three

primary components. First, there is the 'controlled element', which comprises two parts: a 'controlled condition' (e. g., a sentiment like positivity) and a 'source text'. The source text can be either nonexistent or serve as a text prompt, depending on the application's requirements. Second, there is the generative model that acts as the operational mechanism. And third, there is the generated text, which aligns with the initially specified controlled condition and serves as the output (Zhang et al. 2020, 4).

In this paper, I explore a simple method of sentiment-controlled text generation — fine-tuning ruGPT3 with a prompt. The idea is to teach the LM to generate the output — in my case, film reviews — according to the input attribute (i. e., the sentiment). In this context, CTG is a transfer learning task which reduces the computational cost, the amount of training time and the necessary volume of data. I use a pretrained LM and then feed it my custom dataset in order to teach the LM the style of movie reviews' narratives, as well as the style of three sentiments — positive, neutral and negative. The aim of the paper consists of two aspects: not only to try to produce sentiment-controlled movie reviews but also to make it in the Russian language.

For evaluation of generated texts, I propose an approach which is slightly different from the one available in the literature. The use of common metrics ROUGE and BLEU does not really fit my task because they are usually applied to estimate the quality of summarization and translation. Instead, I compare my fine-tuned model with the basic ruGPT3Large using human evaluation. Further, the Dostoevsky library's sentiment analysis is used to evaluate the dataset of reviews generated by the fine-tuned LM. This method shows how well the model generates a text depending on the given sentiment in the prompt.

Previous Research

CTG is a rapidly developing area of NLP. There have been several research papers that cover different approaches to controlling an attribute. The transformer architecture proposed by Vaswani et al. (Vaswani et al. 2017) is the state of art of language models. Because I also use this architecture as the basis of my experiment, in this section I review the methods of CTG with the use of the decoder part of the transformer.

One approach towards CTG is to use prompts or seed texts to guide the generation process. For example, the CTRL model introduced by Keskar et al. (Keskar et al. 2019) allows users to input control codes that represent specific attributes or styles, such as sentiment, tense or persona, to generate a text that adheres to these constraints. Similarly, the GPT-3 model has been shown to be capable of generating text in a specific style or tone, such as formal, informal or sarcastic, based on the prompts provided by the user.

In my work, I focus only on CTG, in which sentiment is the attribute for impacting the output. The authors of Plug and Play Language Models (PPLM) proposed the solution for this case by creating the pipeline of CTG: it consists of a "large, pretrained LM and a Bag of Words (BoW) or a small, easy-to-train discriminator" (Dathathri et al. 2020, 10). This approach gives very comprehensible results; however, it does not fit into my task. The PPLM method was developed four years ago on the base of GPT-2, while I explore the Russian version of GPT-3, which, despite being based on the GPT-2 architecture, has other specifics due to language and the pre-training text-set.

The Russian-language CTG is not a well-covered field of NLP, although there is some research available. In (Nikolich, Puchkova 2021), the authors fine-tune a pretrained ruGPT3 in order to make it produce short sequences (summaries) of given long texts. Unlike my task, which is CTG, the authors explore text-to-text generation: they feed into the model a parallel dataset which consists of the original text and its shortened version. This article is significant in the context of fine-tuning the Russian ruGPT3 because it demonstrates the advantages and

limitations of this LM. For instance, the authors notice that ruGPT3 sometimes changes named entities in the output (ibid., 6).

Methodology and Data

This research is aimed at describing the difficulties, opportunities and prospects of controllable generation of text in Russian with given sentiment using a basic fine-tune. I achieve this by training ruGPT3Large on the dataset of cinema reviews, in which the prompt is already set, and evaluating the resulting models linguistically and statistically. The architecture of ruGPT-3 is similar to that of GPT-2: it is a decoder-only transformer-based model, which makes it perfect for text generation (Radford et al. 2019). The ruGPT3Large model is a demanding model in the context of GPU's memory size: to fine-tune the LM, I use NVIDIA RTX 4090, which has a memory of 24 GB.

The data is gathered through the parsing of a specific website that hosts a comprehensive library of virtually every movie and associated user reviews in Russian. This platform was selected as the textual resource due to the sentiment markup provided by the very users who contribute to the text. Consequently, the primary benefit of this corpus is the presence of sentiments annotated by humans.

The data was parsed using Selenium library on Python (Selenium with Python 2022). The original dataset consists of almost 200k reviews with three types of sentiment (positive, neutral and negative), however its distribution is disparate (Table 1).

Table 1. Distribution of reviews by sentiment

the total number of reviews the number of positive reviews the number of neutral reviews the number of negative reviews

199 354 148 834 28 643 21 877

Table 1 shows that more than 90% of reviews are positive — if all 200k reviews were applied for fine-tuning, the model would be tilted towards one direction. In order to balance the distribution, I cut the dataframe to 20k reviews for each sentiment, making their total number 60k.

To feed the data into the model, I changed the format into a plain .txt file with the prompt. Now the dataset consisted of movie reviews, each in the following format: "^>Тональность: [позитивная, нейтральная или негативная]\nТекст: [текст OT3biBa]</s>" [<s>Sentiment: [positive, neutral or negative]\nText: [the text of the review]</s>].

All texts are divided by line break and special characters that mark the beginning (<s>) of the string and the end of the string (</s>). For training, I divided the dataset into a train-validation split with a ratio of 85:15, which is 51k for train and 9k for validation. The objective of incorporating reviews with a prompt into the model is to facilitate the memorization of patterns by ruGPT3. This is achieved by utilising the second segment of the prompt, which serves as a continuation that the model must generate, namely, the review itself.

Training and Generation Processes

RuGPT3Large is a very demanding model in terms of GPU because of the number of its parameters. In Table 2, the parameters of fine-tuning are displayed. The parameters were chosen in order to fit in ruGPT into the memory. Table 2 shows that I chose the minimal batch size: "a smaller batch size makes the training more stable but decreases the per-step computation efficiency significantly" (Conglong 2022, 2). The learning rate is the default one. With these settings and this GPU, the model took 7.5 hours to fine-tune.

Table 2. ruGPT3 fine-tuning parameters

Parameter Value

num_train_epochs 5

per_device_train_batch_size 1

per_device_eval_batch_size 1

block_size 1024

learning_rate 2.5e-4

Tuning the hyperparameters of the generation process is the key to achieving comprehensible results. The choice of them varies from task to task: summarization would need different parameters compared, for instance, to machine translation. My task is a text continuation with a given prompt, so I set the parameters as displayed in Table 3.

Table 3. ruGPT3 generation parameters

Parameter Value

repetition_penalty 5.0

toP_P 0.95

top_k 5

temperature 1

no_repeat_ngram_size 2

As A. Nikolich and A. Puchkova write, "the temperature ofthe GPT model ranges from 0 to 1, and generally with a lower temperature the model is more likely to choose tokens with a higher probability of occurrence" (Nikolich, Puchkova 2021, 3). For their task of summarization, they chose the temperature of zero in order to make the model less 'creative' and more 'focused' on the given text. For my model, which I want to write more freely and less trivially, I set the temperature at 1.

Another key parameter is repetition_penalty: setting it at 1 means no penalty and the higher the value, the greater the punishment. For my model, repetition_penalty was set at 5 in order to decrease frequent repetitions. Without this parameter the model would generate long sequences with a relatively high percentage of repeated words or numbers. Giving a penalty to a model motivates it to come up with synonyms rather than repetitions. Repetition_penalty also correlates with another parameter from the Table 3: no_repeat_ngram_size. By setting it at 2, I restricted the appearance of more than one bigram in the text. This, as well as repetition_ penalty, makes the text more diverse.

The top_k parameter sorts the probability of token appearance and removes the least probable tokens below the k-token. It improves the quality of text by deleting all words that are unlikely to appear, so that the language model does not go off-topic. However, it does not always work very well, as in some cases there is a huge pool of relatively possible words and top_k sampling makes the text less realistic and less diverse. For that occasion, Holtzman et al. suggest top_p sampling (Holtzman et al. 2020). This sampling, which is also called nucleus sampling, selects "the highest probability tokens whose cumulative probability mass exceeds the pre-chosen threshold p" (ibid., 5). This way, it does not sample extremely unrealistic tokens, but it preserves diversity when the highest-scoring tokens have low confidence.

The last key parameter, which is not displayed in Table 3, is early stopping. I used early stopping to prevent the model from generating endless or very long sequences. By setting early stopping at True, I 'allowed' my model to stop the beam search "when at least num_beams sentences are finished per batch" (Hugging Face 2022).

Results

This section presents the outcomes of fine-tuning a language model for the purpose of CTG with a particular focus on sentiment control. The results consist of three text examples generated by the model, which were subjected to sentiment analysis using the Dostoevsky library. This analysis offers significant insights into the sentiment conveyed by the generated text. To facilitate comparison, the basic ruGPT3Large model was used as a reference point. This section aims to provide a comprehensive understanding of the potential and limitations of fine-tuned language models in generating sentiment-controlled text.

Positive Sentiment

I compare the performance of the fine-tuned language model and the referential model when generating a positive text from the same prompt. The results indicate that the fine-tuned model produced a more coherent and structured text, with fewer factual errors. Additionally, the fine-tuned model demonstrated a better understanding of the sentiment expressed in the prompt, as evidenced by the positive review it generated:

Этот фильм я ждала с нетерпением, так как очень люблю творчество Тима Бертона и Джонни Деппа в целом. Во-первых, из-за того, что это экранизация книги 'Алиса в стране чудес', а во-вторых, хотелось посмотреть на игру таких замечательных актеров, как Хелена Бонем Картер, Энн Хэтэуэй, Миа Васиковска и конечно же неподражаемого Джонни Дэппа. Что могу сказать об актерском составе? Актеры подобраны просто замечательно! Все до единого идеально вписались в свои роли. Каждый персонаж имеет свою изюминку, которая делает его запоминающимся. В общем, актеры справились со своей задачей на все 100%. Особенно хочется отметить Чеширского Кота (Хелена Бонам Картер), Мартовского Зайца (Энн Хэтуэй) и Красную Королеву (Бон-ни Райт). Эти персонажи получились настолько яркими и харизматичными, но при этом не менее глубокими и интересными, чем их книжные прототипы. Не зря Тим Бертон назвал этот фильм самым личным для него. Он хотел показать зрителю всю ту любовь, которую он испытывает к своим героям. Получилось ли у него это? Думаю, да. Лично я получила огромное удовольствие от просмотра этой замечательной картины. 10 из 10 Приятного просмотра! </s>

We can see that the review produced by my LM is coherent and well-structured. It also has few factual mistakes. Further, RuGPT3 detects the named entities of the film: it does not confuse them and it uses only those names that are linked to the movie Alice in Wonderland.

Interestingly, the referential model also demonstrated some degree of understanding of the sentiment expressed in the prompt, as indicated by the sequence of positive words it generated towards the end of the text. However, the referential model lacked the ability to produce a text that follows a review structure, resulting in a more general description of the movie:

Этот фильм рассказывает о том, как молодой человек по имени Джоуи влюбился в одну из самых красивых девушек на планете. Она была настолько красива, что он не мог отвести от нее глаз. Но однажды его жизнь круто изменилась и девушка стала для него важнее всего на свете.\и

So, while both models display a rudimentary understanding of sentiment, the fine-tuned RuGPT3 model shows an added capability: it not only captures the sentiment but also adheres to the structural norms of a movie review, which the referential model does not achieve.

In contrast, the fine-tuned model utilized the end-of-sequence (eos) token to signify the end of generated text, which facilitated a more structured and coherent text. The referential model also made use of the eos-token, albeit in an improvised manner due to the lack of exposure to examples of its use. As a result, the referential model resorted to mimicking the eos-token by generating tokens such as '\n' or '\t' to signify the end of a sequence.

Overall, our findings suggest that fine-tuning a language model can significantly improve its ability to generate a coherent and structured text with a better understanding of sentiment, as compared to referential models. These results highlight the potential of fine-tuning techniques in CTG and their importance in improving the quality of attribute-driven text generation tasks.

Negative Sentiment

This section of the study examines the performance of the fine-tuned ruGPT3 language model when generating a negative review. The generated text demonstrates a clear understanding of the prompt and effectively conveys the negative sentiment:

Этот фильм я посмотрел только из-за того, что в нем снимается один из моих любимых актеров — Леонардо ДиКаприо. Я не мог пропустить этот фильм и посмотреть его от начала до конца. И вот, наконец, добрался до этого фильма. Скажу честно, мне было очень трудно досмотреть это 'творение' до финальных титров. Фильм показался мне настолько нудным и затянутым, насколько это вообще возможно для такого рода фильмов. Но обо всем по порядку. Начнем с сюжета. Сюжетная линия довольно банальна и предсказуема. Ничего нового нам не рассказывают о главном герое или героинях этой картины. Нам просто дают понять, чем они занимаются на протяжении этих двух часов. Актерская игра тоже оставляет желать лучшего. Все персонажи какие-то пресные и скучные. Единственный кто хоть как то выделяется среди них — это персонаж Мэттью МакКонахи. Вот он действительно порадовал меня своей актерской игрой. Остальные же актеры сыграли свои роли так себе. В этом фильме нет ни одного момента, который бы заставил зрителя сопереживать кому-нибудь из героев. Нет здесь таких моментов, которые заставят вас задуматься о чем-либо важном. Ну а концовка... Она ужасна. Такого бездарного финала я давно не видел. После просмотра у меня остался такой осадок, как будто меня обокрали. Вывод: Если вы хотите посмотреть хороший триллер со смыслом, то вам лучше выбрать другой фильм. 3 из 10

The model makes use of typical negative tokens (boring, trivial) and employs metaphors and irony to further strengthen the sentiment expressed. In the end of the review, ruGPT3 gives the imaginary film a rating: three points out of ten, which signifies the right use of controlled sentiment. It is worth mentioning that the model has learned the pattern of metaphor: "После просмотра у меня остался такой осадок, как будто меня обокрали" [after watching, I was left with such an aftertaste, as if I had been robbed], as well as to produce irony by putting a word in quotation marks denoting that the film cannot be actually called art. However, a small mistake in lexical construction is observed in the review: "Я не мог пропустить этот фильм и посмотреть его от начала до конца" [I could not miss the film and watch it from the beginning to the end]. The model forgets to put another negation, which is required by the Russian language, in the second part of the sentence, after the conjunction (translated into English, the correct version would require a "could not but" structure).

Furthermore, I analyze the structure and narrative of the generated text, and find that the model is able to use coreferential links and vocabulary markers to create a natural and coherent review. It uses pronouns with regards to the named entities that it mentions: "that character of Matthew McConaughey", "he" (actor), "this movie", etc. The model also connects different parts of the review by adding vocabulary markers, such as "first things first" and "let's begin with the plot". This makes the text look natural and coherent.

In contrast, the referential model was only able to produce a few sentences that correlate with the given sentiment and lacks the depth and complexity of the fine-tuned model: "Этот фильм рассказывает о том, что такое зло. В нем рассказывается о людях, которые стали заложниками собственных пороков и совершили ужасные преступления" [This film is about what evil is. It tells about people who have become hostages to their own vices and committed terrible crimes].

The referential model utilizes vocabulary that fits the given sentiment but fails to convey an overall negative sentiment in its output. This suggests that the referential model lacks the ability to fully comprehend and synthesize the given prompt to generate a meaningful output with a clear sentiment.

Neutral Sentiment

The fine-tuned ruGPT3 model demonstrates the ability to produce a coherent and ambiguous example of neutral sentiment. This is the model's review with a six-point rating for an imaginary film:

Этот фильм я смотрела в кинотеатре, на большом экране с хорошим звуком. Впечатления остались неоднозначные. С одной стороны — очень красивая картинка и спецэффекты (все-таки 3D), с другой — сюжетная линия не доведена до логического конца или хотя бы завершенности. Не могу сказать, что он меня разочаровал, но ожидал большего. Актерский состав порадовал. Том Круз как всегда неотразим! Эмили Блант тоже хорошо справилась со своей ролью. А вот ее героиня показалась мне какой-то 'неживой'. Я имею в виду поведение главной героини во многих ситуациях. Например, когда она узнала о том, кто был отцом ее мужа. Или когда ей пришлось бежать из плена вместе с пленными американцами. Она вела себя так, как будто все это время была под наркотическим действием. Это немного раздражало. Но больше всего расстроил конец фильма. Он получился каким-то скомканным и нелогичным. Что же мы получили в итоге? Красивая фантастическая мелодрама про спасение мира от инопланетных захватчиков. 6 из 10.

The model uses both positive and negative words to express its ambiguous opinion. Although neutral sentiment is often viewed as problematic, the model's output suggests that it can be used effectively to provide a nuanced and balanced review. In contrast, the referential model's output reflects a lack of understanding of the task at hand, producing a vague and general description of a film's plot that does not take into account the sentiment requested in the prompt:

Этот фильм о том, как один из самых могущественных людей в истории человечества — Дон Корлеоне — оказался на обочине жизни. Он не смог приспособиться к новым условиям и вынужден был начать все сначала. В конце концов он добился того, чего хотел больше всего на свете...

This time, it cannot be said that the referential model uses neutral words in order to fit in the given prompt. Rather, it seems that it just avoids using positive and negative words: as if the neutral sentiment is just the absence of sentiment. The referential model's inability to produce meaningful text in a neutral context raises questions about the effectiveness of zero-shot methods for generating text with a specific sentiment.

Typical Mistakes

As the result of fine-tuning, the texts generated by a large language model tend to have fewer lexical and syntactic mistakes, resulting in more coherent and human-like outputs. However, the generated texts are not free from some common flaws in language.

One of the most common mistakes observed in the fine-tuned ruGPT3 model is the improper use of named entities. The model tends to mix up various components — such as actors' names, characters and plot intricacies — from different cinematic or literary works, resulting in a confounding blend of unrelated elements. This mixing up is illustrated by the following example:

<...>И вот я наткнулась на фильм 'Игра в имитацию', который повествует нам историю гениального математика Алана Тьюринга (Бенедикт Камбербэтч). Фильм снят по мотивам книги Эндрю Ходжеса «Алан Тильда: Энигма». Действие фильма происходит в годы Второй мировой войны.

Молодой математик вместе со своей девушкой Джоан Кларк (Кира Найтли) пытаются взломать немецкую шифровальную машину Enigma. Для этого они нанимают двух лучших криптографов — математика Лорда Генри Мортимера (Эдриан Броуди) и физика Кипа Торна (Марк Стронг).<...>

In this example, the use of named entities is predominantly accurate, encompassing verifiable references to a specific film, its literary basis and the involved actors. However, the LM mistakes one name, Edrian Browdy, who did not actually participate in the film. The model also created for him a 'fictional' character, who is not part of the actual movie.

This specific occurrence can be ascribed to the size of the LM employed in the study. It has been posited that an expansive LM demonstrates an enhanced capacity for employing named entities effectively. This relationship between model size and named entity recognition is attributable to the increased knowledge and memory retention capabilities of larger models, which consequently facilitates more comprehensive and accurate name usage.

Another observable limitation within the fine-tuned ruGPT3 model is the issue of grammatical gender inconsistency. The model occasionally exhibits a propensity to modify gender markers throughout the text, starting with a particular gendered verb form and then altering it in later sections. This is an example of such modification:

Этот фильм я смотрела в кинотеатре, на большом экране с хорошим звуком. Впечатления остались неоднозначные. С одной стороны — очень красивая картинка и спецэффекты (все-таки 3D), с другой — сюжетная линия не доведена до логического конца или хотя бы завершенности. Не могу сказать, что он меня разочаровал, но ожидал большего.

In this passage, the initial sentence utilizes a verb in the feminine form, whereas the final sentence employs a masculine form. This pattern suggests that the model's memory capacity may be limited, impeding its ability to consistently recall and adhere to previously established grammatical constructions. As a result, it is crucial to consider the model's inherent constraints when interpreting and evaluating the generated text.

Similar to the issue of grammatical gender inconsistency, the model sometimes confuses the sentiment of the text by changing it halfway through. For instance, if the given prompt is negative, the model may write a usual negative review and then conclude with "great film, recommend to watch" and ten points out of ten, thereby changing the sentiment from negative to positive. This common mistake applies mostly to neutral and negative sentiments, while the positive ones are almost never mistaken by the model.

While the fine-tuning process of a large language model results in texts that are more coherent and human-like, common flaws such as named entities confusion, grammatical gender inconsistency and sentiment change still persist. It is essential to consider the model's inherent limitations when interpreting and evaluating the generated text, as these flaws can significantly affect the accuracy and reliability of the model's outputs.

Sentiment Evaluation Dostoevsky library

In order to perform a technical evaluation of the generated text, the Dostoevsky library was employed, which is a neural model that provides sentiment analysis for Russian text. The generated text was organized into a dataframe. The dataframe was structured with columns representing different sentiment categories, and each column contained 50 reviews. The Dos-toevsky model predicted the sentiment and returned a dictionary with three keys denoting the sentiments and their corresponding float values representing the polarity of the text. The value

of -1 indicates the most negative sentiment, the value of 1 indicates the most positive sentiment, and 0 denotes a neutral sentiment. The dictionary was sorted in descending order based on the highest value, and the mode was extracted from the list of 50 values. The results of this analysis are presented in Table 4.

Table 4. Distribution of sentiment categories and corresponding modes in generated texts

Sentiment Category Mode Number of occurrences in 50 samples

Positive Positive 42

Neutral Neutral 35

Negative Neutral 29

It is observed that the positive sentiment is predicted the most accurately, while the neutral sentiment is relatively stable. However, there is a notable limitation with negative reviews, as the model frequently confuses them with neutral sentiment. It should be noted that these statistics are largely dependent on the Dostoevsky model itself. While it is possible that the model is the source of confusion for sentiment analysis, 50 texts are sufficient to observe a slight disadvantage for negative reviews and a clear advantage for positive ones.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Human Perspective

To assess the reliability of the sentiment analysis provided by the Dostoevsky library, a psy-cholinguistic experiment was conducted via an online survey with the participation of 15 individuals. The participants were presented with a text with no indication of the sentiment and were asked to rate the text on a scale ranging from -5 to 5, with -5 being the most negative, 0 being neutral, and 5 being the most positive. The mean values obtained from the participants' ratings are presented in Table 5.

Table 5. Mean values of sentiment

Sentiment by LM Positive Neutral Negative

Mean sentiment of human assessment 4.6 2.2 -3.1

The results of the survey suggest that the fine-tuned ruGPT3 model performs better in generating positive sentiments compared to neutral and negative ones. The mean rating for positive sentiment was 4.6, which indicates high accuracy in identifying positive sentiment. Conversely, the mean rating of 2.2 for neutral sentiment suggests a positivity bias. The negative sentiment, in comparison to the Dostoevsky validation, shows a much better result. Interestingly, the model's performance in generating negative sentiment exhibited a significant improvement compared to the Dostoevsky validation, with a mean rating of -3.1 given by human participants, showing a satisfactory level of accuracy.

Human evaluation also shows that the positive sentiment is the most successful, while the neutral and the negative ones are harder for the fine-tuned ruGPT3 to produce. While human evaluation revealed that negative reviews were predominantly detected as negative or close to negative, Table 4 shows that the Dostoevsky model classified bad reviews as neutral. Therefore, it is essential to consider both the model's and the human participants' performance in sentiment analysis to provide a more comprehensive understanding of the accuracy of sentiment analysis.

Conclusion

The findings suggest that a simple fine-tuning of a language model can produce coherent results in generating reviews with a narrative structure and a rating of the film. Incorporating a prompt in the fine-tuning process significantly influences the generated output. Thus, controlling text generation by adding an attribute to the prompt can be a useful technique. Generated reviews, as analyzed from the human perspective, are very natural in terms of language and writing style. They have a narrative which is characteristic of the review structure, and they also provide a rating of the film. The sentiment analysis reveals that the model accurately generates positive sentiment and satisfactorily deals with neutral sentiment. However, the limitations of the fine-tuned ruGPT3 model should also be taken into account — i. e., named entities confusion, grammatical gender inconsistency and sentiment fluctuations. Further, the model still struggles with accurately generating negative sentiments in the context of controlled text generation.

The psycholinguistic experiment conducted in this study highlights the importance of human evaluation in assessing the accuracy of sentiment analysis provided by language models. By comparing the results of the machine-generated sentiment analysis with human evaluation, it is possible to identify the strengths and weaknesses of the model and improve its performance.

In summary, while fine-tuned language models like ruGPT3 offer promising results in sentiment-controlled text generation, further research is needed to address their limitations and enhance performance. The use of prompt-tuning methods and human evaluation can aid in achieving more accurate and reliable results in sentiment-controlled text generation. As a further perspective, it is worth applying ruPrompts library to train a prompt for ruGPT3, in which the seed is searched by gradient descent, which is trained later (ruPrompts library 2022). The seed is divided into two components: the format and the provider (Totmina 2022, 5). The method of prompt-tuning may probably enhance the results of sentiment-controlled text generation.

Conflict of interest

The author declares that there is no conflict of interest, either existing or potential.

Sources

Hugging Face. (2022) [Online]. Available at: https://huggingface.co/docs/transformers/main classes/text

generation#transformers.GenerationMixin.generate.early stopping (accessed 23.03.2022). (In English) ruPromptslibrary. (2022) [Online]. Available at: https://github.com/ai-forever/ru-prompts (accessed 23.03.2022). (In English)

Selenium with Python. (2022) [Online]. Available at: https://selenium-python.readthedocs.io/ (accessed 20.03.2022). (In English)

References

Brown, T. B., Mann, B., Ryder, N. et al. (2020) Language models are few-shot learners. [Online]. Available at:

https://arxiv.org/pdf/2005.14165.pdf (accessed 29.03.2022). (In English) Dathathri, S., Madotto, A., Lan, J. et al. (2020) Plug and play language models: A simple approach to controlled text generation. ICLR2020. [Online]. Available at: https://arxiv.org/pdf/1912.02164.pdf (accessed 29.03.2022). (In English)

Holtzman, A., Buys, J., Du, L. et al. (2020) The curious case of neural text degeneration. Conference paper.

ICLR2020. [Online]. Available at: https://arxiv.org/pdf/1904.09751.pdf(accessed 29.03.2022). (In English) Keskar, N. S., McCann, B., Varshney L. R. et al. (2019) CTRL: A conditional transformer language model for controllable generation. arXiv. [Online]. Available at: https://doi.org/10.48550/arXiv.1909.05858 (accessed 29.03.2022). (In English)

Li, C., Zhang, M., He, Y. (2022) The stability-efficiency dilemma: Investigating sequence length warmup for training GPT Models. In: NeurIPS'2022:36th Conference on Neural Information Processing Systems. [Online]. Available at: https://arxiv.org/pdf/2108.06084.pdf (accessed 29.03.2022). (In English) Nikolich, A., Puchkova, A. (2021) Fine-tuning GPT-3 for Russian Text Summarization. [Online]. Available at:

https://arxiv.org/pdf/2108.03502.pdf (accessed 29.05.2022). (In English) Radford, A., Wu, J., Child, R. et al. (2019) Language models are unsupervised multitask learners. [Online]. Available at: https://d4mucfpksywv.cloudfront.net/better-language-models/language models are unsupervised multitask learners.pdf (accessed 29.03.2022). (In English) Totmina, E. (2022) Detoxification of Russian texts based on combination of controlled generation using pretrained ruGPT3 and the Delete method. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference "Dialogue 2022". Vol. 21. Moscow: Russian State University for the Humanities Publ., pp. 1167-1174. http://doi.org/10.28995/2075-7182-2022-21-1158-1165 (In English) Vaswani, A., Shazeer, N., Parmar, N. et al. (2017) Attention is all you need. In: U. von Luxburg, I. Guyon (eds.). NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates Inc. Publ. [Online]. Available at: arXiv:1706.03762 (accessed 29.03.2022). (In English) Zhang, H., Song, H., Li, S. et al. (2022) A survey of controllable text generation using transformer-based pretrained language models. Association for Computing Machinery, vol. 37, no. 4, article 111. https://doi. org/10.48550/arXiv.2201.05337 (In English)

i Надоели баннеры? Вы всегда можете отключить рекламу.