Features for Chinese sentiment classification

Zagibalov T.E.

УДК 81.00 ББК 81.00

Т.Е. Загибалов

БАЗОВЫЕ ЕДИНИЦЫ ДЛЯ АВТОМАТИЧЕСКОЙ КЛАССИФИКАЦИИ ОЦЕНКИ В ТЕКСТАХ НА КИТАЙСКОМ ЯЗЫКЕ

В статье рассматриваются базовые единицы автоматической обработки неструктурированной текстовой информации на китайском языке на примере автоматической классификации оценочной информации (тональности). На основе экспериментальных данных автор заключает, что: 1) предварительная сегментация текста на слова приводит к ухудшению результатов и 2) ни слова, ни идеограммы сами по себе не являются наиболее подходящими единицами обработки информации в китайском языке.

Ключевые слова: китайский язык; автоматическая обработка текста; анализ оценки; базовая единица обработки

Т.Е. Zagibalov

FEATuRES FOR CHINESE SENTIMENT CLASSIFICATION

The paper considers basic processing units for Chinese natural language processing applied to automatic sentiment classification. Experimental results suggest that 1) preliminary word segmentation does not improve performance and 2) neither words nor characters on their own are good units of processing for NLP tasks in the Chinese language.

Key words: Chinese language; natural language processing; sentiment analysis; basic processing unit

There are some distinctive characteristics of the Chinese language that are known to affect language processing. This paper presents an investigation of these in connection with sentiment classification. The paper first outlines problems with conceptualising Chinese text as comprising a sequence of 'words’. In particular, the problem of automatically segmenting text into words is discussed and tested in an experiment. The difficulty of splitting Chinese text into words raises the issue of what kind of basic unit of processing to use in sentiment analysis. Then the paper describes kinds of units to be experimented on and the data for the experiments as well as basic concepts, algorithms and evaluation metrics. The paper reports experiments in sentiment classification and discusses the results.

The Word’ in Chinese Language Processing

One of the central problems in Chinese NLP in general and in Chinese sentiment analysis in particular is what the basic unit of processing should be. The problem is caused by a distinctive feature of the Chinese language: the absence of orthographically marked word boundaries, while it is widely assumed that a word is of extreme importance for computational language

processing. The absence of word delimiters cannot be solved by simply using dictionary lookup (or any other method) to segment a text into words, because the language has a rather specific structure: a single vocabulary word (e.g.

to eat) can include a part with no separate meaning as in examples (1-a) and (1-b). but the same 'meaningless’ part may be a separate word in other cases (see examples (2-a) and (2-b)).

1-a

^ № he eat (food)

He is eating.

1-b

^ ^ bH ft l

he eat half hour DE food He has been eating for half an hour.

2-a

^ ^ l he eat good food He is eating good food.

2-b

1 ^ ^ food he must eat He must eat food.

Example (1-a) demonstrates that the character sequenced! (to eat, lit. eat food) is one unit and is a vocabulary word which is not to be segmented into smaller units. The same word is split in (1-b). but the second part still does not have a separate meaning and is used as a way of introduction of an adverbial phrase. However in example (2-a) the second character is not only separated from the first one, but also becomes a word in its own right: a noun with a preceding adjective. In the last example (2-b) the word 1 (food) is used as a topicalized object and is clearly used as a separate word.

The example above is not an exception, but a representative of a very frequent morphological phenomenon in Chinese. One of the characteristics of the morphology of the Chinese language is that in many cases words are built in the same way as phrases, which results in words having the same structure as phrases. One of the most widely used patterns is VERB + OBJECT as in the example above which is also used for phrases consisting of separate words. Such patterns are very productive which results in a potentially endless number of phrase-like words.

This characteristic of the language makes it difficult even for human beings to segment texts into separate 'words’ [Tsai, 2001] and [Hoosain, 1999] show that segmentation is not a part of human understanding of written texts by native speakers of Chinese. They found that a segmented text was more difficult to read for native Chinese speakers as evidenced by a significant slowdown of reading. Tsai also described an experiment where the Chinese had to break a text into words. The results showed substantial disagreement on where to divide the characters into words.

Preliminary Word Segmentation of Chinese Texts

Even in cases where words can be segmented quite easily by a human, these cases might be very difficult for a computer. A major problem is caused by segmentation ambiguity. There are two types of segmentation ambiguity ([Guo, 1997; Liang, 1987]): overlapping ambiguity: e.g. A^^£(university life) vs. A^^i§((a) student lives) as shown in examples (3-a) and (3-

b); and hidden ambiguity: ^A vs. ^A A , as shown in examples (4-a) and (4-b).

3-a

university life very interesting

University life is very interesting.

3-b

A^£ ^ * TA 7

student life not continue LE (sentence-final particle LE)

University students can no longer make a living.

4-a

^A ft AS

individual DE power

the power of an individual

4-b

H ^ A ft AS

three GE person DE power

the power of three persons

These examples show that automatic segmentation needs understanding of context even in such 'easy’ cases, which makes complete segmentation a very difficult task. However, many researchers report good results for segmenters they have developed. This can be explained by the fact that in word segmentation experiments in many cases researchers have adopted their subjective understanding of what a word is in Chinese, such that training and test corpora are tagged not according to objective criteria but to ones that the research community have agreed [Xue, 2003] comments: «In practice, noting the difficulty in defining wordhood, researchers in automatic word segmentation of Chinese text generally adapt their own working definitions of what a word is, or simply rely on native speakers’ subjective judgements. The problem with native speakers’ subjective judgements is that native speakers generally show great inconsistency in their judgements of wordhood, as should perhaps be expected given the difficulty of defining what a word is in Chinese».

This problem is also crucial for sentiment analysis since some sort of basic unit needs to be defined in order for sentiment information to be associated with it. In many cases, NLP researchers working with Chinese use an initial

segmentation module that is intended to break a text into 'words’ before it is subjected to further processing. Although this can facilitate the use of subsequent computational techniques, there is no a clear definition of what a 'word’ is in the Chinese language, so the use of such segmenters is of dubious theoretical status; indeed, good results have been reported from systems which do not carry out such pre-processing [Xu et al., 2004; Foo and Li, 2001].

Preliminary Segmentation Experiment

To measure the impact that preliminary segmentation has on sentiment classification of Chinese documents, I compared the performance of two supervised classifiers: Naive Bayes multinomial (NBm) and Support Vector Machine (SVM). I used the entries in a sentiment dictionary. In the first series of experiments the corpus was split into words (segmented), whereas in the second the features were extracted directly from the text without preliminary segmentation. All the experiments used 10-fold cross-validation.

Sentiment dictionary

For this and all subsequent experiments I used the NTU sentiment dictionary (NTUSD) [Ku et al., 2005]. The dictionary has 2809 items in the 'positive’ part and 8273 items in the 'negative’. For these experiments, the dictionary was converted from Traditional Chinese encoding (Big5) into Simplified Chinese encoding (UTF8) and all duplicate entries removed, which resulted in 2,598 items in the 'positive’ part and 7,692 items in the 'negative’ part.

Test Corpus

All experiments were carried out on a corpus comprised of product reviews downloaded from the web-site IT 168. All the reviews were tagged by their authors as either positive or negative. Most reviews consist of two or

three parts: positive opinion, negative opinion and comments (other), though some reviews have only one part. After all duplicate reviews were removed the final version of the corpus comprised 29,531 reviews of which 23,122 were positive (78 %) and 6,409 were negative (22 %). The total number of different products in the corpus totalled 10,631, the number of product categories was 255, and most of the reviewed products are items of either software or consumer electronics.

From manual inspection it seemed that some users misused the sentiment tagging facility on the web-site and quite a lot of reviews were tagged erroneously. However, the parts of the reviews were tagged much more accurately so I used only relevant (negative or positive) review parts as the documents in the corpus. The final version of the corpus included only the first 10,000 reviews, whose parts were extracted to make a balanced test corpus. As the corpus consisted of 10 thematic domains (mostly electrical appliances such as digital cameras, mobile phones and computers), I also balanced each of these domains. The resulting corpus contains 8,140 reviews, of which 4,073 are positive and 4,067 are negative.

Segmenter

To split the corpus into words I used a publicly available segmenter implemented by [Peterson, 1999]. The segmenter uses a 138,000 word vocabulary and works with a version of the maximal matching algorithm. Thus when looking for words, it attempts to match the longest word possible. This simple algorithm is surprisingly effective, given a large and diverse lexicon: its segmentation accuracy can be expected to lie around 95 % [Wong and Chan, 1996], although one should note the methodological and language-specific issues discussed above.

Table 1

Results of sentiment classification of product reviews from the web-site IT 168, with and without segmentation. The features are NTU sentiment dictionary items

Accuracy Precision Recall F-Measure

NBm (Segmented) 83.59 0.84 0.84 0.84

NBm (Not segmented) 85.61 0.86 0.86 0.86

SVM (Segmented) 81.67 0.83 0.82 0.82

SVM (Not segmented) 85.50 0.86 0.86 0.86

The results presented in Table 1 show that segmenting the corpus into words affected the performance in a negative way. This suggests that using preliminary segmentation may negatively affect performance of a sentiment classifier.

Words and Characters as Features for Sentiment Classification

In the absence of preliminary word segmentation, there are two possible types of feature that could be used in Chinese sentiment classification: (vocabulary) words and

characters. This section reports experiments into these two types. The experiments evaluate various techniques that can facilitate classification including a simple negation check, as there is no a general agreement as to whether this feature is useful for sentiment classification. This section also describes and tests an approach which divides the text into zones.

Processing based on words and characters are tested separately and in combination. The latter approach is inspired by results published by [Nie et al. 2000] who found that for Chinese processing (IR in particular) the most effective kinds of features were a combination of dictionary look up (using the longest-match algorithm) together with single-character unigrams [Yuen et al. 2004] showed that Chinese characters constitute a distinct sub-lexical unit which, though having a smaller number of distinct types, has greater linguistic significance than words. Their experiments on sentiment classification of words by means of characters proved to be effective, achieving a precision of 80.23 % and a recall of 85.03 % with only 20 characters.

Basic Concepts.

Frequency. The sentiment score (see below) is based on a basic unit’s relative (normalised)

frequency: v

where N is the number of times CL occurred

a —

in a collection of documents and N is the total number of basic units (lexical units or characters, as appropriate) in the collection of documents. sentiment score

Each word (dictionary item) occurring in the positive side of the dictionary is assigned a positive sentiment score of 1 and negative sentiment score 0, and vice versa for words in the negative side.

• Word score All the words had a score 1

for the class (sentiment) they present and 0 for the class they are not present.

• Character scores The characters for the experiments are extracted from the NTU sentiment dictionary. Most of the characters occur in both sides of the dictionary: positive and negative. The score for a character with respect to sentiment i (positive or negative) is:

(2)

where F,: is the unit’s frequency in a document collection of sentiment, L F is the character’s

j

relative frequency in the opposite side of the dictionary.

• Document score The score of a document is calculated as the sum of the scores of the units it contains.

Experimental Data and Classification Algorithm

The experiments in the remainder of this chapter use the same sentiment dictionary and test corpus as in the previous segmentation experiments (see 1.2).

Basic Classification Algorithm

Classification is done by summing up the sentiment scores of all the classification units found in a document. Since there are two classes (positive and negative) the algorithm does this twice to obtain positive and negative scores for a document, which are then compared to make a decision about its sentiment (see Algorithm 1).

Algorithm 1 Basic Sentiment Classifier

Require: List of basic units a each with sentiment scores Saps« and

return Sentiment tags for all classified, documents in D

Evaluation Metrics and statistical significance Test Accuracy

Since the product review test corpus is balanced with respect to positive and negative documents, I chose accuracy as evaluation metric for all the experiments. I present accuracy for the whole corpus as well as for each class. Accuracy is calculated as

number of documents classified correctly Coverage

To measure what proportion of the test data was classified (regardless of correctness), I use coverage:

umber of documents classified total number of documents Classification skew

Sentiment classification in the experiments presented here can be split into two subtasks: finding positive documents and finding negative documents. Both of the subtasks can be evaluated by accuracy. It is very important to consider both positive and negative classification accuracy as

the overall accuracy does not reflect the subtask performance: for example a classifier may have accuracies 0.50 and 1.00 for the two classes and overall accuracy of 0.75, while another classifier may have 0.76 and 0.74 with the same overall accuracy. Obviously, despite equal overall accuracy the second classifier is performing much better.

precision

number of documents classified correctly

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

total number of documents classified '

I also use precision for evaluation of classification performance:

statistical significance

I use the paired t-test to test if the results of any two experiments are significantly different at the 95 % level.

Unigram-Based Classification

Unigram-based classification is based on computing the sum of all the sentiment scores of the basic unit instances found in a document. In the experiments presented here I test the performance of characters, words and combination of words and characters for sentiment classification.

Table 2

Results of unigram-based sentiment classification using different types of features Accuracy

Basic Unit Kinds Overall Positive Negative Precision Coverage

Chars 0.68 0.82 0.54 0.68 1.00

Words 0.68 0.71 0.66 0.87 0.79

Words and Chars 0.72 0.84 0.59 0.72 1.00

Character-Based Classification performance

Table 2 shows that the character-based classifiers performed reasonably well. However the results are highly unbalanced and tend to be more accurate in classification of positive documents. The results are skewed to the positive class because all characters have scores based on their normalised frequency in the appropriate side of the sentiment dictionary. Thus the sum of the scores of all characters in the positive side is 1,803.05 and 2,016.16 for the negative side, which makes 1:1.12 ratio. Bearing in mind that the number of negative characters is

twice as many as the number if the positive ones, on average an item in the positive part of the list has a score almost twice as big as the score of an average item in the negative part.

Word-based Classification performance

The word-based classifier performed at the same level as the character-based classifier: although the word-based classifier produced a more balanced classification, the t-test showed no significant difference between these two classifiers. But it should be noted that the word-based classifier used only binary scores. However, a particular disadvantage of the word-based classifier is its low.

coverage: 21 % of all documents were omitted by the classifier. But in terms of precision the word-based classier performed much better than any other classifier.

Word and Character Combination performance

The best result in this test was achieved by combining words and characters: the combination of words with the characters achieved an accuracy of 0.73, which is significantly better than the character-only classifier. On the other hand, the combination of characters with the word-based classifier still inherited the degree of skew of the character-based classifier.

Discussion

The experiments described above tested two kinds of basic units for sentiment classification, characters and words, applying them separately and in combination. The main purpose of the experiments was to find the best kind of basic units.

The character-based classifier achieved accuracy of 0.68. The main reason for this is that characters do not usually form semantically independent units (unlike words and phrases) and often have rather vague and ambiguous meanings. This was reflected in their distribution across the sentiment classes: the most frequent characters were present in both classes and so the presence-based score could not contribute to classification.

The performance of the word-based classifier was also relatively high (about 0.68). The drawback of the word-based classifier is its relatively low coverage: up to 23 % of documents were not classified in the classification experiments. The low coverage might be a result of the more domain-dependent nature of words: although the list of sentiment words is quite large, it does not include all the words used in the corpus to express attitude since many of these words have sentiment-related meaning only in the context of a particular topic. However, the high precision (up to 0.88) indicates the importance of capturing a bigger context: words are longer than characters and cover bigger portions of text. Indeed, many of the 'words’ are actually sentiment-bearing phrases which cover all relevant context.

Although the coverage of the word-based classifier was not high, it achieved a very high

precision, compared to the other classifiers (see Table 2). This can be attributed to the more context-dependent nature of the word as compared to the character.

Words and characters when combined together performed relatively well, showing the best features of both: accuracy was never too bad, and coverage was fairly good. The combination of characters and words was able to classify many more documents than the word-based classifier. It is also worth noting that all character-based classifiers benefited from combination with words and performed better in all the tests.

The results obtained from the experiments indicate that the best classifier is one based on the combination of words and characters applied to non-segmented text. This, in turn, suggests that neither words nor characters on their own can be considered as the most suitable units for Chinese NLP (or at least for sentiment classification).

References

1. Chih-Hao, Tsai Word identification and eye movements in reading Chinese: A modeling approach. PhD thesis, University of Illinois at Urbana [Text] / Tsai Chih-Hao. - Champaign, 2001.

2. Rumjahn Hoosain Psycholinguistic implications for linguistic relativity : A case study of Chinese [Text] / Hoosain Rumjahn. - Lawrence Erlbaum Associates Inc, Mahwah, NJ, 1991.

3. Nanyuan, Liang A written Chinese automatic word segmentation system [Text] / Liang Nanyuan // lournal of Chinese Information Processing. - 1987. - l(2). - P. 4452.

4. Jin, Guo Chinese Language Modeling for Speech Recognition. PhD thesis [Text] / Guo Jin. - National University of Singapore, 1997.

5. Nianwen, Xue Chinese word segmentation as character tagging [Text] / Xue Nianwen // Computational Linguistics and Chinese Language Processing. - 2003. -8(1). - P. 29-48.

6. Schubert, Foo Chinese word segmentation accuracy and its effects on information retrieval [Text] / Foo Schubert, Li Hui // TEXT Technology. - 2001. - P. 1-11.

7. Jia, Xu Do we need Chinese word segmentation for statistical machine translation [Text] / Xu Jia, Zens Richard, and Ney Hermann // In Proceedings of the Third SIGHAN Workshop on Chinese Language Processing. - Boston, MA, 2004. - P. 257-264.

8. Lun-Wei, Ku Construction of an evaluation corpus for opinion extraction [Text] / Ku Lun-Wei, Wu Tung-Ho, Lee Li-Ying, and Chen Hsin-Hsi // In Proceedings of the Fifth NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering and Cross-Lingual Information Access. - Tokyo, Japan, 2005.

9. Peterson, Erik A Chinese Named Entity Extraction System. In Proceedings of the 8th Annual Conference of the International Association of Chinese Linguistics [Text] / Erik Peterson. - Melbourne, Australia, 1999.

10. Pak-kwong Wong and Chorkin Chan. Chinese word segmentation based on maximum matching and word binding force [Text] / Wong Pak-kwong and Chan Chorkin // In Proceedings of the 16th International Conference on Computational linguistics. - Copenhagen, Denmark, 1996. -P. 200-203.

11. Jian-Yun, Nie On the use of words and n-grams

for Chinese information retrieval [Text] / Nie Jian-Yun, Gao Jiangfeng, Zhan Jian g, and Zhou Ming // In Proceedings of the fifth international work-shop on Information retrieval with Asian languages. - 2000. - P. 148-156.

12. Raymond, W.M. Yuen Morpheme-based derivation of bipolar semantic orientation of Chinese words [Text] / W.M. Yuen Raymond, Y.W. Chan Terence, B.Y. Lai Tom, O.Y. Kwong, and K.Y. T’sou Benjamin // In Proceedings of the 20th International Conference on Computational Linguistics. - Geneva, Switzerland, 2004. - P. 1008-1016.

УДК 811.581 ББК 81.71

К.А. Кардаш

СЕМАНТИЧЕСКАЯ СИТУАЦИЯ ПОЗДРАВЛЕНИЯ КАК КОМПЛЕКСНАЯ МОДЕЛЬ В АСПЕКТЕ ПЕРЕВОДА (НА МАТЕРИАЛЕ КИТАЙСКОГО ХУДОЖЕСТВЕННОГО ТЕКСТА)

В данной статье рассматривается комплексная модель семантической ситуации поздравления в аспекте перевода. Материалом для нашего исследования послужили поздравления из текстов китайской художественной литературы начала ХХ в. и их перевод нарусский язык. Изучение семантической ситуации как комплексной модели в аспекте перевода актуальна и дает возможность восприятия другой культуры и традиций. Особой задачей является определение составляющих семантической ситуации поздравления в тексте оригинале и тексте перевода.

Ключевые слова, китайский язык; поздравление; перевод

K.A. Kardash

‘CONGRATULATION’ AS A COMPLEX MODEL IN THE TRANSLATION OF CHINESE LITERARY TEXTS

The article is devoted to the problem of translating Chinese congratulations. The study of the congratulations from the Chinese literary texts of XX century and their translation into Russian appoaches the relevant semantic situation as a complex model in the translational aspect. The study of the components of the complex semantic situation is new to the theory of translation and offers the opportunity to discover and understand the other culture and traditions.

Key words: semantic situation; literary text; translation; proposition; complex model; the Chinese language; situation of congratulation

Современный синтаксис, выйдя из учения о форме предложения, сегодня ориентирован на смысловые и коммуникативно-прагматические компоненты высказывания, захватывая при этом области в формальном отношении, не вызывающие особых трудностей, но интересные с точки зрения семантики и прагматики.

Дискуссионность существующих представлений синтаксической системы связана с различными взглядами и подходами к проблемам синтаксиса. Таким многообразием подходов характеризуется, например, совре-

менное учение о семантике предложения в отечественной науке. Широко представлены исследования, авторы которых идут в своих теоретических рассуждениях от формальной организации предложения к его смыслу (Н.Ю. Шведова, Н.Д. Арутюнова). Другое направление объединяет исследователей, ориентированных на структуру события, ситуацию как денотат предложения. Данное направление представлено многочисленными исследованиями, проведенными на материале разных языков (Т.Б. Алисова, О.И. Москаль-ская, В.Г Гак, И.П. Сусов, В.В. Богданов,

Features for Chinese sentiment classification Текст научной статьи по специальности «Языкознание и литературоведение»

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Zagibalov T. E.

Похожие темы научных работ по языкознанию и литературоведению , автор научной работы — Zagibalov T. E.

Текст научной работы на тему «Features for Chinese sentiment classification»