Научная статья на тему 'Comparative analysis of different machine translation methods'

Comparative analysis of different machine translation methods Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
440
108
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
MACHINE TRANSLATION (MT) / RULE-BASED MACHINE TRANSLATION (RBMT) / STATISTICAL MACHINE TRANSLATION (SMT) / MACHINE TRANSLATION TECHNOLOGIES

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Шевченко О. М.

This paper gives a brief analysis of some major machine translation methods designed to speed up the rate of multilingual text translation. Machine translation is achieved by computer software transforming text from one language to another. At present two different approaches in machine translation (MT) are used: rule-based machine translation (RBMT) and statistical machine translation (SMT). Each of them has its advantages and disadvantages. However, current MT quality still remains imperfect as the natural languages are complex and work on different levels.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Comparative analysis of different machine translation methods»

References:

1. Jeffrey K. Riegel. ADFL Bulletin. - 1994. - Vol. 25, № 3. - 57-64 p.

2. Arthur D.Mosher. ADFL Bulletin. - 2001. - Vol. 32, № 3. - 122-129 p.

COMPARATIVE ANALYSIS OF DIFFERENT MACHINE TRANSLATION METHODS

© Шевченко О.М.*

Национальный технический университет Украины

«Киевский политехнический институт», Украина, г. Киев

This paper gives a brief analysis of some major machine translation methods designed to speed up the rate of multilingual text translation. Machine translation is achieved by computer software transforming text from one language to another. At present two different approaches in machine translation (MT) are used: rule-based machine translation (RBMT) and statistical machine translation (SMT). Each of them has its advantages and disadvantages. However, current MT quality still remains imperfect as the natural languages are complex and work on different levels.

Key words: machine translation (MT), rule-based machine translation (RBMT), statistical machine translation (SMT), machine translation technologies.

Introduction. Machine translation (MT) is an automatic translation from one language to another with the help of computerized systems. This process is sometimes described as an automated translation performed by a computer.

The modern world offers huge volumes of multilingual content and we are often faced with the problem of how to translate it as quick as possible. Also today, a large amount of information from all areas of life is available to the users of the Internet. However, the content of many interesting sites is presented only in a foreign language. To quickly overcome the language barrier different machine translation systems are being widely used today [3].

At present, automated translation may effectively solve the problem of growing number of translations and at the same time increase productivity of translation. How does the program manage to coherently translate text from one language to another? What are the current approaches in machine translation (MT)?

At present there exist two fundamentally different machine translation technologies. One is based on the rules (rule-based machine translation or RBMT), and another - on the statistics (statistical machine translation or SMT).

* Старший преподаватель кафедры Английского языка № 3 факультета Лингвистики, МБА (Магистр Бизнес Администрирования).

Both technologies have their pros and cons, supporters and opponents, and the issue often discussed today is which of them allows you to get the top quality result. In this paper we consider the two main machine translation approaches and their general principles.

Rule-based MT technology.

Rule-based machine translation - is based on the application of a great number of linguistic rules (algorithms) which are used in the process of translation in the following sequence: analysis, transfer and generation. The program analyzes the text and using the results of the analysis synthesizes translation. This method requires an extremely massive lexicon with information about the language morphological, syntactical and semantic structure. The translation is done with the help of built-in dictionaries for a given language pair. This translation process is also based on grammar rules which include morphological, syntactic and semantic analyses of words in both languages. On the basis of these complex sets of grammar rules the grammatical structure of the source language is transferred into the grammatical structure of the target language [1]. The process performed by such a system is similar to the process of human thinking: the system analyzes the text using a variety of algorithms.

This method is used by most developers of machine translation systems (PROMT and Abbyy Compreno in Russia, SYSTRAN in France, USA and South Korea, Apertum in Spain, GramTrans in Scandinavian countries, etc.).

In the process of translation by using Rule-based MT method, the sentence from the source language normally goes through the following stages:

Stage 1: Morphological analysis.

Before starting a translation of a sentence, the program first analyzes the words in each sentence in terms of morphology, i.e. indicating their gender, number, person, and other morphological characteristics. At this stage, the program does not solve the question of grammatical ambiguity, but only keeps this information. The following example is a good illustration of the general frame of this method: 'A programmer writes a code' (Source language - English, target language - Russian). In this sentence 'a' is an indefinite article; 'programmer' is a noun; 'writes' is a verb; 'a' is an indefinite article; 'program' is a noun.

After morphological analysis the system performs the following actions:

It solves the problem of grammatical ambiguity (determines the meaning of words, which may belong to different parts of speech) on the contextual level.

For example, if the word belongs to different parts of speech, like the English word 'record' which can be used as a verb (to record = to write smth. down) or as a noun (a record = a written account of smth.), the system determines that 'to record' is a form of a verb and provides it with the appropriate morphological characteristics.

Stage 2: Syntactic Analysis.

The next stage in the translation process is the process of determination of parts of the sentence and their place in the sentence, the boundaries of simple

sentences and their relationships with each other in complex sentences. First, the program searches for a predicate, then for a subject which precedes the predicate (it is assumed that the word order is direct). If, however, there is no subject before the predicate, the system searches it in the postposition, or it assumes that there is no subject at all like, for example, in impersonal sentences ("It is cold") or in imperative sentences ("Switch off the computer"). In our example, the system provides syntactic information about the verb:

writes = Present Simple, 3rd person singular, Active Voice.

Stage 3. Sentence Synthesis.

This is the final stage of the translation process when the elements within groups are coordinated, e.g. predicate and words that depend on it (subject, direct and / or indirect object) are arranged according to the rules of the target language and the correct word-order is used. In the process of translation, the program uses a set of algorithms that help make translation in view of the grammatical and other features of a particular target language. In our example the elements of a sentence are coordinated and arranged according to the rules of the target language: En.'The programmer' (subject) + 'writes' (predicate) + 'a code' (direct object) ^ Ru. 'Программист' (subject) + 'пишет' (predicate) + 'код' (direct object).

As a result, in spite of certain inaccuracies found in the translation, the user will understand the gist of the text translated with the help of the Rule-based MT system.

The advantages of systems based on grammar rules are: fairly good grammatical and syntactic accuracy, stable results, the ability to customize text.

However, the creation of such systems requires much time and huge linguistic resources, like thousands of specialized bilingual dictionaries, and good knowledge of grammar, syntax, semantics, etc. both in the source and target languages. This makes the process of the RBMT system development very time-consuming and expensive.

Statistical MT translation technology.

Statistical machine translation is based on statistical translation of language models obtained from the analysis of bilingual texts. It does not use linguistic translation algorithms, and relies on a statistical calculation of the probability of a match. A bilingual corpus containing huge amount of text in the source language together with its human translation into the target language is downloaded into the system. Then the system analyses the statistical data about interlingual matches, syntactic structures, etc. In fact, it is a self-learning system which is based on previously obtained statistical results. The bigger and more versatile the dictionary, the better the results of statistical machine translation. If you work with large databases of parallel texts, you can expect higher quality of the translation. Every newly translated text improves the quality of subsequent translations [2].

The systems of statistical machine translation are characterized by quick setting and by the ability to add easily new language pairs. Thus the statistical MT

can be described as the process of finding and matching identical pairs from source and target languages.

In the process of translation the Statistical MT systems breaks up source sentences into phrases. This method of finding relevant pairs of phrases yields fewer errors in target language sentences as they include the word combinations and keep the word order of the target language.

The following example illustrates how the SMT system splits the sentence to create pairs of phrases:

'The space station was launched'. Pairs of phrases:

The space station космическая станция

The space station was космическая станция была

Station was станция была

Station was launched станция была запущена

Was launched был(а) запущен(а)

The weak point of the statistical system is the lack of a mechanism for grammatical analysis of sentences of both source and target languages. It is hard to imagine that a system which does not analyze text in terms of grammar, is able to provide any adequate translation.

1. Renewable resources are those which replenish themselves naturally.

Google Translate Microsoft Translator PROMT Systran

Возобновляемые ресурсы являются те, которые пополнять себя естественно. Возобновляемые ресурсы являются те, которые пополнения себя естественно. Возобновимые ресурсы - те, которые пополняют себя естественно. Возобновимые ресурсы те, которые пополняют естественно.

2. We can keep this secret between ourselves.

Мы можем держать это в тайне между нами. Мы можем хранить эту тайну между собой. Мы можем держать это в секрете между нами. Мы можем держать этот секрет между собой.

3. I thought I was going to miss my train, so I rushed to the station.

Я думал, что буду скучать по мой поезд, так что я бросился к станции. Я думал, что собираюсь пропустить мой поезд, так я бросился на вокзал. Я думал, что собрался опоздать на свой поезд, таким образом, я помчался к станции. Я думал, что я шло опоздать к моему поезду поэтому я поспешил к станции.

4. There is plenty of time to make up your mind.

Существует много времени для вас, чтобы сделать свой ум. Есть много времени для вас, чтобы сделать свой ум. Есть много времени для Вас, чтобы решиться. Время множества для вас составить ваш разум

Conclusion. We have analyzed 4 machine translation technologies which use two different approaches. Two of them are based on Statistical MT method (Google Translate and Microsoft Translator) and two - on Rule-Based MT method

(PROMT and Systran). Currently, all automated systems performing online translations irrespective of the method used, are still far from perfect.

References:

1. Costa-Jussa M.R., Farrus M. Study and Comparison of Rule-Based and Statistical Catalan-Spanish Machine Translation Systems (2012). Available at: http://www.cai.type.skcontent/2012/2/study-and-comparison-of-rule-based-and-statistical-catalan-spanish-machine-translation-systems/1007.pdf (Accessed on 25th April 2015).

2. Koehn P., Och F.J., Marcu D. Statistical phrase based machine translation (2003). Available at: http://aclweb.org/anthology/N/N03/N03-1017.pdf (Accessed on 25th April 2015).

3. Потапова К.К. Новые информационные технологии и лингвистика. -Информация, 2002. - С. 368-370.

i Надоели баннеры? Вы всегда можете отключить рекламу.