Sequence-to-sequence based english-chinese translation model

Tian Zhaolin; Zhang Weiwei

Интеллектуальные системы и технологии

DOI: 10.18721/JCSTCS.11205 UDC 004

SEQuENCE-To-SEQuENCE BASED ENGLISH-CHINESE

translation model

Tian Zhaolin, Zhang Weiwei

Peter the Great St. Petersburg Polytechnic University, St. Petersburg, Russian Federation

In recent years, with the continuous improvement of theory in artificial intelligence, artificial neural networks has become novel tools for machine translation. Compared with traditional Statistical Machine Translation (SMT), neural network based Neural Machine Translation (NMT) transcends SMT in many aspects such as translation accuracy, long distance reordering, syntax, tolerance to noisy data et al. In 2014, with the emergence of sequence-to-sequence (seq2seq) models and attentional mechanisms introduced into the model, NMT was further refined and its performance was getting better and better. This article uses the current popular sequence-to-sequence model to construct a neural machine translation model from English to Chinese. In addition, this paper uses Long-Short Term Memory (LSTM) to replace the traditional RNN in order to solve the problem of gradient disappearance and gradient explosion that it faces in long-distance dependence. The attention mechanism has also been introduced into this article. It allows neural networks to pay more attention to the relevant parts of the input sequences and less to the unrelated parts when performing prediction tasks. In the experimental part, this article uses TensorFlow to build the NMT model described in the article.

Keywords: NMT, seq2seq, LSTM, attention mechanism, encoder-decoder, tensorflow.

Citation: Tian Zhaolin, Zhang Weiwei. Sequence-to-Sequence based english-chinese translation model. St. Petersburg State Polytechnical University Journal. Computer Science. Telecommunications and Control Systems, 2018, Vol. 11, No. 2, Pp. 55-63. DOI: 10.18721/JCSTCS.11205

модель sequence-to-sequence в англо-китайском переводе

Тянь Чжаолинь, Чжан Вэйвэй

Санкт-Петербургский политехнический университет Петра Великого,

Санкт-Петербург, Российская Федерация

В последние годы, в связи с постоянным совершенствованием теории искусственного интеллекта, новыми инструментами машинного перевода стали искусственные нейронные сети. Нейронный машинный перевод (NMT) имеет

значительные преимущества по сравнению с традиционно используемым методом статистического машинного перевода (БМТ) в таких аспектах, как точность перевода, изменение порядка слов в длинных предложениях, синтаксис, помехоустойчивость и т. д. После того как в 2014 году появились модели перевода по схеме «последовательность-в-последовательность» ($ед2$ед) и механизмы внимания, введенные в модель, методы ММТ продолжали совершенствоваться, улучшалась их производительность. В данной статье для построения модели нейронного машинного перевода с английского на китайский использована популярная в настоящее время схема перевода $ед2$ед. Кроме того, вместо традиционно применяемой рекуррентной нейронной сети в статье для решения возникающей проблемы взрыва и исчезновения градиента для длинных строк использован метод долгой краткосрочной памяти (ЬБТМ). Рассмотрен механизм, позволяющий нейронным сетям уделять больше внимания соответствующим частям входных последовательностей и меньше — несвязанным частям при выполнении задач прогнозирования. В экспериментальной части статьи для построения описанной модели ММТ использован ТешогИс^

Ключевые слова: ММТ, $ед2$ед, ЬБТМ, механизм внимания, кодировщик-декодер, ТешогИс^

Ссылка при цитировании: Тянь Чжаолинь, Чжан Вэйвэй. Модель Бедиепее-1о-Бедиепее в англо-китайском переводе // Научно-технические ведомости СПбГПУ. Информатика. Телекоммуникации. Управление. 2018. Т. 11. № 2. С. 55-63. БОГ: 10.18721/.1С8ТС8.11205

1. Introduction

Introducing more reforms and implementing the One Belt One Road strategy, China is increasingly participating in international affairs. However, due to the peculiarity of the Chinese language, it is difficult for non-native speakers to master Chinese in a short period of time. In addition, due to the differences in grammatical logic between Chinese and western languages such as English and Russian, traditional statistical machine translation often fails to achieve satisfactory results. As more and more Chinese people travel around the world, and people in other countries are increasingly interested in China, Chinese, the world's most spoken language, and English, the world's most widely used language, produce an inevitable intersection.

Continuous improvement of relevant theories of artificial intelligence in the field of machine translation and the continuous popularization of high-performance hardware devices in the 21st century have paved the way for large-scale application of artificial neural networks in machine translation and created a rare opportunity for further development of neural machine translation. In 2013, Kalchbrenner and Blunsom proposed an end-to-end encoder-decoder model for machine

translation. However, the traditional RNN used in the decoder has a problem of gradient disappearance and gradient explosion, making the model difficult to practically handle longdistance dependences. Also in 2013, Gravse et al. applied deep bi-directional LSTM to speech recognition, paving the way for deeper applications of bi-directional LSTM in Neural Machine Translation (NMT). In 2014, Cho et al. proposed a new sequence-to-sequence model and used LSTM (actually, LSTM is a variation of RNN) instead of traditional RNN as encoder and decoder. In the same year, Bengio et al. introduced the attention mechanism into NMT so that the neural network can pay more attention to the relevant part of the input sequences and pay less attention to the unrelated part when performing prediction tasks.

This paper uses a mature seq2seq model to construct a translation model from English to Chinese. The structure of this paper is roughly as follows. The second part introduces data sources and data preprocessing. The third part introduces the theoretical part of the encoder, attention mechanism and decoder in the seq2seq model in detail. The fourth part introduces the experimental results, the approximate implementation of the model, and the evaluation results of the model. The

fifth section introduces recommendations for further research in the future.

2. Data Source and Data Preprocessing

The data used in this paper comes from the United Nations Parallel Corpus [1]. The English-Chinese parallel corpus contains almost fifteen million sentences. All these materials contain content which was produced and manually translated from 1990 to 2014, including sentence level alignments.

Before building our seq2seq model, we need to do some preprocessing with the parallel corpus.

• Handling training/testing dataset: We extract 100,000 sentences from the parallel corpus as the testing dataset and the remains are the training dataset.

• Handling source sentences: Add "BOS" in the beginning of sentences and "EOS" in the end of sentences.

• Handling dictionaries: Generate two dictionaries for Chinese and English based on the training dataset.

• Handling unknown words: If some vocabularies/words from the testing dataset do not exist in those two dictionaries, use "UNK" to replace them.

• Handling input sequences: Generate one-hot vectors based on the original sentence and two dictionaries. Then combine these one-hot vectors as input sequences.

3. The Model

Encoder. Assuming the input sequence x = (xv ■■ • , xr), the traditional recurrent neural network (RNN) calculates the hidden state vector h = (hl, ■••, hT) and output y = (y, ■ • •, yT) by iterating the following equations from t = 1 to T

h = nw^ + whhkt_1 + bh), (1) yt = Whyh + by, (2)

where Wxh denotes the input-hidden weight matrix, Whh denotes the hidden-hidden weight matrix, Why denotes the hidden-output weight matrix, bh denotes the hidden bias vector, by denotes the output bias vector and H is the hidden layer function.

However, we found out that the Long-

Short Term Memory [2] has its advantages by using a gate mechanism in dealing with long distance dependences. So we use the LSTM cell proposed by Gers et al. in 2002 [3]. So here in our model, H is implemented by the following equations:

it = a(Wxixt + Whh-1 + Wcict _1 + b), (3) ft = a(Wxfxt + Whfht _1 + Wcfct_x + bf), (4)

ct = ftct-1 + htanh( Wxcxt +Whcht-1 + bc )> (5)

Ot = v(Wx0xt + WhA-1 + WC0ct + b0), (6) ht = ot tanh(ct), (7)

where a denotes the logistic sigmoid function, and i, f, o, c denote the input gate, the forget gate, the output gate and the cell activation vectors respectively, all of which are the same size as the hidden vector h.

A disadvantage of the traditional recurrent neural networks is that they are not able to take advantage of subsequent context. In this paper we use a bidirectional recurrent neural network [4] to process input sequences in both directions by using two separate hidden layers, and then feedforward to a same output layer. The calculation of the forward hidden state h, the backward hidden state h and the output sequence j is shown below:

ht =H(Wx^xt + W^-1 + bh X (8)

ht = H(Wh xt + Wh ht+1 + ^ X (9)

y^ = Wyh + Wyh + by. (10)

Here we can also use LSTM, which is mentioned above, to replace the traditional recurrent neural network cell [5, 6]. As a result, bidirectional long-short term memory is the basic structure of the model in this paper.

Furthermore, it has been proved that the performance of a deep neural network is always better than of that with a single layer. In our case, it is totally possible to stack several layers of bidirectional RNN to generate a deep bidirectional RNN [9]. Assuming all the hidden layers are sharing the same function, the calculation of the hidden state in nth layer is shown below:

K = H(W№ + WM„htn-1 + bn x (U)

К = n(Whh„-h + W^h-! + bn ). (12)

hnhn

If we define h0 = h0 = x, then the output of network yt is:

yt = Wnh

+ Whh„ htn +by.

h y t y

(13)

Attention Mechanism. In this paper, we implement a global attention mechanism [7, 8] in our model. Firstly, we take the hidden state ht at the top layer of deep LSTM and generate a probability distribution based on the context vector ct to help predict the current target word yt. The equations are shown below:

ht = tanh(Wc [ct ; h]),

(14)

p(yt |y < t, x) = softmax(Wshit), (15)

where ht denotes attentional vector.

The idea of the global attention mechanism is to consider all hidden states of the encoder when calculating the context vector ct. In this mechanism, by comparing the current target hidden state ht with every source hidden state hs, we may get a variable-length alignment vector at, whose size equals the number of time steps on the source side:

at (s ) =

exp( score(h h, hs )) ^ s exp(score(h,, hs, ))

(16)

The score can be calculated in three ways, all of which are shown below:

score(ht, hs ) =

hThs dot

hfWahs general (17) W [h ; hs ] concat

In our model, we use the general score (the second one in Eq. (17)), which has been proved to be the best one [7], to compute the alignment vector at. Consider the alignment vector is weights, we use the weighted average over all the source hidden states to generate the context vector ct. An example of the attentional vector is shown in Fig. 1 [8].

Decoder. The decoder is trained to predict the next word yt based on the given context vector c and all the previously predicted words {yl3 •••, yt_j) [10, 11]. The calculating function is shown below:

T

p(y) = np(y< I{y1' "'' y_1c), (18)

t=1

where y = (y1, •••, yT ). Here, the conditional probability in the recurrent neural network can be also defined as:

p(yt|{y1, ..., yt_1>, c) = g(yt_1, ht, c), (19) where g denotes a nonlinear function that

■o v,

Ш D ГМ

с oi ai

r 3 Cl

Д .E < гн

sur la zone économique européenne a été signé en août 1992

Y

■

S

Fig. 1. English-French sample alignments found by RNN search-50. (Bahdanau et al., 2014)

outputs the probability of yt , ht denotes the hidden state of recurrent neural network [12].

In our model, every conditional probability described in Eq. (18) is defined as

p( yt ly^ ..., yt-l, x) = g(yt-l, ht, ctX (20)

where ht is computed by the following equation:

ht = f (ht-l, yt-v ct). (21)

Be advised, the probability here is conditioned on a distinct context vector ct for each target word yt . The context vector ct can be obtained by the method described in the previous section.

4. Experiments

Dependency. First our model is running on Linux. And we need the following tools to be ready.

• Python >= 3.5

• TensorFlow >= 1.2

• Numpy >= 1.12

It is preferable to have a GPU to help speed up the training process [25].

Model Structure. Fig. 2. shows a brief structure of our model. Here in our model, we have a double-layer bidirectional LSTM as an encoder and two-layer LSTM as a decoder. The detailed initial configuration of the encoder and the decoder is shown below.

Encoder

• Hidden state size: 1024

• Number of layers: 2

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

• Input keep probability: 1.0

• Output keep probability: 1.0

Decoder

• Hidden state size: 1024

• Number of layers: 2

• Input keep probability: 1.0

• Output keep probability: 1.0

Other configurations

• Learning rate: 0.0005

• Batch size: 128

• Beam size: 5

• Size of attentional vector: 512

Model Evaluation and Result Analysis. First, we use cross entropy as the loss of our model. Fig. 3 shows the variation of cross-entropy during the iteration: as the training progresses, the

Output Sequence

Fig. 2. The structure of the EN-CH model

Fig. 3. Cross Entropy Loss

cross entropy decreases gradually and settles at around 0.5 and between 0.1 and 0.9, showing that the model is still quite effective.

We also notice that the losses vary in a relatively large range after the model was trained over 50,000 steps. That's because the model performs better with shorter sentences then with longer ones. When input sentences are short, the translations by the model are exactly the same with the standard translation most of the time, and, therefore, the losses in this moment may be close to zero. When input sentences are long, the translations by the model may not be accurate, but still acceptable, or, compared to

the standard translations, express the idea in another way. All of this is the reason why the losses under such circumstances are relatively larger than those for short sentences.

Table 1 shows some examples of short sentences.

Table 2 shows some examples of long sentences.

Secondly, we use a BLEU score, a kind of self-evaluation method for a machine translation model, to evaluate our model. The BLEU score is computed based on the test dataset. BLEU score is computed by the following equations [13]:

Table 1

Example of short sentences

Step/Loss Translations by model Standard Translations

229500/ 0.360127

( me® 05 ) (me® 0 5 )

m^m^mmmm^

277000/ 0.335871

©mmM

283500/0.315215 me® 040 me® 0 40

tsm^mt

mmnmmw.

Table 2

Examples of long sentences

Step/Loss Translations by model Standard Translations

276500/0.715091 m^^Mm^x m^rnmm 0 mK^mmMM, umm^rn

%, mmn^rn-M^&^nn o

277500/0.894594 mwwms® , aawwrn^s temm^^mmm, öit^® mMmm^mmmm'&m, TO® ;

■srnmmmmmrw^m 0 MM^âi^Œ mmmirrn o

i^Iff o o

BP =

(1

if c > r

e1-r/cifc < r

f N

Bleu = BP ■ exp

lQg pn

n=1

log Bleu = min I 1--, 0 + ^ran log pn,

N

(22)

(23)

(24)

n=1

where BP denotes the Brevity Penalty, c denotes the length of the output sentence, r denotes the length of the standard translation sentence, N equals 4 and ran equals 0.25.

In our model, we got a 26.6 bleu score for the English-Chinese task.

5. Further Development of NMT

In the rapidly developing and highly

competitive environment, the NMT technology is making significant progress. NMT will also be continuously improved in many aspects, including:

• Rare word problem [14, 15]

• Use of single-language data [16, 17]

• Multilingual Translation / Multilingual NMT [18]

• Memory mechanism [19]

• Language fusion [20]

• Coverage issues [21]

• Training process [22]

• A priori knowledge fusion [23]

• Multi-modal translation [24]

Acknowledgement

We would like to thank Professor E.A. Rodionova for her great help. We would also like to express our gratitude to all the people who have helped and supported us to finish this article.

references / список литературы

1. Ziemski M., Junczys-Dowmunt M., Pouli-quen B. Parallel Corpus United Nations. Language Resources and Evaluation (LREC'16), Portorozh, Sloveniya, 2016.

2. Hochreiter S., Schmidhuber J. Long Short Term Memory. Neural Computation, 1997, Vol. 9, No. 8, Pp. 1735-1780.

3. Gers F.A., Schraudolph N.N, Schmidhuber J. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 2002, Vol. 3, Pp. 115-143.

4. Schuster M., Paliwal K.K. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 1997, Vol. 45, Pp. 2673-2681.

5. Graves A., Mohamed Abdel-Rahman, Hinton G. Speech recognition with deep recurrent neural networks. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, Pp. 6645-6649, preprint arXiv: 1303.5778 03.2013 (https://arxiv.org/pdf/1303.5778.pdf).

6. Graves A., Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 2005, Vol. 18, Issue 5-6, Pp. 602-610.

7. Minh-Thang Luong, Hieu Pham, Christopher D. Manning Bilingual word representations with monolingual quality in mind. Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, 2015, Pp. 151-159, preprint arXiv: 1508.04025 08.2015 (http://www.aclweb.org/ anthology/W15-1521).

8. Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio Neural machine translation by jointly learning to align and translate, preprint arXiv: 1409.0473, 2014.

9. Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio. How to construct deep recurrent neural networks, preprint arXiv: 1312.6026, 12.2013.

10. Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation, preprint arXiv: 1406.1078, 2014.

11. Sutskever I., Vinyals O., Le. Quoc V. Sequence to sequence learning with neural networks. Proceedings of the Advances in Neural Information Processing Systems, 2014, Pp. 3104-3112, preprint arXiv: 1409.3215S.

12. Kalchbrenner N., Blunsom Ph. Recurrent continuous translation models. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013, Pp. 1700-1709.

13. Papineni K., Roukos S., Ward T., Wei-Jing

Zhu. BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting ofthe Association for Computational Linguistics (ACL), Philadelphia, 2002, Pp. 311-318.

14. Jean S., Kyunghyun Cho, Memisevic R., Bengio Y. On using very large target vocabulary for neural machine translation. Conference ACL-2015, preprint arXiv: 1412.2007, 2014.

15. Minh-Thang Luong, Sutskever i., Quoc V. Le, Vinyals O., Wojciech Zaremba. Addressing the rare word problem in neural machine translation, preprint arXiv: 1410.8206, 2014.

16. Sennrich R., Haddow B., Birch A. Improving neural machine translation models with Monolingual Data. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, preprint arXiv: 1511.06709, 2015

17. Cheng Y., Xu W., He Z., He W., Wu H., Sun M., Liu Y. Semi-supervised learning for neural machine translation, preprint arXiv: 1606.04596, 2016.

18. Dong D., Wu H., He W., Yu D., Wang H.

Multi-task learning for multiple language translation. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015, Vol. 1, Pp. 1723-1732.

19. Wang M., Lu Z., Li H., Liu Q. Memory-enhanced decoder for neural machine translation, preprint arXiv: 1606.02003, 2016.

20. Sennrich R., Haddow B. Linguistic input features improve neural machine translation. Proceedings of the First Conference on Machine Translation, 2016, Pp.83-91, preprint arXiv: 1606.02892, 2016.

21. Tu Z., Lu Z., Liu Y., Liu X., Li H. Modeling coverage for neural machine translation. ACL Conference, preprint arXiv: 1601.04811, 2016.

22. Shen S., Cheng Y., He Z., He W., Wu H., Sun M., Liu Y. Minimum risk training for neural machine translation, preprint arXiv: 1512.02433, 2015.

23. Cohn T., Duy C., Hoang V., Vymolova E., Yao K., Dyer C., Haffari G. Incorporating structural alignment biases into an attentional neural translation model, preprint arXiv: 1601.01085, 2016.

24. Hitschler J., Schamoni S., Riezler S. Multimodal Pivots for Image Caption Translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 2016, Pp. 2399-2409.

25. Britz D., Goldie A., Minh-Thang Luong, Quoc Le. Massive exploration of neural machine translation architectures, preprint arXiv:1703.03906, 2017.

Received 08.05.2018. / Статья поступила в редакцию 08.05.2018.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

the authors / сведения об авторах

TIAN Zhaolin тянь чжаолинь

E-mail: peter0431peter@gmail.com

Zhang Weiwei чжан Вэйвэй

E-mail: soszhang@outlook.com

Sequence-to-sequence based english-chinese translation model Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Tian Zhaolin, Zhang Weiwei

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Tian Zhaolin, Zhang Weiwei

Модель Sequence-to-Sequence в англо-китайском переводе

Текст научной работы на тему «Sequence-to-sequence based english-chinese translation model»