Научная статья на тему 'Content and structure of the corpus of spontaneous Mongolian'

Content and structure of the corpus of spontaneous Mongolian Текст научной статьи по специальности «Клиническая медицина»

CC BY
76
15
i Надоели баннеры? Вы всегда можете отключить рекламу.

Аннотация научной статьи по клинической медицине, автор научной работы — Yu Rong

The paper proposes a general description of the Corpus of Spontaneous Mongolian (CSM), developed at the School of Mongolian Studies, Inner Mongolia University. The corpus includes 40 conversations with a total listening time of about 20 hours. Every record contains a free (about 30 minutes long) conversation between two old acquaintances and consists of a sound file in the WAV format and an annotation file in the TextGrid format. Five annotation layers Mong, Pro, Tra, Morp, and Seg are attached to every sound file using Praat. Linguistic, paralinguistic and non-linguistic tagging is added to each annotation layer as well. The paper also introduces the annotation principles and the retrieval system applied in the CSM

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Content and structure of the corpus of spontaneous Mongolian»

Yu Rong (Hasmandal)

Inner Mongolia University, Hohhot

CONTENT AND STRUCTURE OF THE CORPUS OF SPONTANEOUS MONGOLIAN

The Corpus of Spontaneous Mongolian (CSM) is a result of the research project funded by the Chinese Academy of Social Sciences. The present paper gives a general description of the CSM by introducing its structure, annotation layers and retrieval system, etc.

1. General introduction to the CSM

1.1. Working environment of the CSM

The operating systems and software used for the CSM are Windows 7 or Vista, Praat and the Yunlong International Phonetic Alphabet. Praat and the Yunlong International Phonetic Alphabet can be downloaded freely. Written Mongolian letters are typed using the traditional Mongolian input method available in Windows 7 or Vista. Therefore, it is not necessary to install on a computer other Mongolian input methods.

1.2. Recording method

Using the software Cool Edit, an Audio-technica AT9944 Electret Condenser Microphone or a SONY ECM-44B microphone, the data have been recorded as 16KHz, 16bit monaural sound files and stored on an IBM-X60 computer in the WAV format. All the recordings have been made in standard recording studios at the School of Mongolian Studies, Inner Mongolia University, and the School of Journalism and Communication, Hohhot University of Nationalities.

1.3. Data of the CSM

The CSM is not a corpus of read speech, but of spontaneous speech. Every group of data recorded includes a free conversation between two old acquaintances with no limitations lasting about 30 minutes. Before recording, in order to ease the tension and collect natural and spontaneous data, the participants are informed about the aim of recording and asked to be prepared for the topic they are going

to discuss. The situation in which the participants have nothing to talk about has not occurred because they are familiar with each other.

1.4. Participants selection

The participants chosen for the CSM are announcers, TV and radio hosts (or hostesses), journalists, teachers and students majoring in broadcasting who live in Hohhot and demonstrate correct pronunciation with as little dialectal influence as possible, got education in Standard Mongolian or passed the Test of Standard Mongolian. The corpus consists of 40 conversations obtained from 80 subjects. The distribution of their ages is shown in Table 1.

Table 1. Age and gender structure of the subjects chosen for the CSM.

18-25 26-35 36-53 total

sex

male 21 6 5 32

female 29 11 8 48

total 50 17 13 80

1.5. Structure of the CSM

The CSM is divided into three components with the same data, but different kinds of annotation, namely, the CSM-1, the CSM-2 and the CSM-3.

1.5.1. The CSM-1. There are 40 conversations lasting 20 hours, with 5 annotation layers. Here linguistic, paralinguistic and non-linguistic tagging has been made. Some linguistic phenomena are marked as well, according to actual pronunciation, such as syllable loss, vowel loss in the first syllable, consonant loss, etc. As for some phonetic changes, they are transcribed carefully instead of being tagged.

1.5.2. The CSM-2. After linguistic, paralinguistic and non-linguistic tagging has been made, tags are attached not only to such phenomena as syllable loss, vowel loss in the first syllable, consonant loss, but also to some phonetic changes. For example, if [f] changes into [s], it is tagged as <W,£s>. However, some function words and suffixes are transcribed carefully according to their actual pronunciation instead of being tagged.

1.5.3. The CSM-3. There are files with total length of about 9 hours, named from D001 to D018 and supplied with more accurate tags. Here tagging is attached to each phonetic change, including

phonetic changes of function words and suffixes. In addition, the normal pronunciation of a given word or suffix is provided. For example, if the pronunciation of the past tense suffix [san] changes into [sq], it is tagged as <W,san;sq>.

The files in the CSM are given the names from D001 to D040. Each file corresponds to 2 subjects numbered from 001 to 080 and supplied with the tags M for male and F for female. For example: if the two participants of the file D001 (i. e., the conversation 001) are a man and a woman, their numbers should be 001M and 002F. Each file contains a sound file in the WAV format and an annotation file in the TextGrid format.

2. Annotation and its principles

2.1. Annotation layers

First of all, 5 annotation layers, namely, Mong, Pro, Tra, Morp, Seg, have been attached to every sound file using Praat. Linguistic, paralinguistic and non-linguistic tagging is also added to each layer of annotation.

2.1.1. First layer: Written Mongolian orthography (Mong). The first layer is the transcription of a sound file using the traditional Written Mongolian input method. Here, a 30-minute sound file is divided into prosodic words, with every function word or suffix being regarded as a whole with the preceding lexical word. It is required that every prosodic word correspond to a spectrogram. To sum up, transcribing in old Mongolian script is the first step of the CSM annotation, a characteristic feature of the CSM which makes the subsequent annotation work more successful.

2.1.2. Second layer: pronunciation (Pro). Just as on the first layer, a 30-minute sound file is divided into segments which correspond to prosodic words as represented by spectrograms. Here the actual pronunciation is transcribed using the IPA symbols. The phonemic transcription made on this layer can be easily converted to the phonetic one used on the fourth and fifth layers where allophonic variation is consistently represented.

2.1.3. Third layer: Latin transliteration (Tra). Here the prosodic words isolated on the first and second layers are divided into roots and suffixes according to their spectrograms and transliterated into Latin orthography for the convenience of reference. The principles of Latin transliteration used here are the same as those applied in the corpus of Written Mongolian. No matter how many varieties a word's (or a morpheme's) pronunciation has, its Latin transliteration on the third layer is unified. For example,

although the pronunciation of the word written as in Mongolian has many varieties, such as [fitha:], [fta:], [fth], [fta] or [ftha], on the third layer it is consistently transliterated into Latin as "$IDE". The various pronunciations of this word can be easily found by its Latin transliteration "$IDE". For the convenience of reference, the symbol "-" is put before a noun suffix, while the symbol "/" is used for a verbal suffix.

2.1.4. Forth layer: Morphemes (Morp). Just as on the third layer, the prosodic words are divided into roots and suffixes according to their spectrograms. Here they are presented in an allophonic transcription using the IPA symbols which is based on their actual pronunciation. On this layer, phonetic changes within a syllable, a function word or a suffix, as well as devoicing of vowels and voiced consonants, are tagged.

2.1.5. Fifth layer: Segments (Seg). Here the roots and affixes isolated on the third and fourth layers are divided into phonetic segments according to their spectrograms. These segments are transcribed using narrow (allo-phonic) transcription which reflects their actual pronunciation. Phonetic changes have been tagged as well, such as vowel or consonant loss, devoicing of a vowel or a voiced consonant, etc.

All these different kinds of annotations allow to observe linguistic phenomena in spontaneous speech which are hard to be acoustically distinguished, such as, e. g., phonemic loss, phonetic developments, sound changes in segments, syllables, function words, roots, etc. The following is an example of annotation used in the CSM.

2.2. Annotation principles

First of all, sound files are read using Praat and supplied with 5 annotation layers. The first layer is transcribed into Written Mongolian, the second one is in the IPA-based phonemic transcription, the third one is in Latin transcription, the forth and fifth ones are in the IPA-based phonetic (allophonic) transcription. The main annotation principles are given below.

2.2.1. Phonetic tagging should be based on actual pronunciation and transcribed using the IPA symbols.

2.2.2. If non-linguistic phenomena, such as silence, laughing, coughing, hawking, breathing, smacking, swallowing saliva, etc., occur in continuous speech and last for more than 0.2 seconds, all the 5 annotation layers should be attached; if they last for less that 0.2 seconds, tagging should be made on the fifth layer only.

2.2.3. The tag <pz> is used to mark a gap between sound segments. Also the closure phase of a consonant is tagged as <pz><cl> when a word begins with a plosive or affricate consonant following a non-linguistic phenomenon, as e. g. pause or breathing. In addition, if the closure phase of a plosive or affricate consonant has lasted for more than 0.11 second, it is tagged as <pz><cl> as well.

2.2.4. It is known that vowel length is relevant to the word meaning in Mongolian. For linguistic analysis, it is difficult to distinguish between long vowels and short vowels, since vowel length has something to do with other elements, such as speed of speech, language style, phonetic environment, sound intensity, pitch, etc. This is also an important issue for speech recognition and text-to-speech conversion. In the CSM vowel length is classified into 4 different levels.

(1) If the duration of a vowel is less than 0.08 seconds, it is regarded as a short vowel and transcribed according to its actual pronunciation.

(2) If the duration of a vowel is between 0.08 and 0.11 seconds (including 0.08 seconds), it is regarded as a semi-long vowel and followed in a phonetic transcription by the symbol [].

(3) If the duration of a vowel is between 0.11 and 0.20 seconds (including 0.11 seconds), it is regarded as a long vowel and followed in a phonetic transcription by the symbol [:].

(4) If the duration of a vowel is 0.20 seconds or more, it is regarded as an extra-long vowel and preceded in a phonetic transcription by the symbol <H>.

2.2.5. Some specific rules for tagging plosive and affricate consonants are used in the CSM.

(1) If a plosive or affricate consonant has a full-fledged articulation, with the catch, occlusion and release, the tag for the closure phase <cl> should be put before the annotation of the consonant itself.

(2) If a plosive or affricate consonant is represented by a voiced allophone, its closure phase should not be marked separately from the consonant itself.

(3) If a plosive or affricate consonant has only the period of occlusion with no release, the tag for the closure phase <cl> and the consonant itself should be written together.

(4) An extra-short vowel or breathing <BR> should be distinguished after a plosive or affricate consonant occurring in syllable-final position.

2.2.6. The loss of short vowels in syllables after the second one is not tagged.

2.2.7. Such linguistic phenomena as syllable loss, changes in suffixes or function words, etc. are tagged on the forth layer, while all other types of phonetic changes are annotated on the fifth layer.

2.2.8. If the loss of a syllable or a phoneme occurred word-initially, it is tagged before the following phoneme of the word; if it occurred word-finally, it is tagged after the last phoneme of the word; if it occurred word-medially, it is tagged after the preceding phoneme.

2.2.9. Devoicing of a vowel or a voiced consonant is marked by the symbol [ o ].

2.2.10. If the speech signal is very weak, so that a clear spectrogram cannot be produced, but the whole sentence is able to be understood, the latter should be transcribed into Written Mongolian on the first layer and marked by <?> on the other layers.

2.3. Content of the CSM annotations

The 2005 revised version of the IPA is used in the CSM annotations.

2.3.1. Symbols for vowels. Short vowels [a, 9, i, o, u, o, u, e, re] in initial syllables, short vowels [3, a, i, s, e, u] in non-initial syllables, long vowels [a:, 9:, i:, o:, u:, o:, u:, e:, re:, e:], compound vowels [ai (ae), oi (oe) , ui (ue), ui (ue/y:), ua (ua:)] are tagged in the CSM.

Vowel tagging should be based on their actual pronunciation and should not be changed arbitrarily.

There are only a few allophones of compound vowels in Mongolian which are given in parentheses.

There are a number of non-linguistic phenomena occurring in spontaneous speech as opposed to read speech materials. One of them is that unclear sounds appear frequently before or after most vowels on the spectrogram. According to the above-mentioned correlation between types of annotation and vowel durations, if an unclear sound is included into a vowel segment, the duration of the vowel will be influenced. Therefore, these unclear sound segments are tagged differently in the CSM. The detailed explanation can be found on the list of tags below.

2.3.2. Symbols for consonants. Phonetic symbols for consonants [n, q, p (b, w, $, P), p», x (y, k»), k (x, g, k»), m, l (1), s (z), J (3), t» , t (d), f», f №), j, r (r, j, 3), w] are used in the CSM. The symbols for phonemes occurring only in loanwords are [te, te», f, k», te, te», g, 3] etc.

Allophones belonging to the same phoneme are given in parentheses.

2.3.3. Latin letters for transliterating Written Mongolian. The following Latin letters are used on the third layer of the CSM for transliterating Written Mongolian vowels: A (v), E (v(, I (W), 0 (vu 4), V (vu 5), O (v° 6(, U )v° 7). The consonants are transliterated as N )>(, NG (o), B (e), p (9(, H (X), G (ft), M )m(, L (l), S (s), $ (*), T )< D (,), C (*), J (,), Y (T), R (*), W (v). The symbols used only in loanwords are e (v), ZH (T), CH (oo), F (f), K (R), C (c), Z (z), h (nh), r (fr), lh (?h).

2.3.4. Tagging of phonetic changes. Since the CSM is a corpus of spontaneous speech, many phonetic changes occurring in continuous speech have been consistently tagged there.

(1) Phonetic changes are denoted on the fifth layer by the symbol <W,> which is followed by the original phoneme, and, after a semicolon, by its actual pronunciation. For example, <W,k;x> means that [k] has changed into [x]; <W, a:;a> means that [a:] has changed into [a].

(2) Change of a suffix or a syllable is denoted on the fourth layer by the symbol <W,> as well which is followed by the original form, and, after a semicolon, by its actual pronunciation. For example, <W,san;s> means that the suffix [san] has changed into [s].

(3) Sometimes an epenthetic sound is inserted when a suffix is added to a root. For example, when a genitive case suffix is added to the root ending in a vowel, the consonant [k] is inserted in between. In this case, annotation is made according to the actual pronunciation.

2.3.5. Tagging of vowel, consonant or syllable loss. There are cases of vowel, consonant or syllable loss consistently tagged in the CSM.

(1) Vowel loss is denoted by the symbol <V,> which is followed by the dropped vowel. For example, th<V,a> refers to the loss of the vowel [a] after [th]; <V,i>g refers to the loss of the vowel [i] before [g].

(2) Consonant loss is denoted by the symbol <C,> which is followed by the dropped consonant. For example, <C,k>a refers to the loss of the consonant [k] before [a]; i:<C,m> refers to the loss of the consonant [m] after [i:].

(3) As for syllable loss, it is denoted by the symbol <S,> which is followed by the dropped syllable. For example, <S,pe:> refers to the loss of the syllable [pe:].

2.3.6. List of tags used in the CSM. Apart from phonetic changes, there are various kinds of non-linguistic phenomena as e. g. breathing, smacking, coughing, hawking, etc. which occur in the corpus of continuous speech as opposed to that of read speech. In the CMS, all these phenomena are consistently tagged. The layer and the position where one or another tag is put is specified depending on the annotation principles applied in the CSM. The following is the list of tags used for annotation.

Paralinguistic phenomena

<P1>: a pause between sentences lasting more than 0.2 seconds; used on each layer.

<P2>: a pause between words lasting more than 0.2 seconds; used on each layer.

<pz>: a gap between sound segments within a word, also a pause between words or sentences lasting less than 0.2 seconds; used on the fifth layer.

<H>: an extra-long vowel with the duration of more than 0.2 seconds; put before the corresponding vowel.

<G>: an extra-long consonant with the duration of more than 0.2 seconds; put before the corresponding consonant.

<E>: an extra-short vowel occurring in word or syllable-final position; used on the fifth layer.

<cl>: closure phase of a consonant, also the occlusion of a plosive or affricate consonant.

<vb>: irregular noise in speech signal before the articulation of a vowel.

<sv>: vibration of the vocal cords after the articulation of a vowel.

<uv>: a formant clearly visible on a spectrogram after the articulation of a vowel with no lasting vibration of the vocal cords.

<fr>: irregular vibration of the vocal cords after the articulation of a vowel.

<th>: a glottal sound.

<sh>: shortening of a compound vowel lasting for less than 0.11 seconds; put before the corresponding compound vowel.

<tr>: inversion of a sound sequence.

<O>: a loanword, a dialectism or an archaism; used on the second layer only and put before the corresponding word.

<S,>: syllable loss.

<W,>: a phonetic change.

<V,>: vowel loss.

<C,>: consonant loss.

<TP>: murmuring, an unclear mixed sound.

<IN>: a suction sound.

<BR>: breathing.

<SM>: smacking.

<OV>: repeating.

<F>: a cheer, a sigh or a meaningless sound.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

<D>: sound correction.

<?>: unclear.

Non-linguistic phenomena

<LA>: laughing.

<CO>: coughing.

<NS>: noises and other sounds occurring during the conversation.

<HA>: hawking.

<SS>: swallowing saliva.

<LS>: speaking while laughing, unable to be analyzed.

3. Retrieval system of the CSM

The CSM is supplied with a retrieval system which can be used as follows. First, a TextGrid file is converted into Excel using Praat. Secondly, the symbol "-" before a noun suffix needs to be modified because it turns out to be an error code in Excel. In order to solve the problem, all error codes in the Excel C column need to be chosen and

replaced by the symbol "'-". Thirdly, the content of annotation layers should be ordered by entering a simple command so that the content of Mong, Pro, Tra, Morp, Seg appears in columns C2, D2, E2, F2, G2 respectively. Fourthly, the sound data are ordered according to segmentation time. Lastly, blank spaces in columns C2 and D2 can be filled. Thereby, all the annotations are entered into Excel tables ordering them by segmentation time, and the retrieval system can be started using the Excel services.

Using our retrieval methods various forms of a word or a suffix can be looked up by their unified Latin transliteration as presented on the third layer. If there is a tagging error in the searching process and the original sound file need to be checked up, you can easily return to the original file.

As the CSM is richly supplied with different kinds of annotation, various needs can be met by addressing to them. Due to its retrieval system the CSM can be used directly in linguistic research.

Finally, I wish to express my heartful thanks to Professor Choijinjab (Inner Mongolia University), Professor Huhe (Chinese Academy of Social Sciences) and Professor Maekawa Kikuo (National Institute for Japanese Language and Linguistics) who provided valuable advice during the construction of the corpus.

i Надоели баннеры? Вы всегда можете отключить рекламу.