COMPARATIVE ANALYSIS OF KAZAKH LANGUAGE FREQUENCY DICTIONARIES
ZHANABEKOVA AYMAN ABDILDAEVNA
A.Baitursynov Institute of Linguistics Head of the Department of Applied Linguistics, Doctor of
Philosophy, Almaty, Kazakhstan
KOZHAKHMETOVA AKTOTY KOZHAKHMETOVNA
Researcher of the Department of Applied Linguistics, PhD student,
Almaty, Kazakhstan
TLEGENOVA GULDEN BAKYTKAZYKYZY
Researcher of the Department of Applied Linguistics, PhD student,
Almaty, Kazakhstan
BARMENKULOVA AIDA SERIKKHANOVNA
Researcher of the Department of Applied Linguistics, PhD student,
Almaty, Kazakhstan
BESIROV YERKIN BEKZHANOVICH
Researcher of the Department of Applied Linguistics, Almaty, Kazakhstan
Resume. The article compares the frequency dictionaries of written and spoken speech in the Kazakh language. It is determined that the verb "to be" is the word with the highest frequency in written speech in many frequency dictionaries of the Kazakh language. The verb "de" in spoken speech, and the text of the National Corpus of the Kazakh language, the most frequently encountered word is the pronoun "bir". In addition, the article compares the frequency of parts of speech in written and oral texts.
Keywords: statistics, frequency dictionary, alphabetic-frequency dictionary, alphabetic-frequency dictionary
Тушндеме. Мацалада цазац muwdesi жазба тшдщ жишк ce3drnmepi мен ауызша тшдщ жишк ce3diKmepiHe салыстырмалы талдау жасалады. Жазба тшде «бол» ет^ттнщ ал ауызша тшде «де» ет^ттнщ ец жогары жишкке ие сез екендт, К,азац тш улттыц корпус мэлiмemmeрi бойынша «6ip» еЫмдтнщ ец жиi кездесетт сез екендт аныцталады. Сонымен цатар сез таптарыныц жазба жэне ауызша тш мэmiндeрiндeгi жишт салыстырылады.
Трек свздер: статистика, жишк сездщ элтбил^жишк сездщ жиiлiкmi-элmбилi
сездк
Comparative analysis of frequency dictionaries published in Kazakh is the key to compiling a statistical picture of linguistic units, i.e. a sequence of statistical regularities in the language. Statistical analysis of linguistic units can be considered from different points of view. For example, from the point of view of frequency of letters, frequency of letter combinations, even syllables in the language, frequency of usage patterns, words, frequency of occurrence of existing word forms, etc. In the 1970s, researchers such as K. Bektaev, A.K. Zhubanov, A. Akhabaev, S. Myrzabekov, and A. Belbotaev were engaged in compiling frequency dictionaries in the Kazakh language. They studied frequency dictionaries for the language of fairy tales, journalism, and mathematical terminology, as well as a frequency dictionary for the Abai language. The most notable of these works, M. Auezov, published frequency dictionaries for texts in 20 volumes.
These dictionaries are significant statistical works from the twentieth century. The frequency dictionary of M. Auezov's 20-volume works collection includes both fiction and journalism work of the author, totaling more than 2 million words. In addition to this dictionary, the amount of text taken from other frequency dictionaries is significantly less than that of this one. Therefore, we compare the frequency dictionary of texts of oral speech with the frequency dictionary from the 20-volume collection of M. Auezov's works among the frequency dictionaries published in these years.
In 1986, the statistical team of the A. Baitursynov Institute for Linguistics at the National Academy of Sciences of Kazakhstan began work on creating various dictionaries based on M. Auezov's collected works in 20 volumes. As a result of this effort, "Frequency dictionaries of texts from M. O. Auezov's 20-volume works" (22 print pages) were published in 1995. [1]. Table 1 - excerpt from the frequency dictionary ofM. Auezov's works in 20 volumes
WORD/word class Total
бол/ет 38944
де/ет 31901
б^л/ес 22893
е/ет 21649
ез/ес 20471
да/шл 18254
ал/ет 17958
кел/ет 17797
де/шл 17293
ол/ес 17269
сол/ес 15599
мен/ес 14717
айт/ет 14273
бер/ет 11925
6ip^ 10293
бiр/са 10244
отыр/ет 10082
тур/ет 8742
сез/зт 8575
кет/ет 8571
жYр/ет 8536
шык/ет 8221
кал/ет 7974
кер/ет 7628
бар/ет 7378
кеп/сн 7302
The 29,483 words in the list (excluding proper names) are in descending order of frequency.
In the first place in this Frequency dictionary is the verb "to be". It is found in the texts of the collected works of M. Auezov 38944 times, which is 2.28% of the entire text. In the second place, the verb "de " occurs in the text 3,1901 times (1.87%). These two verbs cover 4.15% of the entire text. In the dictionary, you can see what percentage of all phrases cover the various most common words like this one in the dictionary. Only the first 10 words in the list (бол/ет., де/ет., бул/ес., е/ет., вз/ес., да/шл., ал/ет., кел/ет., де/шл., ол/ес.) contains 13% the whole text. And 31 words at the beginning of such a dictionary can contain 1/4 of the text, i.e. 25%, 174 words -1/2%, i.e. 50%, 1292 words-80%.
After the aforementioned frequency dictionaries had been compiled, work on creating additional dictionaries for the Kazakh language paused for a while. In 2016, under the supervision of Professor A.K. Zhubanov from the Department of Applied Linguistics at the A. Baitursynov Institute of Linguistics, a comprehensive frequency dictionary was created and published. This dictionary incorporates previously published dictionaries of Kazakh and is titled "The Frequency Dictionary of the Kazakh Language."
Below we present an excerpt from the original part of the most common words in this Frequency Dictionary. From the "frequency-alphabetic dictionary of the Kazakh language ", words with a frequency of over 10,000 are as follows. Table 2.
Word/word class Absolute frequency Percentage of text coverage by one group of words
бол/ет. 47716 2,2695
де/ет. 40178 4,1805
бул/ес. 26338 5,4332
е/ет. 26122 6,6756
ез/ес. 23635 7,7997
кел/ет. 22566 8,8730
ал/ет. 22522 9,9442
ол/ес. 22262 11,0031
да/шл. 22013 12,0501
де/шл. 20282 13,0147
сол/ес. 17723 13,8577
мен/ес. 17341 14,6825
айт/ет. 16410 15,4630
бiр/са. 15276 16,1895
бер/ет. 14691 16,8883
отыр/ет. 12382 17,4772
осы/ес. 11818 18,0393
жок/мд. 11415 18,5822
сез/зт. 11309 19,1201
тур/ет. 11022 19,6443
кет/ет. 10635 20,1501
жYр/ет. 10555 20,6522
бiр/ес. 10344 21,1442
As we can see, this dictionary also includes a significant number of verbs. This high frequency is due to the presence of auxiliary verbs, pronouns, and conjunctions. Words that appear in more than 20 tables account for more than 20% of the total text volume.
Today, as you are aware, special attention is being paid to applied scientific projects aimed at meeting modern needs and improving the functioning of the state language. In 2016, with the idea of Minister of Education and Science E. Sagadiev, it was decided to develop an effective methodology for teaching the state language and create a frequency dictionary of the Kazakh language containing the necessary lexical and grammatical minimums. As a result of research conducted on this task, the Institute compiled a "Frequency Dictionary of the Kazakh Language for General Education". [3].
The dictionary for compiling the "frequency dictionary of the Kazakh language for general education" is based on a variety of texts in different styles, including fiction, journalism, science, and
speech. These five functional styles represent the diversity of the Kazakh language and are used to create a dictionary that contains 36,265 entries (words and lexical units).
The dictionary is organized in three different structural formats: an alphabet-frequency dictionary, a frequency-alphabet dictionary, and a reverse alphabet-frequency dictionary. Additionally, a "circulation dictionary" (spread dictionary) has been created for each style, as well as a dictionary for various classes of words and word formations. The final product is a comprehensive electronic dictionary that can be accessed through a folder on the computer. Now, let's take a look at an excerpt from the frequency-based dictionary of this "frequency dictionary of the Kazakh language for general education", which contains the most commonly used words with a frequency of over 20,000. Table 3
Table 3-excerpt from the "frequency-Alphabet Dictionary of the Kazakh language in general education"
WORD/word class Total
бол/ет 108133
ол/ес 62796
мен/шл 58406
де/ет 46200
ал/ет 43359
бер/ет 40295
жэне/шл 39765
кел/ет 34526
де/шл 33943
бул/ес 32850
да/шл 32109
жыл/зт 30882
бiр/ес 30288
ез/ес 29594
тур/ет 28643
сез/зт 26107
осы/ес 25995
адам/зт 25614
казак/зт 25069
айт/ет 23421
бшм/зт 22915
ушш/шл 22598
кет/ет 22105
ел/зт 22098
отыр/ет 22038
кандай/ес 21378
екен/ет 21328
бiз/ес 21267
сол/ес 20532
Even in this dictionary, the frequency of the auxiliary verb "to be" is very high compared to other verbs. From this table, we can also see that auxiliary verbs, conjunctions, and pronouns are frequently used words. The statistical data from this frequency dictionary is also similar to the figures in the combined "Frequency Dictionary of the Kazakh Language" mentioned above.
This was done to create a lexical minimum in teaching the state language, as the name "frequency dictionary for general education" suggests. For this purpose, more texts from school
textbooks were obtained. The total volume of the texts was 7 million words, which is a large amount. Based on this volume, the number of unique words in the dictionary is also significant.
The methods and techniques for constructing a frequency dictionary vary. One method involves entering the text into a dictionary of word forms and identifying the root of each word. Another method involves passing the text through a morphological analyzer, which automatically identifies the root and class of each word in the text.
The first method requires a significant amount of manual labor. However, if the addition of homonyms to single-root word lists is possible, then it can be difficult to determine the relationship between words in single-root forms and their class when it comes to classifying them.
The second approach, although it is easy and quick to implement, does not guarantee 100% accuracy. Even with this method, homonyms are not automatically excluded. Only frequency dictionaries that exclude homonyms automatically can be generated from texts with homonyms excluded from them.
In this regard, the team of experts in the Department of Applied Linguistics has established a morphological designation for texts with a word count of 500,000 in the prose genre, such as those written by M. Auezov and A. Kekilbayev. Based on this corpus, a frequency dictionary has been compiled, which indicates 100% accuracy regarding word classes. Let us examine an excerpt from this frequency dictionary. Table 4
Table 4-excerpt from the frequency-alphabetic Dictionary of the word in the Kazakh language
Ordinal number № Word/Word class Absolute frequency Percentage of text coverage by one group of words
1 2 3 4
1 6on/em. 666 2,320
2 de/em. 541 4,204
3 edi/em. 479 5,873
4 urn/em 407 7,290
5 6ip/ca. 352 8,516
6 an/em 335 9,683
7 da/MR. 334 10,846
8 6yn/ec. 299 11,888
9 myp/em. 240 12,724
10 03/ec. 238 13,553
11 de/MR. 231 14,357
12 con/ec. 231 15,162
13 ombip/em 224 15,942
14 Mam/em 218 16,701
15 on/ec. 215 17,450
16 ocbi/ec. 214 18,196
17 Kop/em. 193 18,868
18 Kan/em. 179 19,491
19 yu/3m. 178 20,111
20 aum/em. 176 20,724
21 6ep/em. 171 21,320
22 MOK/Md. 166 21,898
23 Kyn/3m. 166 22,476
24 MbiK/em. 163 23,044
Ordinal number № Word/Word class Absolute frequency Percentage of text coverage by one group of words
25 кет/ет 161 23,605
26 жур/ет 144 24,107
27 кез/зт 144 24,608
28 жер/зт. 143 25,106
29 цара/ет 136 25,580
30 Ш/зт 135 26,050
31 бала/зт 132 26,510
32 бас/зт. 131 26,966
33 бт/ет 119 27,381
34 квп/сн. 118 27,792
35 свз/зт. 115 28,192
36 бар/ес. 114 28,589
37 ауыл/зт. 113 28,983
38 бар/ет. 113 29,377
39 мен/шл. 113 29,770
40 тус/ет. 109 30,150
41 улкен/сн. 109 30,529
42 соц/шл. 105 30,895
43 цыл/ет. 99 31,240
44 бар/мд. 98 31,581
45 жац/зт. 97 31,919
46 ат/зт 94 32,247
47 ет/са. 91 32,564
48 ет/ет 90 32,877
49 сал/ет 89 33,187
50 цой/ет 88 33,494
51 не/ес. 86 33,793
52 алды/ке. 84 34,086
53 ел/зт. 84 34,378
54 баста/ет 81 34,660
55 ет/ет. 81 34,943
56 кш/зт. 79 35,218
57 бiрац/шл. 75 35,479
58 мен/ес. 74 35,737
59 тр/ет 73 35,991
60 ац/сн. 72 36,242
61 цол/зт. 71 36,489
62 енд^ус. 67 36,722
63 жол/зт 67 36,956
64 адам/зт 64 37,179
65 бет/зт 63 37,398
66 кул/ет 63 37,618
67 цара/сн. 63 37,837
68 жалгыз/сн. 61 38,049
69 жан/зт 61 38,262
Ordinal number № Word/Word class Absolute frequency Percentage of text coverage by one group of words
70 мал/зт. 61 38,474
71 гой/шл. 60 38,683
72 цас/зт. 60 38,892
73 тап/ет. 60 39,101
74 аз/сн. 59 39,307
75 уст1/ке. 59 39,512
76 жiгiт/зт. 58 39,714
77 царай/шл. 58 39,916
78 гана/шл. 57 40,115
79 бас/ет. 56 40,310
80 жэне/шл. 56 40,505
81 тарт/ет. 56 40,700
82 цыз/зт. 55 40,892
83 куйеу/зт. 54 41,080
84 жас/сн. 53 41,264
85 жет/ет. 53 41,449
According to the frequency-alphabetic list of words in this small dictionary, Table 4 shows the entire list of the most common words that make up 50% of the text, with words containing 60%, 70%, 80%, 90%, and 100% being abbreviated in an intermittent form. The words are grouped into 10 categories based on their frequency of occurrence, with the earliest group including the most common verb, "to be", which is used 666 times and makes up 2.32% of the text. This verb has also been indicated as the highest frequency in previously published dictionaries. The high frequency of the verb "be" is due to its role as an auxiliary verb and its direct relationship with grammatical adverbs. Frequent use is a characteristic of grammatical units, making them useful for grammatical abstraction. Words with a high frequency, like the verb "to be", are often auxiliary verbs. For example, де, edi, кел, ал, тур, отыр, жатыр, квр, айт, бер, шыц, кет, жур, бар, тус, цыл, сал, цой, баста and etc. Similarly, in the Frequency Dictionary, conjunctions are like да, де, мен, бiрац, гой, царай, гака, жэне, тшы conjunctions are also from pronouns бул, вз, сол, ол, осы, бар, мен words are more common. Since this frequency dictionary was derived from the texts of literary works, it means that these words are not common, and they may not appear at all in other texts.
Now we will compare the oral language texts with the above dictionaries, providing an excerpt from the original part of the most common words in the Frequency Dictionary. Table 5. Table 5-excerpt from the frequency-alphabetic Dictionary of the word in the texts of the oral language
№ Word Absolute frequency Percentage of text coverage by a group of word forms
1 де/ет 1429 2,8655
2 бол/ет 1385 5,6428
3 мен/ес 1307 8,2637
4 ол/ес 1101 10,4714
5 сол/ес 813 12,1017
№ Word Absolute frequency Percentage of text coverage by a group of word forms
6 ал/ет 744 13,5936
7 айт/ет 705 15,0073
8 гой/шл 603 16,2165
9 6ip/ca 575 17,3695
10 кел/ет 566 18,5045
11 да/шл 534 19,5753
12 енд^с 531 20,6401
13 ез^ес 511 21,6648
14 6i3^ 508 22,6834
15 керек/мд 406 23,4976
16 жок/мд 399 24,2977
17 кет/ет 377 25,0536
18 бер/ет 354 25,7635
19 бар/ес 351 26,4673
20 бар/ет 344 27,1572
21 кез/зт 340 27,8389
22 кешн/ус 306 28,4525
23 жYр/ет 303 29,0601
24 аз/ес 302 29,6657
25 осы/ес 297 30,2613
26 не/ес 296 30,8548
27 отыр/ет 281 31,4183
28 тур/ет 273 31,9658
29 бул/ес 269 32,5052
30 бала/зт 259 33,0245
31 жатыр/ет 257 33,5399
32 жацагы/сн 254 34,0492
The fact that the verb "bol" occupies the second position in frequency in some dictionaries differs slightly from other dictionaries, which also differ in terms of their frequency. Compared to the data from other written language frequency dictionaries, the verb "de" has the highest frequency in the oral language frequency dictionary, and "bol" has shifted to second place.
However, we believe that the appearance of "de" in the first position in the oral language dictionary is also linked to its adverbial form, which forms a homonymous series. Therefore, we consider the presence of "bol" at the maximum frequency to be a phenomenon that is characteristic of both written language frequency dictionaries and the oral language frequency dictionary.
In the frequency dictionary of spoken language, in addition to a few high-frequency words, the most commonly used words are auxiliary verbs, pronouns and conjunctions, similar to frequency dictionaries for other written languages. For instance, auxiliary verbs such as бол, де, ал, кел, кет, бер, бар, жур, отыр, тур, жатыр, квр, шьщ, цал, цой, бщ жаса words; pronouns - ол, сол, 03i, бiз, ci3, осы, бул, сен, мына, ана; conjunctions - гой, да, де, ба, бiрац, ушт words [5].
From the table above, it can be seen that the frequency vocabulary of spoken language often includes the following words, in addition to the most commonly used auxiliary verbs, pronouns, and conjunctions found in all the frequency vocabularies mentioned. These words are generally words
that are typical of spoken communication. For instance waqazbi, fou, coHdau, aum, eHdi, KepeK, mok, Ke3, wep, Ô9pi, K,a3ip, waKpu, eMec, waK and etc. words.
And now, the words that are found at high frequencies in the frequency dictionaries of written language are different from those found at high frequencies in oral language. In the frequency dictionary of 500 words (mainly consisting of the texts of works by M. Auezov and A. Kekilbayev), the words "house", "sun", "eye", "child", "head", "word", "village", and "horse" are frequently used. In the "Frequency dictionary of Kazakh language in general education", the words "year", "person", "Kazakh language", and "education" are also commonly used. Finally, in the combined "Frequency dictionary of the Kazakh language", the words "word", "Sun", "Earth", "house", "head", "person", and "child" "people" are often used.
If we compare the characteristics of the frequent occurrence of the above-mentioned lexical items, it can be observed that the word "mentioned" is frequently used in oral communication, serving as a means of connecting thoughts, and significant affirmations such as "well" and "yes" or disapprovals such as "no". Additionally, the adverbs "now", which denotes action in the present tense and "then", preceding a description of ongoing action, are frequently applied. These characteristics are also present in other written languages, along with auxiliary verbs, interjections and excluding pronouns. Therefore, this should not lead to the assumption that we frequently use only these specific words in our everyday language. Along with the characteristics inherent in oral language, these features are also present in written language. In summary, both written and spoken language share a common frequency of use for linguistic units that serve as connecting, indicating or supporting grammatical meanings.
Table 6-comparison table of frequency dictionaries
M.dytlOBTÎK 20 н>)*К тШнЩ WHtatari- К»*К т$лЫд*г$ м>рк«>м ЖиЛПЫ бГлГМ б*|р<УД#Т» Дуьшша тглдГц
тоалдын; эл«пбил1 сээдМне-н стиль дщ жнЕлйкгь к,азак, тЬпвнйц жи1л1вп1ь Ht и 1л Й ktî-элI пб н л i
lublfaipnauiniifhalublh »лЕпбил! с**д»г»н*н «лЕпбил! tojflîfîHOH
свзд1г1&Фен
бол/ет. бол/ет бол/ет де/ет
де/ет. де/ет. ол/ес бол/ет
бул/ес бул/ес. едГ/ет. мен/шл мен/ес
е/ет е/ет. кел/ет. де/ет ол/ес
*)/*С оэ/ес. б 1р/са. ал/ет сол/ес
да/шм кел/ет, ал/ет. бер/ет ал/ет
*л/#т ал/ет. да/шл, жэне/шл айт/ет
кел/ет ап/ес. бул/ес. кел/ет юй/шл
де/шл да/lu л. тур/ет. де/шл б ip/c а
ол/*с де/шл, оэ/ес. бул/ес кел/ет
мл/« сол/ес. де/шл, да/шип да/шл
мин/ihc мен/ес, сол/ес. мыл/эт енд!/ус
айт/ет айт/ет. отыр/ет. 6ip/ec e3i/ec
бер/ет 6 ¡р/с а. жат/ет. вэ/ес баэ/ес
бер/ег. ол/ес, тур/ет кереа/мд
Мр/са отыр/ег. осы/ес, соз/эт жоц/jад
сныр/ «м осы/ес, к&р/ег. осы/ес кет/ет
Tïp/ет нок/мд. кал/ет. адам/эт бар/ет
осы/ес
In addition, regarding the word classes in all the dictionaries mentioned, the following observations can be made. There is a high frequency of nouns, verbs, pronouns, and adjectives. Frequency dictionaries with a classification of words group words with the same classification in the same section, and statistics for word classes are generated using these dictionaries with an inverse alphabetical order of frequency.
In the statistical analysis of these frequency dictionaries, there is observed a specific pattern regarding word classes and pronouns. Pronouns do not have the highest percentage in terms of percentage coverage, which is due to the smaller number of pronouns in the language compared to other word classes. However, pronouns often follow verbs and adjectives in use, ahead of other word classes. As a result, people tend to employ pronouns in their communication.
It can be inferred that the most frequent word classes in the language are nouns and verbs, as they occupy the first and second positions in terms of both their proportion in the lexicon and their proportion in typical texts. This suggests that not only are these word classes frequently used in our language, but they also constitute a significant portion of its vocabulary.
Based on these data, it can be seen that the most common words in the Kazakh language are nouns and verbs.
This information was obtained from a frequency-alphabetical dictionary of the National Corpus of Kazakh, which contains 30 million words and was created using the main body of texts. Table 7
Table 7-frequency Dictionary of texts of the National Corps of the Kazakh language (main Corps)
According to the Corpus frequency dictionary, the verb "DE" is in 2nd place, the verb "be" is in 3rd place, and the pronoun "one" is in 1st place.
LIST OF USED LITERATURE:
1. Bektaev K. B., Zhubanov A. K., Myrzabekov S., Belbotaev A. B. Frequency dictionaries of texts of M. O. Auezov's works in 20 volumes. - Almaty-Turkestan, 1995. - 346 p.
2. Zhubanov A., Zhanabekova A., Karbozova B., Kozhakhmetova A. Frequency dictionary of the Kazakh language. - Almaty: Publishing house "Kazakh tili", 2016. - 665 P.
3. Frequency dictionary of the Kazakh language in general education. - Almaty, 2016. - 1472 P.
4. A frequency dictionary of the text, consisting of 500,000 words based on the works of M. Auezov, A. Kekilbaev. - in the electronic version.
5. Zhubanov A., Zhanabekova A., Tokmyrzaev D., Utegenova B. Frequency dictionary of texts of the Kazakh spoken language. - Almaty, 2020. - 168 P.