Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil
"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year
Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год
ORGANIZATION OF WORD SEARCH IN UZBEK TEXTS BASED ON BOYER-MOORE-
HORSPOOL ALGORITHM
Husniya Akhmedova
Department of Infocommunication engineering TUIT named after Muhammad al Khwarizmi, Tashkent, Uzbekistan
khusniya 150586@gmail. com
Abstract: The development of modern information and communication technologies has led to an increase in the need for means that express, store, edit, transmit and transform information in the form of text. This article analyzes the methods and algorithms used in the implementation of electronic word processing. Also, the application and features of the Boyer-Moore-Horspool algorithm were studied when searching for a given word in Uzbek texts. On the basis of the experiments, the relevant information and conclusions were obtained on evaluating the effectiveness of the algorithm.
Key words: text, algorithm, text processing, Uzbek language, Boyer-Moore-Horspool algorithm, information search.
Introduction Presentation of data in the form of text is one of the most widely used methods, and text is the most convenient form of information for computer processing. A text is an ordered collection of words that convey a specific meaning. Texts used and processed by computers are called electronic text. A lot of research has been carried out in the world on the organization of word processing and continues today. For example, in the book "125 Problems in Text Algorithms: With Solutions" by Maxime Crochemore, Thierry Lecroq and Wojciech Rytter, about the characteristics of elements that can be found in text files, their classification, approaches used in the organization of preprocessing, models and algorithms detailed information is provided [1]. V.V. Dikovitsky, M.G. Shishaev's research work entitled "Обработка текстов естественного языка в моделях поисковых систем" talks about the computer linguistic model and methods of text preliminary processing and organization of information search [2].
Also, the textbook by E.I. Bolshakova, E.S. Klyshinsky, D.V. Lande et al. naimed (available only in Russian) "Автоматическая обработка текстов на естественном языке и компьютерная лингвистика" covers the main issues of computational linguistics: from the theory of linguistic and mathematical modeling to technological solutions [3]. Classification
and clustering of textual information, fundamentals of fractal theory of textual information are considered. The list of such studies can be continued for a long time. The results of conducted and ongoing research can be seen in the capabilities of modern text editors used in computers.
Organization of word search in Uzbek languages texts. Encoding systems are used for electronic representation of text elements in computing machines. Such systems represent the binary code of any symbol that can be found in the electronic text [4,5].
№ tupe 1 2 3 4 5 6
Aa Bb Dd Ee Ff Gg
letter Hh Ii J) Kk Ll Mm
Nn Oo Pp Qq Rr Ss
Tt Uu Vv Xx Yy Zz
letter+sign O'o' G'g'
letter+letter Shsh Chch ng
sign '
Table-1. Latin alphabet of the uzbek language
Texts in the Uzbek language are expressed using letters and symbols of Latin graphics. Table 1 below lists the letters of the Latin alphabet.
196
Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil
"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year
Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год
Electron word processing refers to the input, editing, formatting and printing of text and documents using computers. This process can be complex depending on the creation, design, structure and other technical features of the text. Today, using modern technologies, it is possible to obtain useful information from electronic text, check it and process it accordingly. The basis of such technologies are the methods, models and algorithms used in the organization of text processing.
Researchers such as A.Norov, Sh.Muradov, B.Akmuradov, U.Khamdamov, Dj.Elov, Dj.Sultanov, I.Narzullayev, M.Mukhiddinov have discussed a number of problems of organizing the processing of Uzbek texts and their computer-linguistic solutions in their works [6-10]. However, there are many issues that arise in the implementation of Uzbek word processing and are waiting for their solution. Small texts can be read and analyzed without using special tools. However, it takes a lot of time and resources to extract and analyze the necessary information from a large amount of text.
To solve this problem, algorithms are used that search for a given string of characters in the text. Such algorithms include the traditional one-character sequential search algorithm, the Boyer-Moore-Horspool algorithm [11], which searches for a string based on the values of a character listed in a comparison table, the Rabin-Karp algorithm, which searches for a string based on hashed patterns in the text [12], an example is the Axo-Korasik algorithm [13], which searches for a string using a prefix tree. Each of these types of algorithms has its own capabilities, and it is possible to achieve high efficiency by using one or more of them based on the characteristics of computing machines and given text.
Methods. The Boyer-Moore-Horspool algorithm is an algorithm used to find a sequence of M symbols 0 < M < N from a string of N symbols.
The process of organizing a search for a selected word from a given text can be explained with the following example. From the text below
Let it be necessary to find the word M = "assalom". N = "The given text starts with the word assalam".
The search process is organized in two stages. In the first step, a comparison table d is created:
In this case, the string of characters to be searched for is numbered in ascending order, starting from the last remaining character by deleting one character at a time.
Depending on the structure of the word, there are some additional rules for creating a comparison table. For example, if the selected word is "assalom". The comparison table is formed in the following order.
Step 1. The last character of the word "m" is deleted and the number 1 is assigned to the letter "o" at the end of the remaining characters. This number 1 means that 1 character is deleted.
a s s a l o m
1
Step 2. When the second character is removed from the end of the word, the number 2 is given to the letter "l" remaining at the end.
a s s a l o m
2 1
Step 3. When the third character is removed from the end of the word, the number 3 is assigned to the letter "a" remaining at the end.
a s s a l o m 3 2 1
Step 4. When the fourth character is deleted from the end of the word, the number 4 is given to the letter "s" remaining at the end.
a s s a l o m 4 3 2 1
Step 5. According to the rule, when the fifth character is removed from the end of the word, the number 5 must be given to the letter "s" remaining at the end. However, since the letter "s" has already been found among the deleted characters, the number assigned to it is repeated, that is, the number 4 is added. a s s a l o m
197
Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil
"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year
Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год
4 4 3 2 1
Step 6. In accordance with the solution in step 5 discussed above, the letter "a" is given the number 3. a s s a l o m 3 4 4 3 2 1
Step 7. If the character at the end of the word occurs only 1 time in the word, its corresponding number is equal to the length of the word. If this character occurs more than once in the word structure, it is marked with the number of the character closest to the end in accordance with the rules discussed above. In the example under consideration, since the letter "m" occurs once in the word, its corresponding number is defined as 7 for the length of the word.
a s s a l o m 3 4 4 3 2 1 7
The comparison table based on the simplification of repetitions is expressed in the form of table 2 below.
Word letters s a 1 o m d table values 4 3 2 1 7
Table-2 Comporasion table for the selected
word
At the second stage, searching for a string of characters using the values of the comparison table is carried out as follows. :
- the searched word is superimposed on the first word at the beginning of the given text;
- compatibility of characters is checked from the last character of the word forward;
- if the characters match, it is confirmed that the searched word is the same word. If the symbol does not match, the selected word is moved to a step equal to the value of this symbol in the comparison table and the comparison is made again.
N Berilgan matn assalom so'zidan boshlanadi
a=3 1 assalom
m=7 assalom
s=4 assalom
assalom
Figure 1. Boyer-Moor-Horspool word search process
If the mismatched symbol is not found in the comparison table, it is moved to a step equal to the length of the word. If part of the word matches, the remaining part does not match, it is moved to a step equal to the value of the last character, and the process continues until the word is found or the text is exhausted.
Software. In most cases, development and testing of software to study and evaluate the general characteristics of the algorithm in question is the optimal solution. In this case, the algorithm can be evaluated in this way.
The first step of the process is to create a comparison table, which is created separately for each search word, and the values of this table differ according to the characteristics of the word. Figure 2 below shows the algorithm for creating a comparison table in the form of a block diagram.
198
Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil
"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year
Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год
Figure 2. Algorithm block diagram for creating a comparison table for the selected word
Here, word is the searched word, s is the length of the searched word,
a{}-comparison table object. If we briefly describe the given algorithm, First, variables are declared; Appropriate values are given by deleting one character from the end of the word forward. If a symbol occurs more than once in a word, the value of the symbol closest to the end of the word is accepted;
If the end-of-word symbol is repeated in the word structure, it is given the corresponding value of such a symbol. If it is not repeated in the word structure, it is assigned a value equal to the word length. Symbols with the same value are summed to form a comparison table.
At the second stage of the search, using a comparison table, the position of the first character of the word that matches the given pattern is searched for in a large volume of text. The following Figure 3 shows the word search algorithm in the form of a block diagram.
Figure 3. Block diagram of the word search algorithm
Here, the text-search word can be found in a large amount of text; word-searched word; s-searched word length; the cursor position where the search for the word from the start-text begins; the number of characters that match the pattern.
As mentioned above, the given algorithm is used in cases where the length of the searched word is less than the length of the text, otherwise it is declared that such a word does not exist in this text.
The given word is found by moving the template by the values in the comparison table, according to the matching or non-matching of the symbols of the text with the word. The process can be repeated one or more times until the text is finished, taking into account that the searched word can be found more than once in the given text.
Results. Algorithms' complexity is usually measured in terms of execution time or memory used. In both cases, the complexity depends on the size of the input data[14].
The time complexity of the algorithm is determined as a function of the length of the string representing the input data and the running time given the input of the algorithm. The time complexity of an algorithm is usually denoted by a capital "O", where only the highest-order variable is considered. Based on this, the time complexity of the Boyer-Moore-Horspool algorithm is expressed as O(M/|£|). Here, £
199
Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil
"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year
Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год
represents the set of characters that can be found in the given text and the searched word. In most cases, the Uzbek language texts use letters of the Latin alphabet, numbers and symbols representing different meanings [15].
The Boyer-Moore-Horspool algorithm is one of the most efficient algorithms because the jump step is determined according to the values in the comparison table. It can be justified by the experiments that this algorithm is several times more efficient than the traditional sequential shift comparison algorithm. Table 3 below shows the time taken to find a word at different points in a 50,000-word text.
№ Position Time (ms)
Simple BMX
1. 11431 10,2 1,1
2. 12531 10,4 1,3
3. 28392 11 2,1
4. 35425 11,7 2,4
Tal le 3 Experimental results
The values presented in this table may differ from each other depending on the capabilities of the computers selected for the experiment and the given text sizes. However, the difference in the time spent by the algorithms remains almost unchanged. Even with a sharp increase in the size of the text, a sharp increase in the difference between the times is observed. A graphical representation of the data presented in Table 3 is presented in Figure 4 below.
Figure 4. The graph of the position of the word in the text versus the time it took to find it
Conclusions. Today, in many modern text processing and search systems, a hybrid view of a number of algorithms is widely used. Which algorithm to use depends on the type and complexity of the situation that arises in the organization of the process. It should be taken into account that each algorithm has its own advantages, and in some situations it is possible to achieve the highest efficiency based on the simplest algorithm. From the graph obtained as a result of these studies, it can be seen that the considered Boyer-Moore-Horspool algorithm is several times more effective than the traditional search method in organizing a word search from a given Uzbek language text.
References
Maxime Crochemore, Thierry Lecroq and Wojciech Rytter "125 Problems in Text Algorithms: With Solutions" Book, Cambridge university press. 2021 - books.google.com
V.V. Dikoviskiy, M.G. Shishaev "Обработка текстов естественного языка в моделях поисковых систем" Сборник научных трудов 2010.
E.I. Bolshakova, E.S. Klishinskiy, D.V. Lande, A.A. Noskov, O.V. Peskova, Ye.V. Yagunova " Автоматическая обработка текстов на естественном языке и компьютерная лингвистика" М.: МИЭМ, 2011. - P. 272. ISBN 978-5-94506-2948
Julie D. Allen, Deborah Anderson, Joe Becker and others "The Unicode Standard Version 7.0 - Core Specification" Includes bibliographical references and index. ISBN 978-1-936213-09-2)
(http://www.unicode .org/versions/Unicode7.0.0/)
Ian Waters "ASCII Table" In: PowerShell for Beginners (2021) Apress, Berkeley, CA. https://doi .org/10.1007/978-1 -4842-7064-6 13
Akmuradov B., Khamdamov U., Elov Dj., Sultanov Dj., Narzullayev I. Organization of initial text processing in the Uzbek language synthesizer // International Conference on Information Science and Communications Technologies (ICISCT 2021). 4-6 November, Tashkent - 2021. 5p.
200
Muhammad al-Xorazmiy nomidagi TATU Farg'ona filiali "Al-Farg'oniy avlodlari" elektron ilmiy jurnali ISSN 2181-4252 Tom: 1 | Son: 2 | 2024-yil
"Descendants of Al-Farghani" electronic scientific journal of Fergana branch of TATU named after Muhammad al-Khorazmi. ISSN 2181-4252 Vol: 1 | Iss: 2 | 2024 year
Электронный научный журнал "Потомки Аль-Фаргани" Ферганского филиала ТАТУ имени Мухаммада аль-Хоразми ISSN 2181-4252 Том: 1 | Выпуск: 2 | 2024 год
Akmuradov B., Khamdamov U., Mukhiddinov M., Zarmasov E., A novel algorithm for dividing uzbek language words into syllables for concatenative text-to-speech synthesizer // International Journal of Advanced Trends in Computer Science and Engineering. Volume 9, No.4, July-August 2020. - P. 4657 - 4664
Akmuradov B.U., Nutq sintezatorining o'zbek tili lotin alifbosidagi elektron matn elementlarini tahrirlash algoritmi // МУ^АММАД АЛ-ХОРАЗМИЙ АВЛОДЛАРИ Scientific-practical and information-analytical journal. № 1(15), March 2021. -P.8 - 16
Norov A. The numeral modeling of separating Uzbek words into syllables // «TurkLang-2018». VI International Conference on Computer Processing of Turkic Languages. - Tashkent, October 18-20, 2018. P. 43-48
Norov A.M., Muradov Sh.A. Матн транслитерациясига оид алгоритмларни куришда компьютер лингвистикасининг зиддиятли масалалари // References: Contemporary Learning Principles and Publication Issues. A collection of articles. Scientific and methodological publication. -Karshi, 2013. -P. 114-119
A.V. Zheludkov, D.V. Makarov, P.V. Fadeev "Исследование алгоритмов точного поиска подстроки в строке" International scientific journal «Символ Науки» №11-3/2016 issn 2410-700х
Harutyunyan E.A., Galstyan D.M. "Анализ алгоритмов поиска строкового шаблона Рабина -Карпа и Кнута - Морриса - Пратта" // Information technology as a basis for progressive scientific research, Collection of international scientific and practical conferences. УФА, 2022
Saima Hasib, Mahak Motwani, Amit Saxena "Importance of Aho-Corasick String Matching Algorithm in Real World Applications" // (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (3) , 2013, -P.467-469
Deshko I.P., Tsvetkov V.Ya/'Оценка сложности алгоритма" , Journal "Славянский Форум" 3 (33), 2021.-P. 38-49
Akmuradov B., Mukhiddinov M., Sultanov Dj., Narzullayev I. A. methodology of differentiation based on grouping of symbols in the electronic text in uzbek language synthesizer // International Scientific and Practical Conference "Modern Scientific Solutions to Actual Problems" Russia, Rostov-on-Don - 2020. -P. 171-174
201