Научная статья на тему 'Analysis of modern trends of information search improvement'

Analysis of modern trends of information search improvement Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
104
25
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
ПОШУКОВА ОПТИМіЗАЦіЯ / ПОШУКОВА СИСТЕМА / ПОШУКОВА ВИДАЧА / іНФОРМАЦіЙНИЙ ПОШУК / SEARCH ENGINE OPTIMIZATION / SEARCH ENGINE / SEARCH ENGINE RESULTS / INFORMATION SEARCH

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Tereschenko V.

In the modern conditions of development of information technologies of the Internet and search engines there is a need for new methods of providing effective information search. Accordingly, the paper analyzes the principles of information search systems functioning and, based on the requirements of the present, highlights the most important development trends. Against this background, a number of scientific works in the field of information search was analyzed. The study analyzed the ranking factors of Google published in 2019, considered the prospect of using a vector space model (VSM); improvements to the outdated SeoRank method are conducted, and the prospect of using precedent methodology as part of improving search methods and, in particular, in building information retrieval systems is investigated. In particular, it is emphasized that the use of precedent-based search organization combines different approaches to solving the problem of intellectualization and personalization of search.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Analysis of modern trends of information search improvement»

entering the production plant. Ammonium nitrogen comes in concentrations of 5-15 mg/dm3, but after anaerobic processes in the first two bioreactors its concentration increases 2-4 times due to the destruction of nitrogen-containing boundary compounds. At the inlet of the aerobic bioreactor there is a decrease in the concentration of organic matter and an increase in the concentration of inorganic, including ammonium compounds. In the aerobic bioreactors due to the active nitrification process is the oxidation of ammonium nitrogen to nitrites and nitrates, in addition, at the depth of the fibers is anaerobic oxidation of ammonium with the formation of molecular nitrogen, at the outlet in purified water is observed to reduce the concentration.

The article decided to scientific and technical objectives - a study of patterns of processes wastewater from nitrogen compounds using immobilized microorganisms and development in this technology-based anaerobic-aerobic purification with high efficiency removal of nitrogen while respecting existing regulations with minimal power consumption and a small amount in volume of waste.

As a result of researches it is established that when using the developed technology the efficiency of sewage treatment by ammonium nitrogen is 98,2-98,5% and is provided by the following values of technological parameters: hydraulic load of 5,5-5,8 m3/(m3day); total cleaning time - 20-22 h; duration at each stage - 24,4 h; load on ammonium nitrogen in aerobic bioreactors - 8-20 mg/(gday); the rate of oxidation by ammo-

nium nitrogen is 4-12 mg/(gday), at initial concentrations of ammonium nitrogen at aerobic stages 11-32 mg/dm3.

It is established that the use in the bioreactors with immobilized microorganisms perpendicular to the movement of air jets relative to the direction of movement of wastewater affects the efficiency of the purification process. Provides an increase of the oxidation power by 30-40% at the initial stage of the aerobic process and the removal efficiency of ammonium nitrogen 98,4-99,6% at CmcommgNH4 to 30 mg/dm3, hydraulic load of 5,5-5,8 m3/(m3day) compared to the lengthwise placement of the aerators under the same process conditions, which allows to reduce the size of structures.

References

1. Patent of Ukraine for utility model № 64417 IPC CO2F3 / 02 / Aerobic bioreactor / Sabliy L.A., Zhukova V.S. applicant and patent holder. National Technical University of Ukraine «Kyiv Polytechnic Institute» № 201103744; claimed 28.03.11; publ. 10/11/11, Abs. № 21.

2. Sabliy L.A. Wastewater treatment from nitrogen compounds / L.A. Sabliy, V.S. Zhukov // Scientific Journal of Construction / Kharkiv State Technical University of Civil Engineering and Architecture. -Kharkiv: KHUBA, KOTV ABU, 2011, - Vol. 63. - p. 431-435.

3. Shved O.M. Modern technologies of nitrogen extraction from sewage / O.M. Shved, R.O. Petrina, O.L. Karpenko, V.P. Novikov // Biotechnologia acta. -Vol. 7, issue 5-p. 108-115. DOI: 1015407 / biotech 05/07/10.

АНАЛ1З СУЧАСНИХ ТЕНДЕНЦ1Й ВДОСКОНАЛЕННЯ ШФОРМАЦШНОГО ПОШУКУ

Терещенко В.В.

астрант кафедри тформатики та вищо1 математики Кременчуцький нацюнальний утверситет iMeHi Михайла Остроградського, Украта

ANALYSIS OF MODERN TRENDS OF INFORMATION SEARCH IMPROVEMENT

Tereschenko V.

postgraduate student of Department of Informatics and Higher Mathematics Kremenchuk Mykhaylo Ostrohradskiy National University, Ukraine

Анотащя

У сучасних умовах розвитку шформацшних технологш мережi Internet та пошукових машин виникае потреба у нових методах забезпечення ефективного шформацшного пошуку. Вщповвдно, у робот проана-тзовано принципи функцюнування систем шформацшного пошуку та, спираючись на вимоги сьогодення, виокремлено найважливiшi тенденцп розвитку. Зважаючи на це, проаналiзовано ряд наукових дослвджень у сферi шформацшного пошуку. В ходi дослщження проаналiзовано опублжоваш у 2019 рощ фактори ранжування Google, розглянуто перспективу використання моделi векторного простору (VSM); проведено вдосконалення застаршого методу SeoRank та дослщжено перспектившсть використання методики преце-денпв у рамках вдосконалення пошукових методiв та, зокрема, при побудовi систем шформацшного пошуку. Зокрема, наголошено що оргашзацгя пошуку на основi прецеденпв дозволяе об'еднати в œ6i рiзнi шдходи до виршення завдання iнтелектуалiзацiï та персоналiзацiï пошуку.

Abstract

In the modern conditions of development of information technologies of the Internet and search engines there is a need for new methods of providing effective information search. Accordingly, the paper analyzes the principles of information search systems functioning and, based on the requirements of the present, highlights the most important development trends. Against this background, a number of scientific works in the field of information search was analyzed. The study analyzed the ranking factors of Google published in 2019, considered the prospect of using a vector space model (VSM); improvements to the outdated SeoRank method are conducted, and the

prospect of using precedent methodology as part of improving search methods and, in particular, in building information retrieval systems is investigated. In particular, it is emphasized that the use of precedent-based search organization combines different approaches to solving the problem of intellectualization and personalization of search.

Ключовi слова: пошукова ошгашзацш, пошукова система, пошукова видача, шформацшний пошук.

Keywords: search engine optimization, search engine, search engine results, information search.

Introduction. In the conditions of scientific and technological progress and the development of Internet technologies, there is an enormous that increase in the amount of available information that can be used in solving important tasks in the course of research activities, in support of decision-making in the scientific, technical, social and other spheres [1]. Effective analysis of this information and its application in making strategic decisions gives an advantage in the development not only of the modern economy, but also of science and technology.

The search engine develops in different directions: new ranking factors appear or their priority changes, requirements to the quality of sites and their referral links increase (new anti-spam algorithms appear), the format of the interaction of the search with the user, as well there are new services that simplify the search for information. Since the requirements for the speed of search, the relevance of information with each passing day increase, then the requirements for methods and algorithms for searching and submitting information increase. The process of searching and displaying information on the Internet has a number of features, the main of which is a huge amount of web resources, the need to take into account the semantic peculiarities of information, the impact of a large number of factors in the search, the need to take into account the features of hypertext markup and metainformation [1].

The object of study is the process of organizing an information search.

The subject of study is the modern methods of information search.

The purpose of the work is a significant improvement in the results of information search.

Problem statement. The essence of information retrieval in the general case is that the search engine selects from the set of documents located in the database, which satisfy the information need and correspond to the information request (that is, they are relevant) [4] according to certain criteries.

To date, there are many methods and algorithms for information retrieval, but the continuous development of this industry and the growth of volumes of data require continuous improvement of existing methods and the development of qualitatively new approaches. So, accordingly, the problem of improving the methods of information search is relevant.

Materials and methods. Many scientists worked in the research on information search problems: Ash-manov I.S. [1], Kolysnychenko D.M. [2], Krokhina O.I. [3], Manning K.D. [4], Klimchuk S.O. [5], et al. So, for example, in his book I.S. Ashmanov [1] summarizes the experience of well-known specialists, SEO-professionals; Particular attention deserves an analysis of the principles of the work of search engines.

D.M. Kolisnychenko [2] describes in detail the algorithms of work and methods of using the most popular search engines of the Internet today - Google, Yandex and Rambler. In addition, the author examines how to develop their own Google-based applications: Google's personalized search engines. Despite the fact that the work of O.I. Krokhina [3] is focused on SEO-copywriters, Internet marketers, search engine optimization specialists, webmasters and site owners, and discusses the general principles of search engine algorithms. It is based on them that she explains how to write a text for a site, which will be equally well received by users and will provide high positions in the issuance of search engines. Despite the fact that the K.D. Manning's textbook [4] is conceived as an introductory course in information search and written in terms of informatics; in it along with the classic search are considered web search, the principles of the search engines, as well as the classification and clustering of texts. The book contains a modern account of all aspects of designing and implementing systems for collecting, indexing and searching documents, methods for evaluating such systems, as well as introducing methods in machine learning.

In his work [5] S.O. Klimchuk explains the important, from the researcher's point of view, the principles of organizing a case-based system (Case-Based Reasoning System). In particular, the advantages of the precedent methodology in the creation of intelligent decision support tools are analyzed. The publication deserves attention, given the possibility of applying the appropriate methodology to build an information search system.

Obviously, the problem is widely discussed by the scientific community. However, despite a significant number of researchers' publications, the problem of improving information search methods has not been completely solved and remains relevant.

In accordance with the general principles of organization of information search, at the heart of each search method, its algorithms is a model of implementation, which is used to refine the search strategy [2]. Thus, we can say that formally for a search algorithm it is a mathematical representation capable of displaying any relevant object in the information retrieval system in relation to any criteria for its use by the system in order to perform a search task.

The problem is that the differences between the existing search algorithms generated a significant variety of models [2]. If the model is fairly general, then the corresponding search algorithm will be useful only for a very superficial conceptualization of information search. On the other hand, if the model is defined deeply enough to cover all possible aspects of the system, then there will be a problem in a complex description of the principles of organization, which will create

difficulties for further improvement of the algorithm. Thus, it is expedient to create an improved algorithm for the implementation of an information search, the model of which will be equally relevant to both the criteria of generality and depth criteria.

Given the significant accumulation of information volumes together with the progressive growth of its number and the importance of discoveries in the field of information search, the issue of development and improvement of search algorithms and methods remains relevant.

Information search optimization appeared during the progress of search engines [1]. At that time, search engines attached great importance to the aspects that site owners could easily manipulate: text on a page, keywords in meta tags, and other internal factors. This led to the fact that in the issuance of many search engines, the first few pages occupied sites that were entirely devoted to advertising [2].

The idea of automated processing of text information with the help of electronic computers arose at the beginning of the XX century. The development of computer linguistics contributed to the integration of mathematical methods (first of all, statistics and discrete mathematics) and linguistics for solving applied problems in the analysis of textual information [3].

So, with the advent of Google PageRank [11], more attention has been given to external factors, which has helped Google become a global leader in search, making it more difficult to optimize with text only on the site. For a long time, PageRank was one of the most important Google ranking algorithms [1]. Subsequently, the modified algorithm was applied to a collection of documents linked by hyperlinks (such as web pages from the World Wide Web), and defined each of them a numerical value that measured its "importance" or "credibility" among other documents.

The more there were links to a page, the "more important" it was. In addition, the "weight" of page A was determined by the weight of the link transmitted to page B. Thus, PageRank was a method of calculating the weight of a page by counting the importance of references to it [3].

In the light of recent efforts by well-known search engines to tackle purchased links, cheat code, and other manipulative methods that would lead to artificially raising the ranking of a web resource in the form of extradition, has significantly increased the role of so-called "behavioral factors" as elements of the promotion of the site in the TOP search engine.

Behavioral factors are indicators that characterize the user's work with the search engine and its direct behavior on the site. Their main task is to improve the quality of search engine placement [1].

Influence of behavioral factors on the ranking in search engines is undeniable. The main behavioral factors include [12]:

1. Clicks in the issuance of the search engine. The main value is given to the first and last clicks, because, according to Yandex algorithms, it is usually considered the most appropriate in some cases.

2. Visiting the resource, which indicates its popularity and demand in the network space.

3. Time of finding both on the site as a whole, and in its separate sections in particular. This is the most important criterion for assessing the quality of the resource, since a good portal always delays visitors for a long time while bad users leave practically immediately.

4. Depth of view, which is calculated from the number of pages viewed. This factor depends on the previous one. The more time a user has spent on a site, the more pages were viewed, the better the search engine resources will be considered.

5. Return index - The number of users who form a permanent target audience.

6. Bounce rate that covers users who did not start browsing more than one page. This factor is considered negative, indicating the low quality of the site and the irrelevance of the subject. It is also important to note that this criterion can not be considered the main one, because visitors could leave the page not only because it did not meet their expectations, but also because they immediately found an answer to their question.

7. Links to the site. The number of referrals to a web site is not only from search engines and from other sources (like social networks, etc.).

Usually site owners conduct an analysis of behavioral factors using web analytics systems that connect directly to the site. Among the most popular among them are Yandex.Metrics and Google Analytics [4]. This fact must be taken into account when choosing the principles behind which the search algorithm will work. However, when using the web resource owner of several counters simultaneously (for example, Google Analytics and Yandex Metrics), there is a noticeable difference in the calculation of statistics.

Significant reasons for such differences are that the systems of analytics operate with different data and count the same indicators in different ways. Yan-dex.Metrics and Google Analytics may display different data for a variety of reasons [1]:

1. Counters are installed in different places of the HTML code. For example, if the analyzer counter is set to the <head> tag and the metrics before the closing tag </ body>, then most users who did not wait for the page to load will not appear in statistics.

2. Wrong time zone setting. In Yandex.Metrics and Google Analytics, it is possible to set the time zone in the meter settings to calculate the statistics. If different time zones are specified, the statistics will be different.

3. Filter settings. Different instructions when setting up filters make the difference between the displayed data.

In addition, differences are also in the understanding of terminology [4]. The average user is used to count visits as equivalent visits, but this is not correct. Yandex visits are the number of sessions of user interaction with the site, during which one or more pages are viewed. The visit ends 30 minutes after the user has no activity. The Google Visits session includes all usage data (views, transactions, etc.).

The same situation with the indicator of failures. In Google, this is the percentage of visits in which no more than one page was opened. In Yandex, this is the

fate of visits, within which there was only one pageview.

However, when determining the relevance of search engines, first of all pay attention to how many times a page encounters a phrase, the same query user. This parameter is called the keyword frequency. The higher it is, the more relevant the site is. Until recently, optimizers specifically increased the frequency of key words up to the total unreadable text. Currently, search engines are actively struggling with similar methods and reduce the ranking when they are detected.

To determine the frequency of keywords used special mathematical algorithms, which calculate the number of requests for queries on the volume of the resulting text. At the same time, the optimal ratio is 3-5%. Since search engines' systems are not able to evaluate texts in terms of readability, this circumstance allows optimizers to increase the frequency of keywords to a certain limit, which, on the one hand, violates the rules of using search engines, and on the other hand, does not exceed the criteria established by them.

Nowadays, a rather important problem in the field of information search is the problem of designing intelligent systems oriented on open and dynamic databases [4]. Such systems are based on the integration of adaptable, modifying and learning models of search, discovery and operation of knowledge, focused on the specificity of the desired (subject) area and the corresponding type of uncertainty, reflecting their ability to develop and change their state. Using precedent-based search, you can combine different approaches to solving the problem of personalizing and personalizing your search.

In most encyclopedic sources, the term «precedent» is defined as a case that has occurred before and served as an example or justification for subsequent cases of this kind [5]. Case-Based Reasoning is a technique capable of solving a new or unknown problem, using or adapting a solution to a known problem, that is, using the experience gained from solving such problems. A precedent-based approach has emerged in the development of research into the creation of expert systems (knowledge-based systems).

Typically, the precedent-based inference process is subject to decomposition into four main stages, which form the so-called precedent-based reasoning cycle [5]. The main stages of such a cycle are:

1) ejecting of the most relevant (similar) precedent (or precedents) to the situation that has arisen from the library of precedents;

2) reusing the ejected precedent to attempt to resolve the current problem;

3) revision and adaptation in case of need of the received solution according to the problem;

4) preserving new decision-making as part of a new precedent.

The advantages of case-based considerations include the following aspects [5]:

1) the ability to directly use the experience gained by the system without intensive involvement of an expert in a particular subject area;

2) the possibility of reducing the time to find a solution to a task by using an existing solution for such a task;

3) the possibility to exclude the repeated receipt of the wrong decision;

4) there is no need for a complete and in-depth examination of knowledge regarding a specific subject area.

The disadvantages of considerations based on precedents include the following [5]:

1) when describing precedents are usually limited to superficial knowledge of the subject area;

2) a large number of precedents can lead to a decrease in system performance;

3) it is problematic to define criteria for indexing and comparing precedents;

4) problems with debugging algorithms for determining similar (similar) precedents;

5) the inability to obtain solutions to problems for which there are no precedents or the degree of their similarity (similarity) is less than a given threshold value.

The main purpose of using the precedent apparatus in the information search system will be to issue a response to a user query based on precedents that have occurred in the past when performing such queries. The information about the new query will be used to remove the most appropriate precedent (precedents) from the case library. The extracted precedent is reused to solve a new problem (problem) [5]. The proposed solution can then be adapted to the new situation and implemented in practice. If successful, the proven solution, together with the request description, creates a new precedent that is stored in the precedent database.

Experiments. In the framework of the research the theoretical work was carried out: modern methods of information search were analyzed, and, based on the requirements of the present, the most important aspects were outlined. From the practical point of view, for the experimental research, information retrieval system (Google) were used which, according to the keywords, publish the web pages of the found documents. Google was chosen as a researched search engine for two reasons: 1) popularity and leadership among search engines in the world, and in Ukraine, particularry; 2) difficulties with access to a competitor (for comparison) -a search engine Yandex in our country. The search settings provided the optimal showing option - 10 electronic documents per page. Based on generally acknowledged facts (preferences of users, behavioral factors, effect of long search, etc.) and from our own experience, it is known that the required information about a search query must be on the first 3 pages. Therefore, for calculations, the limit of 3 full pages was chosen, that is, the volume of research of Search Engine Results was 30 electronic documents.

To carry out experiments on these pages, the computerized computer-aided computing system (Mathcad) and the visual and mathematical functionality of the Mitsrosoft Office Excel were used to implement the proposed method for assessing the relevancy of the document to the request (developed by author modified SeoRank).

Results and discussion. To solve more complex tasks of information search (computer translation, automatic referencing and other tasks of analytical processing of textual information) it is necessary to use methods of linguistic analysis of texts that enable to not only find the concepts, the key vocabulary, but also allow to define different connections. between them.

Taking into account the aforementioned perspective, in terms of generality and depth, working with the text will be the use of the vector space vector (VSM) [4]. The corresponding model describes an information search algorithm based on a modified frequency criterion, which, in addition to the use of the relevance of the word, will also take into account its semantic weight, thus improving the quality of the search query. This will allow you to receive relevant data, even if most of the query words are not contained in the context (document), despite the semantic similarity between context and query. Vector Space Model (VSM) is a mathematical model [4] representing texts, in which each document is mapped to a vector that expresses its content. This representation makes it easy to compare words, look for similar ones, classify, clusterize, etc.

In general, there are two basic approaches to semantic search, and indeed to a comparison of documents in content. The first approach is based on manually assigning objects to some attributes and processing exactly these attributes, and corresponding objects [2]. The second approach, which actually represents value, is based on the opposite idea: instead of complex logical rules, a simple mathematical model is used - a statistical analysis of already existing texts. The beginning of this approach takes on the work on the method of LSA (Latent Semantic Analysis, implicit semantic analysis) [3]. Later, the method has undergone many modifications and has become quite popular. Today, Google and a lot of other major search engines use one of the parameters of this method (index tf * idf) when ranking results [7].

The principle of the search algorithm according to this method is quite simple: the more often two words occur in the same contexts (documents), the closer they are to the content.

For LSA, the frequency of finding a particular document is calculated simply as an index tf * idf, which is decrypted as "term frequency * inverse document frequency" [4]. Term frequency - calculated as the number of occurrences of a particular term in a particular document, divided by the total number of words in this document.

Document frequency is the number of documents in which this term is found, divided by the total number of documents. Inverse document frequency, respectively, is the value of the reverse document frequency, i.e. idf = 1 / df. Usually, to mitigate the effect of idfs action on the overall result, instead of the most important, its logarithm is taken.

Accordingly, in the general case, the frequency of occurrence of the term, which is the reciprocal of the document (tf * idf model), is used to calculate the weight d for the term i in the document (1):

di = tfi * idf, (1)

where tf is the frequency of occurrence of the term i in the document, and idfi is the inverse frequency of the occurrence of the term i in the whole context.

All documents are ranked according to their similarity to the entered request. The absence of common terms in two documents does not necessarily mean that the documents are not semantically similar. Similarly, documents that are relevant to a queried request may not contain such terms.

As you know, within the framework of information retrieval, document content (such as web pages) is an important characteristic for analyzing and constructing optimal search results for document search results [2], since they should not contain instances whose content is duplicated on other pages; the amount of information noise should be minimal, and the main content - relevant to the subject of search. Consequently, the evaluation of web pages for the purpose of duplication of information and its novelty is considered as a necessary stage in the construction of optimal information search algorithms.

To solve this task, which is to find duplicates, the best method is the method of exclusiveness - "single method" [1], the main idea of which is to split comparable texts into selected from the text of the sequence of words (singles), for each of which a checksum is calculated.

The proximity measure of two text documents sim (Di, Dj) was determined on the basis of probability apparatus (2), namely, the product of the probability that the random word w is included in the document Di provided that it is included in the document Dj multiplied by the probability the entry of this word into the Dj document.

sim(D.,D .) = P(w e D. | w e D ,)P(w e D .) (2) i J i J J

In this case, the novelty parameter of the new document Di (3):

Rank *sim(D;,PlusDisc), (3)

New =-i---

i N

log(i+1) S sim(Di,D ;)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

J =1 J

where N - total number of web documents; Dj - j-th current document; Di - the first document; PlusDic -dictionary; sim (Di, Dj) - measure of proximity of documents i and j; sim (Di, PlusDic) - of i-th document and dictionary; Ranki is ranked i-th document.

From the point of view of increasing the reliability of evaluating the relevancy of a document to a request, the author developed the modification (for the first time, as a result of a wide range of research) of an outdated SeoRank method and justified the expediency of using that advanced SeoRank [2] method to determine the relevancy of the document's information blocks (web pages) with respect to its main content, which is presented on the web page as meta tag information. That is, a detailed assessment of the components of the document will take place. Unlike the previous methods of assessing relevancy (e.g., PageRank), author in the calculation of modified form of SeoRank does not consider the relevancy of information blocks relative to specific search queries and does not take into account

external parameters such as resource interactions, physical availability of resources, compliance with standards, etc., and gives the ability to evaluate information blocks within a specific document (web page) [2].

Formally modified SeoRank is calculated as (4): 4 (4)

SeoRank = X airi ' i=l

where r - parameter value; ai - weight of the parameter; at which the total weight (5):

4

5 ai i=l

= 1

(5)

Results of calculations on the Table 1 clearly shows us that the use of the firt proposed by author method of assessing the relevancy of the document to the request allows in practice determine how accurate are the search engine issuance.

Table 1

Document The value of the modified SeoRank (rounded up to one hundredth)

Search request

«KpeMeHHy^Ka TEC» Ey^b» «iH^opMaqÎHHHH nomyK»

1 0,92 0,93 0,95

2 0,91 0,94 0,91

3 0,86 0,91 0,78

4 0,85 0,88 0,81

5 0,75 0,91 0,88

6 0,58 0,86 0,84

7 0,66 0,79 0,78

8 0,54 0,81 0,93

9 0,71 0,51 0,91

10 0,52 0,78 0,82

11 0,88 0,79 0,32

12 0,76 0,77 0,47

13 0,83 0,62 0,38

14 0,66 0,63 0,48

15 0,57 0,57 0,52

16 0,81 0,83 0,74

17 0,51 0,81 0,62

18 0,49 0,66 0,53

19 0,77 0,89 0,81

20 0,55 0,82 0,84

21 0,52 0,84 0,54

22 0,44 0,46 0,33

23 0,54 0,48 0,63

24 0,44 0,55 0,38

25 0,41 0,67 0,41

26 0,37 0,71 0,44

27 0,92 0,81 0,51

28 0,41 0,33 0,47

29 0,37 0,45 0,32

30 0,22 0,74 0,31

1______A_/ ^_J

vOv\ / \ \ V la / \ \ \ » Ni it i i * [

_\ 1 i \ \ // Ï 11 ii 11 j

f i \ tt J \ V 1 \ \ 11 1 rn\ Il Ai 1\ i fill r 'ill T \ Ml / \ \ i i\ / Vllfl/ k I / * ^

A / № ' 1 \ _I, fi" 1./__u \xf IJ

V^ v * * \ l\S i \

Fig. 1. Graphs of the results of the calculation the modified SeoRank of the first 30 documents generated by

the search engine Google for three search issues

As we see from Table 1, due to the use of the developed modified SeoRank method, it is possible not only to make a detailed assessment of the components of the document for the relevance of the search issue, but also to establish how does exact from the given point of view is the issuance of the search engine. Obviously, the resulting figures allow us to conclude that the analogue (that used by the search engine) to solve similar problems is insufficient accurate.

Given that the Google search engine (which, as mentioned in the article, gives priority to the relevancy factor) formulates its search issuance based on the principle of "decay (reduction)" (from the most relevant search query of the document to the least relevant), and given that for the solution to this task is the use of a wide range of analogues to the proposed in the article method (outdated PageRank; MozRank; Keyword Ranking; the usual SeoRank); the advantage of the developed modified SeoRank to its counterparts is evident. This is evident from the graphs shown in Figure 1, which shows significant differences between the "falling" search issuance of Google and the actual relevance of documents in the issuance of a search query. As we can see, in contrast to the existing methods of assessing relevance (for example, PageRank), an modified form of SeoRank provides an opportunity to assess the relevance of information blocks within a specific document (web page), which can significantly improve the results of search execution and verify compliance with the requirements of the search query, which actually shows us on a Figure 1.

Accordingly, within the framework of the study and, in particular, to improve the outdated method SeoRank, was found (in the first time) and it was proposed that the following parameters should be used to calculate the modified SeoRank [1]:

1) relevance of the title of the web page ("title") to the text block of information block r1 - the ratio of the

number of occurrences of words from the title to the block text to the total number of words of the block;

2) the relevancy of the keywords of the web page ("meta keywords") to the text block of information r2 - the ratio of the number of keyword hits to the block text to the total number of words in the block;

3) the relevance of the words from the description of a web page or a document ("meta description") to the text of the information block r3 - the ratio of the number of occurrences of words from the description of the web page to the block text to the total number of words of the block;

4) the relevancy of the headers of the web page or the document ("headers") to the text block of information r4 - the ratio of the number of occurrences of words from the headings ("H1" - "H6") of the web page to the total number of words from the headings of the block.

Conclusions. As a result of the analysis of Google ranking factors published in 2019, we can come to the following conclusions regarding the principles of operation of modern search engines:

1. Increased attention to relevant, high-quality content ("Relevant, holistic content is more important than ever") and influence of behavioral factors;

2. The effect of keywords is diminished. ("Keywords are becoming increasingly obsolete");

3. Mobile Friendly: creating a working mobile version of the site - an obligatory step in promoting the resource. Already from 2015, Google has begun ranking higher sites adapted for mobile devices.

4. The role of social networks is growing - if there is a transition from social networks, then the site will have a higher rating in search ("Social signals - a bonus for positive rankings").

Based on the review of the current state of research in the field of optimization of methods and algorithms

of information search, the following problems are identified: a large number of duplicate content; no breakdown of web search results by topics; a significant amount of information spam when viewing documents, which greatly affects the time of searching and viewing documents. Accordingly, the requirements for the speed of search, the relevance of information with each passing day is increasing; At the same time, requirements for methods and algorithms for searching and submitting information are increasing [1]. The continuous development of information retrieval and the growth of data volumes require continuous improvement of existing methods and the development of qualitatively new approaches. These and other factors point to the fact that the problem of developing and improving effective information search techniques in web systems is relevant.

In accordance with the set requirements, perspective in terms of generality and depth in the context of linguistic analysis of texts will be the use of the vector space model (VSM) [4]. The essence of the model lies in the mathematical representation of texts, in which each document is compared to a vector that expresses its content. At the same time, the necessary stage in the construction of optimal information search algorithms is the evaluation of web pages for the duplication of information and its novelty. To find duplicates, the best method is the method of exclusiveness (singles) - single method [1], the main idea of which is to split comparable texts into selected from the text of the sequence of words (singles). From the point of view of increasing the reliability of evaluating the relevancy of a document to a request, it will be advisable to use the improved SeoRank [2] method to determine the relevance of information blocks of a web page about the main content that is presented on a web page as meta tag information.

In addition, by the course of the research, the prospect of using the precedent methodology within the framework of the improvement of search methods and, in particular, in the construction of a distributed information retrieval system oriented structure was established. It is emphasized that the use of precedent-based search organization allows to combine different approaches to solve the problem of intellectualization and personalization of search and to reduce the load on the search engine index, as well as to simplify the solution of a number of other problems.

The findings and suggestions of this study can be used in research and teaching. The results of this study can be used to further analyze and refine information

search methods in general and in the educational field, in particular.

References

1. Ашманов И.С., Иванов А.А. Продвижение сайта в поисковых системах. М.: Вильямс, 2016. 304 с.

2. Колисниченко Д.Н. Поисковые системы и продвижение сайтов. М.: Диалектика, 2014. 272 с.

3. Крохина О.И. Первая книга SEO-копирайтера. Как написать текст для поисковых машин и пользователей / О.И. Крохина, М.Н. Полосина - М.: Инфра-Инженерия, 2015. - 216 с.

4. Маннинг К., Рагхаван П., Шютце Х. Введение в информационный поиск. М.: Вильямс, 2017. 640 с.

5. Климчук С.О. Розроблення прецедентно! системи пвдтримки прийняття ршень. Вюник Нацюнального ушверситету «Львiвська Полггех-шка». 2010. № 689. С. 169-176.

6. Терещенко В.В., Терещенко В.Л. Перспек-тившсть вдосконалення систем шформацшного по-шуку. Четверта Всеукра!нська науково-практична конференщя «1Т-Перспектива». Кременчук: КрНУ, 2017. С. 26-28.

7. Терещенко В.В. Аналiз сучасних методiв шформацшного пошуку. Вюник Кременчуцького нацюнального ушверситету iменi Михайла Остро-градського. Кременчук: КрНУ, 2018. Випуск 3 (110) С. 26-32.

8. Терещенко В.В. Перспектившсть вдосконалення шформацшного пошуку за допомогою прецеденлв // Журнал «The scientific heritage» (ISSN 9215-0365), Budapest (Hungary), - No 41 (2019) - С. 47-52

9. Урвачева В.А. Обзор методов информационного поиска. Вестник Таганрогского института имени А.П. Чехова. 2016. №1. С. 457-463

10. Alexandras N., Mark M. Detecting Spam Web Pages through Content Analysis // Microsoft Research, 2012, - РР. 1-6.

11. Brin S., Page L. The anatomy of a large-scale hypertextual Web search engine // Computer Networks and ISDN Systems, 2004. - PP. 107-117.

12. Fetterly D., Manasse M., Najork M. Spam, Damn Spam, and Statistics // Int'l Workshop on the Web and Databases, ACM Press, 2004, - PP. 1-6.

13. Ganz A., Sieh L., Behavioral factors and SEO // Proceedings of 24th International Conference on Computer Communications and Networks (ICCCN 2015) Las Vegas, Nevada, USA August 3 - August 6, 2015, Scottsdale, Arizona, USA. - PP. 218-223.

i Надоели баннеры? Вы всегда можете отключить рекламу.