Научная статья на тему 'THE SUBSYSTEM OF SEARCH IN THE DISTRIBUTED INFORMATION SYSTEMS'

THE SUBSYSTEM OF SEARCH IN THE DISTRIBUTED INFORMATION SYSTEMS Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
32
5
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
SEARCH / INFORMATION / NETWORK / DISTRIBUTED INFORMATION SYSTEMS / RESOURCE / INTERNET

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Yurchyna Alexey, Bugay Alexander, Amons Alexander

The article describes the principles of the search subsystems of the distributed information systems. It describes the basic mechanisms of the search engine, here the modern search engines of the world, and their positive and negative traits are revealed; it is determined that a successful search is associated with the use of search algorithms, synonyms and synonyms thesaurus.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «THE SUBSYSTEM OF SEARCH IN THE DISTRIBUTED INFORMATION SYSTEMS»

у

which differ in the degree and nature of contamination, thus find their comprehensive solution to the problem of ecological, agricultural (food) and energy areas:

- Cleaning and disinfection of wastewater and air deodorization biological treatment plant and the surrounding areas;

- Receiving a significant amount of biomass with a high content of protein, from which you can get protein-vitamin feed additive and organic fertilizer (ver-micompost);

- Green mass Eichhorn is an effective raw materials for alternative energy sources - biogas with high calorific value, which can be used in a cogeneration plant to produce electricity and heat.

Additional water treatment effects using Eichhorn is, firstly, the reduction of the number of working production lines and compressor plants, and secondly, you can dispense with the chlorination of water, which results in a double positive effect: improved health and safety of staff and reduced admission to the surrounding condition dangerous organochlorine compounds and chlorine.

Literature:

1. Gudkov AG Biological treatment of municipal wastewater: Textbook Vologda: Vogt, 2002. - 127 p.

2. Gunter LI, LL Goldfarb Methane-tanks. - M .: Stroyizdat, 1991. - 128 p .: silt. - (Environmental Protection).

3. K.A.Tohtahunov, N.T.Toshboev, H.Kuchinov. On the possibility of independent power stations of biological sewage treatment with the use of internal energy. Tashkent, issues of energy and resources (special edition) №3-4. 2013 - 218 p.

4. www / cyberenergy.ru

Умирова Нилуфар Равильевна

Место работы - Ташкентский Государственный Технический Университет

Должность - ассистент

Учёная степень - нет

Рабочий адрес - 100095, г.Ташкент, Вузгоро-док, ул. Университетская-2

Alexey Yurchyna

Student NTU KPI them. Sikorsky, Department of Computer Science (FIVT) Alexander Bugay Student NTU KPI them. Sikorsky, Department of Computer Science (FIVT) Alexander Amons Ph.D., associate professor NTU KPI them. Sikorsky

THE SUBSYSTEM OF SEARCH IN THE DISTRIBUTED INFORMATION SYSTEMS

Summary. The article describes the principles of the search subsystems of the distributed information systems. It describes the basic mechanisms of the search engine, here the modern search engines of the world, and their positive and negative traits are revealed; it is determined that a successful search is associated with the use of search algorithms, synonyms and synonyms thesaurus.

Key words: search, information, network, distributed information systems, resource, Internet.

Introduction. Due to the rapid development of the mation systems was the subject of many scientists' re-

telecommunications technology and Internet, in particular, the problem of effective information search development is getting an extreme topicality. The distributed systems exist in various forms and store information by many different methods. The information search in such systems is currently the subject of the scientific debate and research, because if the information can not be found, information is defined as lost.

The interest to the issue of searching for the information was not weakened during the network lifetime. While carrying out information search that satisfies the user's information needs, it is necessary to know, what determines a successful search and which problems arise when working with the information.

In this article the existing search engines are examined, their advantages and disadvantages are analyzed and the solutions are offered which provide the high efficiency of information search in the distributed information systems.

Analysis of the recent research and publications. The subsystem of search in the distributed infor-

search, in particular such as A. Trusov [4], V. Trusov [4], T. Atanasova [1], B. Voyskunskii [3], A. Barysheva [2] and others.

Setting objectives. To analyze the mechanisms and principles of the subsystems of information search in the distributed information systems. To explore the algorithms of subsystem information search in distributed information systems of Internet. To describe the major search engines, identify their positive and negative traits.

The material presentation. The engine of information search in the distributed information systems takes one of the leading positions in the information system and system efficiency depends on its realization. At the same time, the realization of the extensive search capabilities may negatively affect the system productivity. [3].

Three basic requirements are provided to the search engine:

- the control of coverage resources;

- the control of accuracy of information received from the network;

- high speed of search.

The control of coverage resources

In carrying out information search of any question, different types of resources are used. The knowledge of all major existing modern types of network resources, the understanding of the technical and thematic specifics of their content and the features access becomes a necessary condition for the successful planning and carrying out search operations.

The control of accuracy of information

The control can be done by different means. Traditional methods of verification are:

- the localization of the information sources that are alternative to the search data;

- the checking of the actual material, the determination of the frequency of its use by other sources;

- the clarification of the document status and host rating, where it can be find by means of the search engines;

- the receiving of information about the status and competence of the author of the material by using special search services;

- the analysis of the individual elements of the host to assess the skills of the specialists, who support it and others [2].

Speed of search in the network

The speed of search in the network depends on two factors:

- from the competent planning of the search procedures;

- from the skills of working with the selected

type.

Generalized algorithm of information search in the distributed information systems of Internet is shown in Figure 1.

y

femu»» eke tui

rte reie&rck of object irri

prcpntit ifa: isnaat iss{C larekiir us larch lydSa?

No Yes

AcidKMi ai6« aliraiMi nitcs

ltd WTVVV

Pic. i. Algorithm of information search in the distributed information systems of Internet

Drawing up the plan of search works is understood the selecting of search services and tools, corresponding to the specifics of the task and consistency of their application, depending on the expected impact. After gaining the access to the appropriate resource, it is important quickly to be able to understand its structure and methods of navigation. The motility of the actions performing, skillful combination of the search engines and abilities of information processing of local client program and server for the search engine are the necessary skills.

The important place in information search takes the use of synonyms, which includes:

- analysis of the task of information search, given subject area, key words and descriptors;

- search for information with the use of synonyms.

The expansion of the subject area by synonyms (Fig. 2) includes:

- the formation of synonyms thesaurus;

- sections of thesaurus by keywords or descriptors;

- the formation of subject area;

- information storage in the process of subject area by expanding the synonyms.

Pic. 2. Algorithm of the subject area expanding by using synonyms

Another important element that effects on the results of information search is the thesaurus keywords, which includes the subject area expanding by the synonyms and the formation on this basis of the thesaurus synonyms.

Thesaurus synonyms formation (Fig. 3) includes: 1. analysis of subject area, key words and descriptors;

2. synonyms' determination in a given subject

area;

3. the formation of thesaurus synonyms of a given subject area;

4. the formation of sections of the keywords or synonyms thesaurus descriptors of a given subject area [4].

ai? the synonyms

Pic. 3. Algorithm of thesaurus synonyms formation

Finding information in distributed information resources is implemented on the principle "from simple to complex" that provides a gradual immersion of the expert into the process of the problems solving of the info-searching problems, associated with the use of synonymy increasingly complex nature.

The estimation of the distributed information systems of the Internet is characterized by the urgency, dynamics and informativeness.

The generalized algorithm of the semantic information searching model in the distributed information systems looks out as follows:

1. "The document example" (semantic problem), which is a search pattern is introduced by the expert manually.

2. The theme of the document is released from the document, and search instructions are determined.

3. The theme request is expanded by the synonyms and associative queries.

4. The searching image of the query is formed, based on the frequency dictionary broken down to its individual search orders.

5. A primary search of links is conducted to the relevant documents in the existing Internet search engines, the overall result is placed in data storage.

6. The download of found documents is implemented in the data storage.

7. The searching image of the document is formed for each document in the data storage.

8. The rankings of the documents is conducted according to the given topic.

9. The abstracting of the found documents is conducted and the paper transferring is implemented for the information and expert analysis according to the rating [4].

Information must be described for the better searching, and the corresponding descriptive information must be kept. With the administrative, technical and descriptive metadata, search and proofreading data is convenient and easy. However, the loss of only one attribute of metadata may lead to the loss of information about the data, and as a result, the loss of the effective proofreading and use. Then, the preservation process would mistakenly end. That is why the preservation of the information in the distributed systems should be organized according to clear rules for data description and their receiving [5].

Nowadays, there are many search engines in Internet. One of them is Bigtable by Google Inc company [6]. This distributed system is for the structured data managing, which is designed for the large volumes of information: petabytes of data are spread across thousands of servers. Such big projects like Google Earth and Google Finance, and web search store a variety of information in this data storage: from simple web addresses to images from space satellites [7]. Notwithstanding the difference of data types that are stored, Bigtable remains flexible and high-performance solution for all connected systems.

Described solution presents distributed, multidimensional, sorted map that is indexed by the key row, column and note about the time. Each entry in the map is an array of bytes. Each operation of reading and writing by key line is atomically, despite the number of columns that are read in a row. Each set of lines is called tablet and is a degree of the distribution. As result, the proofreading of small sets of lines s effective and usually requires connection with the few machines.

The keys of columns are combined in column sets by the data types. This allows you to control the access at a set of columns and to improve the search by the data type. The indexing of information by marking the time allows you to maintain the multiple versions of the same data at the same time. The selection of the number of versions to store is also available. When the number of versions increases, the outdated are deleted from the system [8].

Considering the distributed system of Google Inc, such positive traits as system flexibility and a large amount of metadata at the same time can be identified. In return, you can highlight the negative impact of the large amounts of metadata, which should be analyzed during the search. Even taking into account the distribution of the system Bigtable into the relatively small particles called tablet, there is a possibility of processing of the redundant information.

Another system of distributed data storage is presented by the Amazon company called Dynamo [9]. It is designed to manage and store data with "the golden mean" between accessibility, performance and scalability. The Amazon platforms demand different set of requirements for data storage: flexible enough, readily

available and in guarantee effective at a reasonable cost.

This kind of product, as Dynamo from the Amazon company, is more suitable for the internal use in the company, than to create a publicly available system. All information is available for the unique identifiers that are not safe for use by more people. A positive aspect of the system is its level of distribution and decentralization; the units' are independent.

Generally, in the distributed systems of information obtaining, the appropriate module classifies data according to their context. The next step is to create metadata that describe information, and then they also be stored in the system over the lifetime of the key data. This means, the module describes the data as it understands, or how it was programmed. As a result, we receive the encrypted information depending on the said module and its implementation.

The classification of the metadata is unchanged and determined to getting information. Existing data are already stored under a certain structure and certain metadata attributes. Metadata attributes are the individual objects, such as "Author" and "Date of the last modification". The structure describes the procedure and rules, according to which the metadata objects are organized. [10].

Another part of the system, that is responsible for the search for information, receives requests for information on attributes (metadata), which may differ from those that have been allocated to other modules. The constant synchronization between classification modules and information search is needed for more effective use. The last module needs the full access to the algorithms and the results of data classification and metadata creation. Having the right information, the search engines can already link their algorithms (inputs and outputs) with the relevant classification module. In the existence of different types of data classifications, the search subsystem must have the appropriate information from each one.

Each piece of information has the invariable structure: the subject, to which it is classified, direct object and a predicate that describes the properties [12].

One of the key ideas is the use of agent units of the system: light parts of units of the distributed system that store a summary about information in the point of the system [13]. This will improve the efficiency of search, as the analysis is on the agents of the availability of information, instead of a direct connection with the database, which is a time-consuming operation. After performing a request to RDF-structures, you can get unique ID of the part of the information system and execute highly effective and safe request. The theoretical algorithm query processing by agents:

- Access request;

- Analysis and translation request in techno-convenient form;

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

- Search in RDF-structures and detecting the presence of information;

- Data collection and a sample result return.

This approach will reduce the amount of data [14]

transmitted over the network and, as a result, will reduce the load on the system. It is possible to optimize

y

the work of agents while maintaining high-speed LAN agents, which then (if necessary) perform long-term needs for data storages. You can also create groups of agents on a similar theme [15], which includes data, or by attributes.

Designed system incorporates the best sides of the seen systems: Bigtable from Google Inc and Dynamo from Amazon. Users of the system are combined in a network of the distributed system. This fact can simultaneously store as the security transfer and storage within the system, as allows decentralization, relying on the common goal of users. The agents make more efficient search system by storing a small amount of metadata and generation of unique identifiers for a quick search of the specific information. The potential system is also stored to the scalability by light agents on the nodes of the system.

Conclusions. This article describes general information about the search subsystems of distributed information systems, which is designed for the safe storage of data and quick search. In particular, the mechanism and operation principles of the subsystem of information search in the distributed information systems are analyzed and it is identified that the search engine provides three basic requirements: the control of coverage resources, the control of accuracy of the information, received from the network; high speed of search; the algorithm of information search in the distributed information systems of Internet was examined: algorithm of information search, algorithm of expanding of the subject area by using synonyms, algorithm of synonyms thesaurus formation; leading search engines are characterized: Bigtable of Google Inc, Dynamo of Amazon company and their positive and negative traits are identified.

The proposed approach to the semantic information search in the distributed information systems allows qualitatively to improve the results of findings of the search engines on the Internet, allows to automate the processing of the relevant information with ranking of the semantic information according to the given theme, which allows experts to move from the manual breadcrumbs of found resources.

References:

1. Atanasova T. The Designing of search services of the corporate organizational management systems. V: N. Bakanova., T. Atanasova). - The collection of "Modelirane and management in the information processes", Sofia, Bulgaria, 2009, KTP, Sofia, 2009, ISBN: 978-954-9332-55-1, pp. 30-33

2. Barisheva O., Hilyarevskii R. On the relevance of primary information requests // NTI. Series 2. -1995. - No 6. - pp. 14-19.

3. Voiskunskii V. On the construction of the search features // NTI. Series 2. - 1992. - No 9. -pp. 69.

4. The approaches to the formation of semantic information search in distributed information systems // Trusov A., Trusov V. / The organization and use of information resources. - Information resources of Russia. - 2011/2. - pages 20-24.

5. FRENCH, C. D. One size fits all database architectures do not work for DSS. In Proc. of SIGMOD (May 1995), pp. 449-450.

6. Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google file system. In Proc. of the 19th ACM SOSP (Dec.2003), pp. 29-43.

7. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation - Volume 7 (Seattle, WA, November 06 - 08, 2006). USENIX Association, Berkeley, CA, 15-15.

8. PIKE, R., DORWARD, S., GRIESEMER, R., AND QUINLAN, S. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal 13, 4 (2005), 227-298.

9. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Laksh-man, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels: Dynamo: Amazon's Highly Available Key-value Store (2007)

10. Margulies Simon. Subotic Ivan, Rosenthaler Lukas. Long-term archiving of digital data, distributed archiving network - DISTARNET. In: EVA 2005, Berlin. Conference, Hg. Gerd Stanke, Andreas Bienert, James Hemsley, Vito Cappellini. Berlin 2005. Pp. 168-174.

11. Hunter Jane, Lagoze Carl. Combining RDF and XML Schemas to enhance Interoperability between Metadata Applications Profiles. In: WWW10, May 1-5, 2001, Hong Kong. Pp. 457-466.

12. World Wide Web Consortium (W3C). RDF Vocabulary Description Language 1.0: RDF Schema. Currently available: http://www.w3 .org/TR/rdf-schema/

13. Nguyen, N.T., Ganzha, M., Paprzycki, M.: A Consensus-based Multi-agent Approach for Information Search in Internet. In: Alexandrov V., van Albada G., Sloot P.M.A.,Dongarra, J. (eds.) ICCS 2006.LNCS, vol. 3993, pp. 208-215. Springer, Heidelberg (2006)

14. Fricke S., Bsufka K.,. Keiser J, Schmidt T., Sesseler R. and Albayrak S., "Agent-Based Telematic Services and Telecom Applications," Comm. ACM, vol. 44, no. 4, pp. 43-48, Apr. 2001.

15. Gavalas D., Tsekouras G., Anagnostopoulos C., A mobile agent platform for distributed network and systems management, Journal of Systems and Software 82 (2) (2009) pp. 355-371.

i Надоели баннеры? Вы всегда можете отключить рекламу.