Научная статья на тему 'Applying machine learning classifiers in a database smart indexing algorithm'

Applying machine learning classifiers in a database smart indexing algorithm Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
216
24
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
DATABASE / INDEXING / MACHINE LEARNING / DECISION TREES

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Angelo Yordan, Pashev George, Totkov George

The paper describes a methodology, data structures, algorithms and software implementing a smart indexing methodology which is used in the integrated repository framework rePoU, created at the University of Plovdiv. The framework was created to facilitate the storage of files and various data objects and a certain amount of heterogeneous meta-data about them. The smart indexing methodology makes use of various classifiers which are commonly used in the domain of Machine Learning and Artificial Intelligence and certain meta-data information, stored in a separate meta-data database.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «Applying machine learning classifiers in a database smart indexing algorithm»

Научни трудове на Съюза на учените в България - Пловдив. Серия В. Техника и технологии, т. XIV, ISSN 1311-9419 (Print), ISSN 2534-9384 (On- line), 2017. Scientific Works of the Union of Scientists in Bulgaria-Plovdiv, series C. Technics and Technologies, Vol. XIV., ISSN 1311-9419 (Print), ISSN 2534-9384 (On- line), 2017.

APPLYING MACHINE LEARNING CLASSIFIERS IN A DATABASE SMART INDEXING ALGORITHM Yordan Angelov1, George Pashev2, George Totkov3 University of Plovdiv "Paisii Hilendarski"1, 2 3

The paper describes a methodology, data structures, algorithms and software implementing a smart indexing methodology which is used in the integrated repository framework rePoU, created at the University of Plovdiv. The framework was created to facilitate the storage of files and various data objects and a certain amount of heterogeneous meta-data about them. The smart indexing methodology makes use of various classifiers which are commonly used in the domain of Machine Learning and Artificial Intelligence and certain meta-data information, stored in a separate meta-data database.

Keywords: database, indexing, machine learning, decision trees

Introduction

The current structure of the Plovdiv University does not have a proper data repository system that can be used not only by teachers and students, but also the administration of the university. A proper repository system would be one, which is universal enough to aggregate data objects and metadata about them, so that different software systems can make use of it. Different software systems of the university generate files occasionally and these files need to be stored somewhere along with variously structured metadata about them. Their metadata needs to be properly indexed, so that they can participate in search queries quickly enough. The benefits of having one, would be numerous, including a minimization of paper printing, the ability to make complex analysis on data generated by different software systems & departments of the university. The goal of this project is to develop, test and implement a data repository (rePoU) for various software systems of the university (with university documents) and all other documents used by "PU". Open source software tools are being and going to be used in order to compensate for the lack of financing for such issues, but at the same time provide a stable, secure and reliable system to be used by both the students and the teachers and the administrative employees of the university.

Main question

How can a data repository be implemented for Plovdiv University in the cheapest and most reliable way using an appropriate DBMS for handling heterogeneous documents using an algorithm for dynamic selection of indexed attributes and suitable indexing structures alongside a RDBMS?

Sub-questions

1. What kind of indexing will be appropriate for the type and amount of data stored in the databases in order to achieve maximum performance?

2. How can the architecture envisioned best be implemented?

3. What algorithm is going to provide the possibility for dynamic selection of indexed

attributes and suitable indexing structures?

Methodology

Sub Question 1: The data is going to be variously structured and dependent on changing and evolving legislature and business processes. Hence, different indexing structures will be researched in order to get a broader knowledge on how they perform against heterogeneous data. So, literature study is going to be conducted to find out appropriate ways for indexing vast amounts of heterogeneous data. Mathematical experiments are also going to be conducted in order to compare various indexing approaches and come up with the more efficient way of indexing.

Sub Question 2: The architecture will be a multilayered one due to the fact that the goal is to be able to add new systems to use the repository in the future. In order to achieve the isolation that is needed and avoid dependencies if new systems are to be used, different layers are required, which will be using specific APIs to communicate between each other. Communication between layers will happen strictly by using the APIs, allowing the developers to choose the best solutions without having to fear for effects in the other layers. Moreover, having a multilayered architecture will allow for software changes in a given layer, without affecting the implementation of the rest of the layers.

Sub Question 3: An algorithm has to be implemented with a built-in intelligent approach

for determining the indexing structures. Dynamical counters of query types and analyzers will dynamically propose the database administrator indexing structures to re-index various indexes in database in order to improve performance. Experimental research is going to be used for the testing the performance of the algorithm by querying the database mimicking examples of various situations. Literature study is going to be conducted on the options available for achieving the result. Options such as third party machine learning tools, statistics done with experimental data on various indexing techniques performance, pre-packaged AI tools like (Quinnlan Ross), implementing a small neural network, etc. are going to be examined. Moreover, various machine learning algorithms and methods are going to be researched. The result will be choosing the adequate option with minimal complexity and minimal use of RAM based on the research.

Fig 1. Control flow diagram of the smart indexing algorithm

Fig. 1 depicts a flowchart of the algorithm which is going to be dynamically executed in

the Calculation Layer when a query event is triggered by the upperApplication Laye rs to perfoun a query to a RDOMO. In cases when no e-RDB s are yoing to be used, tOe Rgoritrnn is similar t-i the cunrnt ons. es^ei-nr^lo, if the d-^^ba^s; is -Joe ftious E^c^i^tC^-ety^^ic^ SJea^^li Ogficiol she term Tahle is going to Re repSacnS ba toe oerm Shtrd aod tOe taon Colunm

is going lob e replaced by the more general term Attribute.

The AI Cdaorifiur builde e -c-fss^^c^^i^i^n tree un JE^^i^c^aUes^m Coe^cm^pies wluch ce c^Tvje^t^s haam^ Che foltowing t^^^^t^o^Sl^^io Name; Geoerhized Athl^SuSet Names- Relation/

Shgof cI^esiSiIi oSУhbutss names of din offscted attributeo by ohe ^t^^R^g Ndmbct oo affected Rows; Ngmbcr cli cffected dttobuteo; Querg Tepe (Iiuert, UpnyCe] oDe^ocoe] Select); Attributds pringyve tygcdo Affecied aoributos index durino q^iy exocution; eto, Csmpouod nttributss hi she clatti^es^(^n ^iiert^;^ie are stmplified, no tiiat ti^^^ eon be resretsnoed at a single nttributo, )hy us^ng i tnectfin gttribute sdn^cclidcet^^o^ ff^^eticns] тneeiastcfise(lon v^aMy gradeo tOo epeed of the qutry tn ciie foilewing 1) haot Eoouyo; 2- A^age tpeed (lmpeovements e^t^tmb^^); h) Slew (chow warning ond poopcse bocter iniiexing shrehtese ff rt exisCs ac jtohis to timilar ccso m ^lie clarsiticeУon tree); 4) Vety Siow (SOow ecroc my^ty.gy anh proyose UeУer indyxmh (trgeShhe if t e-nsCs ac potim to timiihE dase) in the clas slficatioo teee. lo it deeon;c tme tos insuffictsncy ofLeaming Examprcs, yropeseanyof thoindexCiig tt:irlcihres,enaliaUie forthe given datatypes).

Fig o, pressntsa nah oR Wenyplemcntatian nf the smert mdsrtina aigoriChm1which nart )s goiny Sole exeatied os aoyatenDengceOsr aCRON job).

$advice=c4 5 askForAdvice("needReindex",$edExample);//ask classification tree if example is needed $testFN="";

$indexListFuncName=""; $getCurIndexFuncName="" ;

if($advice==true||$advice==1){

$advice1= c45 askForAdvice("whichIndex",$edExample);

$testRES=$testFN($edExample["involvedTable"],$edExample["involvedColumns"

],$advice1);//test new indexing structure

if(!$testRES){

$listOfAvailableIndexes=[];

$curIdx="";

$indexListFuncName ($edExample["involvedTable"],$edExample["involvedColumn s"],$listOfAvailableIndexes);

$getCurIndexFuncName($edExample["involvedTable"], $edExample["involvedColu

mns"],$curIdx);

$betterIndexFound=false;

foreach($listOfAvailableIndexesas$keyAI=>$valueAI){ if($valueAI!=$advice1&&$valueAI!=$curIdx){

if($testFN($edExample["involvedTable"],$edExample["involvedColumns"],$val ueAI)){

$betterIndexFound=true;

c4 5 insertEduExample("whichIndex",$edExample,$valueAI);

c4 5 insertEduExample("needReindex",$edExample,1);

break;

}

$elasticStruct=$edExample; $elasticStruct["testedIdx"]=$valueAI; $elastic rv=[];

Fig 2. A part of the implementation of the smart indexing algorithm

The test function $testFN is a DB specific implementation, which performs a DB test procedure. The test procedure can be described as follows:

1. Copy the table $tableName in a temporary test table;

2. Select all records by the last record value in $columnName and measures the elapsed time in $t1;

2.1 Do series of 2. and calculate average time $avg_t1;

3. Select available indexes for $columnName into $availableIndexes;

4. foreach element in $availableIndexes different than currentlndex;

4.1 Select all records by the last record value in $columnName and measure the elapsed time in $t2;

4.2 Do series of 4.1 and calculate the average time $avg_t2;

If $avg_t2< $avg_t1 return true, else return false.

The functions c45_askForAdvice, c45_insertEduExample are PHP wrappers of the implemented in C++ algorithm C4.5. A tool of [2] is used.

Conclusion

A methodology, strategy and an algorithm are proposed to make use of Machine Learning (ML) Classification Algorithms and Tools in order to achieve smart indexing functionality in RDBMSs as well as non RDBMSs. The proposed methodology uses various variables, essential to the process of data storage and performance of data storage. A good practice would be to use a ML algorithm which performs analysis on the information gain for each of those variables, like C4.5 or other algorithms of the family. Further research and development is needed to evaluate which ML classification algorithms are more suitable to be used in conjunction to the Methodology presented. Further methodologies need to be developed in order to evaluate the efficiency of the proposed methodology along with the proposed algorithms.

References

Elastic Search Official Website https://www.elastic.co/

Quinnlan Ross, C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers

i Надоели баннеры? Вы всегда можете отключить рекламу.