Научная статья на тему 'Metrized Small World Approach for Nearest Neighbor Search'

Metrized Small World Approach for Nearest Neighbor Search Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
203
59
i Надоели баннеры? Вы всегда можете отключить рекламу.
i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

In different areas attempts are made to organize data into multi-linked structures which are well suited for information search, in particular the nearest neighbor search where the result data items are metrically close to a given data item. These structures often take the form of trees (M-Tree, cover tree, KD-tree, GNAT) or networks (M-Chord, VoroNet, RayNet) built over a set of data items. In this paper we give the regular approach to the construction of links between data items which provides logarithmical time complexity of the nearest neighbor search in the structure. According to this approach, data items are organized into an undirected graph with Small World properties, which ensure the existence of a short path between any two data items regardless of the graph size. We propose different construction and search algorithms depending on the properties of the metric which determines the proximity of data items. The types of metric we consider are abstract metric and ordered metric. Further we extend the ordered metric approach to compound data items in the form of attribute-value pair sets to enable inclusion search by an arbitrary subset of attribute-value pairs. Finally we provide simulation results for the structure with compound data items.

Текст научной работы на тему «Metrized Small World Approach for Nearest Neighbor Search»

Metrized Small World Approach for Nearest Neighbor Search

Andrey Logvinov, Alexander Ponomarenko, Vladimir Krylov, Yury Malkov MeraLabs, Nizhny Novgorod, Russia alogvinov@meralabs.com, aponom@meralabs.com, vkrylov@meralabs.com,

ymalkov@meralbs.com

Abstract

In different areas attempts are made to organize data into multi-linked structures which are well suited for information search, in particular the nearest neighbor search where the result data items are metrically close to a given data item. These structures often take the form of trees (M-Tree, cover tree, KD-tree, GNAT) or networks (M-Chord, VoroNet, RayNet) built over a set of data items.

In this paper we give the regular approach to the construction of links between data items which provides logarithmical time complexity of the nearest neighbor search in the structure. According to this approach, data items are organized into an undirected graph with Small World properties, which ensure the existence of a short path between any two data items regardless of the graph size.

We propose different construction and search algorithms depending on the properties of the metric which determines the proximity of data items. The types of metric we consider are abstract metric and ordered metric. Further we extend the ordered metric approach to compound data items in the form of attribute-value pair sets to enable inclusion search by an arbitrary subset of attribute-value pairs.

Finally we provide simulation results for the structure with compound data items.

1. Introduction

The nearest neighbor search problem is defined as follows: given a set S of n points in some metric space (X, d), build a data structure on S so that for a given query point p EX one can efficiently find a point q 6 S which minimizes d(p, q).

Different approaches exist for building such a structure. The works [4, 5, 11] suggest hierarchical tree structures constructed using information about metric proximity of the elements. One notable shortcoming of this approach is the presence of the mandatory root

node in tree-like structures which makes building totally distributed implementations problematic.

There are also ways to build a distributed structure over the set S. The works [12] suggest distributed hash table as the data structure using the pivot-based metric space indexing approach.

The work [6] discusses the VoroNet distributed data structure. The elements of S are two-dimensional Euclidian space points. Each point from S is linked to all of its neighbor points on Voronoi diagram (Delaunay graph) plus additional distant points to give the structure Small World properties. Greedy search algorithm is used.

The following work [7] by the same authors considers the structure where the elements are points in a n-dimensional Euclidean space. The main difference from the previous work is that every point is connected with only a subset of the Voronoi neighbors to avoid exponential dependence of complexity on the number of dimensions. But this link set reduction leads to inexact search results, i.e. the result point is not always the nearest neighbor of the query point although number of such result can be made insignificant. Another drawback of this approach is that it can only be applied to the points of Euclidian space with a fixed number of dimensions.

In this paper we propose a regular approach to the construction of links between data elements in the form of an undirected graph with Small World properties [9, 10] to provide logarithmical complexity of the nearest neighbor search. We called the resulting structure Metrized Small World [1] (MSW).

We propose different construction and search algorithms depending on the properties of the metric which determines the proximity of data items.

The rest of the paper is structured as follows. Section 2 describes the construction of MSW structure based on abstract semi-metric. Section 3 describes MSW structure construction algorithms for ordered metrics. In the section 4 we extend the ordered metric approach to compound data items in the form of attribute-value pair sets to enable inclusion search by an arbitrary subset of attribute-value pairs. Finally we

provide simulation results for the structure with compound data items in the section 5.

2. Metrized Small World data structure

Metrized Small World data structure on the set of data items S is expressed by the graph G(V,E). Each vertex v E V corresponds to a single element of the set S. Each edge e££ is associated with a link between two data items from the set S. Assume that d(v,p) equivalent to d(s,p) where s is the data item which corresponds to the vertex v. Then the search of the nearest neighbor of the query point p EX comes to finding the vertex v EV with the minimal distance to P.

In the work [1] we gave the construction and search algorithms for that structure. In the paper [2] we also suggested a distributed storage architecture based on the proposed structure. Here we re-cite those algorithm according to the notation assumed for this paper.

We provide the algorithm which adds vnew vertex to the graph G(V,E), where V is the set of previously added vertices. Thus the parameters of the algorithm are V — the set of previously added vertices, vnew the vertex being added, vstart EV - an arbitrarily selected vertex from V (the starting point of the search) and two integer numbers m and n.

Algorithm: add_metric(V, vnew, vstart, n, m)

1. Arbitrarily select an element tE V

2. Let VisitedList be the set of visited elements.

3. Let CandidateList be the set of candidate elements for link establishment sorted by value of semi-metric to vnew in ascending order.

4. Assume that CandidateLists initially contains only vstart.

5. For i =1 to n do

5.1. Sort CandidateList by value of semi-metric to vnew in ascending order.

5.2. Select the first elementp from CandidateList not contained in VisitedList. If no such element exists then break.

5.3. Add p to VisitedList.

5.4. Add the set of p neighbor elements to CandidateList.

6. Mutually connect the element with m arbitrary elements from VisitedList.

We shown that the structure constructed using this algorithm provides the necessary condition for the existence of effective search algorithm, because the Small World properties of the graph G(V,E) ensure

the existence of a short path between any two vertices. But this structure requires search algorithms which are more complex than the greedy algorithm due to the existence of metric local minimums.

An advantage of this approach is that the proximity measure M can be any function which is a general metric or even semi-metric defined over the set S.

3. Single-attribute Distributed Metrized Small World Data Structure

In the paper [3] we gave the algorithm for constructing the similar structure for a narrower class of metrics, i.e. for the metrics for which the order between data items is defined. If any data item will be linked with its direct predecessor and successor with regard to the metric, there will be no local minimums. The condition of the data item being linked to its direct successor and predecessor ensures the existence of the Delaunay graph which in its turn provides for correctness of the greedy search algorithm which attempts to minimize the distance from the query on each step.

Algorithm: add_ordered_metric (V, vnew, vstart, m)

1. Let Vcut — Vstart.

2. For each neighbor vt of calculate dt = d(vu vnew).

3. If minOi) < diVcur.Vnw) let Vm = Vi for which di = min (d;) and go to step 2.

4. If ^ Vnew let Vpre Vcur and let ^succ be the direct successor of vnew chosen from the neighbors of v^.

5. If If Voir ^ ^new let let vsucc vcur and let vpre be the direct predecessor of vnew chosen from the neighbors of v^.

6. Mutually connect vnew with vpre and vsucc if they exist.

7. Repeat m times:

7.1. If vpre exists, let Vpre be the direct predecessor of vpre chosen from its neighbors.

7.2. If Vsucc exists, let v'succ be the direct successor of v^cc chosen from its neighbors.

7.3. If none of Vpre and v'^c exist then break.

7.4. If only Vpre exists or d{v!pre,vnew) < ¿(v^cc.vnew) mutually connect vnew and v'we and let vpre = vpre.

7.5. If only v'succ exists or d(Vsucc Vnew) < d(Vpre,vnew) mutually connect vnew and

Vsucc and let VSUCC ~ Vsncc.

1 10 100 1000 10000 100000 1000000

Number of vertexes

—♦— m=0 —■- m=1 —A—m=2

X m=5 X m=7 # m=15

Figure 1. Average shortest path length between two vertexes

The nearest neighbor search is performed by following links from one element to another in the direction of the minimal metric.

The Small World properties of the graph ensure the logarithmical search complexity for a random data set. The absence of the root element and the construction of the structure on the data item level provides for creating a completely distributed implementation of the structure. As can be seen on Fig. 1 and 2, both average shortest path length and maximum vertex degree scale logarithmically with the number of vertexes. Therefore the structure is suitable for storing very large amounts of data.

The nearest neighbor search is reduced to finding the minimum of the metric from the query to a data item. If the distance between the query and the found data item is lower than the query radius than the fond data item is the result, otherwise there is no result. If we must find all data items inside the query radius, we perform a sequential search in both directions from the first found data item.

The proposed data addition algorithm is incremental, i.e. the addition of a new data item affects only a small number of existing data items.

4. Multi-attribute Distributed Metrized Small World Data Structure

In the two previous sections we considered the elements as atomic entities relative to the metric. Now we want to extend our approach to composite data items. We will consider the composite objects which

1 10 100 1000 10000 100000 1000000 Number of vertexes

Figure 2. Maximum vertex degree

are represented by an unordered set of atomic objects for all of which one common ordered metric is defined.

Then we define the search problem as the search of at least one of all of the composite objects which include the given set of atomic objects. This data model is often used for describing application domain entities with a set of tags or keywords, e.g. images, hyperlinks, musical tracks, blog posts etc. This model can also represent objects consisting of non-fixed set of attribute-value pairs.

Therefore for convenience we will consider arbitrary strings (or tags) as atomic objects. Hence the composite objects will be represented as unordered sets of tags.

Our main idea was to construct the graph G(V, E) in a way that objects with any matching subset of atomic objects Tfix would constitute the sub graph (layer) LTflxeG(V,E) consisting of a single connected component which in its turn would form the MSW structure described in the previous section. Then the search for an element containing the given set of tags Tq = {t-L, t2, tm] would be performed by first finding object from sub graph (layer) LTti consisting of objects containing the tag t1. After that, inside this subgraph-layer LTt another element from the subgraph-layer LT cz LTtis recursively searched for. The subgraph-layer ¿rtlt2 consists of objects containing both tags t1 and t2. The process continues until an object form the subgraph-layer ¿rtlt2 tm is found which consists of objects containing all the given tags {t1,t2,...,tm}.

Figure 3. Example Multi-attribute Distributed Metrized Small World Data Structure. The dashed lines represent the edges in the L0 layer. Solid straight lines show the links between objects having a common subset of tags.

For demonstration purposes we provide the example of the network of objects almost all of which contain three tags. Dashed curved lines show the links between objects which contain tags which are neighbors in lexicographical order. Solid straight lines show the links between objects having a common subset of tags.

Further we give a more formal description of the construction and search algorithms for this structure Let T = {t} be the set of all possible tags which are distinct string values.

For each data element a let there be the unordered set Tac.T of tags associated with the object. Given a query set Tqc.T,Tqi=® we must find the set Ar of resulting data elements such that Va 6 Ar Tq a Ta, i.e. all data elements which have all of the tags specified in the query.

Let the set MSWX = t£y)J; au aj £

X, ta'laj 6 Ta. a. be the MSW structure built over a set of elements X. Every element of MSWX represents a link between pair of tags in data elements (it can be the same element). If there is no element corresponding to a pair tags, there is no link between them. Two identical tags on the different items cannot have links simultaneously in one MSWX. We consider a tag t being a member of the MSWX if £ MSWX.

We can use our algorithm described in the section 3 of this paper to search for given tag in MSW.

Let LTfix — (Tfix, MSWX); be the MSW layer built over a set of tags Tfix. For every tag that is a member of ^ c Ta.

Let the a — search_single{LTfix,tstart,ty, a e X be the operation of searching for a single element, member of LTfix for which t £Ta. The tag tstart (member of LTfix) is the entry point of the algorithm described in the second section of this paper.

Let add_partial{LTftx,tstart,a,ta) be the operation of addition of the tag ta of the element a to the MSW layer LTfix. The tag tstart is used as the entry point. The time complexity of the addjpartial operation is logarithmic to the number of tags in ¿j-/ix. We consider an element a being a member of the MSW layer LTfix if it has been partially added to LTfix at least once.

Let add_complete(LTfix,tp,a) be the operation of complete addition of the element a to the MSW layer LTflx. The addjcomplete operation is performed using the following algorithm:

Algorithm: add_complete(LTflx, tp, a)

1200

1000

800

600

400

200

0

♦ ♦ ♦

♦ a ♦

ft

Mr* ♦ ♦

1600

1400

1200

1000

800

600

400

200

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

0

♦ ► ♦

4 ♦

10

100

1000

10000

10

100

1000

10000

Number of objects

Number of objects

Figure 4. Experimental results. Left: two common tags. Right: three common tags.

1

1

1. Let Tfree =Ta\ Tfix be the set of all tags associated with the element a but not contained in Tfix.

2. For each ta £ Tfree do addjpartial(LTfix, tp, a, ta).

Let S — [LTfU\ be the set of all MSW layers (the structure being described). An arbitrary member tgiobaijtart of the MSW layer L0 can serve as a global entry point for addition process.

Let add_recursive(S,LTfix,tstart, a) be the operation of addition of the element a to the structure

S.

The addjrecursive operation is performed using the following algorithm, assuming that the initial values are Tfix — 0 and tsiar^ — tgiobai_start.

Algorithm: add_recursive(S,LTfix,tstart, a)

For each t ETa\ Tfix

1. Find

Onext = search_single{LTfix,

tstart> t)

2. If Pnext exist, perform add_recursive(S, L{Tfix,t}, 0, where tanext is a random tag of a-next,

^Onext ^ {Tfix, t} else

addjpartial(LTfix, tstart, a, t)

Let A = search recursive(S,LTfix,tstart, Tquery) be the operation of searching all elements A for

The search_recursive operation is performed using the following algorithm, assuming that the initial values are TfiX — 0 and tstart — tgi0bai_start.

Algorithm: search_recursive(S, LTfix, tstart,

Tquery)

1. If Tquery = 0 then return all elements in |ayer LTfix

2. for random tq 6 Tquery find.

3. Q-next S@(WCh—SiTl9l&(JJTfix>tstcirt’fy

4. Remove tq from Tquery

5. search recursive(S, L{Tfixit), taiuxt>a) where tone^is a random tag of ext tanext ^ {Tfix, t}

Constructing link using the above approach is to a certain degree equivalent to indexing by all possible combinations of columns in a relational database. The main advantage of this approach is the possibility to quickly find an object or a set of objects with any given set of tags without regard to the quantity of objects with a certain subset of tags (atomary objects).

Further we give the experimental data obtained on the structure prototype to confirm the theoretical assumptions regarding the advantages of our approach.

5. Experimental data

The experiments were set up as follows.

In the first experiment a set of N objects was generated half of which contained the single common tag “X”, other half contained the single common tag

“Y” and a single object with both “X” and “Y” tags. The objects were added to the structure in random order. We measured the time of search for the object containing “X” and “Y” tags. The measurement was repeated many times for different values of N, the set of random objects was regenerated each time. See the left graph.

In the second experiment the test set contained N random objects containing equal amounts of object containing two common tags “X”, ”Y”; “Y”, ”Z”; —X”, —Z” and the single object containing all three tags “X”,”Y”,”Z”. See the right graph.

The results are shown on Figure 4. The graphs show that in both cases the object search time depends logarithmically on the number the objects in the structure which confirms our theoretical assumptions.

6. Conclusion and future work

We believe that the key to the building of search-oriented distributed systems is the construction of multilinked structures similar to social networks. But the metric distance between data items must be correlated to the number of links which separate them. In this paper we described the methods of construction of such structures for certain data types. The necessary and sufficient condition of correctness of the greedy search algorithm is the inclusion of Delaunay graph into the structure graph. Failure to satisfy this particular condition was the obstacle for using the greedy search algorithm with the structure described in the section II. The condition of existence of Delaunay subgraph has been satisfied in the structures described in sections III an IV. But supporting the correct Voronoi tessellation as in [6] or in section IV requires large overhead with the number of dimensions greater than two. For this reason we intend to focus our further research on finding the compromise between search accuracy and calculation overhead.

7. References

[1] V. Krylov, A. Logvinov, A. Ponomarenko,

D.Ponomarev “Metrized Small World Properties Data Structure”, Proc. Software Engineering and Data Engineering (SEDE 2008).

[2] V. Krylov, A. Logvinov, A. Ponomarenko,

D.Ponomarev “Active Database Architecture for XML Documents”, Proc. Computer applications in Industry and Engineering (CAINE 2008).

[3] V. Krylov, A. Logvinov, A. Ponomarenko,

D.Ponomarev, “Single-attribute Distributed Metrized Small World Data Structure”, Proc. IEEE International Conference on Intelligent Computing and Intelligent Systems 2009 (CAS)

[4] CIACCIA, P., PATELLA,M., AND ZEZULA, P. 1998. A cost model for similarity queries in metric paces. In Proc. 17th ACMSymp. on Principles of Database Systems (Seattle), 59-67.

[5] BRIN, S. 1995. Near neighbor search in large metric spaces. In Proceedings of the 21st conference on Very Large Databases (VLDB’95), 574-584.

[6] Beaumont, O. and Kermarrec, A.M. and Marchal, L. and Riviere, E., VoroNet: A scalable object network based on Voronoi tessellations, in IEEE IPDPS, 2007

[7] O. Beaumont, A.-M. Kermarrec, and E. Rivire. Peer to

peer multidimensional overlays: Approximating

complex structures. In OPODIS,11th International conference on principles of distributed systems, 2007.

[8] J.-D. Boissonnat and M. Yvinec. Algorithmic Geometry. Cambridge University Press, 1998.

[9] D.J. Watts “Small Worlds”, Princeton, New Jersey: Princeton University Press, 1999.

[10] R. Albert and A.-L. Barabasi “Statistical mechanics of complex networks.” Rev. Mod. Phys., 74(1): pp. 47-97, January 2002.

[11] A. Beygelzimer, S. Kakade, and J. Langford. “Cover trees for nearest neighbor”. Proceedings of the 23rd International Conference on Machine Learning, pages 97-104, 2006

[12] D. Novak and P. Zezula. M-Chord: A scalable distributed similarity search structure. In Proceedings of First International Conference on Scalable Information Systems (INFOSCALE 2006), Hong Kong, May 30 June 1 . IEEE Computer Society, 2006.

i Надоели баннеры? Вы всегда можете отключить рекламу.