A formula for the mean length of the longest common subsequence

Znamenskij Sergej V.

Journal of Siberian Federal University. Mathematics & Physics 2017, 10(1), 71—74

УДК 004.412

A Formula for the Mean Length of the Longest Common Subsequence

Sergej V. Znamenskij*

Ailamazyan Program Systems Institute of RAS Peter the First, 4, Veskovo village, Pereslavl area, Yaroslavl region, 152021

Russia

Received 10.10.2016, received in revised form 10.11.2016, accepted 20.12.2016

The expected value E of the longest common subsequence of letters in two random words is considered as a function of the a = |A| of alphabet and of words lengths m and n. It is assumed that each letter-independently appears at any position with equal probability. A simple expression for E(a,m,n) and its empirical proof are presented for fixed a and m + n. High accuracy of the formula in a wide range of values is confirmed by numerical simulations.

Keywords: longest common subsequence, expected value, LCS length, simulation, asymptotic formula. DOI: 10.17516/1997-1397-2017-10-1-71-74.

Introduction

The random words of lengths m and n in the alphabet a are also reffered as random symbol sequences. We consider the letter appearance in different positions of words as equally probable and independent events. So for those random sequences the expected value of the longest common subsequence length is a function of E(m, n, a), which reflects the similarity of the original words.

Since the behavior of this function is related to a variety of generic algorithms for fuzzy search and differences identification, it attracts the attention of researchers for a three decades [1]. However, both the use of mathematical apparatus as in [2, 3] and numerical modeling (usually with special algorithms) [4, 5, 6] succeeded to clarify situation only in special cases m = n or a = 2 (see [7]).

m

Even for a = 2 the asymptotic on — became clear only resently [8] (just now without detailed

n

proof). Computer calculations E for small m, n, in [9] have identified a similar relation for the

a = 4.

The work is intended to the detection and empirical proof of this relation with except of huge a and small m + n cases.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Fig. 1. The case a = 2, n = 64 with of the least relative accuracy

1. Model

Hypothesis 1. The functions rx = rx(p,a) and ry = ryexists, for which under the

mm_n I-

notions 6 = —-, r = * rX + r2 the function

2rx

1 <r6

F(m,n,a)={ m+n - r + ry^J 1 - 6X, -1 <r6< 1

r6 < -1

gives a fine approximation for E(m, n, a) at least for all a < 128 and 50 < m + n < 100000.

2. Evaluation

Direct evaluation for huge LCS lengths is impossible due to known square complexity of algorithm. Therefore 6 fixed values of m + n and 10 for a were selected and for 6 x 10 series of 32 triplets (m, n, a) their expected values of LSS lengths were calculated as sample means. A perl XS module with a speed compatible with C compiled code was used. Required number of calculations and processed time were detected in a series of runs attempted to get large enough samples for acceptable accuracy. The full collected sample data is available over email to author.

The rx and ry values were calculated which minimizes the mean square error. All the results are presented in the Table 1. We note Im,n,a the samle set of all calculated lengths of LCS for generated random words, E(m,n,a) their means over each Im,n,a and calculate their experimental standard deviations

K*))

max

n+m=s,a '

T-T E

1 ieimn

li, a f (s)

1 32

31 J2(E(xj,s,a) - F(xj,s, a))X, j=1

1

a

, fjs js\

where xj s = —, s--.

j,s V64 64; The worst matches from all 6 x 10 tests are shown on the Fig. 1.

Table 1. The results of empirical proof

notation

8 16 32 128

N

aF

64 256

Optimal (rx,r.

21.389 75.265

25.056 89.92

34.289 117.155

47.844 153.426

36.57 129.026

44.483 150.263

36.674 132.592

38.624 136.337

36.407 133.781

33.926 123.444

36.082 134.101

30.304 112.73

35.796 134.119

27.519 104.035

34.539 132.891

17.121 68.706

33.752 131.601

11.037 46.221

32.945 130.073

4.741 21.549

Total length of both sequences m + n

1024 8192 16384

286.241 351.613

434.358 551.615

484.121 548.949

502.645 506.569

511.075 465.027

515.004 428.676

517.126 398.526

519.659 270.916

518.343 185.746 515.861 89.437

2231.881 2796.965

3329.878 4155.315

3722.583 4135.24

3881.93 3835.89

3970.617 3562.51

4017.635 3310.963 4043.61 3090.13 4098.747 2136.885 4106.893 1481.344 4104.696 725.75

4425.896 5528.217

6606.47 8198.862

7417.019 8233.956

7731.749 7628.555

7904.151 7067.06

8001.95 6574.387

8063.285 6155.939 8181.072 4263.338 8202.243 2959.956 8202.921 1453.251

65536

17571.808 21893.814

26269.533 32525.718

29509.918 32664.38

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

30838.127 30430.938

31531.221 28201.948

31922.49 26229.624 32145.348 24503.713 32660.916 17022.721 32763.324 11829.582 32784.888 5820.357

CPU core time spent ih hours

55.1 40.1 13.5 10.7 26.4 399.6 Total number of CLS calculations

5658012818 719343652 28547575 489896 100227 103149

- Maximum over (a, n) of LSC length precision -

1.577 2.491 4.025 9.408 12.279 23.573

- Precision of F -

0.027 0.039 0.058 0.355 1.043 1.736

0.12%

Relative accuracy for F 0.053% 0.027% 0.020%

0.029%

0.011%

a

y

2

3

4

5

r

x

6

r

y

7

a

E

e

Acknowledgments

This work was performed under financial support from the Government, represented by the Ministry of Education and Science of the Russian Federation (Project ID RFMEFI60414X0138).

References

[1] V.Chvatal, D.Sankoff, Longest common subsequences of two random sequences, J. Appl. Prob., 12(1975), 306-315.

[2] M.A.Kiwi, M.Loebl, J.Matousek, Expected length of the longest common subsequence for large alphabets, Advances in Mathematics, 197(2005), no. 2, 480-498.

[3] G.S.Lueker, Improved bounds on the average length of longest common subsequences, Journal of the ACM (JACM), 56 (2009), no. 3, 17.

[4] R.Bundschuh, High precision simulations of the longest common subsequence problem, The European Physical Journal B - Condensed Matter and Complex Systems, 22(2001), no. 4, 533-541.

[5] J.Boutet de Monvel, Extensive simulations for longest common subsequences The European Physical Journal B - Condensed Matter and Complex Systems, 7(1999), no. 2, 293-308.

[6] R.Baeza-Yates, G.Navarro, R.Gavalda, R.Schehing, Bounding the expected length of the longest common subsequences and forests, Theory of Computing Systems, 32(1999), no. 4, 435-452.

[7] Kang Ning, Kwok Pui Choi, Systematic assessment of the expected length, variance and distribution of Longest Common Subsequences //arXiv preprint arXiv:1307.2796, 2013.

[8] J.D.Dixon, Longest common subsequences in binary sequences //arXiv preprint arXiv:1307.2796, 2013.

[9] S.V.Znamenskij, A picture of common subsequence length for two random strings over an alphabet of 4 symbols, Program systems: theory and applications, 7(2016), no. 1(28), 201208.

Формула для средней длины длиннейшей общей подпоследовательности

Сергей В. Знаменский

Институт программных систем РАН Петра Первого, 4, Переславльский район, Ярославская обл., 152021

Россия

Математическое ожидание E длиннейшей общей подпоследовательности букв двух случайных слов 'рассматривается как функция от мощности алфавита |А| и длин m и n этих слов. При этом предполагается, что любая буква независимо и с равной вероятностью оказывается в любой позиции слова. Предъявлено простое выражение для E(a,m,n) при фиксированных а и m + n.

Ключевые слова: длиннейшая общая подпоследовательность, математическое ожидание, длина LCS, численное моделирование, асимптотическая формула

A formula for the mean length of the longest common subsequence Текст научной статьи по специальности «Математика»

Аннотация научной статьи по математике, автор научной работы — Znamenskij Sergej V.

Похожие темы научных работ по математике , автор научной работы — Znamenskij Sergej V.

Формула для средней длины длиннейшей общей подпоследовательности

Текст научной работы на тему «A formula for the mean length of the longest common subsequence»