Journal of Siberian Federal University. Mathematics & Physics 2017, 10(1), 71—74
УДК 004.412
A Formula for the Mean Length of the Longest Common Subsequence
Sergej V. Znamenskij*
Ailamazyan Program Systems Institute of RAS Peter the First, 4, Veskovo village, Pereslavl area, Yaroslavl region, 152021
Russia
Received 10.10.2016, received in revised form 10.11.2016, accepted 20.12.2016
The expected value E of the longest common subsequence of letters in two random words is considered as a function of the a = |A| of alphabet and of words lengths m and n. It is assumed that each letter-independently appears at any position with equal probability. A simple expression for E(a,m,n) and its empirical proof are presented for fixed a and m + n. High accuracy of the formula in a wide range of values is confirmed by numerical simulations.
Keywords: longest common subsequence, expected value, LCS length, simulation, asymptotic formula. DOI: 10.17516/1997-1397-2017-10-1-71-74.
Introduction
The random words of lengths m and n in the alphabet a are also reffered as random symbol sequences. We consider the letter appearance in different positions of words as equally probable and independent events. So for those random sequences the expected value of the longest common subsequence length is a function of E(m, n, a), which reflects the similarity of the original words.
Since the behavior of this function is related to a variety of generic algorithms for fuzzy search and differences identification, it attracts the attention of researchers for a three decades [1]. However, both the use of mathematical apparatus as in [2, 3] and numerical modeling (usually with special algorithms) [4, 5, 6] succeeded to clarify situation only in special cases m = n or a = 2 (see [7]).
m
Even for a = 2 the asymptotic on — became clear only resently [8] (just now without detailed
n
proof). Computer calculations E for small m, n, in [9] have identified a similar relation for the
a = 4.
The work is intended to the detection and empirical proof of this relation with except of huge a and small m + n cases.
* [email protected] © Siberian Federal University. All rights reserved
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Fig. 1. The case a = 2, n = 64 with of the least relative accuracy
1. Model
Hypothesis 1. The functions rx = rx(p,a) and ry = ryexists, for which under the
mm_n I-
notions 6 = —-, r = * rX + r2 the function
2rx
1 <r6
F(m,n,a)={ m+n - r + ry^J 1 - 6X, -1 <r6< 1
r6 < -1
gives a fine approximation for E(m, n, a) at least for all a < 128 and 50 < m + n < 100000.
2. Evaluation
Direct evaluation for huge LCS lengths is impossible due to known square complexity of algorithm. Therefore 6 fixed values of m + n and 10 for a were selected and for 6 x 10 series of 32 triplets (m, n, a) their expected values of LSS lengths were calculated as sample means. A perl XS module with a speed compatible with C compiled code was used. Required number of calculations and processed time were detected in a series of runs attempted to get large enough samples for acceptable accuracy. The full collected sample data is available over email to author.
The rx and ry values were calculated which minimizes the mean square error. All the results are presented in the Table 1. We note Im,n,a the samle set of all calculated lengths of LCS for generated random words, E(m,n,a) their means over each Im,n,a and calculate their experimental standard deviations
K*))
max
n+m=s,a '
T-T E
1 ieimn
li, a f (s)
1 32
31 J2(E(xj,s,a) - F(xj,s, a))X, j=1
1
a
, fjs js\
where xj s = —, s--.
j,s V64 64; The worst matches from all 6 x 10 tests are shown on the Fig. 1.
Table 1. The results of empirical proof
notation
8 16 32 128
N
aF
64 256
Optimal (rx,r.
21.389 75.265
25.056 89.92
34.289 117.155
47.844 153.426
36.57 129.026
44.483 150.263
36.674 132.592
38.624 136.337
36.407 133.781
33.926 123.444
36.082 134.101
30.304 112.73
35.796 134.119
27.519 104.035
34.539 132.891
17.121 68.706
33.752 131.601
11.037 46.221
32.945 130.073
4.741 21.549
Total length of both sequences m + n
1024 8192 16384
286.241 351.613
434.358 551.615
484.121 548.949
502.645 506.569
511.075 465.027
515.004 428.676
517.126 398.526
519.659 270.916
518.343 185.746 515.861 89.437
2231.881 2796.965
3329.878 4155.315
3722.583 4135.24
3881.93 3835.89
3970.617 3562.51
4017.635 3310.963 4043.61 3090.13 4098.747 2136.885 4106.893 1481.344 4104.696 725.75
4425.896 5528.217
6606.47 8198.862
7417.019 8233.956
7731.749 7628.555
7904.151 7067.06
8001.95 6574.387
8063.285 6155.939 8181.072 4263.338 8202.243 2959.956 8202.921 1453.251
65536
17571.808 21893.814
26269.533 32525.718
29509.918 32664.38
30838.127 30430.938
31531.221 28201.948
31922.49 26229.624 32145.348 24503.713 32660.916 17022.721 32763.324 11829.582 32784.888 5820.357
CPU core time spent ih hours
55.1 40.1 13.5 10.7 26.4 399.6 Total number of CLS calculations
5658012818 719343652 28547575 489896 100227 103149
- Maximum over (a, n) of LSC length precision -
1.577 2.491 4.025 9.408 12.279 23.573
- Precision of F -
0.027 0.039 0.058 0.355 1.043 1.736
0.12%
Relative accuracy for F 0.053% 0.027% 0.020%
0.029%
0.011%
a
y
2
3
4
5
r
x
6
r
y
7
a
E
e
Acknowledgments
This work was performed under financial support from the Government, represented by the Ministry of Education and Science of the Russian Federation (Project ID RFMEFI60414X0138).
References
[1] V.Chvatal, D.Sankoff, Longest common subsequences of two random sequences, J. Appl. Prob., 12(1975), 306-315.
[2] M.A.Kiwi, M.Loebl, J.Matousek, Expected length of the longest common subsequence for large alphabets, Advances in Mathematics, 197(2005), no. 2, 480-498.
[3] G.S.Lueker, Improved bounds on the average length of longest common subsequences, Journal of the ACM (JACM), 56 (2009), no. 3, 17.
[4] R.Bundschuh, High precision simulations of the longest common subsequence problem, The European Physical Journal B - Condensed Matter and Complex Systems, 22(2001), no. 4, 533-541.
[5] J.Boutet de Monvel, Extensive simulations for longest common subsequences The European Physical Journal B - Condensed Matter and Complex Systems, 7(1999), no. 2, 293-308.
[6] R.Baeza-Yates, G.Navarro, R.Gavalda, R.Schehing, Bounding the expected length of the longest common subsequences and forests, Theory of Computing Systems, 32(1999), no. 4, 435-452.
[7] Kang Ning, Kwok Pui Choi, Systematic assessment of the expected length, variance and distribution of Longest Common Subsequences //arXiv preprint arXiv:1307.2796, 2013.
[8] J.D.Dixon, Longest common subsequences in binary sequences //arXiv preprint arXiv:1307.2796, 2013.
[9] S.V.Znamenskij, A picture of common subsequence length for two random strings over an alphabet of 4 symbols, Program systems: theory and applications, 7(2016), no. 1(28), 201208.
Формула для средней длины длиннейшей общей подпоследовательности
Сергей В. Знаменский
Институт программных систем РАН Петра Первого, 4, Переславльский район, Ярославская обл., 152021
Россия
Математическое ожидание E длиннейшей общей подпоследовательности букв двух случайных слов 'рассматривается как функция от мощности алфавита |А| и длин m и n этих слов. При этом предполагается, что любая буква независимо и с равной вероятностью оказывается в любой позиции слова. Предъявлено простое выражение для E(a,m,n) при фиксированных а и m + n.
Ключевые слова: длиннейшая общая подпоследовательность, математическое ожидание, длина LCS, численное моделирование, асимптотическая формула