UDC 519.612
SOME ESTIMATIONS OF PERFORMANCE OF PARALLEL ALGORITHMS FOR SOLVING LARGE LINEAR SYSTEMS OVER GF(2)
© Mikhail Alekseevich Cherepniov
Moscow State University named after M.V. Lomonosov, Leninskiye gory, 1, Moscow, 119899, Russia, Candidate of Physics and Mathematics, Associate Professor Numbers Theory Department, e-mail: [email protected]
Key words: fast algorithms, sparce linear systems, parallel algorithms, computer algebra.
This topic explains how to estimate the running time and RAM volume required by programs of Wiedemann-Coppersmith algorithm, Montgomery’s algorithm, some modifications of them and new algorithm when uploading multiple compute nodes and some other details of these algorithms.
1 Introduction
In principle, each operation of any algorithm can be considered for possible use of several similar sites to reduce time of their implementation. However, if the number of arithmetic operations is fixed, the most significant reduced time - this is where time falls proportional to the growing of compute nodes’ number. It will be that, for example, if you search relations in the methods of the discrete logarithm or factorization problem based on the factor databases. However, for most complex algorithms top rated time of their work T(N, n,d,c...; s) depend on task and used equipment parameters: N, n, d, c and the number of used compute nodes s and may
s
a certain threshold s0(N, n, d, c...), depending on the task settings, time not only falls, but
s
faster than of running time on each compute node separately reduce.
According to the author it is legitimately to raise the issue of calculating of values s0(N,d,n,c...), and T0(N,d,n,c...) = minsT(N,d,n,c...; s) =
T (N,d,n,c...; s0(N,d,n,c...)) for some model cluster. As a model it is logical today to take the cluster with an unlimited number of compute nodes for which runtime of one
c
runtime of one machine word’s passing between nods.
In modern computer network runtime of passing between the processor and RAM memory have c & 5 , and between the individual parts of RAM c & 20 .
c
processor and "bus which links compute nodes, portion of information bits transferred on the internal network communications and some other characteristics of the cluster. Let’s also assume that on a cluster there exists a possibility of direct delivery between performing specific job computing nodes; and broadcast from fixed site to the sites of some group by binary tree, i.e. by the logarithm of the number of elements of this group transfers. In this article, all the time calculates in units equal to time per arithmetic operation with machine words in such model
cluster. We assume that time for conversion of matrix formats and delay time for preparing network is negligible.
The task will characterize by four parameters: N ^ 226 - size of the original matrix
d
nc above. However, if the volume of carriage, in bits, is V, while sending time will compute by formula -V units of time. One unit time is the time that takes one operation with machine
n r
words in our model cluster,
2 Notes about symmetric matrixes
Let N,M e N, F = GF(2) and we want to obtain solution of system of linear homogeneous equations
DX = 0,D e FMxN,X e FNxn, M < N, (1)
n
At the beginning of Montgomery’s algorithm, and new algorithm [1] select random block Y e FN xn and consider linear system
AX = B, A e FNxN; B = AY,X e FNxn, (2)
where A = DTD e FNxN is singular symmetric matrix. Note that any solution XD of the system (1) allows you to build a XA - solution of the system (2) by the formula
Xa = Xd + Y.
Back, if XA - is a solution of system (2), than DT D(XA — Y) = 0 .If rangD = M, it follows that D(Xa — Y) = 0 . That is XA — Y - solution of the system (1),
Let dzmiX is a dimension of the interseetion of space (X), formed by columns of block X and the kernel of a linear operator D .Let dim2 X is a dimension of the intersection of the space, formed by columns of block X and the kernel of a linear operator A = DTD, and dim'2X - dimension of the intersection of space (X) and the kernel of a linear operator A' = DDT
X e FNx n
1. dim1X ^ dim2X — (M — rangD)
2. dim1DTX ^ dim'2X — (M — rangD)
Proof.
D dimKerDT = M — rangD
definition, the vector X is in KerA if and only if it is in KerD or DX e KerDT \ {0} , The maximum number of linearly independent vectors that meet the second of these conditions is not more than dimKerDT . So corang KerD in KerA not more than M — rangD . Crossing (X)
DTX
X
than dimKerDT = M — rangD . So corang KerDT in KerA' no more than M — rangD . (X) KerA'
The theorem is proved,
A
A' A
Theorem 2 Let Krylov .space (W) built as a sum of (Wi) ,i = 0,1,... ,m, with an initial unit type W0 = B = AY, on the first m — 1 of which A -scalar production is nonsingular. Let (Wm)
- A -orthogonal to whole (W), in particular to itself. Let along with AWi,i = 0,1, ...,m — 1, also calculated AWm and X by the formula
m— 1
X = ^ Wi(WT AWi)—1WT B, (3)
i=0
dim(Wm) = ^ > 0. Then by not more than (n + ^ — 1)(n + ^)N bitwise operations with, probability at least 1 — 2^ (subject to statistical independence (Y) and (W) ) one can calculate the solution of system (2).
Proof.
Note, that by condition the space (Wm) A-orthogonal to the space W = (W0)+■ ■ ■ + (Wm), in particular A-orthogonal to itself. By the way, note that in practice dimension (Wm) is small (say 2), Also by the the condition (Wm) contains all vectors from Krylov space that are A-orthogonal to all this space. Then AWm have the form w0 + w1 + ■ ■ ■ + wm,wi C (Wj,),i =
0,1, 2,..., m. When j e {0,1,... ,m — 1} we obtain a chain of equations
w3T AWj = (wo + ■■■ + wm)T AWj = (AWm)T AWj =
(Wm)TA(AWj) C (Wm)TAW = {0}
that, due to non singularity of A-scalar product at Wj shows th at wj = 0 for j = 0,1,... ,m — 1 , i.e. AWm C (Wm) ■
If (AWm) = (Wm), the element from the kernel of the matrix A can be obtained by
reduction of right part of equality AWm = Z to triangular form by 2 N operations (The
complexity of one elimination with columns of matrices AWm\\Wm, Z e FNx(n+v) estimated N
Now let (AWm) = (Wm). From formula (3) we get by substitution
(Wj)T(AX — B) = (Wj)TB — (Wj)TB = 0,j = 0,1, ...,m — 1,
(Wm)T (AX — B) = (Wm)T A(AX — B ) = 0,
AX — B C W A
(Wj)TA(AX — B) = (AWj)T(AX — B) C WT(AX — B) = {0}, j = 0,1, ...,m — 1,
that is vector of the block AX — B are A-orthogonal to W , i.e. (AX — B) C (Wm), It means that linear operator A displays sum of spaces (X — Y) + (Wm) in (Wm) .If (Y) C (W) > dimension of the first space is larger than the dimension of the second, and with the assistance of the Gaussian exceptions in equality A(X — Y\\Wm) = Z, (Z) C (Wm), item from the kernel A
The probability that the (Y) ^ (W) subject to the statistical independence of these spaces and singularity of matrix A, that leads to (W) = FN , obviously estimated by value 1 — -1, nY X — Y\\Wm,Z e FNx(n+^) estimated with value 2((n+p — 1) +... + 1) = 2(n+^(n+»—1) , Theorem is proven.
Note that condition W0 = B in the last theorem may be replaced by W0 = Y without subject to statistical independence (Y) and (W), Then similarly we obtain A(X—Y) C (Wm), where by construction (X — Y) C (W) \ (Wm).
3 Parallelization of first and third phases
Before discussing the wording of theorems, refer that consistent implementations of considered algorithms picks up maximum time for repeatedly recurring operation of multiplying matrix and block of vectors, this is the main part of the first and third phases of algorithms. Therefore, parallelization must be primarily applied to this operation.
You can consider the two approaches to parallelization of multiplication matrix and block of vectors related to distributed storage of the matrix and (or) distributed storage of the block.
The need to use the first of these approaches is also the impossibility of storing in memory
D
To use the second approach consider block of ns vectors. Namely, that all consider algorithms apply to a system of linear equations DX = 0 , where D e FMxN,X e FNxns, The amount s called block factor. Let for simplicity M = N.
Consider the task of parallel distributed multiplication of sparse matrix on block.
To ensure balance load its need to separate sparse matrix into parts, containing approximately the same number of units. Thus without additional transformation, matrix can be separated in one direction - by ratios. Divide matrix D e FNxN of our system of linear homogeneous equations on horizontal strips by number l of used compute nodes. Thus processor with number i will receive matrix Di , in which the nonzero 1 eft only N/l matrix rows of D .
In the first phase of Montgomery’s and the new algorithm it is necessary to calculate (DT D)iB, BT (DT D)iB, B e FN xns, i = 1, 2,... .
Have
DT D = ¿ DT Di,
i=1
I
DTDBj = Y, DT Di Bj ,j = 1,...,s, (4)
i=1
where Bj e FNxn - is j vertical strip of B e FNxns. Storage of this strips and matrix Di on the compute node need memory
nNd Nn +------—
Di
number in one machine word),
DTDB ls
Nd operations for calculate items D^j and Nd (number of nonzero elements in the matrix DiT) operations to complete the calculation DTDiBj e FNxn .
Further, the algorithm consistently evaluated (DTD)kB, This requires sending with
l
in its calculation computer nodes of computing site. Using the cyclic sending this takes time c 2Nn = 2Nc.
n
(DTD)B ls
+ 2Nc. (5)
Calculation of the total Krylov space needed in the new algorithm and in algorithm of Wiedemann-Coppersmith, to build the solution. In addition to the first phase of the algorithms vectors that constituents Krylov space uses for constructing coefficients of series as scalar XTDiY BT(DTD)iB
algorithm.
For distributed by rows (strips) storage at l compute nodes total block BT even need bits. After computing the (DTD)iBj on each of the l compute nodes of group with number j, 1 ^ j ^ s, it calculates its part of scalar production BT(DTD)iBj by
Nns
3 -
l(log2 -2 — log2log2 ^)
arithmetic operations (for corresponding algorithm one can see pp. 342-343 [4]), When N ^ 226 this fraction can be evaluated from the top by value
4Nns
(6)
l ■ log2N
All parts after concatenation inside one group gives BT(DTD)iBj , and for all groups -BT(DTD)iB
evaluation (5) we obtain estimate of calculation time for the next block of vectors from Krylov
sl
2dN 4Nns
~r + rriog2N +2Nc- (7)
l
components:
d
l = c- <8>
A more detailed calculation shows that the second element in the estimation (7) when
considered in the present values of parameters, namely, 2ns < dlog2N less than the third. For
large values of s it’s better to do scalar multiplications on the additional O() compute nodes so that the second element of the evaluation (7) be less than the first and the third. So let’s use estimation
4Nc
with overall estimation throughout the first phase of the 2 — steps approximately
8N 2c
ns
(10)
with condition sl ^ C or s d ^ C where C - general number of compute nodes of the cluster. The memory of one node that participate in multiplication matrix on the block and calculating coefficients of the series, should have RAM space about
nN-d + Nn + = —9(1 + c + f)GB.
matrix Di\ current Bj ; part of BT
when l = d , In the case of N & 226 this is approximately
1 sc
2(1 +c + ~c
Note that if scalar multiplications calculates on another additional compute nodes, then memory on a single node enough demand up
1 + c -——GB.
2
For obtaining general evaluation of the first and the third phase of the new algorithm in various applications, estimation (10) must be multiplied by a constant, not a great 2,
In the case of Widemann-Coppersmith algorithm rating for memory usage gives expression (12), Estimate for running time may be obtain similar to (7), where there is no factor 2 neer d DT l
l d = 2c
For the time of the first phase, contains the calculation of the — series coefficients and
ns
taking into account the need to build two passages (second to build the solution), we get very close to the previous estimation:
16N 2c
16^. (13)
sn
Note only that in the second pass of algorithm Widemann-Coppersmith, operation similar BT gi
requires equivalent time.
4 Parallelization of the second phase
In the algorithm of Widemann in version of Coppersmith vector g builds consistently increasing its size, Giorgi P., Jannerod C-P,, Villard G, in [3] proposed another option with almost linear estimation of complexity, though with several logarithmic and rather big absolute multiplicative constants. So optimized algorithm we will call algorithm of Widemann-Coppersmith with matrix polynomial multiplication.
Let
h = h(A) = ^ HlAi, Ht e Fnsxns. (14)
i=0
Here the size of ns selected so that the algorithm can be apply to the construction of approximation polynomial at the second stage of the algorithm Widemann-Coppersmith using s
Definition 1 Degree of vector polynomial m(A) e (F[A])2nsx1 called its degree as a polynomial with vector coefficients from F2nsx1.
Definition 2 Order of vector polynomial m(A) e (F[A])2nsx1 is a nonnegative integer j , that satisfies h(A)m(A) = Y.i^j+1 ViAl,^j+i = 0.
Definition 3 a-basis of the series h(A) let’s call matrix polynomial M(A) e F2nsx2ns[A] that satisfies the following conditions:
1. The columns M^ (A) of matrix M(A) have an order of not less than a .
2. Any v e F2nsxl[X], whose order is not less than a, have unique representation
2 ns
v = ^ M(i)C(i),C(i) e F[A],degM(i) + degC(i) ^ degv.
i=1
To implement Widemann-Coppersmith algorithm, we must build Pade approximations to series of the form (14), where Hi = XT DiY; X,Y e FN xns, namely P (A),Q(A) e Fnsx2ns[A], that satisfies
(h || Ins)(^ Q) = O(A2d+1),degQ,degP ^ d, (15)
where Ins ^n is identity matrix from Fnsxns,d = — .
ns
Let further for simplicity Nns = 2k ,
Theorem 3 Let N ^ 226 . An upper bound for running time of parallel implementation using s
polynomial 'multiplication
1) When
120{log2log232N){log2 f - log2log2 f)
s >--------------------17n------------------ (16)
has the form :
1122Nn2s3(log232N)(log24N). (17)
2) When
^ 120(log2log232N)(log2rf - log2log2ns) /10^
S ^ 17n 1 j
has the form :
15840Nns2 (log232N )(log24N )(log2log232N). (19)
Remark 1 Because inequality z ^ 2Alog2A leads to z ^ Alog2z, we obtain that if s ^
2 VOlog^og^N ^ ( 120log2log232N J ? ^ mequaUty hMs_
Proof of 3,
To build the necessary approximation we use the algorithm from article [3], which builds aa
replace series h(A) to h(A2ns) x (1, A,A2,..., A2ns-1)T and construct approximation to it which 2dns 2d h(A)
corresponding algorithm C(ns,ns, 2dns) can be obtained from the proof of theorem 2,4 [3], d 2dns
C(ns,ns, 2dns) ^ 2C(ns,ns, dns) +
MM(ns, dns) + MM(ns, 2dns),
where MM (a, b) -complexity of multiplying the two matrix polvnomials with degree b from F“x“[A], Continuing similarly, obtain
C(ns, ns, 2dns) ^
MM (ns, 2dns) + Y,l°=gi2dns MM (ns, 2-i2dns)(2i + 2i-1) (20)
^ I YlS2^ 2iMM(ns, 2-i2dns).
Remember, that for simplicity we demand Nns = 2k.
Optimized algorithm to multiply matrix polynomials may be taken from the article [5], According to lemma 3,2 of this work obtain following estimation
MM (ns, 2i) ^ aQ + $Q, (ns n
n log2 ns - log2log2ns
where Q : p(rQ) > 2(2i + 1),H : 2H + 2 ^ Q ^ 2H+1 + 1, aQ ^ rQ2H((6r + 2)^r(H + 1) + ),@Q ^ rQ2H^2, where as r any natural number other than 2 can be selected, ^r-
sum of modules of the coefficients in a cyclotomie polynomial with number r, a2 and @2 respectively is the number of additions and multiplications for multiplication polynomials with degree p(r2) — 1, It is known that when r = 3 one can choose ^equals t o 17. Here we apply trivial algorithm for addition of matrixes, broken by rows in machine words with the length of n and optimized algorithm of multiplying such matrixes as set out above. When r = 3 we get ^r = 3 . Q can be chosen the lowest such that 3Q-1 ^ 2% + 1 (see, p. 8 [3]), so 3Q ^ 9(2Z + 1), and when i ^ 1 we obtain 3Q ^ 2%+A , and because 3log32 < 2, we have:
MM(ns, 2i) ^ 2i+4logI2i+4 ■
n
2 17ns
20 ■ ?>(log2logi2%+4 + 1) + — a2 + - — I 7 ns
9 log2ns - log2log2ns
(ns)2 2i+5 ^ (—^(i + *)■ n 3
nnn 2(i + 4) 2 17ns
60(log2----- ------+ 1) + — &2 +
3 9 log2 ns - log2log2ns
Because inequality
2 4(i + 4)
-a2 + 60log2-- ---^ 120log2log232dns
93
for i ^ log22dns obviously done, subject estimation can be continued by value
ns22dns11(log2 2dns + 4)^
17ns
120log2log232dns + Tog2 ns-iog2iog2 n
Thus
(21)
2
C(ns, ns, 2dns) ^
|ns222 ■ 2dns(log22dns + 4)log24dns■
(i20log2log232dns + iog2¥-¡Zl0g2¥) ^
66Nns2log232Nlog24N ■
(í—0logllogl■i2N + fag2 n. -¡^ ns) <22)
In the case 1) this inequality can be continued by value
132Nns2(log232N)(log24N) ■ 17ns log2ns - log2log2 ns '
Since in this case, ns ^ 16, this evaluation can be continued by value
1122N ns2 (log2 32N )(log24N).
Second case obviously follows from the evaluation (22), Theorem proven.
Theorem 4 Running time estimation of parallel implementations using block factor of Wiedemann-Coppersmith algorithm with the matrix polynomial 'multiplication under condition that N ^ 226 and
N > 71(log232N)(log24N) ■ cn
^ 240log2log232Nlog2( ^ log2log232N) ^ 4
has the form
94N1+4n-4c4 ((log232N)(log24N))1.
Proof.
Proof of this theorem follows from remark 1, theorem 3, 1) and evaluation (13) when
_ 1 / 8Nc \ 4
S = 4\561n3(log2 32N )(log24N)J '
Note that when N ^ 226,c ^ 1 estimation of this theorem holds,
5 Conclusion
It’s worth noting that in procedures (see [6]) using which 12 December 2009 was received new record of integer factorization, actually applies the usual algorithm of Widemann in Coppersmith version with s = 8, but construction of approximations was not consistent, but by binary tree as described above. Complexity of the corresponding algorithm proposed by Thom e in the [7] estimates by value
O(n2s2(ns + log2k)k ■ log2k ■ log2log2k), k = —,
ns
that is,
O(Nns(ns + log2 N )log2Nlog2log2 N).
This estimate is very close to evaluation (22), In principle, the difference is that degree of s is one less. Therefore, similar reasoning, when s = O(N3) we can obtain time estimation of the type
O(N1+2 (log2Nlog2log2N) 1).
It is important to note that the necessary memory volume in this case
2N
O(k(ns)2log2k) = O(Nnslog2---------)
ns
bits, where multiplier log2k associated with increasing integer factors when the recursive application of Fourier transform for polynomials with integer coefficients (see, for example [8, 9])
N 3 ■ 226, s = 8, it’s
around 1TB. With using optimal value of s this is O(N1+1 nlog2N) that is really significantly much. An important advantage of algorithm [1] is the lack of need for multiplication of matrix polynomials to calculate only few coefficients such works. That leads to the good parallelization
O(N)
RAM volume (with constant in O equals approximately 3),
However, some optimizations of this algorithm allows quite remove requirement growing RAM, and get an estimation of time of parallel implementations using block factor when N ^
226,n = 64,1 ^ c of the whole algorithm of the type:
8, 4c2 N1+ 2
(23)
in assessing the number of compute nodes asymptotically O( -NgN) (when N ~ 22, c ~ 12, d 29 number of required compute nodes is approximatelv 2400),
It is known that in usual parallel implementation of Montgomery’s algorithm [2] the main complexity gives sending time during multiplying a matrix by a vector (see estimation (5)), You can assess that value by
As is known for the author, now there exist parallel implementations of Montgomery’s algorithm with better bounds. Attitude this time work to the optimal time for the new algorithm roughly
N
In addition, this algorithm have not "high parallelism" , i.e. its parts cannot be done on independent clusters. The new algorithm and algorithm of Wiedemann-Coppersmith allow you to compute the coefficients of series and its approximations on the independent sites.
Note ones more feature of Widemann-Coppersmith algorithm and the new algorithm. When ss time of the second phase increases depending on s not slower than s2, For example, the complexity of multiplying two matrices from GF(2) with size ns x ns , as we saw above, have the estimate n2s3 , At the same time, if this matrix divide to the bioeks with the size ns 3 x ns2 and send each pair of such blocks from different matrices for multiplication on an execution site,
O(n2s2)
used compute nodes O(s 3), Similar considerations can be done fore the parallelization of scalar production of vectors. Solving of homogeneous systems (bring to the triangular form) can be made using a recursive algorithm that transfers calculations again to a matrix multiplication. Such parallelization leads to overall estimate of the form
n n n
N2
O(—) + O(Ns), s
that for s = O(\^N) leads to O(N2), The total number of compute nodes 2sl (see, (8)) will
be about O(N/logN),
We gather results in the table:
Running time Nodes Total RAM
Wiedemann- Coppersmith with matrix polynomial multiplication 94N1+4 n-4 c 4 ((log232N )(log24N )) 4 O( N2 A O V (log2N)2 ) O(N1+4 )
Wiedemann- Coppersmith- Thome O(N1+2 (log2Nlog2log2N )1 ) O( O(N1+1 log2N )
Montgomery (1995) 2c N 2 n O(1) O(N )
New algorithm 8, 4c2 N1+3 O( O(N1+3 )
References
1. Cherepnev M.A. Block Lanczos-type algorithm for solving sparse linear systems //Diskr, Math, (in Russian). 2008. V. 20. N. 1. P. 145-150.
2. Montgomery P.L. A Block Lanczos Algorithm for Finding Dependencies over GF(2)//EUROCRYPT’95, LNCS, 1995. V. 921. P. 106-120.
3. Giorgi P., Jannerod C-P., Villard G. On Complexity of Polynomial Matrix Computations//ISSAC’03, August 3-6, 2003. Philadelphia, USA.
4. Coppersmith D. Solving Homogeneous Linear Equations Over GF(2) via Block Wiedemann Algorithm//Mathematics of Computation. Jan. 1994. V. 62. N. 205. P. 333-350.
5. Cantor D., Kaltofen E. On Fast Multiplication of Polynomials Over Arbitrary Algebras//Acta Informatica.1991. V. 28. P. 693-701.
6. Kleinjung T., Lenstra A.K., and others Factorization of a 768-bit RSA modulus, version 1.0. January 7, 2010. URL: http://eprint.iacr.org/2010/006.pdf.
7. Thome E. Subquadratic computation of vector generating polynomials and improvement of the block Wiedemann algorithm//J, of Symbolic Computation. 2002. V. 33. Issue 5. P. 757-775.
8. Vasilenko O.N. Number-Theoretic Algorithms in Cryptography //AMS. Moscow State University, 2007.
9. Naudin P., Quitte C. Algorritmique Algebrique. Masson, 1992.
Accepted for publication 7.06.2010.
НЕКОТОРЫЕ ОЦЕНКИ ПРОИЗВОДИТЕЛЬНОСТИ ПАРАЛЛЕЛЬНЫХ АЛГОРИТМОВ РЕШЕНИЯ БОЛЬШИХ ЛИНЕЙНЫХ СИСТЕМ НАД
GF (2)
© Михаил Алексеевич Черепнев
Московский государственный университет им. М.В. Ломоносова, Ленинские горы, 1, Москва, 119899, Россия, кандидат физико-математических наук, доцент кафедры теории
чисел, e-mail: [email protected]
Ключевые слова: разреженные системы; факторизация целых чисел; параллельные алгоритмы; компьютерная алгебра.
В данной работе даны оценки времени, памяти при оптимальном числе узлов для известных блочных алгоритмов решения разреженных систем над GF(2) , их модификаций и нового алгоритма.