Полиномов в параллельных алгоритмах

Лапаев Алексей Олегович

UDK 519.688

DFT FOR POLYNOMIALS IN PARALLEL ALGORITHMS

Tambov State University named after G.E. Derzhavin, Internatsionalnava, 33, Tambov, 392000, Russia, Post-graduate Student of Computer and Mathematical Modeling Department,

e-mail: [email protected]

Key words: polynomials; discrete Fourier transform; parallel algorithm; method of homomorphic images; cluster.

We investigate sequential and parallel algorithms for polynomial arithmetic based on discrete Fourier transform (DFT). Algorithms for polynomial multiplication are discussed. Sequential algorithms for polynomial matrix are proposed. Each algorithm based on DFT has been compared with similar algorithm based on Chinese remainder theorem. In the last part of work parallel algorithms for calculation DFT and multivariable polynomials multiplication are considered. Theoretical expressions of complexity are presented for each algorithm. Results of experiments on MVS cluster are presented for parallel algorithms.

1 Introduction

Effective implementation of arithmetics for multivariable polynomials is a significant problem in symbolic computations. This problem is important for calculations with polynomial matrices. During calculations polynomials of high degrees are appearing. Therefore, standard algorithmes are non-effective. Modular methods are used to reduce cost of polynomial coefficient and degree growth. Let’s remind Chinese remainder theorem (CRT), It is usually used by next scheme: elements of polynomial matrices are mapped into finite fields Zp[x]/(x — j)Zp[x], Zp = Z/pZ , where p is some prime number. Then calculations are fulfilled over this finite fields. Result is recovered via Newton’s or Lagrange’s scheme. Complexity of interpolating of one polynomial is O(t2 + tr2), where t and r are the numbers of used polynomial and numerical modules respectively.

It has been shown in works [10, 11] that algorithms based on discrete Fourier transform are more effective at interpolating result instead of CRT for polynomial modules. The main idea of calculations based on DFT is that under mapping in factor ring Zp[x]/(x — j)Zp[x] values of parameter j are chosen in a special way. It allows to recover result in Zp[x] bv DFT and CRT in time O(t log21 + tr2) where t - the number of points in DFT, r - the number of the numerical modules. Coefficients of polynomials are also recovered by CRT,

In paper [1] the algorithm for multiplication of two polynomials based on DFT in a finite field Zp[x] is described. Theoretical and experimental comparison of DFT-algorithm of polynomials multiplication with Karatsuba’s algorithm of multiplication [12] and direct algorithm is resulted in works [7,12], The problem of multiplication of two polynomials based on DFT on the processor with several kernels and the general memory is considered in articles [2,3],

Comparison of two classes algorithms for polynomial calculations is done in the given work: the first class of algorithms uses CRT both for polynomial and for the numerical modules. The second class of algorithms uses DFT instead of CRT for polynomial modules.

In section 2 theoretical and experimental comparison of algorithms of polynomial multiplication of one variable is spent.

In section 3 DFT-algorithm for calculation of a determinant, the characteristic polynomial, the adjoint matrix for the matrices which elements are polynomials of one variable over a ring of integers are considered. The problem of multiplication of two polynomial matrices is considered. For each algorithm theoretical and experimental comparison with the CRT-algorithm is resulted, Alogrithm based on DFT are suggested by author.

In section 4 the algorithm of calculation of DFT for multivariable polynomials is considered, complexity estimations are shown. The parallel algorithm of multidimensional DFT calculation is proposed by author. Results of experiments on cluster MVS are presented.

In section 5 the parallel algorithm of multivariable polynomial multiplication based on DFT on parallel machines with the distributed memory is considered, Parallezation scheme of algorithm has been made by the authors.

All experiments are done on MVS cluster with the next configuration: 1460 computing modules with 8 cores Intel Xeon 3 GHz and 8Gb RAM by module, operation system is Cent

OS, All algorithmes were implemented in Java 1,6, MPIJava binding for MPICH is used for parallel algorithms.

2 Algorithms of multiplication of polynomials of one variable in Z-ring

It has been shown In work [1,7,12] that multiplication of polynomials based on DFT has the complexity estimation O(mlog2 m), where m is the degree of polynomials. Let’s consider the following algorithm for computing product of two polynomials f,g E Z [x] based on DFT (PF):

1, Choose the number of numerical prime modules r , p0,... ,pr-i, sufficient for recovering

fg Z

2, Calculate DFT F(f ),F(g) for polvnomials f and g in each finite field Zpi, i = 0,... ,r — 1,

3, Calculate F(f) ■ F(g) in each finite field Zpi, 6 = 0,..., r — 1, where ” ■ ” - operation of element-wise multiplication of two vectors,

4, Calculate the inverse DFT for vector F (f) ■ F (g) in each finite field Zpi, 6 = 0,... ,r — 1. Elements of this vector are polynomial’s coefficients fg over the module p^

fg Z

At the given algorithm following mappings in various algebras take place:

Z[x] ^ Zp[x] ^F F(Zp[x]) ^F-1 Zp[x] ^ Z[x]

fg

polynomials in Z[x] with degrees m — 1, each coefficient occupies w machine words. Product degree fg is 2m — 2 , The maximum bv absolute value coefficient of product of polynomials fg contain no more than r = |~logh m + 2w] words, where h = 2H , H - number of bits in a machine word,

H

Then r is the number of prime modules p0,... ,pr-i which is enough for recovering the result from Zpi [x] to Z [x] via modular a method according to the Chinese remainder theorem (CRT), Here it is designated Zpi = Z/piZ.

As product degree fg is equal 2m — 2 , the number of points for DFT is N = 2 riog2(2m—i)l _ Let f = FpN (f) and FpN (f) - N-dimensional vectors of direct and inverse discrete Fourier transforms for a polynomial f in the field Zpi on N points. Here f is considered as a vector

f

Nf = F£ (FPN (f)), f g Zpi

FpN(FPN(f) ■ FpN(g))/N, (*)

where operation " ■ " designates multiplication of two veet ors of length N in terms Vi = Wi ■Ui. Let’s result separately each step of algorithm and number of operations on the step,

f g pi,

i = 0,... ,r — 1, It takes to execute 2m divisions of numbers of length (w) on numbers of length (1), It is required 2mrw divisions and 2mrw subtractions.

b) We calculate DFT for polynomials f and g in Zpi [x]

i = 0,... ,r — 1 Calculation of DFT is carried out by algorithm Cooley-Tukey [7] on N =

2rio§2(2m—i)l points. Then the number of operations of addition, multiplication and division are

2Nr log2 N

f g Zp i

rN

Zpi

rN log2 N

e) We recover the result coefficients bv CRT, For this recovery it is required to fulfill 2r2(2m— 1) addition and multiplication operations.

Let’s compare the considered algorithm and the following:

0, Standard algorithm of multiplication for numbers and polynomials (PSS),

1, Karatsuba’s algorithm for multiplication of numbers and standard algorithm for multiplication of polynomials (PSK),

2, Standard algorithm for multiplication of numbers and Karatsuba’s algorithm for multiplication of polynomials (PKS),

3, Karatsuba’s algorithm for multiplication both numbers and polynomials (PKK),

Let’s result theoretical estimations of algorithms [12]. We get the number of operations of addition - A, multiplication - M and divisions - D,

Table 1

Number of additions, multiplications and divisions at multiplication of dense polynomials for

algorithms 0-4

0 A 2 2 m2w2

PSS M m2(w2 + 2w)

1 A 10m2(wlog2 3 — w)

PSK M m2wlog2 3

2 A w2(10mlog2 3 — 14m + 4)

PKS M mlog2 3w2

3 A w(10mlog2 3 — 14m + 4) + (mw)log2 3

PKK M (mw)log2 3

4 A 2mrw + 3Nr log2 N + 2r2(2m — 1)

PF M 3Nr log2 N + rN + 2r2(2m — 1)

D 2mrw + 3rN log2 N + rN

It is clear from Table 1 that the best algorithm by degree of polynomials is algorithm PF,

w

based on Karatsuba’s scheme for number multiplication,

2.1 Experimental comparison of algorithms

Programs have been written and experiments for measuring time of execution for algorithms

mw

Results are presented in tables below,

mw

least time for multiplication of two polynomials by comparison the tables listed above. Besides, let’s find the relation of the time spent with algorithm PSS, to epv time spent with the most fast algorithm. Results are presented in Table 3,

Table 2

Results of experiments with multiplication of polynomials of one variable in a ring of integers

m 4 16 64 256 1024 4096 16384

PSS

w = 4 0.008 0.235 5.9 103 1550 20930 292020

w = 16 0.05 1.325 26.5 425 6720 101970 1.58-106

w = 64 0.68 17.25 305 5060 80740 1.3-106 21-106

PSK

w = 4 0.009 0.26 6.4 109 1550 21770 319440

w = 16 0.05 1.3 25 415 6480 99700 1,6106

w = 64 0.66 19.75 330 5430 78730 1213330 19.3-106

PKS

w = 4 0.022 0.39 5 55 300 3600 28240

w = 16 0.057 0.85 10.1 102 870 6940 59040

w = 64 0.56 6.8 68 590 5360 47420 42679

PKK

w = 4 0.023 0.4 5.2 57 510 3690 29830

w = 16 0.057 0.84 10.1 104 880 6990 59670

w = 64 0.55 11.2 90 720 5680 48040 513630

PF

w = 4 0.33 1.575 7.8 34.5 145 670 2510

w = 16 1.9 9.6 41 175 730 2910 12200

w = 64 20. 88 360 1450 5890 24130 100140

Table 3

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

Numbers of algorithms of multiplication of dense polynomials which have shown in experiments the least time for the data m and w and their gain in time in relation to

standard algorithm

m w = 4 w = 16 w = 64

4 0/1 0/1 1/1.03

16 0/1 2/1.53 2/2.53

64 2/1.18 2/2.48 2/4.85

256 4/2.99 2/4.06 2/8.58

1024 4/10.69 4/8.87 2/14.69

4096 4/31.24 4/35.04 4/53.24

16384 4/116.34 4/129.94 4/204.84

The table 3 can be compared with theoretical estimations of complexity resulted in Table

1, For this purpose we construct Table 4 on expressions of complexity for the same sets of mw

Table 4

Numbers of the best algorithms by theory for multiplication of dense polynomials and their

prize concerning standard algorithm

m w = 4 w = 16 w = 64

4 2/1.5 2/1.7 3/1.9

16 2/2.1 2/2.7 3/3.4

64 2/3.3 2/4.7 3/5.9

256 2/5.7 2/8.2 3/10.4

1024 2/10.0 2/14.5 4/20.4

4096 2/17.6 4/43.9 4/78.2

16384 4/61.8 4/160.4 4/300.0

From comparison of these two tables it is clear that theoretical expressions for complexity of the algorithms well enough correspond with times are received during experiments. Average relative error makes 35,86 %,

Distinctions in a prize are connected by that theoretical estimations consider only operations of addition, multiplication and division, laying aside all other operations,

3 DFT in consecutive matrix algorithms

3.1 Matrix multiplication

Let’s consider a problem of multiplication of two matrices over a ring of polynomials of one variable. Let’s result two algorithms of matrix multiplication over a ring of polynomials of one variable and receive expressions of their complexity:

0, Modular algorithm of matrix multiplication, using CRT both for polynomials, and for their coefficients (MCC),

1, Modular algorithm of matrix multiplication, using fast Fourier transform for polynomials and CRT for their coefficients (MFC),

In algorithm MFC the new way of application of discrete transformation of Fourier is used. For each element of multiplied matrices it is calculated homomorphic DFT-image, and further calculations on algorithm are made with these images. The result is recovered at first by inverse DFT and then by CRT after the end of calculations.

Let matrices A,B e Mnxn[Z[x]], Let maximum on the coefficients absolute value of polynomials which are elements of matrices A and B , contain w^d wB words accordingly. The maximum degrees of elements of matrices A and B it is less mA and mB, We get m = min{mA,mB} , Then the maximum coefficient of product contains r = \wA + wB + logh m + logh n] words, h = 2H , Having assumed that in CRT for numbers are used H-bit r

The maximum degree of polynomials which appear in product A ■ B is less then mAB = mA + ms - 1 ■

Let’s take polynomial modules of the first degree. In algorithm MCC it is necessary to take mAB prime polynomial m odules x,x — 1, x -2,..., x — mAB + 1, In algorithm MFC it is required to calculate DFT for each element of matrices A and B on N = = 2^log2mAB 1 points. It is known from work [7] that with use of N points, DFT in a prime field can be calculated for N log2 N operations of addition and as much multiplication and division operations.

At reflection in the field Zp. [x] remainders of division by prime numbers p0,... ,pr-1 are calculated. Thus, it is required to execute n2rmAw^d n2rmBwB divisions and as much

AB

ZPi [x]j = Zpi [x]/(x — j)ZPi [x], ^ = 0.mAB — 1, it is necessary to calculate a polynomial remainders of division on the prime modules (x — j) in the fie Id Zp. [x], For this purpose it is required n2rmABmA and n2rmABmB operations of multiplication, addition and division for AB

For multiplication of matrices we use standard algorithm. Then the number of ring operations in Z[x] is n3 operations of multiplication and n3 addition operations,

rmAB rmAB

calculations on algorithm MCC , Complexity of multiplication of two polynomials makes rmAB multiplications and as much divisions,

rN

It is necessary to recover n2 polynomials. It is required mABr multiplications and twice as much additions for recovery of one polynomial from Zp. [xj to Zp. [x] , As addition and multiplication are carried out bv the prime module p* it is required also 3mAB divisions. It is required r2mAB multiplications and twice more additions for restoration of factors of a polynomial in Z[x] ,

Let’s result the number of additions A, multiplications M and divisions D for algorithms MCC and MFC.

Table 5

The functions are expressing the number of operations of addition, multiplication and division

for algorithms MCC and MFC

0 A n2r(mAwA + mB wb + mAB mA+ +mAB ms) + n3rmAB+ +n2(2mAB r + 2r2mAB)

MCC M n2r(mAB mA + mAB ms) + n3rmAB+ +n2(2mAB r + r2mAB)

D n2r(mAwA + ms ws + mAB mi+ +mAB mB) + 2n3rmAB + 3n2r2mAB

1 A n2r(mA®A + mB wB + 2N log2 N)+ +n3rN + n2(rN log2 N + 2r2mAB)

MFC M 2n2rN log2 N + n3rN+ +n2 (rN log2 N + 2r2mAB)

D n2r(mA®A + mB wB + 2N log2 N)+ +2n3rN + n2rN log2 N

3.2 Experimental comparison of algorithms MCC and MFC

Programs have been written and experiments in which and MFC time of performance of

m, n, w

Let’s consider that the matrix has the size n, density a which elements are degree polynomials m with the factors consisting from w of machine words, has type (n,m,a,w), Let’s result the relation of time of performance of algorithm MCC to time of performance of algorithm MFC:

Table 6

The relation of time of performance of algorithm MCC to algorithm MFC at multiplication of matrices of type (n,m, 100, w) by results of experimental comparison

w=8

m n=4 n=8 n=16 ii 32 ii 6!

4 2.01 1.77 1.6 1.23 1.04

8 3.49 3.05 2.22 1.48 1.15

16 5.61 3.99 2.99 1.84 _

32 7.71 5.85 3.43 1.82 _

64 11.74 6.65 _ _ _

128 13.91 7.95 _ _ _

w 16

m n=4 n=8 n=16 ii 32 ii 6!

4 2.51 2.31 2.0 1.71 _

8 3.99 3.49 2.86 2.42 _

16 5.73 4.63 3.78 3.04 _

32 8.01 6.03 4.57 _ _

For some m, n, w values of the functions expressing complexity of algorithms MCC and MFC (Tab, 5) have been calculated. The theoretical prize of algorithm MFC rather MCC is presented to Table 7,

Table 7

The relation of the number of arithmetic operations of algorithm MCC to the number of arithmetic operations of algorithm MFC for type matrixes (n,m, 100, w) bv results of

theoretical comparison

w = 8

m n = 4 n = 8 n = 16 n = 32 n = 64

4 2.33 2.14 1.87 1.58 1.32

8 2.52 2.33 2.06 1.75 1.46

16 2.94 2.73 2.41 2.03 1.66

32 3.84 3.55 3.12 2.58 2.05

64 5.63 5.2 4.53 3.68 2.8

128 9.09 8.38 7.27 5.83 4.3

w = 16

m n = 4 n = 8 n = 16 n = 32 n = 64

4 2.81 2.63 2.35 1.99 1.63

8 2.94 2.77 2.5 2.14 1.77

16 3.23 3.05 2.76 2.37 1.95

32 3.88 3.66 3.31 2.82 2.28

By comparison of two last tables it is clear that theoretical and experimental estimations well correspond to themselves. The average relative distinction is equal to 24,37 %.

3.3 Algorithms of calculation of a determinant, a characteristic polynomial and the adjoint matrix

Let’s receive theoretical expressions of complexity for algorithms of calculation of a determinant, a characteristic polynomial and the adjoint matrix for a matrix over polynomials of one variable,

Sij — 1

Let A = (a,ij(x)) be a matrix over a ring Z[x] the size n x n, (x) = ^ akjxk, Let

k=0

max |ak7-1 ^ a and deg aj < s . i,j,k

Complexity of calculation of discrete Fourier transform on algorithm for n2 polynomials on N = 2 l"log2 nsl points at use r modules is equal

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

n2r(7s |~logh a] + 9N log2 N).

3.3.1 Determinant calculation

Let’s apply algorithm of a forward stroke to calculation of a determinant of a matrix. Let’s

r

image at discrete Fourier transform. The matrix determinant det A can be calculated bv the formula:

det A = (-1)*aijia2j2... anjn, (1)

(jl,...,jn)

where (jl..., jn) - of numbers from 1 to n, t - signum of this transposition.

The formula (1) contains exactly n! composes. Having estimated the maximum coefficeint det A

n! r

r = |~logh 2(V2nn e^ sn-1an)],

where h = 232.

rN

a remainder of division occupies as much time, as 7 additions of words. Then the number of

A

forward stroke equally

8n3rN.

r

ns

image DFT be

9rNlog2 N + 2r2sn.

The total of operations with use DFT in algorithm of calculation of a determinant is equal

n2r(7s |~logh a] + 9N log2 N) + 8n3rN + 9rNlog2 N + 2r2sn. (2)

Let’s result expression of complexity of algorithm of calculation of the determinant, using CRT both for polynomials, and for their factors:

n2r(7s |~logh a] + ns2) + 8n4sr + 2n2 s2r + 2r2sn. (3)

N = 2 Tlog2 nsl in (2), (3), ns

n2r(7s |~logha] + 9nslog2(ns)) + 8n4rs + 9nslog2(ns) ■ r + 2r2sn, (4)

n2r(7s [logh a] + ns2) + 8n4rs + 2n2s2 ■ r + 2r2sn. (5)

3.3.2 Calculation of a characteristic polynomial

The algorithm for calculation of characteristic polynomial of a matrix [6] is known. The given algorithm also has complexity O(n3) bv ring operations. The difference is in a value estimation r and it is necessary to recover n polynomials. For the given algorithm r1 = logh 2nnsn-1 an, Complexity of this algorithm is

n2r1(7s[logh a] + 9Nlog2 N) + 8n3r1N + n(9r1Nlog2 N + 2r2sn). (6)

The classical approach with use CRT both for polynomials, and for numbers has complexity

n2r1(7s |~logh a] + ns2) + 8n4r1s + n(2n2s2r1 + 2r2sn). (7)

ns

n2r1(7s[logha] + 9nslog2(ns)) + 8n4r1s + n(9nslog2(ns) ■ r + 2r2sn), (8)

n2r1(7s|~logha] + ns2) + 8n4r1s + n(2n2s2 ■ r1 + 2r2sn). (9)

3.3.3 Calculation of the adjoint matrix

The algorithm of calculation of the adjoint matrix also has complexity O(n3) by ring operations. The number of prime modules is

____ / n \ n i

r = logh 2(V2nn\—j e sn-1an).

n2

algorithm is

n2r(7s|~logha] + 9Nlog2 N) + 8n3rN + n2(9rNlog2 N + 2r2sn). (10)

We also result the expression of complexity of a classical method:

n2r(7s|~logh a] + ns2) + 8n4 rs + n2(2n2s2r + 2r2sn). (11)

ns

n2r(7s|~logha] + 9nslog2(ns)) + 8n4rs + n2(9nslog2(ns) ■ r + 2r2sn), (12)

n2r(7s[logha] + ns2) + 8n4rs + n2(2n2s2 ■ r + 2r2sn). (13)

3.4 Experimental comparison between the algorithms of calculation of a determinant, a characteristic polynomial and the matrix

On expressions of complexity algorithms with use DFT are close to what use CRT for polynomials and CRT for coefficients. The difference only that complexity of calculation of remainders of division of polynomials on prime modules x,x + 1,...,x + ns makes ns2

2n2s2

operations, and computing the image of a polynomial at DFT and recovering a polynomial from its DFT-image demands 9N log2 N operations. As N ~ ns DFT has advantage, Bv theoretical

ns

Let’s present results of computing experiments.

Comparison of algorithms for calculation of a determinant, characteristic polynomial and the adjoint matrix was spent.

We compared two algorithms for calculation of a determinant: the first one is based on CRT only and the second one is based on FFT for polynomials.

For calculation of a characteristic polynomial of a matrix over a ring Z[x] two algorithms were compared: Danilevsky’s algorithm with application CRT for polynomial prime modules and New algorithm [6] in which operations over polynomials are replaced by operations over their images of transformation of Fourier, For restoration of factors of a polynomial as a result in both algorithms CRT was used.

For calculation of the ajoint matrix it was spent comparison of two algorithms: the first one is based on CRT only and the second one is based on FFT for polynomials.

Results are presented in Tables 8-10,

Table 8

Results of experiments with calculation of a determinant of a matrix

s = 2, bits = 8 n increases

n CRT, ms FFT, ms

2 38 26 1.46/1

4 30 13 2.3/1

8 70 48 1.45/1

16 1094 952 1.14/1

32 25009 22203 1.12/1

64 625352 506382 1.35/1

n = 8 bits = 8 s increases

s CRT, ms FFT, ms

2 76 51 1.48/1

4 132 124 1.06/1

8 276 249 1.11/1

16 755 501 1.51/1

32 1884 1010 1.87/1

64 5040 2538 1.99/1

Table 9

Results of experiments with calculation of a characteristic polynomial

bits = 32, s = 8 n increases

n Danilevsky’s algorithm, ms New algorithm with DFT, ms

4 63 23 2.74/1

8 415 143 2.90/1

16 7655 3619 2.12/1

32 162543.0 99562 1.63/1

bits = 32, n = 8 s increases

n Danilevsky’s algorithm, ms New algorithm with DFT, ms

16 1705 761 2.24/1

32 5586 1181 4.73/1

64 23204 2671 8.69/1

Table 10

Results of experiments with calculation of the adjoint matrix

s = 2 bits = 8 n increases

n CRT,ms FFT,ms

2 1 1 1/1

4 7 5 1.4/1

8 118 67 1.76/1

16 2574 1164 2.21/1

32 65679 27226 2.41/1

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

64 1775930 611129 2.91/1

n = 8 bits = 8 s increases

s CRT,ms FFT,ms

2 132 73 1.81/1

4 325 138 2.35/1

8 968 284 3.41/1

16 4034 589 6.85/1

32 14834 1171 12.67/1

64 55809 2354 23.71/1

It is clear from Tab, 8-10 that the greatest advantage with DFT has the approach in algorithm of calculation of the adjoint matrix.

4 Parallel algorithm of DFT calculation

Let f G Zp[xi,x2,... ,xd], p - prime number. Let the greatest degree of a variable x is equal n — 1 in the polvnomial f , n = 2Ni, We designate that n = n1n2 ... nd , The polvnomial f can be written down in a form:

«4-1 «2-1 «d-1

f V f xil xi2 'd

f y y . . . / y f *H2...»nx1 x2 . . . .

il=0 '2=0 'd=0

Let the prime number p be so that n divides p — 1, Then in Zp there is the root degree n from 1 which we designate ^, We enter definition of discrete Fourier transform for the f

f

dthe-dimensional table of numbers F(f) = (fj1...jd), where 1 ^ ^ n1, 1 ^ j2 ^ n2, ... 1 ^

jd ^ nd , where

«1-1 «2-1 «d-1

fjlj2...jd = £ £ ... £ fili2...in ^j1i1 j j. (14)

ii=0 '2 =0 id=0

The right part (14) contains n composes. In the left part there is an element dthe-

n

Hence, total number of operations for calculation DFT of a polynomial f is O(n2), We consider a way of fast calculation of DFT, Let’s write down the formula (1) in a kind:

«d-1 «d-1-1 «k -1 «1-1

f = V"1 / jd'd / jd-1'd-1 j'k ST' f , j1i1

f J1 J2...Jd = 2-^i ^d-1 . . . f*1*2...id^1 .

id=0 id-1=0 ik=0 '1=0

At the fixed parameters i2,..., id expression

«1 - 1

£ fi1'2...id j -'1=0

element with number j of one-dimensional discrete transformation of Fourier on n1 points which can be counted on algorithm Coolev-Tukey [1] for O(n1 log2 n1) operations.

Let’s designate

«k+1-1

F k+1 \ F k ik + 1'k + 1

j1 ,...,jk + 1 ,'k + 2 ,...,'d / j j1 ,...,jk ,'k + 1 ...'d d

'k + 1 =0

At consecutive calculation F1,F2, ..., Fd the element Fd contains DFT a polynomial f, The parallelization scheme consists of d consecutive steps, each of them is carried out in parallel. It is possible to present the scheme of calculation of each step in the form of a binary tree at which the data is distributed from root vertex to leaf, and the result of calculations gathers again in the root. On each step in parallel to be calculated one-dimensional DFT, thus all calculations occur in leaf vertexes. At transition to a following step the order of variables

dd

the algorithm.

Algorithm,

Let’s spend calculations in a following order: on the step k are in parallel calculated n/nk DFTs at the fixed values of indexes j ... jd-k-1 jd-k+1... j .

Let Fk be the result of ealeulations of k-th step, and let F0 = f ,

kd received on the previous step splits on two equal parts bv an index jk,

And each part goes to corresponding affiliated top. In affiliated tops division repeats by the same index on two half. Such process is carried out recursively to those while there are free processors or while division on the given index, i.e. number of the processors involved in calculations less n/nk , is possible. At leaf level one-dimensional DFTs are calculated and the result comes back upside-down. Then, at the transition to a next step of calculations, the order of indexes of the array varies from [jd-k+2j1... jd-k+1] to [jd-k+1j ...jd-k]. We notice that on the step k to it is required no more, than n/nk processors.

Let there are m computer modules (CM) with n umbers of 1... m .Let k = 1,

d

wdd_-kk+r-k+1 are calculated. The array and the list of free processors are halved. One half of array and the list of free processors together with a array of degrees is sent to CM with number (n/2 + 1), and second half remain in the given CM,

2) Each CM which has received the part of a problem, continues such division further. Process proceeds before achievement of sheet level, or exhaustion of all free processors,

3) Parallel calculation of one-dimensional Fourier transform is carried out,

4) Each CM sends result back on a tree to that processor from which has obtained the data,

5) The CM 1 collects result and changes an indexation order in result array from

[Jd-fc+2j1jd-fc+1] tO [jd-k+1j1...jd-k]■

6) Number k increases. If k = d +1 the end of all calculations, differently we pass to item 1,

F1 . . .

j1, '2, '3,... , 'd

Definition: Efficiency of computation on k processors with comparison on n processors

is

- i

a _ _____

an,k _ k _ i • n

On this algorithm the program complex on cluster MVS the Russian Academy of Sciences is developed. Results are presented in Table 11,

Table 11

Time and speedup of parallel algorithm for calculation DFT for polynomial of two and three

variables in finite field Zp

d=2

Time, ms Efficiency, %

n 512 1024 2048 4096 8192 n 512 1024 2048 4096 8192

procs procs

1 817 6838 47466 371882 67108864 1

2 608 3918 31059 190410 2267073 2 67.19 87.26 76.41 97.65 1480.08

4 424 1755 11470 79170 715310 4 71.7 111.62 135.39 120.25 158.47

8 374 1109 4220 30643 210488 8 56.68 79.13 135.9 129.18 169.92

16 323 730 2785 11514 83500 16 57.89 75.96 75.76 133.07 126.04

32 294 649 1792 6372 36665 32 54.93 56.24 77.71 90.35 113.87

64 367 576 1454 4732 18416 64 40.05 56.34 61.62 67.33 99.55

128 454 571 1341 3749 11462 128 40.42 50.44 54.21 63.11 80.34

256 516 1423 4571 12469 256 43.99 47.12 41.01 45.96

d = 3

Time, ms Efficiency, %

n 32 64 128 256 n 32 64 128 256

procs procs

1 345 6916 212756 8140475 1 — — — —

2 263 3446 111121 2955568 2 31.18 100.7 91.46 175.43

4 175 1731 45085 1055536 4 32.38 99.85 123.97 223.74

8 129 888 12420 358084 8 23.92 96.98 230.43 310.48

16 86 490 4726 127972 16 20.08 87.43 293.45 417.41

32 68 290 2466 66859 32 13.14 73.7 275.08 389.54

64 58 316 1495 17918 64 7.85 33.15 224.3 719.55

128 433 184 945 7554 128 —0.16 28.81 176.49 847.75

5 Parallel algorithm of multiplication of polynomials of many variable integers in a ring by means of discrete transformation of Fourier

In work [1] the consecutive algorithm of multiplication of polynomials of one variable by means of DFT is described.

It is offered to calculate product of polynomials in r final fields Zp0,..., ZPr-1 with the subsequent reeoverong of result in Z with the Chinese remainder theorem.

If the number of variables in a polynomial more than one such algorithm can be carried out on several processors by parallelization calculations of direct and inverse DFTs and calculations over prime modules.

Let f, g - polynomials in a ring Z[x0, ..., xd-i].

ni —1 n2 — 1 nd-1

f V f Xl1 xl2 id

J / j / j . . . / j f iii2...inx1 x2 . . . xd ,

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

ii=0 12=0 id =0

mi —1 m2 —1 nd —1

g _ £ £ ... £ giii2...inx11 x22.. .xdd.

ii =0 i2 =0 md =0

Let a/, ag - maximum on the coefficients of polynomials f and g accordingly. We designate n _ n0 ■ ... ■ nd— 1, m _ m0 ■ ... ■ md—1, Then h _ min{n, m} ■ a/ag - the top estimation of

fg

of result the number of 32-bit prime modules r _ [log232 h] ,

Let si _ 2Si - the minimum natural number, greater than ni + mi, Let there is a computer with k _ 2K processors with numbers of 0,..., k — I. We result parallel algorithm of multiplication of polynomials,

1, Processor with number 0 defines the necessary number of prime modules p0,... ,pr—1. Let r _ 2r , We get t _ r/k, In case r < k sending of polynomials is executed only to r

2, Processor 0 sends polynomials f, g and prime numbers p0,...,pr—1 to the processor with number i _ 0,..., k — I. Following steps are carried out in parallel on each processor,

3, On the processor with number i product of polynomials hi/t _ Jg on modules pi/t,... ,p(i+1)/t—1 by means of DFT is calculated. If r < k each processor uses k/r — I of free processors for parallelization of DFT calculations,

4, Splitting arrays of coefficients for polynomial hi/t _ Jg on k parts h0/t,..., hk/t1 is carried out,

5, Send parts hq/t to the processor with number q.

6, Receive hq/t on the processor with number q,

7, Recovering hq polynomial coefficients h is carried out,

h

9, End of calculations,

6 Conclusion

In the given work comparison of two approaches has been spent to the organization of polynomial arithmetics over a ring of integers,

m

the coefficients length w of machine words was considered. Theoretical and experimental comparison of following algorithms was spent:

0, Standard algorithm of multiplication of numbers and polynomials (PSS) with complexity O(m2w2),

1, Karatsuba’s algorithm for multiplication of numbers and standard algorithm of multiplication of polynomials (PSK) with complexity O(m2wlog2 3),

2, Standard algorithm of multiplication of numbers and Karatsuba’s algorithm for multiplication of polynomials (PKS) with complexity O(mlog2 3w2),

3, Karatsuba’s algorithm for multiplication both numbers and polynomials (PKK) with complexity O(mlog2 3wlog2 3)

4, The algorithm of multiplication using DPT (PF) with complexity O(w(m log2 m)+mw2), The best algorithm at increase in degree of polynomials was an algorithm 4, On the average the relative difference in theoretical and experimental estimations makes 35,86 %,

n

mw

0, Modular algorithm of multiplication of the matrices, using CRT both for polynomials, and for their factors (MCC) with complexity O(n3m + n2(wm2 + mw2),

1, Modular algorithm of multiplication of the matrices, Fourier using fast transformation for polynomials and CRT for their factors (MFC) with complexity O(n3m + n2(wm log2 m + mw2),

The best algorithm at increase of degree of polynomials in a matrix is MFC, At the fixed

n

and experimental comparison. Average relative distinction in theoretical and experimental comparison makes 24,37 %.

At a theoretical estimation of algorithms of calculation of a determinant, a characteristic polynomial and the adjoint matrix, for the matrices which elements are polynomials of one variable with the integer coefficients, following expressions of complexity have been received:

Table 12

Theoretical expression of complexity for algorithms of calculation of a determinant, a characteristic polynomial and the adjoint matrix for the matrices which elements are polynomials of one variable with the integer coefficients, n - the size of a matrix, s - the maximum degree of a polynomial in a matrix, a - maximum on the module of polynomial’s

r

r1

modules necessary for calculation of a characteristic polynomial, h _ 232

Algorithm CRT+CRT CRT+DFT

Determinant n2r(7s|"logh a] + ns2) + +8n4rs + 2n2s2 ■ r + 2r2sn n2r(7s|"logh a] + 9ns log2(ns)) + 8n4rs + 9ns log2(ns) ■ r + 2r2sn

Char, polynomial n2r1(7s[logh a] + ns2) + 8n4r1s + n(2n2s2 ■ r1 + 2r2sn) n2r1(7s[logh a] + 9ns log2(ns)) + +8n4r1s + n(9ns log2(ns) ■ r + 2r2sn)

Adjoint matrix n2r(7s|"logh a] + ns2) +8n4rs + n2(2n2s2 ■ r + 2r2sn) n2r(7s|"logh a] + 9ns log2(ns)) +8n4rs + n2(9ns log2(ns) ■ r + 2r2sn)

From Table 12 it is clear that the best results the DFT-arithmetic has been shown at a calculation of the adjoint matrix. In problems of calculation of a determinant and a characteristic polynomial the algorithms using DFT, lose the efficiency with growth of the sizes of a matrix.

The algorithm of parallel calculation DFT in a finite field has shown high scalability on the number of processors from 2 to 256, Algorithm acceleration varies from -0,16 % up to 1480 %. It is possible to explain such acceleration to that at increase in the number of processors the computing problem on everyone becomes small enough for effective work with the data in cache-memorv of the processor.

The further direction of researches is the implementation of parallel DFT-algorithms for multiplication of matrices, calculations of a determinant, a characteristic polynomial and the adjoint matrix.

References

1. Noden P., Kitte K. Algebraic algorithmic (with exercises and solutions)/ the traslation from French, The World, 1999,

2. Moreno Maza M., Xie Y. FFT-based Dense polynomial Arithmetic on Multi-cores // HPCS 2009: post-conference proceedings, 2009,

3. Moreno Maza M., Xie Y. Balanced Dense polynomial Multiplication on Multi-cores // Proceedings of Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2009.

4. Knut D.E. The art of programming, V, 2, Seminumerical algorithms. The publishing house «Williams», 2001.

5. Kormen, Thomas S., Lejzerson, Rivest Ch., Shtojn /?.. Klifford Algorithms: construction and the analysis, 2 edition/ the translation from English, The Publishing house «Williams», 2005.

6. Pereslavtseva O.N. The method of calculation of a characteristic polynomial of a matrix// Tambov University Reports. Series Natural and Technical Sciences. 2008. V. 13. Issue 1. P. 131-133.

7. Lapaev A.O. Comparison of algorithms of multiplication of polynomials // International conference polynomial Computer Algebra. St. Petersburg. PDMI EAS. 2008. P. 39 - 40.

8. Lapaev A.O. Parallel computation of discrete Fourier transform of a polynomial in a finite field//Materials of 9th international conference-seminar High-efficiency parallel calculations on cluster systems. Vladimir, 2009. P. 272-273.

9. Lapaev A.O. About calculation of multidimensional discrete transformation of Fourier in a finite field // Tambov University Reports. Natural and Technical Sciences. 2009. V. 14. Issue 4. P. 729-731.

10. Lapaev A.O. Algorithms for polynomial matrices with use DFT // International conference polynomial Computer Algebra. St. Petersburg. PDMI EAS. 2009. P. 141-146.

11. Lapaev A.O. On calculation of determinant and a characteristic polynomial of polynomial matrix with use of discrete Fourier transform // Tambov University Reports. Natural and Technical Sciences. 2009. V. 14. Issue 1. P. 281-282.

12. Malaschonok G I., Valeev Yu. D., Lapaev A. O. On the choice of multiplication algorithm for polynomials and polynomial matrices // Zapiski POMI. 2009. V. 373. P. 157-188.

GRATITUDES: Supported by the Sei, Program Devel. Sei, Potent, High, School, RNP 2.1.1.1853.

Accepted for publication 7.06.2010.

Тамбовский государственный университет им. Г.Р. Державина, Интернациональная, 33, Тамбов, 392000, Россия, аспирант кафедры компьютерного и математического моделирования, e-mail [email protected]

Ключевые слова: полиномы; дискретное преобразование Фурье; параллельный алгоритм; метод гомоморфных образов; кластер.

Рассматриваются последовательные и параллельные алгоритмы для полиномиальной арифметики, основанные на дискретном преобразовании Фурье (ДПФ). Обсуждаются алгоритмы для умножения полиномов. Приведены последовательные алгоритмы для полиномиальных матриц. Каждый алгоритм, основанный на ДПФ, сравнивается с аналогичным алгоритмом, использующим китайскую теорему об остатках. В последней части работы приведены параллельные алгоритмы для вычисления ДПФ и умножения полиномов многих переменных. Приведены результаты экспериментов на кластере МВС100К в МСЦ РАН.

Полиномов в параллельных алгоритмах Текст научной статьи по специальности «Математика»

Аннотация научной статьи по математике, автор научной работы — Лапаев Алексей Олегович

Похожие темы научных работ по математике , автор научной работы — Лапаев Алексей Олегович

DFT FOR POLYNOMIALS IN PARALLEL ALGORITHMS

Текст научной работы на тему «Полиномов в параллельных алгоритмах»