Параллельные полиномиальные вычисления с использованием внешней памяти

Поздникин Алексей Геннадьевич

UDC 519.688

OUT-OF-CORE PARALLEL POLYNOMIAL ARITHMETIC

Tambov State University named after G.E. Derzhavin, Internatsionalnava, 33, Tambov, 392000 Russia, Post-graduate Student of Computer and Mathematical Modeling Department,

e-mail: [email protected]

Key words: polynomials on the external data carrier; polynomial arithmetic; the parallel algorithm of multiplication.

This paper presents the description of structure of polynomials on the external data carrier. The algorithms for addition and parallel multiplication of polynomials are scrutinized. The results of experiments conducted with parallel multiplication of polynomials on cluster are given.

1 Introduction

Polynomials are the main objects in symbolic computation [1]. The effectiveness of computer algebra system depends on the effectiveness of polynomial procedures.

Symbolic computations are characterized as problems of high computational complexity. Therefore, it is necessary to develop parallel algorithms and conducting calculations on multiprocessor computer systems.

In the articles [3], [4] there is information about parallel polynomial algorithms. Traditional systems of computer algebra, such as Mathematica, can operate on polynomials, that do go into RAM. However, these systems are unsuitable for operation with large polynomials, which need more memory and cannot be written into RAM. Therefore, providing operating on such polynomials is one of the primary tasks of parallel computer algebra.

One of such systems that can operate on so large mathematical expressions is «FORM». It is a system for symbolic manipulation of algebraic expressions specialized in handling with very large expressions of millions of terms in an efficient and reliable way [5].

In the article [6] the representation of the large polynomials is discussed, the algorithms realising an implementation of main arithmetical operations are considered and the results of the experiments are also presented there.

This article describes the structure of the polynomial, which is stored on the external data carrier. Algorithms of addition and parallel multiplication of such polynomials are considered there.

You may also get acquainted with the results of experiments which were carried out on operation of parallel multiplication of polynomials on cluster of JSC RAS. It is presented in graphics.

2 The structure of polynomials on the external data carrier

We used two one-dimensional arrays to store one polynomial. The first array stores only the nonzero coefficients of a polynomial and the second array stores the degrees of each variables. If there are «var» variables in a polynomial, the second array contains «var» times more elements than the first. Monomials in the polynomial are stored in reverse lexicographical order. This order is accepted, that arithmetic algorithms with polynomials worked faster. You can learn about other structures of polynomials in article [7].

This polynomial will be stored on external data carrier in two files. Monomials of a polynomial in the form of arrays bytes will be saved in one file. The second file will contain the type of coefficient, i.e. the set of whole or rational numbers which polynomial coefficients are taken from. Then, the number of variables of the polynomial (vars), the total number of nonzero monomials in the polynomial and an array of integers will be written in the second file. The array of integers contains the information about number of bytes, which each monomial of the polynomial occupies on a hard disk. We will call such polynomial the file polynomial, which are stored in external memory.

We should be able to operate with file polynomials and to send them between cores. For this purpose, we will operate small fragments of file polynomials, which can be located in RAM,

3 Addition of file polynomials

Operation of addition of file polynomials is implemented in the form of consecutive algorithm. This algorithm consists of three main parts:

1, We compare variables in file polynomials. If one of the file polynomials has more the number of variables, the monomials, that contain these older variables will be recorded in the resulting file. Otherwise, we go to the step 2,

2, We compare exponents of variables in each monomial. If exponents of variables in the monomials are equal, the coefficients are added and a new monomial with the same degrees is recorded in the file. If the sum of coefficients is equal to zero, then the monomial at this degree is not written into file. If the exponents of variables in one of the monomials will be greater, then this monomial can be written in the file, and the smallest one is compared with the following one. Transition to the next step will be done when the file polynomials will be read to the end,

3, We read and write into the resulting file of the remaining monomials of one of polynomials. Let

p1 = 9x2y2 + 4x2y — 8xy + x — 6,

p2 = —8x2y2z3 — x2 y2z2 — 4x2y — 5x3 + 3xy.

We consider example of addition p(1) and p(2),

Step 1, We write into the file: —8x(2)y(2)z(3) — x(2)y(2)z(2).

Step 2, We write into the file: 9x(2)y(2) — 5x(3) — 5xy.

x — 6.

As a result, we obtain the sum of two polynomials in form of a sorted file polynomial.

If we view p(1) + p(2) then the following will be written in the file:

p = —8x2y2z3 — x2y2 z2 + 9x2y2 — 5x3 — 5xy + x — 6.

4 The parallel algorithm of multiplication of file polynomials

The procedure of multiplication of file polynomials is recursive. Dichotomous division of

polynomials on the part present a basis of the recursive algorithm.

The condition is a way of the exit out of the recursion, if it is satisfied, then the multiplication of individual parts of polynomials can be made in memory of the given size.

The value of free RAM is set bv a variable freeMemory, Procedure getMemForMul estimates size of the memory that may be required to multiplication of two polynomials or

getMemForMul

freeMemory

The binary tree is the graph of the recursive algorithm. The multiplication of parts of polynomials is performed on its leaves.

The interval with numbers of free cores is set in root node. The parallel algorithm of splitting of polynomials by parts is accompanied by splitting the interval with numbers of cores. If set of free cores is empty and multiplication of parts of polynomials cannot be done in memory, the consecutive recursive algorithm on one core will be caused.

We consider a parallel algorithm for multiplication of file polynomials A and B , Algorithm’s graph is presented at Figure 1,

AB

distribution of four cores to nodes of the tree

Let A = (a1 + a2) and B = (b1 + b2) be two file polynomial that we want to multiply. The product can be found as the sum of four items: a1 * b1 + a1 * b2 + a2 * b1 + a2 * b2 . The calculation of each of the four items can be executed on a separate core.

We choose greater polynomial and splitted it by two parts. Parts should occupy in the memory an equal amount of bytes. Let the polynomial A > B, ie A occupies more memory than B . On the first step we divide a polynomial A into two parts a1 and a2 , ie A = a1 + a2 . The interval with numbers of cores [0, 3] will be divided into two intervals [0,1] and [2, 3], We a1 * B a2 * B division of polynomials into parts will be continued.

a1 a2 B > a1 B > a2 B

b1 b2 a1 * b1 a1 * b2 a2 * b1 a2 * b2

Let we have reached leaf nodes if multiplication is possible to execute in RAM on cores 0,

1, 2, 3, accordingly.

During the sending the calculated fragments back to the root, their addition will be done: a\^ * b1 + a1 * b2 = a1 * B and a2 * b1 + a2 * b2 = a2 * B . The result of multiplication will be the sum a1 * B + a2 * B = A * B , calculated at the root.

We consider the program code of procedures for parallel multiplication of the file polynomials, implemented on language Java,

We introduce the following designation:

Polynom - is a type of polynomial, which is stored in memory,

FPolynom - is a type of file polynomial.

Subset - is a set of numbers of available cores,

BasePolynomDir - is a class that is used to create the directory where the file will be written polynomials,

Bv default it is a directory ”/tmp/fpolynoms/” in operating systems Linux and ”C : \temp\fpolynoms” in Windows,

In algorithm of parallel multiplication of file polynomials the procedures are used:

1) Polynom mulS(Polynom pol2). The procedure multiply polynomials in RAM,

2) Polynom toPolynom(long skipBytes, long bytes), The procedure reads a part of the file polynomial and writes it into RAM, Parametres: skipBytes — quantity of bytes which will be skipped, bytes - bytes quantity which will be read. Result is a polynomial in RAM,

3) FPolynom toFPolynom(File filename, Element itsCoeffOne). The procedure reads a polynomial from memory and writes down on a hard disk, at the specified path filename, itsCoeffOne - is a unit in the field of the coefficients of the polynomials. The result is a polynomial, written in external memory,

4) long getMemForMul(FPolynom fpoll, FPolynom fpol2, long s1, long n1, long s2, long n2)

multiplication of parts of polynomials fpol1 and fpol2. s1, s2 is a bytes which will be skipped in the polynomials fpol1 and fpol2. n1, n2 is a bytes which will be read in the polynomials fpol1 and fpol2.

long getByteLength() polynomial occupied,

6) Subset[] divideOnParts(int n), The procedure splits an interval into n parts and returns an array of intervals,

7) long middlePolynom(long skipBytes, long middle). The procedure returns the number of bytes approximately equal to half of a memory size which occupies a part of a file polynomial, skipBytes is the bytes needs to be skiped a file polynomial, middle - is the middle of a part of the file polynomial,

8) Ssend(Object obj, int proc, int tag), The procedure sends an object obj, to the core with number of proc, and of tag is the tag.

9) Recv(int objType, intproc, int tag). The procedure receives a object obj , from the core

tag

10) SendFPolynom(F Polynom pol, long skipBytes, long numbytes, intproc).

The procedure sends numbytes bytes of a file polynomial pol to the core with number proc, skipBytes of bytes will be skiped from the file beginning,

11) RecvFPolynom(File dir, int proc) - The procedure receives a file polynomial from the proc dir

add(String dir1, String dir2, File fdir)

dir1 dir2 fdir

The program code of procedure of multiplication of file polynomials can be seen in Fig, 2,

public static FPolynom multiply (FPolynom fpoll,

FPolynom fpol2, File fres) throws Exeeption{ int mvrank = MPI,COMM_WORLD,RankQ; if(mvrank == 0){

int size = MPI,COMM_WORLD,SizeQ;

Subset procs = new Subset(new int[]{0,size-1}); multiplyRec(fpoll, fpol2, 0, fpoll,getBvteLengthQ,

0, fpol2,getBvteLengthQ, fres, procs, 0)} else{

Status st = MPI,COMM_WORLD,Probe(MPI,ANY_SOURCE, MPI,ANY_TAG); if (st ,tag==tag_true) {

int parent = (Integer)LLP.Recv(LLP.INT_TYPE, MPI.ANY_SOURCE, tag^true); BasePolvnomDir dir = new BasePolvnomDirQ;

File fl = new File(dir,ereatePolynomDir("proe"+mvrank), "pi");

File f2 = new File(dir,ereatePolvnomDir("proe"+myrank), "p2");

File f3 = new File(dir,ereatePolvnomDir("proe"+myrank), "p3"); int[] arr = (int[])LLP.Recv(LLP.INT_ARRAY_TYPE, parent, tag^proc);

Subset process = new Subset (arr);

LLP,RecvFPolynom(fl, parent);

LLP,RecvFPolynom(f2, parent);

FPolynom pi = new FPolvnom(fl);

FPolynom p2 = new FPolvnom(f2); multiply Rec (pi, p2, 0, pl,getBvteLength(),

0, p2,getBvteLengthQ, f3, process, mvrank);

LLP,SendFPolynom(new FPolvnom(f3), 0, f3,length(), parent); }} return new FPolynom (fres);}

Fig. 2. The code of procedure of multiplication of file polynomials

Procedure multiply receives on an input two file polvnoms fpol1, fpol2 and a directory fres in which the result of multiplication will be written down,

Size()

On the core with number zero ( myrank =0), the variable size accepts value of total number

size — 1

On zero core recursive procedure of multiplication multiply parts of polvnoms fpol1, fpol2 is started. The remaining cores, with numbers not equal to zero, waiting for a message with tag true

tag true

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

the interval with numbers of available cores, and two polynomial will be received.

private static FPolynom multiplyRec( FPolvnom fpoll, FPolvnom fpol2, long skipl, long lengthl, long skip2, long length2,

File fres, Subset proe, int mvrank) throws Exeeption{

File bufres = fres;

FPolynom result = new FPolynom (fres);

String namedirA, namedirB; int 11 = endl-stl, 12 = end2-st2;

if(getMemForMul(fpoll, fpol2, skipl, lengthl, skip2, length2)<freeMemory){ fpoll ,toPolvnom(skipl, lengthl), mulS (

fpol2,toPolvnom(skip2, length2)),toFPolvnom(fres, itsCoeffOne); if (proe.eardinalNumber () >1)

for(int i=l; i<proe,eardinalNumber(); i++)

LLP.Isend(new Integer(0), proc.toArray()[i],

tag^false);}

else{ long sl=skipl, s2=skip2, el=lengthl, e2=length2, sll=0, s22=0, ell=el, e22=e2;

Subset[] process;

if(proc,cardinalNumber()>l){ process = new Subset[2]; process = proe,divideOnParts(2);

LLP,Ssend(new Integer (mvrank), process[l].toArray()[0], tag^true); LLP.Ssend(process[l].toArray(), process[l].toArray()[0], tag^proc); if (lengthl >=length2){ el = fpoll,middlePolvnom(skipl, el/2); LLP,SendFPolynom(fpoll, si, el, process[l].toArray()[0]); LLP,SendFPolynom(fpol2, s2, e2, process[lj.toArray()[0]); sll = el+skipl; ell = lengthl-el; s22=skip2;

} else{ LLP,SendFPolvnom(fpoll, si, el, process[l].toArray()[0]); e2 = fpol2,middlePolvnom(skip2, e2/2);

LLP,SendFPolvnom(fpol2, s2, e2, process[l].toArray()[0]); s22 = e2+skip2; e22 = Iength2-e2; sll=skipl;}} else{ process = new Subset[l]; process[0] = proc;

if (lengthl >=length2){ el = fpoll,middlePolvnom(skipl, el/2);

sll=el+skipl; ell=lengthl-el; s22=skipl;

}else{ e2 = fpol2,middlePolvnom(skip2, e2/2); s22=e2+skip2; e22=length2-e2; sll=skipl;}} fileA = fres.getAbsolutePath()+"a"; bufres = new File(fileA);

multiply Rec (fpoll, fpol2, sll, ell, s22, e22, bufres, process[0], mvrank); fileB = fres.getAbsolutePath()+"b"; bufres = new File(fileB);

if (proe.eardinalNumber () > 1) { LLP.RecvFPolynom (bufres, process[l] .to Array () [0]); else{ multiply Rec (fpoll, fpol2, si, el, s2, e2, bufres, process[0], mvrank); } FPolynom. add (fileA, fileB, fres); return result;

}

Fig. 3. The code the recursive procedure of the multiplication of parts of file polynomials

Recursive procedure of multiplication of the received polynomials will be caused. The result of multiplication is sent back, to the core from which polynomials have been received. If the tag true

4.1 The recursive procedure of multiplication of parts of file polynomials

The recursive procedure of multiplication will have following arguments:

1) two file polynomials fpol1 and fpol2;

2) number of bytes which is necessary to skip in the file polynomial fpol1;

3) number of bytes which is necessary to read from the file polynomial fpol 1;

4) number of bytes which is necessary to skip in the file polynomial fpol2 ;

5) number of bytes which is necessary to read from the file polynomial fpol2 ;

6) the directory fres in which the result of multiplication will be written down;

7) the interval of procs which contains numbers of cores;

8) the number of node on which procedure is caused.

The program code the recursive procedure of multiplication of file polynomials can be seen in Figure 3,

After initialization of some variables, there is a condition check. The result of multiplication

freeMemory

condition is satisfied, then they are multiplied in memory, the result is returned. All untapped

tag false

If function returns value exceeding freeMemorv, then greater of the polynomials will be splitted on two parts. One of pairs of parts from the file polynomials remains on one node, and the second pair is sent to another core. Division of polynomials into parts will be will proceed until product of these parts will be located in RAM in volume freeMemorv, After parts of polynomials have been multiplied, product will be sent the core from which they have been received. The core will calculate the sum of the received polynomials. On zero core last operation of addition of polynomials will be made.

5 Experiments

The program complex has been developed. The experiments were conducted on the cluster of MVS — 100K in the MSC Russian Academy of Sciences, At experiments we used polynomials

103

value and quantity of monomials 25 * 104 , For parallel algorithm it is accepted that free RAM, ie freeMemorv it is equal 32 Mb,

Let:

T0 - The time of calculations on n cores;

Tk - The time of calculations on k cores; k>n

nk

formula a(Tk) = (1 — T0/Tk)/(1 — k/n) * 100, The speedup is measured in percents. In this experiment n = 8. The results of experiments are presented in Tables 1 and 2,

Table 1

The table of values of run-time of operation of multiplication of polynomials on n cores and speedups of calculations on n cores in comparison with calculations on one core. One core is

used on each node

number of cores time, sec efficiency, %

8 2644 -

16 1719 53,8

32 1176 41,6

64 785 33,8

128 577 23,9

Table 2

The table of values of run-time of operation of multiplication of polynomials on n cores and speedups of calculations on n cores in comparison with calculations on one core. Eight core is

used on each node

number of cores time, sec efficiency, %

8 3713 -

16 2578 44,0

32 1688 40,0

64 1009 38,3

128 716 27,9

4000

0-1------------1-----------1------------1-----------1

8 16 32 64 128

number of cores

Fig. 4. The graph of dependence of run-time of operation of multiplication of polynomials

from number of cores

60 -i

d

s

T3

01

(D

20

_ \

\

^ \ ■—--^1 k

s *

1 O \ \

■

16

■ 1 core on the node

• 8 cores on the node

32

64

123

number of cores

Fig. 5. Efficiency of run-time of operation of multiplication on k-core cluster, in comparison

with calculations on n-core cluster

6 Conclusions

We can see on the graph in Figure 4 that with an increase of number of cores, run time of operation of multiplication decreases.

There are 8 cores on each node on the cluster MBC-100K, If we use single core on 8 nodes, operation will be executed faster than when using 8 cores on one node. Because 8 cores on one node use one hard disk. When we set the task for 8 nodes, and use only 1 core on the node, instead of 8 possible, the most of cores in this case will not work. Therefore, use of all cores on the node is more profitable and is more economical.

On the graph of Figure 5 we can see that the speedup time of the operation of multiplication decreases with increasing number of cores. If we continue to increase quantity of cores, speedup becomes close to zero. In the above example, we not used more than 128 cores when speedup of calculations in comparison with 8 cores will be less than 30 %.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The realised parallel algorithm of multiplication of file polynomials has shown the efficiency and can be applied dealing with problems which use multiplication of polynomials of the big sizes.

References

1, Pankratiev E.V. Elements of computer algebra//Internet university of an information technology. Laboratory of knowledge, 2007, P. 248,

2, Malaschonok G.I., Avetis an A.I., Valeev U.D., Zuev M.S. Parallel algorithmes of computer algebra // Proceeding of the institute of system programming, 2004, V, 8, Issue 2, P, 169180, (Russian),

3, Valeev U.D., Malaschonok G.I. On the forms of polynomials for parallel calculations// Tambov University Reports, Natural and Technical Sciences, 2004, V, 9, N, 1, P, 149-150, (Russian),

4, Malaschonok G.I., Valeev Y.D. Parallel polynomial recursive algorithms// International conference Polynomial Computer Algebra, St, Petersburg: PDMI RAS, 2008, P. 41-45, (Russian),

5, Fliegner D., Retey A., Vermaseren J.A.M. Parallelizing the symbolic manipulation program FORM, URL: http://arXiv.org/abs/hep-ph/0007221,

6, Pozdnikin A.G. File polynomials// Tambov University Reports. Natural and Technical Sciences. 2009. V. 14. Issue 4. P.783-785.

7, Yan T. The Geobucket Data Structure for Polynomials// J. Symbolic Computation. 1998. P. 285-293.

GRATITUDES: Supported by the Sci. Program Devel, Sci. Potent. High. School, RNP

2.1.1.1853.

Accepted for publication 7,06,2010,

ПАРАЛЛЕЛЬНЫЕ ПОЛИНОМИАЛЬНЫЕ ВЫЧИСЛЕНИЯ С ИСПОЛЬЗОВАНИЕМ ВНЕШНЕЙ ПАМЯТИ

Тамбовский государственный университет им, Г.Р. Державина, Интернациональная, 33, Тамбов, 392000, Россия, аспирант кафедры компьютерного и математического моделирования, e-mail: [email protected]

Ключевые слова: полином на внешнем носителе; умножение полиномов; параллельный алгоритм.

В статье приводится описание строения полинома, который хранится на внешнем носителе. Рассматриваются алгоритмы сложения и параллельного умножения таких полиномов. Приводятся результаты экспериментов, которые проводились на кластере.

Параллельные полиномиальные вычисления с использованием внешней памяти Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Поздникин Алексей Геннадьевич

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Поздникин Алексей Геннадьевич

OUT-OF-CORE PARALLEL POLYNOMIAL ARITHMETIC

Текст научной работы на тему «Параллельные полиномиальные вычисления с использованием внешней памяти»