Аlgorithms of artificial neural network teaching using the parallel target function calculation

Kryuchin Oleg Vladimirovich

UDC 519.95

ALGORITHMS OF ARTIFICIAL NEURAL NETWORK TEACHING USING THE PARALLEL TARGET FUNCTION CALCULATION

Kew words: artificial neural networks (ANN); parallel calculations; theoretical equations; training algorithms.

In the article the method to improve efficiency of artificial neural networks training by using parallel calculations and cluster systems is described. The theoretical equations putting in correspondence a number of algorithms operations and characteristics of used information resources are presented. The results of computing experiment confirming derived theoretical equations are presented.

INTRODUCTION

Nowadays the artificial neural networks (ANN) of the forward-direction are the very useful technology for solving different tasks. For ANN usage it is necessary to search weight coefficients for the inaccuracy minimization (sometimes it is necessary to select a network structure but we will not look this situation). The inaccuracy is the value of the functional estimating of the difference between pattern data and values calculated by ANN. The most useful target functional type is the square function that is why ANN training is a nonlinear task of ordinary least squares. For the solving this task we can use the optimization theory and usually it uses this theory for the ANN training [1].

But if we simulate complex objects then the calculation time is large inane (even we use the modernest computers). For example, the simulation of the social object, which is dependence of schoolboys’ professional ability on their personal characteristics in Tambov State University named after G.R. Derzhavin expended two weeks [2-3]. One of these problem solution methods is to use computer clusters. But the usage of such clusters needs training algorithms to be developed with this technology specification considered.

This paper’s aim is to develop p universal parallel algorithm which can be used for the ANNs training by most part of weight coefficients searching algorithms. And it is necessary to estimate developed method efficiency.

where di,yi are i -th output values of the simulation object and the used ANN-model; N is the rows number in the pattern; P is the simulation object output number (the vectors dt and y{ size).

The output ANN values are calculated by formula:

(x~, œ, ^),

(2)

where xi is input data vector; w and ^ are management parameters (weight coefficients and neuron activation functions).

We can unite equations (1) and (2) and get formula:

£ =

d,,j - F

(3)

j=0

If we have n elements of information resource (it may be computer cluster or local network nodes), then we can divide a pattern on n parts and allow each element of information resource (IR-element) to work with the unique part. We will call zero IR-element “lead element” because it manages all others. So, each non-lead IR-element receives M rows:

PARALLELIZATION METHOD IDEA

The main ANN’s training task is the minimization of the inaccuracy value which is calculated by the target function:

M =

N

n

N

n -1

N mod n = 0; N mod n ^ 0,

(4)

N-1 ._ _, N-1 P-1 z

£=Z(di- yj f = ZZ ( - y>.j)

(1)

i=0 j=0

and the lead IR-element works with Ml rows:

Mi =

M

N mod n = 0;

N - M(n -1) N mod n ^ 0.

(5)

i=0

So, all non-lead IP-elements have equal rows numbers [4-5].

Other method of the pattern division is the division on two parts. At first, the lead IR-element divides the pattern on two parts and sends one part to first non-lead IR-element. Then each part is divided on two parts and sent to other IR-elements. So after second step zero IR-element has [X0 - Xn4—, Y0 - Yn4-1 ] rows,the first IR-element

has [XN4 - XN2-1 ,Yn4 - Yn2_i ] rows, the second IR-element has [XN 2 - X3N4_i , Yn2 - Y3N4_i ] rows and the third IR-element has [X3N4 - XN—1, Y3N4 - Yn—1] rows. This operation is repeated until all IR-elements will receive pattern parts. This method disadvantage is necessity for a pattern row number to be aliquot to a processors number (values log2N and log2n must be integer).

So equation (1) can be written as:

n-1

8 = Z 8k , (6)

k=0

where 8k is the inaccuracy value which was calculated by a k -th IR-element. Values of 8k are calculated by formula:

(7)

i=0 j=0

(for the lead IR-element) and by formula:

M-1P-1

£ k ZZ(dM (k-1)+M +i i=0 j=0

- yM (k-1 )+M + i,j) ,k> 0

(8)

(for the non-lead processor). Here M is a number of pattern row which are located in the k-th non-zero IR-element; M is a number of pattern rows which are located in the lead IR-element; P is an ANN output number; d and y are output values of a simulated object and ANN.

If we put equations (7) and (8) to formula (6) then we can write:

£=ZZ(di,j- y^2

+

i=0 j=0

n-1 M-1P-1 / >.2

+ Z zZ z Z z \M ( k-1)+M+i,j yM ( k-1)+M+i,j / '

k=1 i=0 j=0

(9)

Before the beginning of the weight coefficients searching it sends ANN structure from the lead IR-element to all non-zero IR-elements. For calculating the target function value this algorithm executes few actions:

1) sending the weight coefficients vector from the lead IR-element to all non-zero IR-elements;

2) calculating the value £0 (by formula (7));

3) receiving the inaccuracy values which were calculated by non-zero IR-element (each IR-element used its pattern part for it);

4) calculating the total inaccuracy value 8 by the lead IR-element (using formula (9)) [4, 6].

ALGORITHMS REALIZATION

A computer cluster is a group of linked computers, working together closely thus in many respects forming a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and availability over that of a single computer, while typically being much more cost-effective than single computers of comparable speed or availability [7].

The GNU/Linux world supports various cluster software; for application clustering there is Beowulf, distcc, and MPICH. Linux Virtual Server, Linux-HA are director-based clusters that allow incoming requests for services to be distributed across multiple cluster nodes. MOSIX, openMosix, Kerrighed, OpenSSI are full-blown clusters integrated into the kernel that provide for automatic process migration among homogeneous nodes. OpenSSI, openMosix and Kerrighed are single-system image implementations.

For the parallel training we have selected the MPI which is a widely-available communications library that enables parallel programs to be written in C, Fortran, Python, OCaml, and many other programming languages. For the parallel training algorithm realization it has developed high-level tool of the multiprocessor data passing based on the MPI technology. Algorithms were developed with the usage of the C++ language in the environment KDevelop [8].

A PRIOR ESTIMATION OF ALGORITHMS EFFICIENCY

We can think that parallel method searching weights will be n times faster than serial method but it is not so in reality. The time expenses can be calculated by formula:

t = a A — + (1 - a A )+ y = t( - a A +1 ) + y

(10)

where t is the time expending for training by serial algorithm version; aA is the part of time which can be implemented in the frame of parallel algorithm. In this formula y is the difference between ideal and real time expenses. We can define this value as the all algorithms time expenses amount which is back proportional to the algorithm efficiency and can be calculated by formula:

¥ =

t9 +ty +tX + ty

(11)

where t9 is the time of sending data from one nodes to other; ty is the time expenses of the data converting (from

arrays-pointers which are used in MPI implementation functions to containers which are used in training algo-

En =

0

rithms and from containers to arrays-pointers); tx is the time of waiting and ty is the other time expenses.

But this value (y) defines absolute difference between ideal and real time expenses and if we use it then we cannot understand the real algorithms efficiency (we know that its value is back proportional to the algorithm efficiency only). So we need to define new coefficient which can be used for the efficiency definition. For this task solution we can use

output values calculation as £y and additive operations number as £y . For the different platforms the rations of

one multiplicative operation time expense to one additive operation time expense are not equal that is why it is necessary to use coefficient c . This coefficient shows the ration of one additive operation time expense to one multiplicative operation time expense (one multiplication operation needs time equaling c addition operations). So, for the ANN output values it is necessary to execute

'-i

a = — = ■

m

(12)

where t I = tn

is the ideal time expenses. For efficient algorithm this value should be lie in the band (0.7, 1).

The algorithm efficiency in the level of the target function calculation is defined by two characteristics:

1) the number of the target function callings;

2) the interconnect speed.

The first characteristic defines the ratio of the time expenses of the target function calculation to the one iteration time expense:

t8

(13)

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

(16)

operations.

Calculating the inaccuracy value 8 for the i -th pattern row needs £ y +1 multiplication (£ y for the ANN output

values calculation and one for the square calculation) and £y addition operations. That is why the full inaccuracy

value calculation needs N£y multiplicative and N£y additive operations

£8

N (£ y+i);

£8 =N£ y+N = N (£ y+1);

(17)

(18)

t

7

P

t

i

So the target function parallel calculation efficiency is the direct-proportional to this value:

a8 = IP . (14)

The second characteristic is defined by hardware properties. The value t9 from formula (11) which is one of

items of y defines time expenses for the data sending

from one processor to other:

Vv

tv = —. (15)

Vv

So, the interconnect speed vv is one of target function parallel calculation efficiency components.

These values (l8 ,P8) are not equal for different algorithms of weights searching. For example, the full enumeration algorithm calculates the inaccuracy once in the iteration but the gradient methods do lm calculations (one calculation for each weight coefficient). On the other hand, the value of P8 in the full enumeration method usage is greater than in gradient methods and it may be the that values of a8 for these methods are approximately equal.

MULTIPLICATIVE AND ADDITIVE OPERATIONS NUMBER

For the calculation of parallel algorithms efficiency we can use the executed operations count. It symbolizes the number of multiplicative operation executing for the ANN

Z8 = £8 + c£8 =N (£y+1+ c£y+ c). (19)

If we want to use parallel inaccuracy calculation using n processors, then we should divide pattern to n parts. In the each target function the value calculation processor calculates the inaccuracy using its pattern part and after it the lead processor summands all results and computes total inaccuracy value as it is shown by equation (6).

So, for the target function value parallel calculation it is necessary:

1) to divide pattern into n parts before the training start;

2) to send weight coefficients from the lead processor to all non-zero and to return the calculated inaccuracy value in the each target function value calculation [9].

For passing one pattern element it is necessary to execute one multiplicative and two additive operations for passing the pattern (which consists of PM elements) belonging the k-th processor to the lead processor the algorithm executes PM multiplicative and 2PM additive operations. But we should send the pattern to all non-zero processors, so algorithm executes PM(n -1) multiplicative and 2PM(n -1) additive operations. As it was written, the division of the pattern into n parts executes two multiplicative and N additive operations that is why the lead processor performs 2 + PM(n -1) multiplicative and N + 2PM (n -1) additive operations for the first algorithm step. Non-lead processors execute MP multiplicative and 2MP additive operations. And we should consider that non-zero processors cannot begin to receive data before sending these data by the zero processor. It means

that the k -th non-zero processor waits executing 2 + kPM multiplicative and N + 2kPM additive operations. So, the k -th processor executes 2 + PMk + gN + 2cPMk empty operations, which fit to preparing data operations in the lead processor and y(MP,v) operations, which fit time expenses of passing

MP numbers with interconnect speed equaling v . After these values analysis we can see that the lead processor executes

C% = 2 + PM(n -1) + g(N + 2PM(n -1)) =

„ i (20J

= 2 + PN - PM + gN + 2gg (N - M)

and the k-th non-zero processor executes

C0k= 2 + PMk + gN + 2gPM + j(MP,v) +

+ MP + 2gMP = 2 + PM (k + 1)+ (21)

+ 2gPM (k +1) + gN + y(MP,v)

operations.

The target function value calculation consists of few steps. Numbers of operations which are executed in each step are shown in the table 1.

1. Sending weights (the weight coefficients vector size is lm ) to all non-zero processors. For it the lead processor executes lm (n -1) multiplicative and 2lm (n -1) additive operations and the k -th processor executes lm multiplicative and 2lS additive operations for receiving data. We should consider that sending lm elements needs y(lm ,v) operations that is why the k -th processor executes klm + 2g klm + Y(lm ’v) + L + 2g lm operations.

2. Calculating the inaccuracy value.

3. Return the calculated inaccuracy value to the lead processor. For it the k -th non-zero processor executes one multiplicative and two additive operations and the lead processor executes n -1 multiplicative and 2n - 2 additive operations for receiving and 1 + 2g + y(i, v) empty operations (for waiting). So, the lead processor performs

Clo=n -1 + 2g( -1)+ H1, v)+1 + 2g = (22)

= n + 2g( -1)+ y(1, v)+ 2g = n + 2g n + y(1, v)

operations.

4. Calculating the total inaccuracy value by the lead processor. It executes n -1 additive operations.

For the two first steps performance the lead processor executes Cle0 + C2e0 operations and the k -th non-zero

processor executes Clek + C^k operations.

C1o+CE2o=lm (n -1)1 + 2g) +

* / * \ (23)

+M {Qy+1 + GQy + g ) ;

Table 1

Numbers of operations executing for the inaccuracy calculation

Step Lead processor Non-zero (k -th) processor

1 n 1 + 2 ) kls + 2g kla + Y(ls ,v) + + ls + 2g S

2 M (? y+1+ Gi:y + G) M (? y+1 + G<Ty + g)

3 C3 ^e0 1 + 2g

4 g(u -1)

Clk + Cek = klm + 2Gklw + Y(lm ’v)+ L + 2gCTq +

+ M (y +1+ G?y + G)=(1+ 2G)(klm + lm )+ (24)

+ Y(lm ,v) + M (y +1+ g£y + G)

So, for two first steps it executes

Cl = mkax (pik + Cek ) = mkax (lffl (n - 1)1 + 2g) +

+ M (y +1 + Gl?y + G )(1+ 2g )(klm +lrn )+ (25)

+ Y(lro ,v) + M( +1+ G?y + G)>

operations.

For calculating the number of operations executing before the beginning of receiving data by the lead processor

we should develop C2 based on C . This value depends on few factors:

- operations passing weights from the lead processor to other (it is executed by the zero processor);

- operations receiving weights from the lead processor (it is executed by non-zero processors);

- operation calculating inaccuracy ek (it is executed by all processors);

- operation passing inaccuracy values to the lead processor (it is executed by non-zero processors).

So, we can calculate the C value using formula:

+ n + 2GG = max (lffl (n - 1)(1 + 2g) +

a k (26) + M(^y +1 + °ty + GX(1 + 2G)(klffl + lrn ) +

+ Y(lm >v) +M (£y +1+ G£y + g) + y(1, v)) + n + 2.

So, parallel target function value calculation needs Zs operations. This value can be calculated by formula:

Z£ =C2d + a(n -1). (27)

Efficiency coefficient values

Table 2

Method \ Cluster TSTU TSU JSC

Full enumeration 0.75-0.77 0.78-0.80 0.79-0.81

Monte-Carlo 0.79-0.81 0.79-0.82 0.80-0.83

Gradient 0.75-0.77 0.75-0.78 0.76-0.78

After the analysis of equations (19) and (27) we can develop the efficiency of the algorithm which use the parallel calculation of the target function value:

& ) =

nIF, ZS +Csk+ X

(28)

a

where Is is the inaccuracy calculations number; X is the number of other algorithm operations (which are not fit into target function calculation).

CHECKING THE PARALLEL ALGORITHMS EFFICIENCY

For the parallel algorithms efficiency checking we have done several experiments which are ANN trainings. For these experiments we have used three computer clusters. These are the computer clusters of Tambov State University named after G.R. Derzhavin (TSU), the Tambov State Technical University (TSTU) computer cluster and the computer cluster of Institute of the Russian Academy of Sciences Joint Supercomputer Center of RAS (JSC). It has searched weight coefficients using full enumeration method, Monte-Karlo method and gradient method. We have used multilayer perceptron, cascade-correlation network and Volterry network. There are result efficiency vales of this experiment in the table 2.

CONCLUSION

As we can see the experiment results show that efficiency coefficient values are high and the parallel calculation of the target function value is very efficient. It means that we can use this method for the ANN training of the real problems solving.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

LITERATURE

1. Сараев П.В. Использование псевдообращения в задачах обучения искусственных нейронных сетей // Электронный журнал «Исследовано в России». 2009. C. 1208-1221. URL: http://zhurnal. ape.relarn.ru/articles/2001/029.pdf (Saraev P.V. Pseudo-inversions's to artificial neural networks learning problems // Electronic journal “Analysed in Russia”. 2009. P. 1208-1221. URL: http://zhurnal.ape.relarn.ru/ articles/2001/029.pdf).

2. Арзамасцев А.А., Зусман Ю.А., Зенкова Н.А., Слетков Д.В., Шкута Н.О., Крючин О.В., Банников С.С., Королев А.Н., Шкатова Л.С., Шохина Т.Б. Реализация проекта Tempus Tacis «System Modernisation of University Management» в Тамбовском шсударственном сниверситете им. Г.Р. Державина // Вестник Тамбовского университета. Серия Естественные и технические науки. Тамбов, 2006. Т. 11. Вып. 5. С. 619-645. (Arzamastsev A.A., Zusman Y.A., Zenkova N.A., Sletkov D. V., Shkuta N.O., Kryuchin O. V., Bannikov S.S., Korolev A.N., Shkatova L.S., Shohina T.B. The realization of the project Tempus Tacis «System Modernisation of University Management» in Tambov State University named after G.R. Derzhavin // Tambov Uni-

versity Reports. Series: Natural and Technical Sciences. 2006. V. 11. Issue 4. P. 564-570).

Арзамасцев А.А., Крючин О.В., Азарова П.А., Зенкова Н.А. Универсальный программный комплекс для компьютерного моделирования на основе искусственной нейронной сети с самоорганизацией структуры // Вестник Тамбовского университета. Серия Естественные и технические науки. Тамбов, 2006. Т. 11. Вып. 4. C. 564570. (Arzamastsev A.A., Kryuchin O. V., Azarova P.A., Zenkova N.A. Universal program complex for computer simulation based on artificial neural network with structure self-organization // Tambov University Reports. Series: Natural and Technical Sciences. 2006. V. 11. Issue 4. P. 564-570).

Крючин О.В. Использование кластерных систем для обучения искусственных нейронных сетей при применении параллельного вычисления значения невязки // Наука и образование в развитии промышленной, социальной и экономической сфер регионов России. 2 Всероссийские научные Зворыкинские чтения: сб. тезисов докладов 2 Всероссийской межвузовской научной конференции (Муром, 5 февраля 2010 г.). Муром: Изд.-полиграфический центр МИ ВлГУ, 2010. 802 с. (Kryuchin O.V. The Usage Computer Clusters for the Training Artificial Neural Networks with Parallel Calculating Inaccuracy Value // Science and Duration in Evolution of Industrial, Social and Economic Spheres of Russian Regions. Second All-Russian Scientific Zvorikin Reading. Materials of 2 All-Russian Interacademic Scientific Conference. Murom: Publishing of Vladimir State University, 2010. P. 142-143).

5. Крючин О.В. Параллельные алгоритмы обучения искусственных нейронных сетей // Материалы 15 международной конференции по нейрокибернетике. Ростов н/Д: ЮСУ, 2009. Т. 2. Симпозиум «Интерфейс ''Мозг-Компьютер''». 3 Симпозиум по Нейроинформатике и Нейрокомпьютерам. C. 93-97 (Kryuchin O.V. Parallel algorithms training artificial neural networks // Processes of 15 international conference of neuroncibernetics. Rostov-on-Don: YSU, 2009. V. 2. Symposium “Interface "Bram-Computer"”. 3-rd symposium of neuroinformatics and neurocomputers. P. 93-97).

6. Крючин О.В., Арзамасцев А.А, Королев А.Н., Горбачев С.И., Семенов Н. О. Универсальный симулятор, базирующийся на технологии искуственных нейронных сетей, способный работать на параллельных машинах // Вестник Тамбовского университета. Серия Естественные и технические науки. Тамбов, 2008. Т. 13. Вып. 5. C. 372-375 (Kryuchin O.V., Arzamastsev A.A., Korolev A.N., Gorbachev S.J., SemenovN.O. Universal simulator based on technology of artificial neural network which can be executed on parallel computers // Tambov University Reports. Series: Natural and Technical Sciences. Tambov, 2008. V. 13. Issue 5. P. 372-375).

7. Bader D., Pennington R. “Cluster Computing: Applications”. Georgia Tech College of Computing, 1996. URL: http://www.cc.gatech.edu /~bader/papers/ijhpca.html (Retrieved: 13 july 2007).

8. Крючин О.В.. Королев А.Н. Библиотека распараллеливания // Вестник Тамбовского университета. Серия Естественные и технические науки. Тамбов, 2009. Т. 14. Вып. 2. С. 465-467 (Kryuchin O.V., Korolev A.N. Parallelization library // Tambov University Reports. Series: Natural and Technical Sciences. Tambov, 2009. V. 14. Issue 2. P. 465-467).

9. Крючин О.В., Аразамасцев А.А. Сравнение эффективности последовательных и параллельных алгоритмов обучения искусственных нейронных сетей на кластерных вычислительных системах // Вестник Тамбовского университета. Серия Естественные и технические науки. Тамбов, 2010. Т. 15. Вып. 6. С. 1872-1879 (Kryuchin O. V., Arzamastsev A.A. Comparison of effectiveness of consecutive and parallel algorithms of teaching of artificial neural networks on cluster computation systems // Tambov University Reports. Series: Natural and Technical Sciences. Tambov, 2010. V. 15. Issue 6. P. 372-375).

Поступила в редакцию 26 мая 2012 г.

Крючин О.В. АЛГОРИТМЫ ОБУЧЕНИЯ ИСКУССТВЕННОЙ НЕЙРОННОЙ СЕТИ, ИСПОЛЬЗУЮЩИЕ ПАРАЛЛЕЛЬНОЕ ВЫЧИСЛЕНИЕ ЦЕЛЕВОЙ ФУНКЦИИ

В статье описан способ повышения эффективности обучения искусственных нейронных сетей при помощи использования параллельных вычислений и кластерных систем. Приводятся теоретические выражения, ставящие в соответствие число операций алгоритмов и характеристики используемых информационных ресурсов. Приводятся результаты вычислительных экспериментов, подтверждающие полученные теоретические выражения.

Ключевые слова: искусственные нейронные сети; параллельные вычисления; теоретические выражения; алгоритмы обучения.

Аlgorithms of artificial neural network teaching using the parallel target function calculation Текст научной статьи по специальности «Компьютерные и информационные науки»

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Kryuchin Oleg Vladimirovich

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Kryuchin Oleg Vladimirovich

Текст научной работы на тему «Аlgorithms of artificial neural network teaching using the parallel target function calculation»