Научная статья на тему 'SOLVING GRID EQUATIONS USING THE ALTERNATING-TRIANGULAR METHOD ON A GRAPHICS ACCELERATOR'

SOLVING GRID EQUATIONS USING THE ALTERNATING-TRIANGULAR METHOD ON A GRAPHICS ACCELERATOR Текст научной статьи по специальности «Компьютерные и информационные науки»

CC BY
42
12
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
MATHEMATICAL MODELING / PARALLEL ALGORITHM / GRAPHICS ACCELERATOR

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Sukhinov A.I., Litvinov V.N., Chistyakov A.E., Nikitina A.V., Gracheva N.N.

The paper describes a parallel-pipeline implementation of solving grid equations using the modified alternating-triangular iterative method (MATM), obtained by numerically solving the equations of mathematical physics. The greatest computational costs at using this method are on the stages of solving a system of linear algebraic equations (SLAE) with lower triangular and upper non-triangular matrices. An algorithm for solving the SLAE with a lower triangular matrix on a graphics accelerator using NVIDIA CUDA technology is presented. To implement the parallel-pipeline method, a three-dimensional decomposition of the computational domain was used. It is divided into blocks along the y coordinate, the number of which corresponds to the number of GPU streaming multiprocessors involved in the calculations. In turn, the blocks are divided into fragments according to two spatial coordinates - x and z. The presented graph model describes the relationship between adjacent fragments of the computational grid and the pipeline calculation process. Based on the results of computational experiments, a regression model was obtained that describes the dependence of the time for calculation one MATM step on the GPU, the acceleration and efficiency for SLAE solution with a lower triangular matrix by the parallel-pipeline method on the GPU were calculated using the different number of streaming multiprocessors.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «SOLVING GRID EQUATIONS USING THE ALTERNATING-TRIANGULAR METHOD ON A GRAPHICS ACCELERATOR»

DOI: 10.14529/cmse230204

SOLVING GRID EQUATIONS USING THE ALTERNATING-TRIANGULAR METHOD ON A GRAPHICS ACCELERATOR*

© 2023 A.I. Sukhinov1, V.N. Litvinov1,2, A.E. Chistyakov1,

A.V. Nikitina1,3, N.N. Gracheva1,2, N.B. Rudenko1,2

1 Don State Technical University (Gagarin Sq. 1, Rostov-on-Don, 344003 Russia),

2 Azov-Black Sea Engineering Institute of Don State Agrarian University (Lenina 21, Zernograd, 347740 Russia),

3Southern Federal University (Bolshaya Sadovaya 105/42, Rostov-on-Don, 344006 Russia) E-mail: sukhinov@gmail.com, litvinovvn@rambler.ru, cheese_05@mail.ru, nikitina.vm@gmail.com, 79286051374@yandex.ru, nelli-rud@yandex.ru

Received: 15.03.2023

The paper describes a parallel-pipeline implementation of solving grid equations using the modified alternating-triangular iterative method (MATM), obtained by numerically solving the equations of mathematical physics. The greatest computational costs at using this method are on the stages of solving a system of linear algebraic equations (SLAE) with lower triangular and upper non-triangular matrices. An algorithm for solving the SLAE with a lower triangular matrix on a graphics accelerator using NVIDIA CUDA technology is presented. To implement the parallel-pipeline method, a three-dimensional decomposition of the computational domain was used. It is divided into blocks along the у coordinate, the number of which corresponds to the number of GPU streaming multiprocessors involved in the calculations. In turn, the blocks are divided into fragments according to two spatial coordinates — x and г. The presented graph model describes the relationship between adjacent fragments of the computational grid and the pipeline calculation process. Based on the results of computational experiments, a regression model was obtained that describes the dependence of the time for calculation one MATM step on the GPU, the acceleration and efficiency for SLAE solution with a lower triangular matrix by the parallel-pipeline method on the GPU were calculated using the different number of streaming multiprocessors. Keywords: mathematical modeling, parallel algorithm, graphics accelerator.

FOR CITATION

Sukhinov A.I., Litvinov V.N., Chistyakov A.E., Nikitina A.V., Gracheva N.N., Rudenko N.B. Solving Grid Equations Using the Alternating-triangular Method on a Graphics Accelerator. Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering. 2023. Vol. 12, no. 2. P. 78-92. DOI: 10.14529/cmse230204.

Introduction

Modeling of any physical processes occurring in the environment and their mathematical description leads to the necessity to solve differential equations in private derivatives. To study dynamic processes in hydrophysics and hydrodynamics, the diffusion-convection-reaction equation is used [1, 2].

The solution to the equations of mathematical physics is based on the approximation of equations of end-and-character schemes. In the case of the use of an implicit, non-exposure scheme, the solution of equation is reduced to solving the system of linear algebraic equations of a large dimension. The largest computational costs in solving differential equations are at solution to the indicated SLAE, therefore, various iterative methods and algorithms are developed and

‘The paper is recommended for publication by the Program Committee of the International Scientific Conference “Parallel Computational Technologies (PCT) 2023”.

applied [3-6]. One of the effective iterative methods for solving SLAE is the alternating-triangular method. This method is applicable to the high-dimensional SLAE with self-adjoint and non-self-adjoint operators, and has a high convergence rate. The iterative alternating-triangular method is used to solve ill-conditioned SLAE.

The dimensions of resulting SLAE are such that they require large computing performance. Problems of computing performance lack are solved in several ways, in particular, using graphic accelerators (GPUs) for calculation the resources and computing processes [3, 7-12].

Russian scientists applied computational grid decomposition in solving a three-dimensional boundary value problem. The parallelization algorithm was implemented in a heterogeneous computing environment. Due to the use of graphic accelerations, the calculation time was reduced in 60 times, compared to the calculations on the CPU [7]. The three-phase filtration problem was also solved in heterogeneous computing environment using the decomposition of computational domain. The parallel algorithm is performed on C++ with using the CUDA and MPI technology. Due to the use of GPU for calculating the specified problem, the computational costs reduces in several tens of times [3]. Scientists of China University of Petroleum propose a parallel algorithm for modeling hardening in two dimensions based on the domain decomposition. This algorithm was implemented on GPU, which significantly reduces the calculation time [8]. Researchers of Altai State University have developed a parallel algorithm for a numerical solution to the problem of electromagnetic impulse spreading in two-dimensional rectangular field. The algorithm was implemented on the basis of CUDA technology. An analysis of the performance of the developed algorithm showed that the GPU performance is several times higher than the CPU performance at solving the problem [9]. The effectiveness of the use of graphic accelerators for numerical modeling of the tasks of applied hydrodynamics has been proved. The calculation rate when solving the problem of numerical modeling of the hydrodynamic characteristics of mushroom screws was increased 1.4-3 times [10]. The exact calculation of heat transfer coefficients requires powerful computing resources that are not available. The parallelization algorithm for such calculations with implementation in heterogeneous computing environment increases productivity, compared to the CPU, more than ten times [12].

In this paper, the decomposition method of three-dimensional calculated circle to implement the parallel algorithm on the GPU is proposed. The developed parallel-conveyor method for SLAE solving allows to effectively use the GPU resources and reduce the calculation time.

1. Method of Solving Grid Equations

Solving the equations of mathematical physics can be reduced to solving a system of linear algebraic equations of the form:

Ax = /, A:H -+ Я, (1)

where A is the linear, positive definite operator.

For the grid equation (1), the iterative methods are used, which in canonical form can be represented by the equation [1, 2]:

Tm+1 _ Tm

B-------— + Axm = f, В :H^H, (2)

Tm+1

where m is the iteration number; rm+i > 0 is the iteration parameter; В is the preconditioner.

The resulting grid equations will be solved using the modified alternating-triangular method of variational type. The preconditioner is formed as follows:

В = {D + ujRi) D_1 (D + ojR2) , D = D* > 0, oj > 0, (3)

where D is the diagonal operator; R\, R2 are the lower- and upper-triangular operators, respectively.

The calculation algorithm of grid equations by the modified alternating-triangular method of the variational type is written in the form:

rm = Axm - /,

(D + ooRi)ym = rm, (D + ajR2)wm = Dym,

Cbm

(Dwm,wm)

{D^R^iR^y

(Aowm,wrn)2 2 _ 1A1wm,A1wm)

~ m~ (B~1A0wrn, A0wm) ’

j g2 £.2

_ 1 _ V (i+fc£) (A0wm,wm)

l + k^(l- s^) ’ Tm+1 m (B-'Aow™, A0wm) ’

Xm+1 =Xm - Tm+1Wm, LOm+i = Ulm,

(4)

where rm is the residual vector; wm is the correction vector; the parameter sm describes the convergence rate of the method; km describes the ratio of the norm of the skew-symmetric operator part to the norm of the symmetric part.

The most labors part of the algorithm is the calculation of the correction vector wm and reduced to the solution of two SLAE with the lower-triangular and upper-triangular matrix:

(D + ujR^y™ = rm, (D + ajR2)wm = Dym.

The algorithm fragment for solving SLAE with the lower-triangular matrix is given in Algorithm 1. The residual vector is calculated in 14A^ arithmetic operations. The total number of

Algorithm 1 matm(IN: ni,n2,ns,ao,a2,a4,aQ,oj‘, IN/OUT: r) l: for к G [1;тез — 2] do 2: for i G [1; щ — 2] do

3: for j G [1; n2 — 2] do

4: po i + Щ ■ j + Ur • П2 ■ к

5: if ao[po] > 0 then

6: P2^Po- 1; PA Po - nr, P6 <- PO - m ■ n2

7: r\po] <- (ш ■ (a2[po] • r\p2\ + a4[po] • r[p4] + a6[po} ■ r[p6]) + r[p0])/((0.5 • ш +

1) • a0[po])

arithmetic operations required to solve the SLAE with the seven-diagonal matrix using MATM in the case of known iterative parameters rTO+i, a;TO+i is 35N, where N = П\П2щ is SLAE dimension.

2. Decomposition Model of Computational Domain

Let Q be the set of technical characteristics of the video adapter, then we will present the characteristics of the video adapter in the form of a tuple.

Q = (q1,q2), (5)

where q1 is the amount of video memory of the video adapter, GB; q2 is the number of streaming multiprocessors.

If S' is a set of program threads involved in the computational process, then

S = {в*,,*; = l,g2} , (6)

where sk is a CUDA streaming block that implements the calculation process on GPU streaming multiprocessor with index k.

Let us take the computational domain with the following parameters: lx is the characteristic size on the axis Ож, ly — on the axis Oy, lz — on the axis Oz.

Let us compare the specified area with a uniform computational grid of the following type:

W = {xi = ihx, yj = jhy, zk = khz;

i = 0,nx — 1, j = 0,ny — 1, к = 0, nz - 1;

{fix 1)^U — lx, {fly ^)hy — ly, {nz 1 )hz — lz\i (7)

where hx,hy, hz are the steps of computational grid at the corresponding spatial directions; nx,ny, nz are the number of grid nodes at the corresponding spatial directions.

Then the set of nodes of the computational grid can be represented as

G = {gij,k,i = 0,nx- 1 ,j = 0,ny - 1, к = 0,n2 - 1} ,

9i,j,k = (%ii Vji zk) i (8)

where Qi,j,k is the grid node.

The number of nodes of the computational grid Nq is calculated by the formula:

NG = nx-ny-nz, (9)

Under the block of the computational grid Gkl C G (further - the block) we will understand the sub-set of nodes of the computational grid G.

G= J Gkl = {gkl\3k\ G Kkl,gkl e Gkl}, f| = 0, (10)

where Kkl = {1,..., Nkl} is the set of block indices Gkl of the computational grid G ; Nkl is the number of blocks Gkl, Nkl = d2; Kkl, Nkl С IV; N is the set of natural numbers; k\ is the block index Gkl.

Since Gkl C G, then

Gkl

Jki

97lb'i = °’n* “ -

(11)

1, к = 0, nz — 1 j

where gk\ block node k\\ the ~ sign denotes belonging to the block; j is the block node index k\ г-)3->к

at coordinate y; nkl is the number of nodes in block k\ at coordinate y.

9Ь,к = ^ vZk) ’ Xi = ihx’ =

hy, zk = khz

(12)

where fly is the number of nodes at coordinate у of the 6-th block.

Under the fragment of the computational grid Gkl,k2 (further — the fragment) we will understand a subset of the nodes of the computational grid of block Gkl.

Gk1 = J Gkl’k2 = {gk^2\3k2 <E KklM,gk^k2 e Gkl’k2},

&2 ^Kk1,k2

f| Gk^ = 0, (13)

,k2

where Kkuk2 = (!) ■■■^k1,k2} is a plurality of fragment indexes Gkl,k2 of block Gkl; Nkl^2 is the number of fragments Gkl,k2; Kkl:k2, ^k1:k2 C N; k2 is the index of fragment Gkl,k2 of block Gkl. Since Gk^k2 C then

Gkl,k2 = = 0,nx - 1, j = 0,ny - 1, к = 0,nz - l|, (14)

where щ~.~k is the fragment node; the sign • denotes belonging to a fragment; i,k are indexes of fragment node by coordinates x, z; nx,nz is the number of nodes of the computational grid in the fragment along the coordinates x, z.

Each index A?2 of fragment Gkl,k2 is associated with a tuple of indices (кз,к^), designed to store fragment coordinates in plane xOz, where кз is the fragment index at coordinate x, к4 is the fragment index at coordinate z.

k2 = k3 + Ккз • fc4, (15)

where кз is the fragment index along the x coordinate; к4 is the fragment index along the z coordinate; Ккз is the number of fragments along the Ox axis.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

The number of fragments Gkl,k2 block Gkl is calculated by the formula

Kk2=Kk3-Kki, (16)

where Kki is the number of fragments at coordinate z.

9i,j,k ~ Уз' zk)

Л3"1 л

I ^fib + ij ■hx,yj= jhy, zk

ki-l

у^nb + k

. b= 1

■hz, (17)

where щ is the number of nodes in the Ь-th fragment.

Let us introduce a set of comparisons of computational grid blocks with program currents M.

M = jmfcl = (Gkl,skl^ , кi <E Kkl j , where skl £ S — program flow, calculating block Gkl.

(18)

For the domain decomposition, it is necessary to take into account the computing performance of device, involved in calculations. Performance refers to the number of nodes of the computational grid, calculated using a given algorithm, per unit of time.

To calculate the number of nodes along the coordinate у in the blocks of the computational grid processed by GPU streaming multiprocessors, we use the formulas

nyGT

fly

Nkl- 1

Nkl-1

Пусть = ny- ^ nbyGT, b=1

(19)

where nyGT is the number of computational grid nodes along coordinate у in blocks processed by GPU streaming multiprocessors, except for the last block; nyGTb is the number of nodes at coordinate у in the last block of the computational grid processed by GPU streaming multiprocessors.

The number of the computational grid fragments along the coordinate у is equal to

Щ = Мк1. (20)

Let the number of fragments be n£ and n£ at coordinates x and z, respectively. Then, the number of nodes of the computational grid along the coordinate x is calculated by the formulas:

n

f =

X

nx

Nl-1

nlL = nx-nfx- (n£ - 1),

(21)

f

where nx is the number of nodes of the computational grid along the coordinate x in all fragments except the last one; nlL — the number of nodes of the computational grid along the coordinate x in the last fragment.

Similarly, the number of nodes of the computational grid along the coordinate z is calculated

nz

n£ -1

n{L = nz - n{ ■ {N[ - 1),

(22)

where n{ is the number of nodes of the computational grid along the coordinate z in all fragments except the last one; n{L — the number of nodes of the computational grid along the coordinate z in the last fragment.

Let on M it is necessary to organize a parallel process for calculation some function F, and the calculations in each fragment Gkl,k2 depend on the values in neighboring fragments, each of which has at least one of the indices at coordinates x, у and z one less than the current one.

To organize a parallel-pipelining method, let us introduce a set of tuples A that define correspondences a between program flows sk, processing fragments Gkl,k2, and the numbers of steps of the parallel-pipelining method r.

Vsk <e S 3a <e A : a = (sk, Gk^k2,r) ,

(23)

where r = 1 ,Nr is the step number of the parallel-pipeline method, Nr is the number of steps of the parallel-pipeline method, calculated by the formula

Nr = n£n£ + n£-1. (24)

Full download of all calculators in the proposed parallel-pipeline method starts from step Goo START = Ny and ends at the step Goostop = ^£ n£ . In this case, the total number of steps

with a full load of Ntpar calculators will be

NrPAR = DOO STOP — flOO START + 1 = -ZV/iV/ — N? + 1.

(25)

The calculation time of some function F by the parallel-pipeline method can be written in

the form

tm = ^2max{Ta)

(26)

r= 1

where Ta is a vector of time values for fragment processing in parallel mode.

3. Parallel Implementation

The numerical implementation of the MATM for solving SLAE with the high dimension is based on the developed parallel algorithms that implement the pipeline computing process. The use of these algorithms allows to fully utilize all available streaming multiprocessors of graphics

A class library was developed in C++ for describing the domain decomposition. The class library contains the following classes:

• Grid3D, describes the parameters of the computational grid (number of nodes nx, ny, nz, and step sizes hx, hy, hz in spatial coordinates) and contains an array of objects of the GridBlock3D class.

• GridBlock3D, describes the parameters of the computational grid block and contains an array of objects of the GridFragment3D class.

• GridFragment3D, describes the parameters of a computational grid fragment and contains data arrays.

The organization of calculations is performed by an algorithm that controls all available streaming multiprocessors of GPU (calculators). Each calculator performs calculations only for its own block of the computational domain. For this, the computational domain is divided into blocks that are assigned to individual calculators (Fig. 1). Next, each block is divided into fragments. Notations in Fig. 1: SM\, SM2, SM3 are streaming multiprocessors of GPU.

A graph model was used to describe the relationships between adjacent fragments of the computational grid and the organization of the pipeline calculation process (Fig. 2). Each graph node is an object of a class GridFragment3D that describes a fragment of the computational domain. This class contains the following helds: the dimensions of the fragment along the Ox, Oy, and Oz axes; the index of the zero node of the fragment in the global computational domain; pointers to adjacent fragments of the computational grid; pointers to objects that describe the parameters of calculators. The computational process is a graph traversal from the root node with parallel launch of calculators that process the graph nodes in accordance with the value of the calculation step counter r.

An algorithm and its program implementation in the CUDA C language are developed to improve the calculation efficiency of the computational grid fragments assigned to the graphics accelerator [13-17].

We present an algorithm for searching the solution for the system of equations with the lower-triangular matrix (straight line) on CUDA C.

The input parameters of the algorithm are the vectors of the coefficients of grid equations ao, й2, й4, йб and the constant ш. The output parameter is the vector of the water ffow velocity v. Before running the algorithm, it is necessary to programmatically set the dimensions of the

accelerator.

CUDA computing block blockDim.x, blockDim.z according to the spatial coordinates x, z, respectively. The CUDA framework runs this algorithm for each thread, and the variable values threadldx.x, threadldx.z, blockldx.x, blockldx.z are automatically initialized by the indexes of the corresponding threads and blocks. Global thread indexes are calculated in rows 1 and 2. The row index i and the layer index к that the current thread processes are calculated in row 3. A variable j is initialized that represents a counter by coordinate y. The calculation pipeline is organized as a loop in line 4. The indexes of the central node of the grid pattern po and the surrounding nodes p2, Pa, p§ are calculated in line 8. The two-dimensional array cache is located in the GPU shared memory and designed to store the calculation results on the current layer by the coordinate y. This allows us to reduce the number of reads from slow global memory and accelerate the calculation process by up to 30 %.

The performed researches show a signihcant dependence of the algorithm implementation time for calculation the preconditioner on the ratio of threads in spatial coordinates. A series of experiments is preperformed to calculate the performance of calculators, which is the 95th percentile of the calculation time in terms of 1000 nodes of the computational grid.

GeForce GTX 1650 video adapter was used in experimental researches. The GeForce GTX 1650 video adapter has 4 GB of video memory, core and memory clock frequency of 1485 MHz and 1665 MHz, and a video memory bus bit rate of 128 bits. The computing part consists of 14 streaming multiprocessors (SM).

The purpose of the experiment is to determine the distribution of flows along the Ox and Oz axes of the computational grid at different values of its nodes along the Oy axis so that the implementation time on the GPU of one MATM step is minimal. Two values are taken as factors: к = X/Z is the ratio of the number of threads on the Ox (X) axis to the number of threads on the Oz (Z) axis; Y is the number of threads on the axis Oy. Values of the objective

Fig. 2. A graph model that describes the relationships between adjacent fragments of the computational grid and the process of pipeline calculation

function: Tqpjj is the calculation time of one MATM step on the GPU in terms of 1000 nodes of the computational grid, ms.

The regression equation was obtained as a result of the experimental data processing (Fig. 3):

Tgpu = a — b ■ Y — c ■ ln(k) — d ■ ln(Y), (27)

where Tqpjj is the implementation time of one MATM step on the GPU in terms of 1000 nodes of the computational grid, ms. The determination coefficient was 0.86; a = 0.026; 6 = 2- 10-7; c = 16 • 10-5; d = 77-10"5.

To evaluate the effectiveness of the parallel-pipeline method for solving SLAE with a lower triangular matrix, a numerical experiment was performed. The dimensions of the three-dimensional uniform computational grid along the spatial coordinates x, y, and z were respectively set equal to 640, 224, and 448, respectively. The amount of video memory was 3.8 GB. In the course of experimental researches, we changed the number of streaming multiprocessors Arfc1 involved in the calculations and fixed the computation time Tm-

Algorithm 2 matmKernel(IN: ao, a2, оц, а$, ш IN/OUT: v; ) l: thX <— blockDim.x ■ blockldx.x + threadldx.x 2: thZ blockDim.z ■ blockldx.z + threadldx.z

3: i i— thX T15 j i— I5 к i— thZ T1 4: for s G [3; ni + n2 + П3 — 3] do

5: if (г + j + к = s) A (s < i + n2 + k) then

6: po <— i + (blockDim.x + 1) • j + щ • пг • к

7: if aO[pO] > 0 then

8: P2^Po- 1;Pa Po - ni;p6 p0 - Щ ■ n2

9: vpA 0

10: if (s > 3 + thX + thZ) then

ll: vpA cache[thX][thZ]

12: else

13: VpA v\p^[

14: Vp2 0

15: if (thX ф 0) A (s > 3 + thX + thZ) then

16: vp2 cache[thX — l][thZ]

17: else

18: Vp2 v\p2]

19: vp6 0;

20: if (thZ Ф 0) A (s > 3 + thX + thZ) then

21: vp6 <(— cache[thX][thZ — 1]

22: else

23: Vp6 v\po\

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

24: vpO <- (u}-(a2\p0\-vp2+aA\p0\-vpA+a6\po]-vp6)+v\po\)/((0.5-u}+l)-ao\po\)

25: cache[thX][thZ]vpO

26: г>[ро] vpO

27: j ^ j + 1

For each experiment, the computational grid was divided into three-dimensional blocks and fragments. In this case, the number of blocks was set equal to the number of streaming multiprocessors. The number of fragments in blocks along spatial coordinates x, y, and z was set equal to 4, 1, and 7, respectively. The sizes of fragments along spatial coordinates x (n%}) and z (n21) were set equal to 160 and 64, respectively. The acceleration Sp = Тм(1)/Тм(г) and efficiency Ep = Бфф/Х^фг) were calculated from the experimental data. The results of numerical experiments are shown in Tab. 1.

Conclusions

To solve grid equations using the MPTM method on the graphics accelerator, the decomposition model of computational domain has been developed. The computational domain is divided into blocks along the spatial coordinate y, and then the blocks are divided into fragments along the spatial coordinates x and z. This model allows each GPU streaming multiprocessor to map a computational domain block and organize a parallel-pipelined computational process. The graph model was proposed that describes the relationship between adjacent fragments of

Fig. 3. Surface of the response function Tqpu = f(k, Y)

Table 1. Results of SLAE calculations with a lower triangular matrix by a parallel-pipeline

method on GPU

Nkl Пу1 Nr Тм sP Ер

1 224 - 580 1.0 1.00

2 112 29 304 1.9 0.95

7 32 34 102 5.7 0.81

14 16 41 61 9.5 0.68

the computational grid and the process of conveyor calculation. The algorithm for solving the system of equations with a lower triangular matrix in the CUDA C language was described.

As a result of the experiment, a regression model was obtained: it describes the dependence of the time for calculation one step of the MATM on the GPU. According to the regression model, at к < 10 and Y < 1000, the calculation velocity slows down, which is explained by the inefficient use of the distributed memory of the graphics accelerator.

The results of calculations of SLAE with the lower triangular matrix by the parallel-pipeline method on the GPU with using the different number of streaming multiprocessors are presented. At Nkl = 14, the acceleration Sp was 9.5, and the efficiency Ep was 0.668.

The reported study was funded by the Russian Science Foundation (project No. 21-71-20050).

References

1. Sukhinov A.I., Atayan A.M., Belova Y.V., et al. Data processing of field measurements of expedition research for mathematical modeling of hydrodynamic processes in the Azov Sea. Computational Continuum Mechanics. 2020. Vol. 13, no. 2. P. 161-174. DOI: 10.7242/1999-6691/2020.13.2.13.

2. Sukhinov A.I., Litvinov V.N., Chistyakov A.E., et al. Computational aspects of solving grid

equations in heterogeneous computing systems. Parallel Computing Technologies. Vol. 12942 / ed. by V. Malyshkin. Springer, 2021. P. 166-177. Lecture Notes in Computer Science. DOI: 10.1007/978-3-030-86359-3_13.

3. Lyupa A., Morozov D., Trapeznikova M., et al. Three-phase filtration modeling by explicit methods on hybrid computer systems. Mathematical Models and Computer Simulations. 2014. Vol. 6. P. 551-559. DOI: 10.1134/S2070048214060088.

4. Mat Ali N.A., Rahman R., Sulaiman J., Ghazali K. Solutions of reaction-diffusion equations using similarity reduction and HSSOR iteration. Indonesian Journal of Electrical Engineering and Computer Science. 2019. Vol. 16, no. 3. P. 1430-1438. DOI: 10.11591/ijeecs.vl6.i3.ppl430-1438.

5. Kittisopaporn A., Chansangiam P. The steepest descent of gradient-based iterative method for solving rectangular linear systems with an application to Poisson’s equation. Advances in Difference Equations. 2020. Vol. 2020. Article number 259. DOI: 10.1186/sl3662-020-02715-9.

6. Yifen К., Ma C. Adaptive parameter based matrix splitting iteration method for the large and sparse linear systems. Computers k. Mathematics with Applications. 2022. Vol. 122. P. 19-27. DOI: 10.1016/j.camwa.2022.07.010.

7. Klimonov I.A., Korneev V.D., Sveshnikov V.M. Parallelization technologies for solving three-dimensional boundary value problems on quasi-structured grids using the CPU+GPU hybrid computing environment. Numerical Methods and Programming. 2016. Vol. 17, no. 1. P. 65-71. DOI: 10.26089/NumMet.vl7rl07.

8. Ding P., Liu Z. Accelerating phase-held modeling of solidification with a parallel adaptive computational domain approach. International Communications in Heat and Mass Transfer. 2020. Vol. 111. P. 104452. DOI: 10.1016/j.icheatmasstransfer.2019.104452.

9. Molostov I., Scherbinin V. Application of NVIDIA CUDA Technology for Numerical Simulation of Electromagnetic Pulses Propagation. Izvestiya of Altai State University. 2015. Vol. 1, no. 1/1(85). DOI: 10.14258/izvasu(2015)l.l-06.

10. Krasnopolsky B., Medvedev A., Chulyunin A. On application of GPUs for modelling of hydrodynamic characteristics of screw marine propellers in OpenFOAM package. Proceedings of the Institute for System Programming of RAS. 2014. Vol. 26, no. 5. P. 155-172. DOI: 10.15514/ISPRAS-2014-26(5)-8.

11. Egorov M., Egorov S., Egorov D. Using graphics accelerator to improve computing performance in the numerical modeling of complex technical systems functioning. Perm National Research Polytechnic University Aerospace Engineering Bulletin. 2015. No. 40. P. 81-91. DOI: 10.15593/2224-9982/2015.40.05.

12. Szenasi S. Solving the inverse heat conduction problem using NVLink capable Power architecture. PeerJ Computer Science. 2017. Vol. 3. P. 138. DOI: 10.7717/peerj-cs.l38.

13. Zheng L., Gerya T., Knepley M., et al. GPU Implementation of Multigrid Solver for Stokes Equation with Strongly Variable Viscosity. GPU Solutions to Multi-scale Problems in Science and Engineering. Springer, 2013. P. 321-333. Lecture Notes in Earth System Sciences. DOI: 10.1007/978-3-642-16405-7_21.

14. Konovalov A. The steepest descent method with an adaptive alternating-triangular preconditioner. Differential Equations. 2004. Vol. 40. P. 1018-1028.

15. Sukhinov A.I., Chistyakov A.E., Litvinov V.N., et al. Computational Aspects of Mathematical Modeling of the Shallow Water Hydrobiological Processes. Numerical methods and programming. 2020. Vol. 21, no. 4. P. 452-469. DOI: 10.26089/NumMet.v21r436.

16. Samarskii A.A., Vabishchevich P.N. Numerical methods for solving convection-diffusion problems. Moscow: URSS, 2009. (in Russian).

17. Browning J.B., Sutherland В. C++20 Recipes. A Problem-Solution Approach. Berkeley, CA: Apress, 2020. 630 p.

УДК 519.6 DOI: 10.14529/cmse230204

РЕШЕНИЕ СЕТОЧНЫХ УРАВНЕНИЙ ПОПЕРЕМЕННО-ТРЕУГОЛЬНЫМ МЕТОДОМ НА ГРАФИЧЕСКОМ УСКОРИТЕЛЕ

© 2023 А.И. Сухинов1, В.Н. Литвинов1,2, Ф.Е. Чистяков1,

А.В. Никитина1,3, Н.Н. Грачева1,2, Н.Б. Руденко1,2

1 Донской государственный технический университет (344003 Ростов-на-Дону, пл. Гагарина, д. 1),

2Азово- Черноморский инженерный институт ФГБОУ ВО Донской ГАУ (347740 Зерноград, ул. Ленина, д. 21),

3 Южный федеральный университет (344006 Ростов-на-Дону, ул. Большая Садовая, д. 105/42)

E-mail: sukhinov@gmail.com, litvinovvn@rambler.ru, cheese_05@mail.ru, nikitina.vm@gmail.com, 79286051374@yandex.ru, nelli-rud@yandex.ru Поступила в редакцию: 15.03.2023

В статье описана параллельно-конвейерная реализация решения сеточных уравнений модифицированным попеременно-треугольным итерационным методом (МПТМ), получаемых при численном решении уравнений математической физики. Наибольшие вычислительные затраты при использовании указанного метода приходятся на этапы решения системы линейных алгебраических уравнений (СЛАУ) с нижнетреугольной и верхнетреугольной матрицами. Представлен алгоритм решения СЛАУ с нижнетреугольной матрицей на графическом ускорителе с использованием технологии NVIDIA CUDA. Для реализации параллельно-конвейерного метода использовалась трехмерная декомпозиция расчетной области. Она делится по координате у на блоки, количество которых соответствует количеству потоковых мультипроцессоров GPU, задействованных в вычислениях. В свою очередь, блоки разделяются на фрагменты по двум пространственным координатам — жиг. Представленная графовая модель описывает взаимосвязь между соседними фрагментами расчетной сетки и процессом конвейерного расчета. По результатам проведенных вычислительных экспериментов получена регрессионная модель, описывающая зависимость времени расчета одного шага МПТМ на GPU, вычислены ускорение и эффективность расчетов СЛАУ с нижнетреугольной матрицей параллельно-конвейерным методом на GPU при задействовании различного количества потоковых мультипроцессоров.

Ключевые слова: математическое моделирование, параллельный алгоритм, графический ускоритель.

ОБРАЗЕЦ ЦИТИРОВАНИЯ

Sukhinov A.I., Litvinov V.N., Chistyakov А.Е., Nikitina A.V., Gracheva N.N., Rudenko N.B. Solving Grid Equations Using the Alternating-triangular Method on a Graphics Accelerator / / Вестник ЮУрГУ. Серия: Вычислительная математика и информатика. 2023. Т. 12, № 2. С. 78-92. DOI: 10.14529/cmse230204.

This paper is distributed under the terms of the Creative Commons Attribution-Non

Commercial 4-0 License which permits non-commercial use, reproduction and distribution of

the work without further permission provided the original work is properly cited.

Литература

1. Sukhinov A.I., Atayan A.M., Belova Y.V., et al. Data processing of field measurements of expedition research for mathematical modeling of hydrodynamic processes in the Azov Sea / / Computational Continuum Mechanics. 2020. Vol. 13, no. 2. P. 161-174. DOI: 10.7242/1999-6691/2020.13.2.13.

2. Sukhinov A.I., Litvinov V.N., Chistyakov A.E., et al. Computational aspects of solving grid equations in heterogeneous computing systems // Parallel Computing Technologies. Vol. 12942 / ed. by V. Malyshkin. Springer, 2021. P. 166-177. Lecture Notes in Computer Science. DOI: 10.1007/978-3-030-86359-3_13.

3. Lyupa A., Morozov D., Trapeznikova M., et al. Three-phase filtration modeling by explicit methods on hybrid computer systems // Mathematical Models and Computer Simulations. 2014. Vol. 6. P. 551-559. DOI: 10.1134/S2070048214060088.

4. Mat Ali N.A., Rahman R., Sulaiman J., Ghazali K. Solutions of reaction-diffusion equations using similarity reduction and HSSOR iteration // Indonesian Journal of Electrical Engineering and Computer Science. 2019. Vol. 16, no. 3. P. 1430-1438. DOI: 10.11591/ijeecs.vl6.i3.ppl430-1438.

5. Kittisopaporn A., Chansangiam P. The steepest descent of gradient-based iterative method for solving rectangular linear systems with an application to Poisson’s equation // Advances in Difference Equations. 2020. Vol. 2020. Article number 259. DOI: 10.1186/sl3662-020-02715-9.

6. Yifen К., Ma C. Adaptive parameter based matrix splitting iteration method for the large and sparse linear systems // Computers & Mathematics with Applications. 2022. Vol. 122. P. 19-27. DOI: 10.1016/j.camwa.2022.07.010.

7. Klimonov I.A., Korneev V.D., Sveshnikov V.M. Parallelization technologies for solving three-dimensional boundary value problems on quasi-structured grids using the CPU+GPU hybrid computing environment // Numerical Methods and Programming. 2016. Vol. 17, no. 1. P. 65-71. DOI: 10.26089/NumMet.vl7rl07.

8. Ding P., Liu Z. Accelerating phase-held modeling of solidification with a parallel adaptive computational domain approach // International Communications in Heat and Mass Transfer. 2020. Vol. 111. P. 104452. DOI: 10.1016/j.icheatmasstransfer.2019.104452.

9. Молостов И.П., Щербинин В.В. Применение технологии NVIDIA CUDA для численного моделирования распространения электромагнитных импульсов / / Известия Алтайского государственного университета. 2015. Т. 1, № 1/1(85). DOI: 10.14258/izvasu(2015) 1.1-06.

10. Краснопольский Б.И., Медведев А.В., Чулюнин А.Ю. Применение графических ускорителей для расчета гидродинамических характеристик гребных винтов в пакете OpenFOAM // Труды Института системного программирования РАН. 2014. Т. 26, № 5. С. 155-172. DOI: 10.15514/ISPRAS-2014-26(5)-8.

11. Егоров М.Ю., Егоров С.М., Егоров Д.М. Применение графических ускорителей для повышения производительности вычислений при численном моделировании функциони-

рования сложных технических систем // Вестник ПНИПУ. Аэрокосмическая техника. 2015. № 40. С. 81-91. DOI: 10.15593/2224-9982/2015.40.05.

12. Szenasi S. Solving the inverse heat conduction problem using NVLink capable Power architecture // PeerJ Computer Science. 2017. Vol. 3. P. 138. DOI: 10.7717/peerj-cs.l38.

13. Zheng L., Gerya T., Knepley M., et al. GPU Implementation of Multigrid Solver for Stokes Equation with Strongly Variable Viscosity // GPU Solutions to Multi-scale Problems in Science and Engineering. Springer, 2013. P. 321-333. Lecture Notes in Earth System Sciences. DOI: 10.1007/978-3-642-16405-7_21.

14. Konovalov A. The steepest descent method with an adaptive alternating-triangular preconditioner // Differential Equations. 2004. Vol. 40. P. 1018-1028.

15. Sukhinov A.I., Chistyakov A.E., Litvinov V.N., et al. Computational Aspects of Mathematical Modeling of the Shallow Water Hydrobiological Processes // Numerical methods and programming. 2020. Vol. 21. P. 452-469. DOI: 10.26089/NumMet.v21r436.

16. Самарский А.А., Вабигцевич П.Н. Численные методы решения уравнений конвекции-диффузии. Москва: УРСС, 2009.

17. Browning J.B., Sutherland В. СН—К20 Recipes. A Problem-Solution Approach Berkeley, СА: Apress, 2020. 630 р.

Сухинов Александр Иванович, чл.-корр. РАН, д.ф.-м.н., профессор, кафедра математики и информатики, Донской государственный технический университет (Ростов-на-Дону, Российская Федерация)

Литвинов Владимир Николаевич, к.т.н., доцент, кафедра математики и информатики, Донской государственный технический университет (Ростов-на-Дону, Российская Федерация); кафедра математики и биоинформатики, Азово-Черноморский инженерный институт ФГБОУ ВО Донской ГАУ (Зерноград, Российская Федерация)

Чистяков Александр Евгеньевич, д.ф.-м.н., кафедра программного обеспечения вычислительной техники и автоматизированных систем, Донской государственный технический университет (Ростов-на-Дону, Российская Федерация)

Никитина Алла Валерьевна, д.т.н., доцент, кафедра программного обеспечения вычислительной техники и автоматизированных систем, Донской государственный технический университет (Ростов-на-Дону, Российская Федерация); кафедра интеллектуальных и многопроцессорных систем, Южный федеральный университет (Ростов-на-Дону, Российская Федерация)

Грачева Наталья Николаевна, к.т.н., доцент, кафедра математики и биоинформатики, Азово-Черноморский инженерный институт ФГБОУ ВО Донской ГАУ (Зерноград, Российская Федерация); кафедра проектирования и технического сервиса транспортнотехнологических систем, Донской государственный технический университет (Ростов-на-Дону, Российская Федерация)

Руденко Нелли Борисовна, к.т.н., доцент, кафедра математики и биоинформатики, Азово-Черноморский инженерный институт ФГБОУ ВО Донской ГАУ (Зерноград, Российская Федерация); кафедра медиатехнологий, Донской государственный технический университет (Ростов-на-Дону, Российская Федерация)

i Надоели баннеры? Вы всегда можете отключить рекламу.