Новая С++ имплементация метода Гауса-Жордана
Дима Михай-Тибериу,
старший лаборант лаборатории информационных технологий Объединенный институт ядерных исследований, mtdima@jinr.ru
Кореньков Владимир Васильевич.
доктор технических наук, г.н.с., Лаборатория информационных технологий
Объединенный институт ядерных исследований Дима Светлана Николаевна
студент факультета «Лечебное дело», Приволжский исследовательский медицинский университет, svetlanadima7@gmail.com
Мы представляем производительность OMP-5.1 распараллеленной шаблонной C++ имплементации метода Гаусса-Жор-дана (full pivot) для решения линейных уравнений по сравнению с непараллельной имплементацией. Наши результаты показывают сложность O(N282) для double scalars, соответственно O(N241) для cpx<double> scalars (комплексный тип с двойной точностью) для диапазона до N = 10000. Это приводит к преимуществу над x5 для первого и x229 для последнего в верхней части диапазона по сравнению с непараллельным алгоритмом. Другим важным результатом нашего исследования стало определение Ncr0ss0ver, при котором становится возможным распараллеливать код (N =100).
Ключевые слова: Гаусс-Жордан, C++, численные методы, алгоритм, инверсия матрицы
1. Introduction
The method of elimination (as it was initially called by Sylvestre Lacroix (1765-1843)), came up when Adrien-Marie Legendre [1] and Gauss [2] invented what Legendre named the method of least squares ("méthode des moindres quarrés," modern "carrés").
The solution to a system of linear equations - the basic idea of the Gauss-Jordan method - is to add or subtract linear combinations of the given equations until each equation contains only one of the unknowns, thus giving an immediate solution.
For solving sets of linear equations, Gauss-Jordan elimination produces both the solution of the equations for one or more right-hand side vectors b, and also the matrix inverse.
The method's principal strength is that it is as stable as any other direct method, perhaps even a bit more stable when full pivoting is used.
For inverting a matrix, Gauss-Jordan elimination is about as efficient as any other direct method.
2. Templated C++ implementation of the Gauss-Jordan method
The code we designed uses two auxiliary functions, as follows:
o_LNS_norm(T&) .... which for arithmetic types
defaults to the object itself (positive), or minus the object -while for abstract objects it is their fabs() function.
o_gmx (T***, int, int&, int&, unsigned long int) ....
which finds the max pivot for a given (sub)-matrix. The code takes decision to execute vectorised, or vectorised and parallelised for higher dimensions.
2.1. Linear algorithm
The data is presented to the algorithm as type T***, such that decisions be taken with pointers (for instance row and column swaps) and not with the scalars themselves, which can be data-intensive.
The code takes decision to execute in vectorised, or vectorised and OMP-5.1 parallelised version depending on system dimension.
It then proceeds to find a pivot, calls_gmx() for the
maximal available element - swaps rows and columns to bring it to (1,1) position and records the swaps (which will have to be performed in reversed order after the system is upper triangular).
After this the code renders null the first column less position (1,1) and computes the adjusted rows.
It then proceeds to find a pivot out of the remaining matrix, less column and row 1. And essentially performs the same operations as before.
When the last element is reached, the system is in upper-triangular form and the solution can be calculated starting with the last row.
The result is then used in the second-to-last row to obtain the unknown and so on up to the first line.
X X
о
го А с.
X
го m
о
2 О M
to
J
<
CÛ
0
1 i
We need to remember that the unknowns are in swapped order at this point and they have to be re-arranged as recorded in the ledger.
2.2. Parallelised OMP-5.1 algorithm
The parallelised version of the algorithm, with OMP-5.1 [3], adjusts the rows in parallel fashion for the first part of the procedure, when they are many.
Towards the end there are few rows to adjust and it is un-performant to use the parallel version due to the large time to fork threads.
To obtain the maximal number of available processors, the following commands are issued: omp_set_dynamic(0);
omp_set_num_threads(omp_get_max_threads()); The first command disables the system from dynamically adjusting the number of threads, while the second sets it to the maximal number available on the system.
The compiler directive for the parallel section reads: #pragma omp parallel for simd schedule (guided,32) firstprivate(by,u,k,B) private(bb,bx,tmx,tmz,max,may,ii,jj) where the machine is instructed that (by,u,k,B) be initialised for all threads to their single-thread values, while (bb,bx,tmx,tmz,max,may,ii,jj) are separate (independent) copies for each thread.
The compiler is instructed to arrange thread arrangement to "guided", which (per Intel) [4]. "Works best with small chunk sizes as their limit; this gives the most flexibility. It's not clear why they get worse at bigger chunk sizes, but they can take too long when limited to large chunk sizes.", hence 32 was used.
2.3. SIMD vectorisation
The SIMD vectoring [5] is simply declared in the code, the only compiler directives are in the makefile: -floop-block \ -ftree-loop-distribution \ -floop-parallelize-all \ -ftree-parallelize-loops=4 \ -ftree-vectorize \ -mavx2 \ respectively:
- try and make loop blocks to fit architecture cache sizes
- try and vectorise loops, such that independent instructions are ran on different vectors
- allow global scope (enabling OpenMP globally is sort of non-sense: if a package does not use OpenMP there is no way to automatically enable it)
- option-4 in gcc parallelisation works is part of the blackmagic of gcc-still-in-dev
- the vectorisation part truly pertains to SIMD, and
- AVX2 is the extended instruction set for parallelisation (SSE3 could also be an alternative).
Another (depending on available hardware resource) option is to use MPI and distribute the workload on different nodes of a computing cluster.
MPI parallelisation would complete the set of parallelisation measures.
2.4. Usage with double scalars
The performance of the code was tested on double-precision scalars (figure 1).
The linear algorithm version is depicted in light-blue -with the expected n3 complexity. The F multiplicative factor is 132-1, which is fairly small.
The parallelised version, depicted in red, displays an apparent complexity of n282.
This can be due to the fact that up to n = 10000, the system has not exhausted its thread-resource and is compensating part of the exponent.
The crossover between the linear and parallel version of the algorithm is n = 200.
However due to other processses running on the system, thread allocating processses may get a NICE -7 priority instead of -5 for in-stance.
This points to the resolution that practical crossover should be in the range of n = 700 ... 1500.
Figure 1 depicts the crossover point.
parallel OpenMP-4.0 single -02
io10
10
=5.
109
qj c
.c ■Lj 10a
a.
O IO7
106
10s
10"
103
102
IO1
100
IO1
10'2
complexity 2: F = 1/132
LIN eq. SOLVER
Gauss-Jordan full pivot
Haswell, 8 core, 2,39 GHz, 4 MB Liriux-2.6.32-504.12.2.el6.x86_64 gcc-4.4.7 20120313 C++_14
00
complexity 3 F = 1 / 220
1 10 100 1000 10000
order
Figure 1. Performance of the code in linear (light-blue) and parallelized (red) version for double scalars. Although the algorithm complexity is n3, for limited order (up to n = 10000) the system responds allocating more threads, hence an effective complexity of n282 is observed
2.5. Usage with cpx scalars
The performance of the code was tested also on cpx<double> scalars (figure 2).
The Linux command and example output is shown here to the right.
Note that the actual clocked user time is substantially greater than the real time, showing the kick-in of the parallelised version of the code.
system
cpx<double>[1500 x 1500]
LNS > time make run
real 0ml3 .002s ✓_average
user lm36.468s ^ 7.4 threads sys 0m0.436s
<Л 10'
ч.
QJ ю'
Е
■С
^ 10'
ъ
LIN eq. SOLVER
Gauss-Jordan full pivot
Haswell, 8 core, 2.39 GHz, 4 MB Linux-2.6.32-504.12.2.el6.x86_64 gcc-4.4.7 20120313 C++_14
complexity 3 F = 1/74
parallel OpenMP-4.0 single -02
10 ioo woe order
Figure 2. Performance of the code in linear (light-blue) and parallelized (red) version for cpx scalars. Although the algorithm complexity is n3 , for limited order (up to n = 10000) the system responds allocating more threads, hence an effective complexity of n2.49 is observed
The ratio user time to real time gives an estimate of the average number of threads used: in this case 7.4 threads (on an 8 core Haswell machine).
The linear algorithm version is depicted in in figure 2 in light-blue - with the expected n3 complexity. The F multiplicative factor is 74-1, which is reasonably small.
The parallelised version, depicted in red, displays an apparent complexity of n249 This can be due to the fact that up to n = 10000, the system has not exhausted its thread-resource and is compensating part of the exponent.
The crossover between the linear and parallel version of the algorithm is n = 500, however due to other processes running on the system, thread allocating processses may get a NICE -7 priority instead of -5 for instance. This points to the resolution that the practical crossover should be in the range of n = 700 ... 1500.
3. Discussion
It can be observed that the processor is optimised (cache wise) for double scalars.
The introduction of a new scalar, cpx, limited the performance of the linear algorithm in the low-order region. Also, the apparent complexity is lower, as the system is better compensating (pre-saturation) the increased workload through spinning new threads.
This has the implication that the advertised performance of the processors is significantly better than what they are capable in real-life, independent work.
References
1. Legendre A. Nouvelles méthodes pour la determination des orbites des comètes, Courcier //City Paris, 1805.
2. Gauss C. F. Theoria motus corporum coelestium in sectionibus conicis solem ambientium. - FA Perthes, 1877. - Т. 7.
3. Costa J. J. et al. Running OpenMP applications efficiently on an everything-shared SDSM //18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. - IEEE, 2004. - P. 35.
4. Modern Code // Intel URL: https://software.intel.com/en-us/articles/performance-obstacles-for-threading-how-do-they-affect-openmp-code (дата обращения: 15.12.2018).
5. Patterson D., Hennessy JL., Kaufmann M. Computer Organization and Design: the Hardware // Software Interface, 1998.
Templated C++ implementation of the Gauss-Jordan method Dima M.-T., Korenkov V.V. Dima S.N.
Joint Institute for Nuclear Research, Privolzhsky Research Medical University JEL classification: C10, C50, C60, C61, C80, C87, C90_
We present the performance of an OMP-5.1 parallelised templated C++ implementation of the Gauss-Jordan full pivot method of solving linear equations compared to the non-parallelised implementation. Our results indicate complexity O(N282) for double scalars, respectively O(N241) for cpx<double> scalars (complex type in double precision) for a range up to N = 10000. This translates into an advantage of over x5 for the former and x229 for the latter in the upper part of the range, versus the non-parallelised algorithm. Another important result of our study was the determination of Ncrossover at which it becomes advantageous to parallelise the code (N =100). Keywords: Gauss-Jordan, C++, numerical methods, algorithm, matrix
inversion References
1. Legendre A. Nouvelles méthodes pour la determination des orbites des
comètes, Courcier //City Paris, 1805.
2. Gauss C. F. Theoria motus corporum coelestium in sectionibus conicis
solem ambientium. - FA Perthes, 1877. - Т. 7.
3. Costa J. J. et al. Running OpenMP applications efficiently on an everything-
shared SDSM //18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. - IEEE, 2004. - P. 35.
4. Modern Code // Intel URL: https://software.intel.com/en-us/articles/performance-obstacles-for-threading-how-do-they-affect-openmp-code (дата обращения: 15.12.2018).
5. Patterson D., Hennessy JL., Kaufmann M. Computer Organization and
Design: the Hardware // Software Interface, 1998.
X X О го А С.
X
го m
о
2 О M
to