Performance Analysis of a 3D Elliptic Solver on Intel Xeon Computer System

It was shown that block-circulant preconditioners, applied to a conjugate gradient method, used to solve structured sparse linear systems, arising from 2D or 3D elliptic problems, have very good numerical properties and a potential for good parallel efficiency. In this contribution, hybrid parallelization based on MPI and OpenMP standards is experimentally investigated. Specifically, the aim of this work is to analyze parallel performance of the implemented algorithms on a supercomputer consisting of Intel Xeon processors and Intel Xeon Phi coprocessors. While obtained results confirm the positive outlook of the proposed approach, important open issues are also identified.


I. INTRODUCTION
I N THIS contribution, we are concerned with the numerical solution of linear boundary value problems of an elliptic type.After discretization, such problems are reduced to finding the solution of a linear systems in the form Ax = b.In what follows, symmetric and positive definite problems are considered.Moreover, it is assumed that A is a large matrix.Obviously, the term "large" is relative, as what was large in the past, is no longer large.Therefore, it is assumed that the size of the linear system (matrix) is defined as large, in the context of capabilities of currently existing computers.
In practice, large problems of this class are often solved by iterative methods, such as the conjugate gradient (CG) method.At each step of such methods, a single product of A with a given vector v is needed.Therefore, to minimize number of arithmetic operations, the sparsity of the matrix A should be explored.On the other hand, exploration of sparsity may be in conflict with parallelization (for large number of processors and cores) of the iterative process.
Typically, the rate of convergence of CG methods depends on the condition number κ(A) of the coefficient matrix A. Specifically, the smaller κ(A) is, the faster the convergence.Unfortunately, for elliptic problems of second order, usually, κ(A) = O(n 2 ), where n is the number of mesh points in each coordinate direction.Hence, conditioning of the matrix grows rapidly (gets worse) with n.To accelerate the convergence of the iterative process, a preconditioner M is applied within the CG algorithm.The theory of the Preconditioned CG (PCG) methods says that M is a good preconditioner if it significantly reduces the condition number κ(M −1 A) and, at the same time, if it allows one to efficiently compute the product M −1 v, for a given vector v.The third important aspect should be considered, namely, the need for efficient implementation of the PCG algorithm on modern parallel computer systems, see e.g.[1], [2].Here, again, the question can be raised, what does it mean "modern" as the practical meaning of this term evolves.Establishing how the problem should be approached on a computer current to the time of conducted research is one of the issues that inspired this work.

II. THE 3D ELLIPTIC PROBLEM
Let us now consider the following 3D elliptic problem: to be solved on the unit cube [0, 1] 3 .Let the domain be discretized by a uniform grid with n grid points in each coordinate direction.

A. Finite Difference Method
Let us consider the usual seven-point centered difference approximation for problem (1).This discretization leads to a system of linear algebraic equations Ax = b where the vector of unknowns x has size n 3 .If the grid points are ordered along the x 3 and x 2 directions first, the resulting matrix A admits a standard block-tridiagonal structure.Here, the diagonal blocks are block-tridiagonal matrices, while the off-diagonal blocks are diagonal matrices.Overall, the matrix A can be written in the following form where A i,i are block-tridiagonal matrices which corresponds to one x 1 -plane.For details see [3], [4], [5], [6].

III. CIRCULANT BLOCK-FACTORIZATION PRECONDITIONING
Let us recall that a circulant matrix C has the form (C k,j ) = c (j−k) mod m , where m is the size of C.Moreover, for any given coefficients (c 0 , c 1 , . .., c m−1 ), let us denote by C = (c 0 , c 1 , . . ., c m−1 ) the circulant matrix Any circulant matrix can be factorized as where Λ is a diagonal matrix containing the eigenvalues of C, F is the Fourier matrix and F * = F T denotes adjoint matrix of F .Here, i stands for the imaginary unit.
Let us now denote the general form of the CBF preconditioning matrix M , for the matrix A, by Here, C i,j = Block − Circulant(A i,j ) is block-circulant approximation of the corresponding block A i,j [3], [4].Note that the approach to defining block-circulant approximations can be interpreted as simultaneous averaging of the matrix coefficients, and changing the Dirichlet boundary conditions to the periodic ones.Each PCG iteration consists of one solution of the linear system with the preconditioner.The CBF preconditioner can be written in the form and the solution of the linear system with M CBF requires one forward 2D Discrete Fourier Transform (DFT), solution of the tridiagonal linear systems, and one backward 2D DFT.
The details of the sequential and parallel realizations, of the CBF preconditioner, have been described in [5], [6], which should be consulted for the remaining details.

IV. NUMERICAL TESTS -EXPERIMENTAL SETUP
Conducted experiments have been selected to illustrate the convergence rate, as well as the parallel performance of the developed algorithms for the 3D elliptic problems.Specifically, test problems, with variable coefficients in the form where ϵ ∈ [0, 1] is a parameter have been considered.It is well known that the circulant preconditioners are competitive with the incomplete LU factorization for moderately varying coefficients.This reflects the averaging of the coefficients, used in the block-circulant approximations.
The right hand side f , is chosen in such a way that the problem (2) has solution

All computations are done in double precision. The standard iteration stopping criterion is ||r
, where r j stands for the residual at the jth iteration step of the preconditioned conjugate gradient method.The code has been implemented in C. For the implementation of the preconditioning, Fast Fourier Transform (FFT) was used, and functions fftw_init_threads, fftw_plan_with_ nthreads, fftw_plan_many_dft, and fftw_execute from the FFTW (the Fastest Fourier Transform in the West) library were used.A hybrid parallel code, based on joint application of MPI and OpenMP-based parallelizations has been developed [7], [8], [9], [10], [11].
In this contribution, the parallel code has been tested on cluster computer system Avitohol, at the Advanced Computing and Data Centre of the Institute of Information and Communication Technologies of the Bulgarian Academy of Sciences.The Avitohol consists of HP Cluster Platform SL250S GEN8.It has 150 servers, and two 8-core Intel Xeon E5-2650 v2 8C processors and two Intel Xeon Phi 7120P coprocessors per node.Each processor runs at 2.6 GHz.Processors within each node share 64 GB of memory.Each Intel Xeon Phi has 61 cores, runs at 1.238 GHz, and has 16 GB of memory.Nodes are interconnected with a high-speed InfiniBand FDR network (see, also http://www.hpc.acad.bg/).

1054
PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

V. EXPERIMENTAL RESULTS AND ANALYSIS
The first series of experiments established the "baseline" performance.In Tables I and II results obtained using processors of a single node of Avitohol are presented.Table I shows the used memory, the number of iterations, the maximal error of the obtained solution, and the total execution time.The results have been obtained using only shared memory parallelism: i.e. there was one MPI process and up to 32 OpenMP threads.
The first observation concerns memory use.For n increasing from 120 to 720, memory consumption grows from 707 Mb to 52542 Mb.This means that the limit of size of the problem that can be solved on a single node has been reached.This was checked experimentally, and increasing n to 840 resulted in an "out of memory" error.
Second, let us note that for n = 120 there is no performance gain with the number of used threads.Clearly, problem is too small.However, for n = 720 speedup of order of 5 has been reached for 16 threads.This was also the "best result".Moving to 32 threads resulted in speedup decreasing to approximately 3. Interestingly, almost no performance improvement was observed when moving form 1 to 2 threads.
Approaching the performance from the"completely opposite" perspective, Table II shows the execution time when using only distributed memory parallelism; i.e. one OpenMP thread and up to 32 MPI processes.It should be noted that in the current (prototype) implementation, the algorithm works correctly only if the number of mesh points in each coordinate direction (n) is divisible by the number of processes.In the Table, the best execution time, for each size of the problem, is marked in bold.
It is easy to note that, for n = 120, 240, . . .720 the algorithm works much faster when using only MPI parallelism and utilizing multiple threads (in comparison to using OpenMP based multi-threading).For the largest case (n = 720), the solver that is using MPI and 30 threads is more than 2 times faster than using OpenMP and 16 threads (the fastest result from Table I).Overall, for the smallest problem (n = 120) the fastest execution is obtained when using 15 MPI processes.
For bigger problems, on the other hand, the best results have been reached when using 30 processes.
The second series of experiments concerned use of individual processors.Specifically, Table III shows the execution time when using only processors from multiple nodes (from 2 to 8).Here, results for 1 OpenMP thread, as well as 16 and 32 threads are reported.Note that, results for n = 960 are also reported.This was possible due to the fact that the problem was "split" into at least two nodes and, therefore, it "fit in" (did not generate out of memory errors).Again, the best execution time, for each problem size, is marked in bold.
The algorithm runs the fastest when using 16 threads in just few cases, i.e. for n = 120 on 3 and 8 nodes, for n = 240 on 5 nodes, and for n = 960 on 2 nodes.In the remaining cases, the execution using one thread is the fastest.Considering the largest case (n = 960), for a single OpenMP thread, speedup of order 6 can be observed when moving from 2 to 8 nodes.Moreover, when comparing results with those reported in Table II, for n = 720, the best result on 8 nodes is more than 6 times faster.
In the next series of experiments, the performance of the co-processors has been evaluated [12].Specifically, Tables IV and V present times collected on the Avitohol using only Intel Xeon Phi co-processors (processors have not been used for solving the computational problem).Here, Table IV shows the execution time on one co-processor for n = 120, 240, 360.Again, the best execution time is marked in bold.
Here, the positive effect of combining OpenMP and MPI based parallelism can be observed.For the largest problem that fit in the memory of the co-processor (n = 320), use of 4 OpenMP threads improved the performance by more than 4 times.This could be interpreted as a case of super-linear speedup.However, delving into this point is out of scope of this contribution.
Next, Table V presents execution times obtained when solving the problem on co-processors (only), but using up to 8 nodes.Note that, since the code is a prototype, case of 7 nodes had to be excluded.Here, again, it was possible, for  It can be seen that the algorithm runs faster using 2 threads for n = 120 and 4 threads for n = 240, 360.In this context, it should be recalled that the memory of one co-processor is only 16 GB.This memory limit is the reason that the code could have been run only for small size problems.In particular, for problems with n = 480 at least 2 co-processors were needed, while for n = 960 the code could have been executed starting from 12 co-processors.
In the final series of experiments, processors and coprocessors have been jointly used.Specifically, Table VI shows the best execution times collected on the Avitohol using Intel Xeon processors working together with the Intel Xeon Phi coprocessors.Here, the code was executed using: on processors p c MPI processes and every process runs q c OpenMP threads; on co-processors -p m MPI processes and every process runs q m OpenMP threads.In each case, the optimal combination of the number of MPI processes and the number of threads has been used.These combinations have been established experimentally.The memory limitation resulted in not being able to run experiments, for n = 960, for less than 3 nodes.For the reasons explained above, there are no results for 7 nodes.
As can be seen, due to the, above stated, memory limitations on co-processors, the largest problem size n = 960 required at least 3 nodes to be solved.Considering problem of size n = 720, use of 8 nodes turned out to be ineffective, as solution time increased, as compared to the use of 6 nodes.For 6 nodes an almost perfect speedup (larger than 5), has been obtained.Interestingly, for all problem sizes, use of 8 nodes resulted in performance that was inferior to 6 nodes.We do not have an explanation of this fact, other than possibility that in this case operations not related to the solution of the problem had to run "somewhere" and their execution interfered with execution of the solver.For the largest problem, when comparing the performance obtained on 3 and on 6 nodes, a speedup of almost 6 was recorded.This shows that if ample resources are provided, the proposed approach behaves as expected and parallelizes well, when applying hybrid approach to algorithm parallelization.
To better visualize the relationship between execution times, they have been visualized also in Figure 1.Here, the execution time of the hybrid code, on up to 8 nodes for

TABLE I USED
MEMORY, NUMBER OF ITERATIONS AND TIME (IN SECONDS) FOR THE EXECUTION OF THE PARALLEL ALGORITHM ON ONE NODE USING ONE MPI PROCESS AND VARYING THE NUMBER OF THREADS.
IVAN LIRKOV ET AL.: PERFORMANCE ANALYSIS OF A 3D ELLIPTIC SOLVER ON INTEL XEON COMPUTER SYSTEM 1055 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III TIME
(IN SECONDS) FOR THE EXECUTION OF THE PARALLEL ALGORITHM ON UP TO 8 NODES.
larger number of nodes, to solve the problem for n = 960.In the Table, the best execution time is marked in bold.