Automatic code optimization for computing the McCaskill partition functions

In this paper, we present the application of three automatic source-to-source compilers to code implementing McCaskill’s bioinformatics algorithm. It computes probabilities of various substructures for RNA prediction. McCaskill’s algorithm is compute and data intensive and it is within dynamic programming. A corresponding programming code exposes non-uniform dependences that complicate tiling of that code. The corresponding code is represented within the polyhedral model. Its optimization is still a challenging task for optimizing compilers employing multi-threaded loop tiling. To generate optimized code, we used the popular PLuTo compiler that finds and applies affine transformations, the TRACO compiler based on calculating the transitive closure of loop dependence graphs, and the newest polyhedral tool DAPT implementing space-time tiling. An experimental study fulfilled on two multi-core machines: an AMD Epyc with 64 threads and a 2x Intel Xeon Platinum 9242 with 192 threads demonstrates considerable speedup, high locality, and scalability for various problem sizes and the number of threads of generated codes by means of space-time tiling.


I. INTRODUCTION
Mc CASKILL'S algorithm is an efficient dynamic programming one to return the value of the computed partition function Z = 3 P exp(2E(P )/RT ), where P represents all possible nested structures formed by a given RNA sequence S, E(P ) is the energy of structure P , R is the gas constant, and T represents temperature [1].
The approach computes the structure probabilities of each individual base pair in the RNA sequence. These probabilities are used for simultaneous folding and alignment in algorithms to predict an RNA structure with a maximum expected accuracy (MEA) [2].
Each base pair of a structure contributes a fixed energy term E bp independent of its context. Under such an assumption, a partition function for a sub-sequence from position i to position j is represented with table Q i,j , while table Q bp i,j holds the values of the partition function of the sub-sequences for a base pair or 0 when base pairing is absent.
The following calculations are used to compute the values of the partition functions and inserted as elements of tables Q i,j and Q bp i,j .
Listing 1 presents the code implementation computing Q i,j and Q bp i,j filled with random double values (data in arrays do not affect the speed of the code). ERT is equal to exp(−E bp /RT ). To simplify target tiled code generation, we replaced k with k+i and the innermost loop boundaries from 0 to j − i − 1.
Many algorithms in bioinformatics are within dynamic programming (DP). Programming loop nests implementing those algorithms can be represented within the polyhedral model. That model is used in many optimization compilers, which automatically generate efficient parallel tiled code. However, the code implementing McCaskill's algorithm exposes nonuniform dependences that make it difficult effective parallelization and tiling of that code.
A polyhedral optimizer generally improves code locality by means of loop tiling, which groups loop statement instances within smaller blocks (tiles). This allows for reuse provided that the block fits in cache. In parallel tiled code, tiles are enumerated as indivisible macro statements. This increases the granularity of parallel code that often improves the performance of that code executed in parallel systems with shared memory.
Dynamic programming codes expose non-uniform dependences, which limit applying commonly known optimization techniques such as permutation, diamond tiling [3], or index

II. OPITMIZING COMPILERS USED FOR EXPERIMENTS
Modern automatic optimizing compilers, for example, PLuTo [5], demonstrate the success of using the polyhedral model. PLuTo extracts and applies affine functions to parallelize and tile serial code. Target parallel tiled code demonstrates good efficiency on modern multi-core computers with shared memory in particular for stencils.
For a given loop nest statement, compilers based on affine transformations use the relation where I is the loop statement iteration vector; t is the time assigned to execute iteration I; C * I + c represents the affine function. When two statement instances get the same execution time, they can be run in parallel. To extract the unknown matrix C and unknown vector c, firstly, for each loop nest statement, time-partition constraints are formed by applying extracted dependences. Then the time-partition constraints are resolved for elements of matrix C and elements of vector c.
The PLuTo engine is used in other compilers, for example, in Apollo [6], PPCG [7], PTile [8], and Autogen framework [9] as well as in commercial IBM-XL and R-STREAM from the Reservoir Lab [10].
TRACO is based on the slicing framework introduced in paper [11]. It calculates the transitive closure of a dependence graph, which is used to fulfill corrections of original rectangular tiles. The goal of the correction is to eliminate all cycles among target tiles. This allows us to enumerate target tiles in lexicographic order.
After tile correction, the inter-tile dependence graph does not contain any cycle and any technique of loop nest parallelization can be used to generate parallel code, details are presented in paper [12]. TRACO uses the commonly known wive-fronting technique for tiled code parallelization. For its implementation, it applies the ISL library.
PLuTo and TRACO have some limitations. PLuTo may not find the number of linearly independent solutions to time partitions constraints that is equal to the number of the loops surrounding a loop nest statement. This reduces the dimension of target tiles and as a consequence target code locality may be low. For example, it is not able to tile each loop nest for the Nussinov and Knuth algorithms [13], i.e., instead of 3D tiles it generates only 2D tiles. TRACO, in general, may generate irregular tiles that reduce code locality and worsen multiple thread work balance [14].
To generate regular code and increase tile dimension, DAPT implements space-time tiling. First DAPT generates space tiles according to the technique presented in paper [15]. Then it splits each space tile into multiple time slices. Each time slice is represented with a number of time partitions found by means of the ISL scheduler. The number of time partitions within the time slice is defined by the user. As a result, the tile dimension is increased by one. Target code enumerates smaller tiles (time slices) within each space tile. This increases code locality due to increasing the probability of catching all the data associated with each smaller tile in cache provided that the number of time partitions forming the time slice is chosen properly.
However, each of the mentioned above automatic sourceto-source compilers is able to generate tiled code for the McCaskill algorithm and we conducted a comparison analysis of the performance of codes generated by them on two modern multi-core machines.

III. EXPERIMENTAL STUDY
In this section, we present the results of an experimental study with PLuTo, TRACO, and DAPT codes implementing the McCaskill partition function computation. All target parallel tiled codes were compiled using the Intel C++ Compiler (icc) and GNU C++ Compiler (g++) with the -O3 flag of optimization.
The code generated with DAPT is presented in Listing 2, while the codes generated with PLuTo and TRACO can be found at https://github.com/markpal/NPDP Bench/blob/main/ mcc/mcc dapt.cpp, they are too long to be inserted in this paper.
It is worth noting that tiles generated with TRACO are irregular, they can be fixed or parametric (the size of such tiles is unbounded). PLuTo generates regular fixed tiles except from boundary ones.
Space-time tiling implemented in DAPT generates regular tiles and the tile dimension is one more than that of tiles generates with PLuTo.
In all examined compilers, for parallelism representation, the OpenMP standard is used. For different sizes examined by us, by means of experiments, we discovered that the best tile size for the TRACO target code is 1x128x16. This means that the outermost loop in the loop nest should not be tiled. For the target code generated with PLuTo, the best tile size is 16x16x16. For the DAPT code, the optimal size is 16x16x16 for space slices and the size of the time slice (the number of time partitions within the space tile) is 2.
The McCaskill code can be tiled by all the compilers used for us for experiments. However, only TRACO and DAPT allow us to generate parallel tiled code. The serial code generated with PluTo is very cache-efficient, but PluTo is unable to extract any affine schedule allowing for parallelism of target code. TRACO generates target code applying the transitive closure of the dependence graph for the McCaskill loop nest, then it builds a relation representing inter-tile dependences. Finally, using that relation, it applies the ISL scheduler to extract a valid tile schedule, which is used to generate parallel tiled code. DAPT applies the wave front technique to generate target parallel tiled code. Tables 1 and 2   lengths on an Intel Xeon Platinum 9242 (192 hardware threads) and an AMD Epyc 7542 (64 hardware threads), respectively. Figures 1 and 3 show the speed-up (a ratio of T 1 over T n , where T 1 and T n are execution times for one and n threads used for code running, respectively) of parallel tiled programs for RNA sequence sizes from 1000 to 10000 for an Intel Xeon Platinum 9242 (192 hardware threads) and an AMD Epyc 7542 (64 hardware threads), respectively. Figures  2 and 4 show how target code speed-up depends on the number of threads for an Intel Xeon Platinum 9242 (192 hardware threads) and an AMD Epyc 7542 (64 hardware threads), respectively, for N = 5000 (roughly the size of the longest human mRNA).
Analyzing the presented results of experiments, we may state that the DAPT code overcomes considerable those of PLuTo and TRACO ones. The TRACO code overcomes that of PLuTo for eight and more threads. The worse efficiency of the TRACO code for a few thread numbers is caused with the irregularity of the target code (see the previous section).

IV. CONCLUSION
We presented the results of a comparative performance analysis of three tiled codes generated with optimizing compilers PLuTo, TRACO, and DAPT for the McCaskill partition function calculation. The best performance demonstrates the DAPT code due to the fact that it applies space-time tiling  allowing us to increase the tile dimensionality by one in comparison with that of PLUTO. That makes all tiles to be regular and of fixed size. A proper choice of a tile size allows us to hold all the data associated with each tile in cache that increases code locality.