A GPU approach to distance geometry in 1D: an implementation in C/CUDA

We present a GPU implementation in C and CUDA of a matrix-by-vector procedure that is particularly tailored to a special class of distance geometry problems in dimension 1, which we name "paradoxical DGP instances". This matrix-by-vector reformulation was proposed in previous studies on an optical processor specialized for this kind of computations. Our computational experiments show that a consistent speed-up is observed when comparing our GPU implementation against a standard algorithm for distance geometry, called the Branch-and-Prune algorithm. These results confirm that a suitable implementation of the matrix-by-vector procedure in the context of optic computing is very promising. We also remark, however, that the total number of detected solutions grows with the instance size in our implementations, which appears to be an important limitation to the effective implementation of the optical processor.


I. INTRODUCTION
T HE Distance Geometry Problem (DGP) asks whether a simple weighted undirected graph G = (V, E, d) can be realized in the Euclidean space R K , with K > 0, so that the distance constraints ∀{u, v} ∈ E, ||x u − x v || = d u,v , are satisfied [5]. When this is the case, we say that the mapping x : v ∈ V → x v ∈ R K is a valid realization of the graph G. Depending on the DGP application at hand, the vertices v ∈ V can represent different kinds of objects, for which possible positions in R K are searched. The edge set E encodes the information about the distance between vertex pairs, and the numerical value of these distances is given by the associated weight. Notice that ||·|| is the Euclidean norm. We suppose that the available distance values are exact, i.e. extremely precise.
In this work, we focus our attention of DGPs in dimension 1. In 1979, Saxe proved that the DGP is NP-complete when K is set to 1 [10]. The main DGP application in this dimension is the clock synchronization problem: given a set of clocks (represented by the vertices v ∈ V ), and a subset of offset measurements between pairs of clocks (encoded by the edges {u, v} ∈ E and the associated weight d u,v ), the problem asks whether it is possible to know the precise time indicated by all clocks [11]. This problem is fundamental for the synchronization of events in distributed systems [1], [12], as for example in wireless sensor networks [14].
The Branch-and-Prune (BP) algorithm was proposed in [4] for a subclass of DGP instances that admit the discretization of their search space. In the 1-dimensional case, this algorithm can be employed under the much weaker assumption that the graph G is connected [9]. In this case, in fact, a vertex order on V , which ensures that every vertex v has at least one predecessor u (exception made for the first vertex in the order), can be easily constructed [8]. This vertex order is indicated by the subscripts associated to the vertices in the discussion below.
We are interested in a particular subclass of DGPs in dimension 1: the class of paradoxical DGP instances [2]. These instances are represented by graphs G that are cycle graphs, for which a vertex order on its vertex set can be trivially identified. The paradoxical character of these instances is given by the two following observations. On the one hand, the construction of solutions to these instances appears to be relatively easy, because most vertices v k ∈ V only depend on one predecessor in the vertex order, so that, for each x k21 , the two new positions x k21 − d k21,k and x k21 + d k21,k can be easily computed for v k . Notice that this procedure allows us to build up a binary tree collecting, on each of its layers, the possible positions for each vertex v k . On the other hand, however, the absence of any other distance information (apart the distance to the predecessor d k21,k ) up to the layer n, where n = |V |, makes this class of instances actually very hard. In fact, it is only at layer n that the distance between the first vertex v 1 and the last vertex v n can be exploited to select the only two valid realizations out of a set of 2 n21 potential solutions [6].
In this work, we consider the matrix-by-vector reformulation of paradoxical DGPs in dimension 1 (recently proposed in [2] for solving paradoxical DGP instances on a new optical processor) and we implement it on a GPU device. The presented computational experiments show a consistent speed-up when our GPU implementation is compared against the BP algorithm, as well as when the comparison is performed against the sequential implementation of the matrixby-vector procedure itself. These results confirm therefore that the matrix-by-vector reformulation is promising in the context of optic computing. Our experiments also point out, however, a possible limit in the actual implementation of the optical processor.
The rest of this paper is organized as follows. In Section II, we will describe the matrix-by-vector reformulation of our paradoxical DGP instances in dimension 1. In Section III, we will present our GPU implementation for the matrix-by-  vector multiplication, which will benefit of some simplifications implied by the particular problem we aim to solve. We will present and discuss our computational experiments in Section IV, and finally draw our conclusions in Section V.

II. A MATRIX-BY-VECTOR REFORMULATION
When the BP algorithm mentioned in the Introduction is employed for the solution of paradoxical DGP instances in dimension 1 [9], a binary tree containing all possible vertex positions can be recursively constructed, and the valid realizations can be selected at the very end when positions are computed for the last vertex v n ∈ V . Our paradoxical instances have the particularity of solely executing the branching phase of the algorithm until a leaf node of the tree is reached; it is only at this point that the pruning mechanism is invoked, where the only distance not used for branching, the distance related to the edge {1, n}, is verified. If the distance is satisfied by the current position for v n , then the path from the tree root to the current node is a valid realization; it can be discarded otherwise.
As remarked in [2], it is possible to replace, for our paradoxical instances, the pruning phase occurring only at layer n with an additional branching phase, which is performed over the fictive vertex v n+1 that is introduced in the original graph. The fictive vertex is connected to its predecessor v n by an edge having the same weight as the original edge {1, n}. The edge from v 1 to v n is thereafter removed, breaking in this way the original cycle structure. The main reason for making this manipulation on G is that now the distance information is equally distributed over the vertices of the graph, and the solver of paradoxical instances can perform exactly the same operation when stepping from one vertex to its successor. In order to identify the valid realizations, it is finally necessary to verify that The introduction of the fictive vertex allows us to reformulate the paradoxical DGP in dimension 1 as a matrix-by-vector multiplication [2]. We introduce the matrix and the vector y j = d j,j+1 , which contains the distance information related to our paradoxical instance. Thus, the vector r = M y contains all possible positions x n+1 for all possible solutions. Notice that the index i varies from 1 to 2 n , whereas the index j varies from 1 to n. The feasible solutions to our paradoxical instances are the ones for which r i = 0 (because x 1 is here implicitly set to 0).
We notice that performing the matrix-by-vector multiplication gives an answer to the original decision problem (does it exist a realization such that. . . ) but it does not directly provide the realizations, i.e. the sequences of positions on the real line for the vertices of G. In order to construct one selected valid realization, as for example the realization encoded by the i th row of the matrix M , the value of each position x i k for the vertex v k ∈ V (we added a superscript to x to specify the matrix row) can be obtained by performing the following partial sum: The next section describes an ad-hoc GPU implementation of this matrix-by-vector multiplication.

III. A GPU IMPLEMENTATION
Our GPU implementation does not perform generic matrixby-vector multiplications. For this general problem, the reader can refer to some recent (see for example [7], [13]) and very recent (see [3]) publications on the topic. Differently from the cited papers, our implementation takes advantage of the structure of our matrix M to optimize the computations.
First of all, since the elements of our matrix M are either −1 or 1, we can trivially "move" all distance values from the vector y to the matrix, by paying only attention to the sign to consider for each distance value when placed in a particular row of the matrix. We define therefore this new matrix: from which the vector r can be simply computed by summing up all row elements: As a consequence, our GPU implementation will only perform sums, and not products of real numbers. Another important point in our implementation is the procedure to construct the matrix M 2 , and in particular for the choice of the sign for each matrix element. The rule given in the definition (involving the modulus operator) is simple to understand and to apply, but it can be computationally very expensive to perform for every element of the matrix. For our implementation, we found another, and more efficient, method to identify the sign of every matrix element.  Fig. 1 shows the sign distribution over the matrices M and M 2 : all positive elements correspond to the dark pixels, while all negative elements correspond to the light gray pixels. More than one pattern can be identified in these matrices, but one in particular turns out to be very useful for our GPU implementation. If in fact we interpret every gray pixel as a 0 (instead of a −1), whereas the dark pixels still represent 1's, then we can see every matrix row (notice that the matrix is transposed in Fig. 1) as the binary representation of integer numbers spanning from 0 to 2 n − 1. Moreover, if we consider the big-endian convention for the bit ordering (which is, the less significant bit is on the left side, differently from our standard convention with decimal numbers), then the integer number at row i is simply the predecessor (in integer arithmetic) of the one at row i + 1, and it is the successor of the one at row i−1. If the bits of an integer 3 encode therefore the signs at row i, the bits of the integer 3 + 1 simply encode the signs at row i + 1.
In our GPU implementation, every thread is in charge of computing the sums for a subset of matrix rows. This subset forms a block of contiguous matrix rows, so that, once each thread has found out its starting value for 3, it simply needs to increase it by one unit per time for treating all subsequent rows. Naturally, all row blocks are supposed to have the same size in order to better exploit the power of the GPU device.
After the computation of every row sum, the thread verifies whether this sum is close enough to 0. In the case it is true, the thread keeps this information aside (in binary format) and it sends it back to the CPU at the end of the computations. Notice that this information is binary (a valid realization was found or not), because the symmetry properties [6] of our paradoxical instances indicate that the only chance to have two valid realizations treated by the same thread is when all matrix rows are assigned to one unique thread.

IV. COMPUTATIONAL EXPERIMENTS
This section presents some computational experiments where we compare the standard BP algorithm (see Introduction) against our matrix-by-vector procedure, executed both in sequential and in parallel. All programs 1 were written in C, and CUDA was employed for coding the computations on GPU. The experiments were performed on a workstation equipped with an Intel(R) Xeon(R) CPU E5-2609 v3 @1.90GHz, Nvidia GPU GeForce GTX TITAN X graphics card, and running Ubuntu Linux operating system. We compiled our programs with the version 5.4.0 of GCC, and with the version 9.0.176 of CUDA. In all experiments, our execution on GPU was launched with a thread grid comprising 64 blocks, having 512 threads each. Table I presents some computational experiments where the BP algorithm is compared against the sequential and the parallel implementations of our matrix-by-vector procedure. We considered instances of size ranging from 20 to 32 which were randomly generated so that to satisfy the properties of paradoxical instances. The cardinality |V | of the vertex sets is reported on the first column of the table. We omit to report the cardinality of the edge set E because it always corresponds to |V | in our instances. The BP algorithm was run only on CPU; the matrix-by-vector procedure was run on both CPU (the sequential version) and GPU (the parallel version).
While it is expected (see Introduction and Section III) that every instance only admits two valid realizations, the second column of Table I shows that the number of solutions (#sols) found by the BP algorithm is larger, and it tends to increase with the instance size. Our interpretation for this phenomenon is that, the more the search space increases in size (exponentially with n), the more are the chances to find a realization that is close enough to feasibility. The verification of the final distance d 1,n is performed with tolerance · = 10 24 in all experiments, which allows us to take into consideration the possible round-off error propagation. However, the use of this tolerance seems to enlarge too much the subset of realizations for which this final distance appears to be satisfied. We remark, however, that a generic tuning on the value of · that would work for all instances is naturally not possible.
The third column of our table gives the time in seconds necessary for the BP algorithm to fully explore the search space of the given instances. To measure the time, we used the standard C clock function. Recall that the search space has size 2 n21 .
The forth and fifth columns of Table I show the results obtained with the implementation of our matrix-by-vector procedure in sequential. While the number of found solutions does not change w.r.t those found by BP, we remark an increase on the computational time. This was expected, because the matrix-by-vector formulation does not exploit the fact that the computations necessary for a given solution can be partially reused for neighboring solutions (i.e. belonging to near matrix rows). This is the reason why the BP algorithm works differently. However, the independence of the matrix rows is an essential feature for our GPU implementation.
Finally, the last two columns of the table show the results obtained with our GPU implementation. We point out that we used the float primitive type for the distances and the positions (we did the same in the previous two C implementations, in order to obtain results as uniform as possible). We used unsigned long types to store the values of the integer 3 (see Section III). Naturally, this choice limits the instance size n to 64, but has no impact on the experiments we have performed for this work (the decimal representation of 2 64 is already a quite "large" integer, composed by 20 digits).
The computational time is significantly reduced, when compared to the sequential version of the matrix-by-vector procedure, as well as when the comparison is performed against the original BP algorithm. We notice, however, that the number of found solutions differs with the other implementations for the instances of larger size. This error is due to the way the GPU threads communicate their results to the CPU: this information is in fact binary (solution found / not found) because at most one solution per thread was initially expected to be identified. Apparently, during the computations, more than one solution was instead (wrongly) detected by the same thread, thus leading to a smaller final solution sum.
The reader may wonder why we have decided not to fix this "issue" in our CUDA implementation. Since our work is motivated by the optical processor mentioned in the Introduction, which is supposed to perform the calculations (although in an analog fashion) in a similar way, it seems important to us to point out this drawback of the implementation. This is actually a quite important limitation for the physical implementation of optical processor.

V. CONCLUSIONS
We have presented a GPU implementation of a matrix-byvector procedure that is particularly tailored for the solution of paradoxical DGP instances in dimension 1. The idea to reformulate this problem as a matrix-by-vector multiplication comes from previous studies on an optical processor, which is specialized for this class of problems.
Our computational experiments show that our GPU implementation is already able to take advantage of the matrix-byvector reformulation. On our randomly generated paradoxical instances, the pattern shown by our table of experiments is very regular: the GPU implementation is about 16 times faster than the standard BP algorithm. Naturally, much better speedups have been achieved on GPUs; in our case, however, the reformulation transforms the problem in a harder one (even if the complexity class remains the same). Nevertheless, the GPU implementation is still able "to do better" than the standard sequential algorithm.
In view of an implementation of the optical processor in [2], we remark that it is likely to suffer of the same effect on the increased number of detected solutions that we have observed in our computational experiments. This opens new challenging for the conception and development of this kind of alternative computing devices.