Resilient s-ACD for Asynchronous Collaborative Solutions of Systems of Linear Equations

—Solving systems of linear equations is a critical component of nearly all scientiﬁc computing methods. Traditional algorithms that rely on synchronization become prohibitively expensive in computing paradigms where communication is costly, such as heterogeneous hardware, edge computing, and unreliable environments. In this paper, we introduce an s-step Approximate Conjugate Directions (s-ACD) method and develop resiliency measures that can address a variety of different data error scenarios. This method leverages a Conjugate Gradient (CG) approach locally while using Conjugate Directions (CD) globally to achieve asynchronicity. We demonstrate with numerical experiments that s-ACD admits scaling with respect to the condition number that is comparable with CG on the tested 2D Poisson problem. Furthermore, through the addition of resiliency measures, our method is able to cope with data errors, allowing it to be used effectively in unreliable environments.


I. INTRODUCTION
S OLVING a system of linear equations Ax = b is a critical kernel in many applications, studied in great detail across applications, as well as for both iterative [1], [2] and direct solves [3], [4], [5], [6].However, even iterative methods such as Krylov subspace methods, which have reduced serialization [7], require global synchronization.One of the most popular of such methods, the Conjugate Gradient (CG) method, computes global inner product at each iteration [8], [9].The burden of this synchronization cost is increasing in modern computing environments due to two reasons: 1) as the number of parallel processes rapidly increases, the cost of global synchronization does too; 2) new environments are being considered for computationally expensive tasks, e.g.distributed (drones, power grid) and heterogeneous (accelerators) computing.Due to these factors, there is a critical need for II.BACKGROUND First, let us consider the Krylov subspace methods, a class of iterative methods where an initial guess of the solution to Ax = b is updated by iteratively building up a Krylov subspace span{b, Ab, A 2 b, . ..}.For symmetric positive definite matrices (SPD), the CG method, a particular Krylov solver, constructs a series of direction vectors p κ that are A-conjugate to each other, as well as residual vectors r κ that span the same Krylov subspace for iteration κ [8], [9].Asymptotically, this method achieves convergence to a given tolerance in O( cond (A)) iterations for a problem with condition number cond (A), making it the solver of choice for SPD matrices.However, global communication is needed to ensure the orthogonality necessary for the method to be robust.
To reduce the cost of global communication, methods such as the communication-avoiding s-step algorithms [11], [12] and communication hiding pipeline methods [13], [14] have been proposed in the synchronous case [15].To address the issue of unreliable computing environments, some faulttolerant or resilient CG methods are also available [16], [17], [18].However, both of these classes of methods still require a high level of global synchronization for orthogonality to be preserved.In sufficiently distributed environments, these costs may become too restrictive, leading to the need of methods that do not require global synchronization and exact orthogonality.
For the solution of linear systems, chaotic or asynchronous methods [19], [20], [21] such as asynchronous Jacobi [22], [23], [24] have been developed to provide asynchronicity to already existing solvers.Although resiliency has been added to asynchronous Jacobi [25] to make it more fault-tolerant, the number of iterations scales proportional to the condition number, driving the need for more powerful asynchronous linear solvers.

A. Skywing
Edge computing, in which many small devices exist and work together in an unstructured setting, is a rapidly growing field in computing.Edge computing applications pose a unique set of challenges: 1) Both the physical and cyber environments are highly unreliable, as devices are placed in uncontrolled locations, e.g.homes or along power lines.As such, they can readily and unexpectedly break, get unplugged, or become compromised by cyberattacks.
2) The collection of participating devices is often quite heterogeneous, with a range of vendors and device capabilities.
3) The computational workflows are frequently streaming workflows that continually monitor and respond to some needs, rather than being a single computational task that terminates upon completion.While traditional parallel computing paradigms, such as HPC or database computing, each have some of these challenges in common, the combination is unique to edge computing.This paper details a new method in the collaborative autonomy paradigm, a class of methods in which multiple computational units work independently of each other but towards a common goal.Through adapting to unreliability present in the environment, these methods can provide reliable computing in unreliable environments.
Existing software platforms like Apache Hadoop [26] and Apache Spark [27] are designed for large-scale, "big data" computing work, but they largely implement leader-follower patterns and perform computing in batches.These approaches, while effective in controlled cluster environments, lack the resilience necessary to withstand common faults in edge computing applications such as hardware faults and, increasingly, cyber intrusions.Other parallel computing frameworks, such as OpenMP and MPI, do not necessarily rely on leaderfollower paradigms, but are more naturally designed for wellcontrolled environments and terminating computational tasks.
Skywing is a software platform developed at Lawrence Livermore National Lab, which follows a publication/subscription paradigm.This allows any agent involved in the computation to subscribe or publish to a stream of data, and any data on a stream an agent is subscribed to is considered agnostic.Because of the unstructured nature, this enables increased flexibility, particularly for consensus based methods.Skywing aims to provide method composition to enable a modular approach, allowing users to utilize appropriate levels of resiliency for each module.The source code of Skywing is available on GitHub [10].

A. Problem Statement
Consider solving the linear system for x ∈ R m , where A ∈ R m×m and b ∈ R m .Assume the linear system is distributed across N agents according to a non-overlapping partition.Denote A i ∈ R mi×m , m i < m, as the block of rows of matrix A that are stored on agent i, i = 1, . . ., N .Then This paper establishes an asynchronous iterative method for solving Ax = b, where agent i computes successive approximations to the solution vector x, denoted x 0 , x 1 , etc. . .

B. Conjugate Directions Algorithm
Due to the high communication costs of edge computing environments, the classical Conjugate Gradient (CG) method cannot be directly used due to the need for significant synchronization when computing orthogonal direction vectors.However, we can utilize a variant called the Conjugate Directions

442
PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 (CD) method by relaxing the global orthogonality constraints.
The CD method, introduced by Hestenes and Stiefel in [9], is a generalization of the classical CG method.It solves the problem iteratively by computing a sequence of conjugate direction vectors.CG defines the new search direction based on a residual vector and the previously computed search directions, while the CD uses only the previous search directions.In this section, a short introduction to the CD method is given.For more details, see Hestenes and Stiefel [9].Denote a vector z ∈ R m at iteration κ as z κ .Let x 0 be an initial guess to the solution, then set the initial residual r 0 = b − Ax 0 ∈ R m and select an arbitrary initial direction p 0 ∈ R m .At each iteration κ = 0, 1, . .., the new solution approximation and the residual are computed as A new direction vector p κ+1 is chosen such that p κ+1 , p ι A = 0, ι = 0, . . ., κ.
In the special case of the Conjugate Gradient (CG) method, we initialize the first direction vector as p 0 = r 0 and compute the subsequent direction vectors using a three-term recurrence relation, i.e., The second formulation for β κ in equation ( 8) represents the coefficient used to orthogonalize the new residual vector r κ+1 against the prior direction vector p κ using Gram-Schmidt orthogonalization with the A-norm.In other words, p κ+1 is computed by A-orthogonalizing the new residual vector r κ+1 against the prior direction vector p κ .Our method combines the Conjugate Direction (CD) method globally while allowing each device to perform Conjugate Gradient (CG) steps locally.This approach achieves improved scaling compared to asynchronous Jacobi (for which some convergence is presented in [23]) without requiring the synchronization at each iteration as CG does.

C. Asynchronous s-Approximate Conjugate Directions (s-ACD)
Within the framework of the Conjugate Directions (CD) method, our objective is to design a fully asynchronous method.First, we introduce the following notation.Let z ψ(i,j,κ) ∈ R m denote the local copy of the vector z from agent j received by node i at iteration κ.Let z i ∈ R mi denote the subvector of z corresponding to the block of elements that agent i is approximating.Each agent has access to its local portion A i of the matrix A, the full right-hand side vector b, and maintains a set of local variables: a local residual vector r κ ∈ R m , a local solution vector x κ ∈ R m , and a local direction vector p κ ∈ R m for each iteration κ.In this context, κ = 1, 2, . . .represents the local iteration count of agent i.It is important to emphasize that the iteration count may vary between agents due to the asynchronous nature.Therefore, the agents may be at different stages of the iterative process at any given time.
Each agent will initialize its local vectors as follows for κ = 0: Then, each agent will asynchronously advance from local iteration κ to κ + 1 using the following steps: 1) Compute the local matrix-vector product w κ := A T i p κ i , where w κ ∈ R m and p κ i ∈ R m is the subvector of p κ corresponding to the block of elements that agent i is responsible for.
2) Asynchronously send the vector w κ from step 1 and the vector p κ i to the other agents.3) Receive available updates from other agents where U κ i is the set of updates, i.e. j ∈ U κ i if and only if agent i during its local iteration κ received an update from agent j.Note that for the restart mechanism introduced later, r κ and x κ are also sent and received during these communications.Note that the non-blocking communication allows agents to send and receive information in an asynchronous fashion, which enables parallelism while avoiding the need for synchronization at every iteration.
Once the updates have been received, each agent i will assemble the asynchronous direction vector p κ ∈ R m blockwise according to the partition: where p ψ(i,j,κ) j represents the partial local direction vector agent i received from agent j at iteration κ.Since the matrix A is SPD, Thus, the exact A matrix-vector product of the asynchronous direction vector, denoted w κ := A p κ , can be computed using only the received and local partial matrix-vector products: κ) .

LUCAS ERLANDSON ET AL.: RESILIENT S-ACD FOR ASYNCHRONOUS COLLABORATIVE SOLUTIONS OF SYSTEMS OF LINEAR EQUATIONS 443
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
We use the asynchronous direction vector w to construct the s-conjugate direction vector, denoted d κ .It is essential that d κ is A-conjugate to the s prior s-conjugate direction vectors d κ−s−1 , . . ., d κ−1 .To achieve this, we can employ a method such as Gram-Schmidt orthogonalization.In order to ensure conjugacy with prior direction vectors, additional storage of the prior s conjugate direction vectors, d κ−ℓ s ℓ=1 , and their A-products, v κ−ℓ s ℓ=1 , is necessary.These vectors are defined recursively, with d 0 = p 0 , v 0 = w 0 , and for κ > 0, where GS A ( p κ , d κ−ℓ ) is the magnitude of the projection of the vector p κ onto the vector d κ−ℓ under the A-inner product, i.e., Note that the exact matrix-vector product v κ = Ad κ is ensured due to the definition of w κ := A p κ .In the following theorem, we prove that d κ is A-conjugate to the prior s conjugate direction vectors d κ−ℓ−1 s ℓ=1 .Theorem 1.Let d κ , v κ be defined as in (13).Then for ℓ = 1, . . ., min(s, κ), Proof.We proceed by induction on the iteration number κ.For κ = 0, the statement holds trivially.Suppose that the statement holds true for κ = 0, . . ., ι − 1.We now show the statement holds true for κ = ι.By the definition of By the induction hypothesis, Thus at iteration κ, we have that d κ , d κ−ℓ A = 0 for ℓ = 1, . . ., min(s, κ).
Using this definition of d κ , we can proceed in a manner similar to the Conjugate Directions (CD) method.Define the step size Using the step size α κ , the approximate solution and residual vectors are updated Finally, the next local direction vector is computed by enforcing the new residual r κ+1 to be s-conjugate with the sconjugate direction vectors Restarting: Due to the asynchronous nature of the s-ACD algorithm, it is possible that the direction vectors, and consequently the approximate solution vectors, differ between agents at each local iteration.To address the potential stagnation that can result from such a scenario, we incorporate an asynchronous restarting procedure.By introducing these asynchronous restarts, we provide an opportunity for the agents to realign their progress, mitigate the effects of asynchronicity, and make collective advancements towards the true solution.The frequency of the restarts can be adjusted based on the specific requirements and characteristics of the problem being solved.As mentioned earlier, during the s-ACD communication stage, each agent will send its local solution vector x κ and local residual vector r κ .This exchange allows each agent to have an updated understanding of the current state of the (global) computation, enabling more efficient choices of direction vectors for the subsequent iterations.
The algorithm is restarted periodically after detecting stagnation in the residual norm.The detection of stagnation and following restart is purely a local decision and calculation.The restarting will be performed if a specified number of iterations have passed since the last restart and the residual norm has decreased less than a prescribed tolerance.This involves resetting the necessary variables, such as solution vectors, residual vectors, and direction vectors, to a common starting point.By doing so, the agents can restart from a more unified state and resume the algorithm to overcome the 444 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 convergence stagnation.When a restart is deemed necessary, the local approximation to the asynchronous residual vector r κ is constructed by averaging the most recent updates r ψ(i,j,κ) received from each neighbor.If no updates have been received from agent j, then set r ψ(i,j,0) := b, i.e., where r ψ(i,i,κ) = r κ .Then, local approximation to the asynchronous solution vector x κ is computed by averaging the most recently received solution vectors x ψ(i,j,κ) .If no updates have been received from an agent j, then set x ψ(i,j,0) = 0, i.e., where Thus, the restarting procedure maintains the accuracy of the residual.Additionally, we found empirically that explicitly recomputing the local partial residual as x κ generally improves convergence and enables convergence for some non-symmetric matrices A. Modifications to (13) and (17) are required to account for the possibility that the s prior s-conjugate direction vectors may no longer be consistent after a restart.To address this issue, we introduce the set S, which represents the subset of "active" s-conjugate direction vectors.After a restart, this set is reset to S = ∅.After each iteration, we update S by taking the union of the set with the newly computed direction vector d κ , i.e., S = S ∪ {d κ }.Only the vectors within the set S can be used in the s-step orthogonalization process.Thus, ( 13) and ( 17) need to be modified accordingly, i.e., To complete the restarting procedure, all the agents set their local vectors accordingly: x κ = x κ , p κ = r κ , and r κ = r κ .

IV. DATA ERRORS AND CORRUPTION MODELING
As increased parallelism and new environments are considered, the likelihood of errors increases and so too does the costs associated with data errors.Practical Krylov methods introduce restarts to handle errors introduced through finite precision calculations.However, the restarts alone are not enough when larger or more discrete data errors happen, the methods can lose the underlying subspace and orthogonality properties that they rely on.Thus, additional resiliency measures must be considered in environments where such large disruptions are expected.Even the s-ACD method above, which has the self-correcting restart mechanism, is still susceptible to errors introduced through numerous pathways including malicious injection, disconnection of agents, corruption introduced into the signal, delays in communication, agents entering a failed state, bit-errors introduced locally, etc.We focus on data corruption where the original data being communicated at a single iteration is replaced by other values.
One important note is that we only corrupt one vector at a time.Because the x and r vectors are only used during the restart, to accurately model the errors, we must force a restart after a corruption occurs, otherwise corruptions in x and r may be masked.To understand this, imagine we corrupt a transmission of vector x at iteration ι, and we do not force a restart.The receiving agent receives this corrupted x at iteration κ, stores it, but it does not use it in the calculation because a restart does not happen.However, at iteration κ ′ > κ when a restart occurs, the transmitted vector x from iteration κ ′ is stored (overwriting the previous corrupted vector with an uncorrupted vector) and is used during the restart.Thus, the corrupted x that was stored at iteration κ was overwritten and never used, leading to the corruption being masked.
The failure model we consider is that of one-off corruption, meaning that at some point in time, one or more agents all have corruption applied to one of the vectors transmitted during that iteration.This failure model has been chosen for multiple reasons: • such corruption clearly partitions time into a "before" and "after" corruption portions, where the "before" portion should be identical to the uncorrupted case, • one can easily visually identify where the corruption occurs, • it is simple to implement, • it forms the basis for other forms of corruption and can relatively easily be generalized to the other forms.

V. RESILIENT S-ACD
While the s-ACD method has the self-correcting restart mechanism that allows it to be resilient to the presence of non-orthogonal directions introduced by the asynchronous approach (and somewhat resilient to other data errors), additional resiliency measures would be able to decrease the impact of other data errors.In this section, we introduce resiliency in LUCAS ERLANDSON ET AL.: RESILIENT S-ACD FOR ASYNCHRONOUS COLLABORATIVE SOLUTIONS OF SYSTEMS OF LINEAR EQUATIONS Algorithm 1 s-Approximate Asynchronous Conjugate Directions.
for node i ← 1 to N do INPUT: global vector b, local portion A i of the A matrix OUTPUT: Each node i has a local approximation x κ to the solution vector x for j ← 1 to N, j ̸ = i do initialize x j = 0, r j = b j , p j = b j end for for κ ← 0 to t max do w κ ← A T i p κ Send w κ , p κ i , x κ , r κ w ψ(i,j,κ) , p ψ(i,j,κ) j , x ψ(i,j,κ) , r ψ(i,j,κ) i := set of node indices from which updates were received x ψ(i,j,ℓ) = x ψ(i,j,κ) x ψ(i,j,ℓ) = r ψ(i,j,κ) end for end for end for two stages: the detection stage and the correction stage.The benefit of this approach is to separate the tasks of identifying deviations from "normal" behavior and the ability to correct said behavior.

A. Detection Stage
To detect data errors, we need to be able to identify when the method is in a state that is not "normal."In CG, this can be done through the orthogonality conditions or monotonically decreasing quantities -which if violated indicate that something is not as expected.However, in order to achieve asynchronicity, s-ACD loses the orthogonality conditions.Instead, we can develop different detection schemes leveraging knowledge from CG.
1) Checksum: The checksum method is based off on idea of checksums, i.e., calculating a quantity locally, transmitting it with the information, and checking that the received quantity, when recalculated locally, is consistent.One simple way of doing this is using the inner-product of two vectors.The downside of this method is that it relies on a trustworthy sender (the sender can adjust both the checksum and sent vectors accordingly) and requires additional communication.However, it comes with a number of pros, e.g., it requires only a small amount of local computation (perhaps some that is being performed anyway), provides per-agent detection, puts constraints on the possible malicious vectors that can be used, and is cheap and easy to implement.In particular, we use p T w, where only the local portions of both vectors are used (as only the local portion of p is sent and available to check with).This is done by calculating the checksum before an agent sends its vectors (Alg.2), and comparing the received value against the recalculated value at the arrival (Alg.3).
Algorithm 2 Checksum calculation before sending.
Algorithm 3 Checksum calculation after receiving.
x ψ(i,j,ℓ) = x ψ(i,j,κ) x ψ(i,j,ℓ) = r ψ(i,j,κ) else Mark update κ from agent j as corrupted end if end for 2) General: As mentioned previously, one way of detecting corruption is to detect when the result from a calculation is different from what it should be.One can use "metrics" (also called indicator variables), which are simple scalars that change over time, and see when they change in unexpected ways.This allows us to adapt to different convergence speeds and parts of the convergence without relying too heavily on tunable parameters.While we lose precise orthogonality conditions and monotonically decreasing quantities in s-ACD, we do still have some relations that are roughly predictable.For example, as iterations progress, ⟨d κ , v κ ⟩, ⟨p κ , p κ ⟩ A , and ⟨r κ , r κ ⟩ all tend to decrease.Thus, we can monitor these values and determine when they increase between successive iterations more than expected.
Because there is significant variation of the metrics over the course of the solve, we should not look directly at the successive difference of a timeseries metric ξ between iterations.Instead, we apply a smoothing step, take the difference between the smoothed values, and compare that against a smoothed version of the difference of smoothed values.When the ratio of these quantities gets above a specified value (which tends to be quite robust), we mark this iteration as corrupted.
To perform the smoothing we use a running average with a window size σ of 15 iterations, which allows us to perform these calculations online.If we let ξ be a R κ timeseries with ξ i being the value at point i in time, then we define the smoothed timeseries as We define the relative successive difference to be We consider a timeseries ξ to be corrupted at time i if smooth(diff(smooth(ξ, size, i), i), size, i) > ϵ for some tolerance ϵ > 0.
In particular, we track the timeseries defined by ⟨v κ , d κ ⟩ and ⟨r κ , r κ ⟩.The biggest drawbacks of this method are that it does not provide per-agent detection and introduces additional computational steps.However, it is easily generalizable to other methods, requires only local computation, doesn't rely on a trustworthy sender, and the computational complexity can be mitigated by updating the differences and smoothed timeseries at each time step rather than recalculating the entire timeseries.
3) Algorithm-Based: The final class of detection methods that we will discuss are the algorithm-based metrics.These rely on knowing specific analytical relations within the calculations, such as the orthogonality conditions in CG.Although we do not have the orthogonality conditions, we do know that, Ax = b and r = b − Ax.There are many solution approximations and associated residuals that can be calculated, all of which should be similar to each other.Thus, if one of these vectors differs significantly from the others (or what is expected via explicit calculation of the residual), this indicates that a corruption is likely to have occurred.We compare the incoming solution and residual vectors from other agents against the updated value of the currently considered solution and residual, as well as the most recent solution and residual that have been checked by the detection mechanisms and identified as uncorrupted.
Due to the many comparisons and matrix-vector products involved, this is a computationally expensive check.However, it doesn't require any additional communication and provides per agent detection.Furthermore, the adaptive properties of the general metrics can be explored, although they are not currently used in our implementation (Alg 4).Algorithm 4 Algorithm-based check for agent i, which compares the proposed update when applied to the incoming solution and residual vectors from agent j, with baseline updated solution and residual vectors.for j=1, . . ., N do x S := the last x ψ(i,j,ι) vector considered to be uncorrupted r S := the last x ψ(i,j,ι) vector considered to be uncorrupted for (x, r) ∈ ((x κ , r κ ), (x S , r S )) do ) then Agent i considers agent j's communication at iteration ι as corrupted.end if end for end for

B. Correction Phase
Once a transmission has been identified to contain a data error, a correction must be performed.Under traditional methods, this might require restarting the entire computation [28], however, we are able to utilize a more nuanced approach which reduces the amount of redundant calculations.We introduce a simple rejection approach, which performs well for the class of investigated data errors.The simplest form of correction is to ignore the updates associated with an iteration that has been flagged as corrupted.If a specific agent has been identified as the source of the corruption, it is possible to ignore the updates from that agent only.

VI. NUMERICAL EXPERIMENTS
To demonstrate the performance of the newly proposed s-ACD and resiliency methods, a number of numerical experiments are conducted.These experiments are performed on a MacBook Pro (2019) with a 2.4GHz 8-Core Intel Core i9 CPU, and 64GB of 2667 MHz DDR4 Memory.Unless otherwise stated, experiments are run with four agents.When a 2D Poisson problem is used, it has homogeneous Dirichlet boundary conditions and is discretized with first order central finite differences with a right-hand side g(x, y) = sin(πx) sin(πy) for a point at location (x, y).When a random SPD matrix is used, it has a condition number of 50, with a right-hand side defined by a vector of all ones.
Fig. 1 displays a variety of restarting frequencies f and sstep sizes s when using s-ACD over five runs, where the mean of the runs is plotted in a solid line and the 95% confidence interval displayed in the corresponding shaded region.We can see that having a large s-step size s relative to restart size f , such as the cases of f = 15, s = 10 and f = 5, s = 5 results in both a higher iteration count and a slower convergence.This is likely due to when the s-step size and restart sizes are similar, the number of full s sized s-step updates performed is small.There was not a significant difference in the other tested combinations.For the following tests, f = 15 and s = 5.
Fig. 2 demonstrates the scaling of the s-ACD method against the scaling of the asynchronous Jacobi and serial CG methods when changing the number of rows (and hence the condition number) of a 2D Poisson matrix with five runs, with the means plotted with solid lines and the 95% confidence interval plotted in the shaded region.We can see that our s-ACD method achieves better asymptotic scaling than the asynchronous Jacobi method, although not as good as serial CG, especially for larger condition numbers.The scaling achieved on systems with condition numbers in the range of 10 2 to 10 3 is comparable with CG, with a significant improvement in the absolute number of iterations compared to asynchronous Jacobi.The development of convergence theory could be used to better understand the results seen, and asynchronous preconditioning can improve the convergence further.
A 100× 100 random SPD problem is used for Fig. 3, where different vectors are corrupted during communication, and the resulting l 2 -norms are plotted for each agent, demonstrating the impact of these data errors on the metrics.The metrics are very noisy, due to the loss of orthogonality and independent nature of each agent.We see spikes around each restart, as the local residual and solution vectors are changed, potentially significantly, compared to the previous iteration.We observe that in all cases, the convergence slows down after the corruption at one second, while in the x, p, r cases, there is a significant spike in the observed metrics.This is because corrupting the x, r, p vectors (directly or indirectly) permanently destabilizes the subspace and has immediate consequences, while the w vector is recalculated at each iteration from the p vector.
The varying steps of the "generic" correction scheme are demonstrated in Fig. 4, which is a random SPD problem with a data error occurring after one second, separating out the global post-processed metrics from the metrics visible to each agent (Fig. 4a).We see that the global post-processed metric is very smooth, ignoring the spike at each restart, indicating that the overall behavior of the agents approximates is similar to the behavior of CG.We observe that the local metrics do correlate strongly with the trend of the global postprocessed metrics, indicating that they can be used as a proxy for the overall convergence.While in synchronous CG, we   would expect these metrics to be monotonically decreasing, the asynchronous algorithm removes these guarantees.Thus, the procedure of smoothing and successive differences must be used and through this adaptive method is able to detect when the corruption happens via local computations.It is clear that after these procedures (Fig. 4e) the aberration can be easily detected.
Finally, Fig. 5 shows the residual of a 100 × 100 random SPD problem for four agents for the s-ACD method with and without the resiliency measures, for 30 runs with the mean of runs plotted as the solid line, while the dotted lines correspond to individual runs and the shaded region to the 95% confidence interval.By enabling all three sets of resiliency measure discussed above, the data errors are successfully detected and corrected.We can see that adding the resiliency measures annihilates the impact of the corruption, leading to four times faster convergence than without resiliency measures.

448
PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

VII. CONCLUSION
We have seen that due to the increased parallelism and new computational paradigms, asynchronous and resilient methods should be developed.In this paper, we have developed the s-ACD method that combines the CD method globally with the CG method locally.This provides scaling with respect to the condition number comparable with CG on the tested 2D Poisson problem, while ensuring complete asynchronicity, as global orthogonalization is no longer required, as well as some resiliency.Furthermore, we developed three detection techniques: a "generic" detection scheme, a "checksum" detection scheme, and an algorithm-based detection scheme.These methods were applied to s-ACD, creating the resilient s-ACD method.Numerical experiments were performed to demonstrate that this resilient s-ACD method is able to handle the introduction of data errors into the communication pattern, resulting in a significant decrease of iterations compared to the uncorrupted case.Future improvements include developing theory for the s-ACD method and resilient variation, adding more elaborate correction methods such as rollback, as well as developing asynchronous preconditioners to allow the considered methods to scale to larger problems.Fig. 4: Different stages of the post-processing pipeline applied to the metrics when an error is introduced after one second into all agents for the s-ACD method (without resiliency measures).The problem considered is a 100 × 100 random SPD with condition number 50.

Fig. 1 :
Fig.1: Comparing the scaling of different restarting frequencies f and s-step sizes s for s-ACD on a 2D Poisson problem discretized with finite differences with four agents and five runs proceeding until a tolerance of 1e-5 is reached.The shaded region represent a 95% confidence interval.

2 × 10 2 3 × 2 10 3 Fig. 2 :
Fig.2: Scaling of s-ACD vs asynchronous Jacobi (ASJ) and serial CG method on a 2D Poisson problem discretized with finite differences with four agents and five runs proceeding until a tolerance of 1e-5 is reached.The shaded region represents the 95% confidence interval.

Fig. 3 :
Fig.3: Demonstrating the impact of errors introduced into different vectors of s-ACD after one second into all agents, where the metric of each agent is displayed.The problem considered is a 100×100 random SPD problem with condition number 50.
Global metrics via post-processing.