Defect Backlog Size Prediction for Open-Source Projects with the Autoregressive Moving Average and Exponential Smoothing Models

—Context: predicting the number of defects in a defect backlog in a given time horizon can help allocate project resources and organize software development. Goal: to compare the accuracy of three defect backlog prediction methods in the context of large open-source (OSS) projects, i.e., ARIMA, Exponential Smoothing (ETS), and the state-of-the-art method developed at Ericsson AB (MS). Method: we perform a simulation study on a sample of 20 open-source projects to compare the prediction accuracy of the methods. Also, we use the Naïve prediction method as a baseline for sanity check. We use statistical inference tests and effect size coefﬁcients to compare the prediction errors. Results: ARIMA, ETS, and MS were more accurate than the Naïve method. Also, the prediction errors were statistically lower for ETS than for MS (however, the effect size was negligible). Conclusions: ETS seems slightly more accurate than MS when predicting defect backlog size of OSS projects.


I. INTRODUCTION
D EFECT backlog is the collection of all project defect reports that need to be handled.The size of this collection changes over time.The problem of monitoring defect backlogs is important in all modern software development organizations.In agile software development, it is important to correctly prioritize defects to continuously deliver business value.Also, especially in large organizations, the assignment of developers and testers to projects is often done dynamically, on demand.When a situation in a project demands more human resources for quality improvements, developers shift their focus from feature implementations to defect removal [2].Because of these dynamic changes, knowing in advance that a project may require more human resources in the following week is valuable information for the managers, developers, testers, and other project stakeholders.
In particular, the managers need to know defect backlogs for the coming weeks.Therefore defect prediction models that forecast the number of defects that will need to be handled in a given time horizon are needed.There have been multiple studies on designing such models in industrial contexts [3], This work was supported by the Poznan University of Technology within the project 0311/SBAD/0738.Some of the paper's contents come from the corresponding author's master thesis [1].[4], [5], [6].One of the successful studies on defect backlog prediction was conducted at Ericsson by Staron and Meding [3].They proposed an autoregressive model (MS) that is based on the moving average and predicts defect backlog size within a weekly horizon.Although the MS model turned out to be very accurate at Ericsson, there are several other stateof-the-art autoregressive models for time series forecasting that have not been tried out for defect-backlog predictions.Two of them are Autoregressive integrated moving average (ARIMA) [7] and Exponential Smoothing (ETS) [8], [9], which are the two most widely used approaches to time series forecasting [10].
Although most of the previous studies on defect prediction were based on open-source software (OSS) datasets, the studies on defect backlog predictions in Ericsson have not been replicated in the OSS context.The flow of OSS projects differs from the flow of their industrial counterparts, however, they are often larger in terms of the number of involved contributors.Predicting the number of defects in a backlog is not easy due to uncertainties in identifying all the defects.The dynamic nature of software development, with its changing requirements and iterative cycles, adds to the complexity.Moreover, the accuracy of predictions can be affected by the quality and relevance of historical data used for analysis.Most existing methods rely on data from classical repositories like NASA and PROMISE, which may have limitations.Lastly, current predictive models may not consider all the contextual factors and unique project characteristics that impact defect discovery.
The goal of this study is to design defect-backlog prediction models based on the ARIMA and ETS methods and validate their accuracy in the context of large OSS projects.We use the state-of-the-art MS model developed at Ericsson [3] as a baseline for comparison since it has been reported as an accurate defect backlog prediction model validated in an industrial setting.
The structure of this paper is as follows.Section II provides a brief overview of the ARIMA and ETS methods, while Section III discusses the related work.Section IV describes the research methodology of our study.The results are presented and discussed in Section V. Finally, Section VI summarizes the main findings of our study.

A. Defect Backlog
In many software projects, defects that need to be resolved are collected in defect backlogs.There are many defecttracking tools on the market (e.g., Bugzilla, Jira, ReQtest).This kind of software helps the entire team and managers to get a view of how many defects remain in the software and what they are.To help developers work on the project, the defects in the backlog can be ordered according to priority.Often each issue includes additional information which differs between projects.If the software is regularly tested the size of the defect backlog changes over time.The number of defects that have been reported in a specific period of time is called defect inflow.Similarly, the number of defects that have been resolved in that period is referred to as defect outflow.The defect backlog size change within a given period of time (for the sake of this study, a week) is the difference between its inflow and outflow.

B. Autoregressive Integrated Moving Average
ARIMA stands for Autoregressive Integrated Moving Average.As the name suggests, it combines two time-series techniques, namely, the Autoregressive model and Moving Average.
ARIMA requires the time series to be stationary.The values of stationary time series do not depend on time.Thus, if we can see a trend or seasonality in time series, it means that it is non-stationary-its value depends on the time.Non-stationary time series have to be first transformed into stationary time series by using the differencing operation.The differenced time series is calculated as changes between subsequent observations [10]-see Equation 1.The differencing operation can be repeated multiple times if the obtained time series is still non-stationary.
where: y ′ t -value of the differenced series at time t, y t -value of the original series at time t, y t−1 -value of the original series at time t − 1.
The Autoregressive model is based on multiple linear regression.What distinguished it from other linear regression models is that it predicts the outcome variable (y) using past values as predictor variables (x).This approach assumes that there is some correlation between subsequent values in a time series (autocorrelation).The Autoregressive model is defined by Equation 2 [10].
where: c -constant value, ϕ -model parameter, ϵ t -error, p -order of the model.The p value in Equation 2 is called the order of the Autoregressive model.It determines how many past values will be considered to calculate the outcome.The autoregressive model of order p can be referred to as AR(p).
The Moving Average model calculates the outcome variable as a linear combination of past forecast errors.The formula of the model is presented in Equation 3 [10].
where: c -constant value, θ -model parameter, ϵ -forecast error, q -order of the model.It shows that the value of y t can be considered as a weighted moving average of the past forecast errors.The model order value q determines how many past forecast errors will influence the outcome.The moving average model of order q can be referenced as MA(q).
The equation of a non-seasonal ARIMA model presented in Equation 4shows that it combines components of autoregressive and moving average models, which are lagged values and lagged errors. where: The outcome of the model is a differenced series.To get the actual predicted values time series need to be integrated.Integrating is the reverse of differencing.The transformation aims to add the trend or seasonality which were previously removed.
The non-seasonal ARIMA model is characterized by 3 parameters: • p -order of autoregression part, • d -degree of involved differencing, • q -order of moving average part.The model of some specific parameters can be referenced as ARIMA(p, d, q).We use the ARIMA implementation provided by the R forecast package [11].The function auto.arima()estimates the model parameters by analyzing the training data.

C. Exponential Smoothing
The general idea behind Exponential Smoothing (ETS) forecasting methods is that predicted values are weighted averages of past observations.The weight which is associated with the observation depends on how old the observation is.Thus, the oldest observations will have a smaller impact on the outcome than the recent ones.
The simplest version of the exponential smoothing method, called Simple Exponential Smoothing, is expressed by Equa-tion 5 [10].The application of this version of the method is limited to the data with no clear trend or seasonality. where: • 0 ≥ α ≤ 1 is smoothing parameter.The smoothing parameter regulates how the weights change with the change in the distance of observation.If α is small, more weight is given to the observations from the past.If it is large, more weight is associated with the recent observations.
Equation 5 [10] can be described in the component form as presented in Equation 6.
Smoothing Equation where: • h -number of steps to forecast, • l -level component.
A single component is called level (smoothed value) l t of the series at time t.From the forecast equation, we can see that the predicted value at time t + 1 is the level of the time series at time t.Replacing the level component in the smoothing equation according to the relation ŷt+h = l t leads to the exponential smoothing form presented in Equation 5 [10].
To extend the application of the simple exponential smoothing method for data with trend, an additional component has been added to the equations.The extended method's name is Holt's linear trend method and is expressed by 3 equations presented in Equation 7 [10].
The trend (slope) forecast function is no longer flat as it was in the case of Simple Exponential Smoothing.However, this method is still not especially useful because of the fact that the trend is constant.The method assumes that the outcome values always increase or decrease in the same way.Thus, an additional parameter called damping parameter has been introduced to deal with that.Because of this modification, the trend can be flattened in the future.The form of the method which includes the damping parameter is expressed by Equation 8.As we can see with damping parameter ϕ = 1, the method is the same as Holt's linear method presented in Formula 7.
Holt's method can be extended with the seasonal component.This version is called Holt-Winters' seasonal method.There are two versions of the method: additive and multiplicative.The additive method is suitable for the series with constant seasonal variations.On the other hand, the multiplicative method is preferred when the variations change in proportion to the series.The component form of Holt-Winters' additive method is expressed by Formula 9 [10].
Level Equation Trend Equation where: • m -number of seasons in a year, The component form of Holt's-Winters' multiplicative method is expressed by formulas 10.
Level Equation The difference between those two versions of Holt-Winters' method is how they express the seasonal component and then take it into account.In the additive method, it is expressed in absolute terms and then is seasonally subtracted from the series.In contrast to this in the multiplicative method, the seasonal component is expressed in relative terms and then the series is seasonally divided by it.
There are 9 different exponential smoothing methods.Those presented so far are examples of different combinations of components.Each method is defined by the type of trend and seasonal components.
The types of trend components are: • None (N ), • Additive (A), • Additive damped (A d ) .The types of seasonal components are: Each exponential smoothing method can be labeled with two letters which refer to the type of trend and seasonal components.Table I For each of the 9 presented methods, there are two models differing in the way of expressing the errors.The first model with additive errors and the second one with multiplicative errors.For each method, the forecast points of two different models are the same.However, they generate different prediction intervals.To make the distinction the classification in Table I is extended by the third letter.Every exponential smoothing model is labeled with three letters as ETS(Error, Trend, Seasonal).Thus, the model that includes additive error, none trend component, and multiplicative seasonal component would be denoted as ETS(A, N, M).
We use the ETS implementation provided by the R forecast package [11].The function ets() estimates the model parameters by analyzing the training data.

A. Software Reliability Growth Models
Software Reliability Growth Models are equations used to model the growth of software reliability using defect inflow data gathered during the development process.Researchers use SRGMs to make defects forecasting.The side effect of predicting defects themselves is the knowledge about the number of defects in defect backlog.Using this relationship and applying SRGMs to predict the size of the defect backlog is a popular technique [12].There is no standard way of selecting the most appropriate SRGMs for given defect data.There are studies that reveal the best-fitted models for reliability in some types of projects.In [6] researchers investigated the distribution of defect inflow in automotive domain projects which could aid in finding the best-fitting SRGMs.This work presents that selecting the appropriate model is the most challenging part of the forecast.There are more than 100 SRGMs.

B. Linear Regression
Linear regression modeling was also applied to the problem of defect backlog prediction.The examples of independent variables which are used in the regression models are [13]: • program metrics (such as program size, number of variables), • number of defects found in the earlier phase, • testing time, • design methodology.Yu, Shen, and Dunsmore [13] investigated the correlation between those variables and the number of defects that remain in the software.They discovered the strongest relationship between a number of defects identified during earlier phases of development and those discovered later.

C. Defect Backlog Prediction at Ericsson AB
A research program executed at Ericsson AB company resulted in a few studies on defect backlog predictions.In the first study [4], Software Reliability Growth Models were designed to defect inflow prediction after release.The results were not satisfying.Defects profile described by the model significantly deviated from the profile of defects in the studied project.
In the follow-up study on defect inflow predictions in a large-size software project [5], the prediction accuracy of different methods (e.g., multivariate linear regression or method which used the moving average of defect inflow) was compared.Table II presents average prediction errors depending on number of predictor variables and the used method.To evaluate the accuracy of methods they used the Mean Magnitude of Relative Error (MMRE).The error of none of the evaluated methods was good enough to reach the required accuracy level by the organization.To improve prediction accuracy researchers from Ericsson decided to conduct a more detailed study on medium-size project [3].This time they evaluated the prediction accuracy of three methods: • Multivariate linear regression, • Analogy-based prediction, • Expert estimations.Seven variables identified as most influential on defect inflow were chosen from a set of over 50 and used to construct a multivariate linear regression model.For analogybased prediction, researchers collected an analogy database (projects that they found the most similar to the one that they were working on).The variables used for calculating similarity were [3]: • the number of test cases planned in integration testing 4 weeks before the predicted week, • the number of test cases executed in integration testing 4 weeks before the predicted week.Also, they decided to enrich analogy-based predictions by involving experts and asking them to choose variables that 86 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 they found the most influential on defect inflow and assign them weights.
In the following study at Ericsson AB [3], the problem was reframed to predict defect backlog size instead of predicting defect inflow.A new method was proposed by Meding and Staron (MS) that relied on the moving average of defect inflow and defect outflow and the previous backlog size (see Equation 11).The proposed model allowed for predicting defect backlog size with the highest accuracy (MMRE of 16%) compared to the previous studies.

A. Research Goal and Questions
We perform a Simulation-Based-Study (SBS) [14] using the data from OSS projects to compare the accuracy of two new defect-backlog prediction models based on Autoregressive Integrated Moving Average (ARIMA) and Exponential Smoothing (ETS) with the state-of-the-art Meding-Staron model (MS).We formulate the following research questions: • RQ1: Are the MS, ARIMA, and ETS models more accurate than the Naïve prediction method?
• RQ2: Are ARIMA and/or ETS more accurate than the MS model when predicting the number of defects in defects backlogs of OSS projects?
The question RQ1 could be considered a sanity test for the models.Shepperd and MacDonell [15] recommend performing such a test against "random guessing," however, we decided to use the so-called Naïve method instead of guessing.This method uses the actual observed values from the last week as the forecast for the next week.Although very simple, the Naïve method is reported to "work remarkably well for many economic and financial time series" [10].Therefore, our sanity test is more demanding than the one proposed by Shepperd and MacDonell.However, for practical reasons, if a given model does not outperform the Naïve method, there is no point in considering it for real-life applications.The latter question (RQ2) is the central research question of this study.We compare the accuracy of the MS model, which according to the literature is the most accurate model for defect-backlog predictions with two models which are state-of-the-art in timeseries predictions.
The replication package for this study is available on GitHub.

B. Dataset
We collected defect backlogs from 20 Bugzilla instances of OSS projects managed by Apache Foundation, Eclipse, Mozilla Foundation, Linux, Open Office, and Libre Office.We selected only the projects that had a sufficiently long defect reporting period.The shortest defect-tracking period was 8 years (Libre Office Draw), while the longest was 22 years (Mozilla Core).
In the first step, we fetched defect reports from the Bugzilla service instances of OSS projects.The reports were grouped based on the dates when they were submitted or resolved and their severity level.By counting the number of defects submitted, resolved, or remaining in the backlog, we calculated defect backlog level (number of defect reports still opened at the end of the week), defect inflow (number of defects reported in a given week), and defect outflow (number of defects resolved in a given week) for every week.
The resulting dataset consisted of 20 defect backlogs presented in Table III.The beginning of each defect backlog is determined by the date of the first reports submitted to Bugzilla.The end of the bug tracking period is the same for all projects (01-01-2019).The average defect backlog size presented in Table IV   While visualizing the change in defect backlogs over time, we observed a suspicious phenomenon of rapid, significant drops in the number of defects in the backlog.We perceive them as anomalies that could result from "cleaning" processes of Bugzilla instances from irrelevant defect reports.Figure 1 presents an example of defect backlog level change over time in the Open Office Draw project with at least two sudden major drops in the number of defects around weeks 650 and 860, which are unlikely to be caused by the real defect fixing activities.Unfortunately, such drops are unexpected and poorly predicted by the considered prediction methods.On  the contrary, they have a less visible impact on the prediction errors made by the Naïve method since the error is present only for a single week following such a drop.
To mitigate the effect of sudden drops in the level of the backlog, we decided to split the defect backlogs based on the presence of unexpected drops in the number of defects.The process of dividing backlogs into fragments was the same for all projects and started with differencing the defect backlog level which result is presented in Figure 2. The peaks in the differenced time-series plot correspond to the sudden falls in backlog level from Figure 1.To consistently determine the weeks for which there are sudden decreases in the backlog level and to set the boundaries between fragments in those weeks, the 99.5 percentile of the difference in the backlog between successive weeks was calculated.For instance, Open Office Draw backlog was divided into 5 fragments presented in Figure 3.The final prediction error of a given method for a divided backlog is counted as the average of errors for individual fragments.

C. Predictions And Accuracy Evaluation
We performed training and accuracy evaluation for each individual defect backlog.We used data from all the previous weeks to train the ARIMA and ETS models and predict the number of defects in the backlog for the following week.We based the accuracy evaluation on Absolute Error calculated according to Equation 12 and calculated Mean Absolute Error (MAE) for each backlog.
We also calculated a variant of standardized accuracy measure (SA m ) [15] that shows a relative improvement in accuracy in comparison to the Naïve method, which was calculated according to Equation 13. where: -MAE m -Mean Absolute Error for the method m, -MAE n -Mean Absolute Error for Naïve method.
We used a non-parametric Wilcoxon signed-rank test to compare AE between prediction models with significance level α = 0.05 and Cliff's δ effect-size coefficient to quantify the strength of the observed difference.Cliff's δ evaluates how often the values of one set are larger than the ones from the second set.The thresholds used for Cliff's δ coefficient interpretation proposed by Kitchenham et al. [16] are as follows: δ < 0.112 -negligible, 0.112 ≥ δ < 0.276 -small, 0.276 ≥ δ < 0.428 -medium, and δ ≥ 0.428 -large.
We also calculate the number needed to treat (NNT = δ −1 ) measure, which is commonly used in the field of medical science.NNT indicates how many patients need to be treated with a drug to heal one patient and is the measure of the medicine's effectiveness.The lowest the NNT value the fastest we achieve the improvement.In the context of this study, NNT could be interpreted as the number of weeks one would have to use a given prediction method A instead of method B to observe improvement in the accuracy for at least one week.We also calculated the average NNT to aggregate information from all the projects.

V. RESULTS AND DISCUSSION
Mean prediction errors for the three considered methods and the Naïve method are presented in Table V, while the mean errors transformed to standardized accuracy measures (SA m ) are presented in Table VI.
ARIMA, ETS, and MS predicted backlog sizes with mean errors lower than the Naïve method for most of the projects (mean SA equal to ca. 17.1%, 17.7%, and 10.3%, respectively).However, there were three projects for which the Naïve method performed better than all three other methods.The effect size and the results of Wilcoxon signed-rank tests for comparison between AE (intra-project level) for the considered models and the naive one are presented in Table VII.For every of the considered models that were at least 10 projects for which a statistically significant difference in the central tendency of AE was detected.The effect size was at least "small" for nearly half of the projects, i.e., 10/20 (ARIMA vs. Naïve), 9/20 (ETS vs. Naïve), and 11/20 (MS vs. Naïve).That translates to NNT at the levels of 8 weeks for ARIMA, 14 weeks for ETS, and 36 weeks for MS.Finally, we performed Wilcoxon signed-rank test at the level of MAE (dataset level).In all three cases, the difference in the central tendency for MAE between considered models and the Naïve method turned out to be statically significant.Therefore, we conclude that all of the considered methods outperform the Naïve method (RQ1).

88
PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.In the next step, we compared the accuracy of the state-ofthe-art MS method and two methods proposed in this paper that are based on ARIMA and ETS.As it follows from Table V the lowest mean MAE was observed for ETS (16.36) and ARIMA (16.70), however, the mean MAE for MS was only ca. 5% higher (17.28) than the one observed for ETS.It is also visible that the methods perform consistently for all projects, i.e., there are no methods that would visibly outperform other methods on a single project.Also, as it follows from Table VIII, only for 3-4 projects the observed

A. Threats to Validity
We address the threats to validity in the manner as described by Wohlin et al. [17] and de França et al. [14].
a) Construct validity: The main construct validity threat concerns the process of cleaning the data.Each backlog was divided into fragments in places of sudden falls in the number of defects.The rule of determining the sudden falls was the same for all backlogs.It assumes a division point between weeks for which the difference in the number of defects in the backlog was more than 99.5 percentile of the differences between successive weeks from the backlog.It resulted that all backlogs being divided, even those in which the aggressive declines have not really taken place.
b) Internal validity: There exists a threat to the internal validity of this study regarding the validity of reported defects.For OSS projects, all users can report defects to Bugzilla.The new reports may be duplicates or not be real defects.We cannot control who makes reports and what they are.Because of that, the size of some defect backlogs is extremely large.
c) External validity: The main threat to the external validity of our results is the fact that we applied the defect backlog prediction methods only to a selected sample of well-established OSS projects that maintain public Bugzilla instances.We do not know whether our results would also apply to smaller OSS projects, however, there is a question of whether such projects would benefit from defect backlog predictions.Also, we limited our study to OSS projects only, therefore, we would be careful in generalizing the findings to industrial projects since the ways of working differ visibly between the OSS and industrial settings.Even when it comes to OSS projects themselves, we have to be aware that the process could be less stable in time than it is for the industry (e.g., the number of contributors involved, the number of commits they produce, or the number of defects they fix can vary in time).Also, we used the 1-week prediction horizon after the previous studies in Ericsson AB, however, we cannot claim based on our results that the methods will behave the same way if a longer prediction horizon is needed by a given OSS community.
d) Conclusion validity: The main threat to conclusion validity regards performing multiple statistical inference tests while drawing some of the conclusions (statistical inference tests at intra-project and dataset levels).We set the local significance level α to 0.05, however, the true, global significance level would be much higher.Still, the outcomes of the statistical inference tests were only one of a few sources of information that we used to draw the conclusions, therefore, the impact of rejecting a true null hypothesis would have a minor impact on the final conclusions.

VI. CONCLUSIONS AND FUTURE DIRECTION
In this paper, we evaluated three defect backlog prediction methods in the context of open-source projects, i.e., the stateof-the-art Meding-Staron model (MS) and two new models based on Autoregressive Integrated Moving Average (ARIMA) and Exponential Smoothing (ETS) time-series forecasting methods.
We compared the accuracy of these methods on the dataset consisting of defect backlog histories of 20 large open-source projects (ranging from 8 to 22 years).In the first step, we performed a sanity check by comparing the accuracy of these methods with the accuracy of the so-called Naïve method, which predicts a defect backlog size in the following week to be the same as the one observed in the current week.All three methods outperformed the Naïve method by ca.10% to 17.7% (the largest improvement was observed for the ETS method).The observed differences in mean absolute errors (MAE) were statistically significant for all the methods.With respect to effect size, one would have to use these methods instead the Naïve one for 8 to 36 weeks to, statistically, notice improvement in defect backlog predictions for at least one week.
Exponential Smoothing (ETS) provided slightly more accurate defect backlog size predictions than the state-of-the-art MS method (by ca.5%) and the prediction method based on ARIMA (by ca.2%).If one decides to use the ETS-based defect backlog prediction method instead of the MS method should statistically, notice an improvement in defect backlog predictions for at least one week after using it for 32 weeks.

Fig. 2 :Fig. 3 :
Fig. 2: The differenced time series of defect backlog level in the Open Office Draw project.
presents the classification of exponential smoothing methods.PAULINA ANIOŁA ET AL.: DEFECT BACKLOG SIZE PREDICTION FOR OPEN-SOURCE PROJECTS 85

TABLE I :
Classification of exponential smoothing methods.

TABLE II :
[5]ract of prediction accuracy in a large-size project for 1-week interval[5].
ranges from 112 defects/week for Kernel Networking to 29,050 defects/week for Mozilla Core.

TABLE III :
Dataset of OSS projects under study.
PAULINA ANIOŁA ET AL.: DEFECT BACKLOG SIZE PREDICTION FOR OPEN-SOURCE PROJECTS 87

TABLE IV :
Average defect backlog sizes in OSS projects (defects/week).

TABLE V :
Defect backlog size prediction errors.

TABLE VI :
Defect backlog prediction improvement (SA) compared to the Naïve method predictions.

TABLE VII :
Effect size (n-negligible, s-small, m-medium) and Wilcoxon singed-rank tests result for comparison between ARIMA, ETS, MS, and the Naïve method (T -null hypothesis rejected with α = 0.05).