udCATS: A Comprehensive Unsupervised Deep Learning Framework for Detecting Collective Anomalies in Time Series

—Anomaly detection has recently gained enormous attention from the research community. It is widely applied in many industrial areas, such as information security, ﬁnancing, banking, and insurance. The data in these ﬁelds can mainly be represented as time series data, the corollary being that time series anomaly detection plays an essential role in these applications. Therefore, many authors have tried to solve the problem of collective anomaly detection in time series. They have proposed several approaches, from classical methods such as Isolation Forests to modern deep learning networks such as Au-toencoders. However, a comprehensive framework for handling this problem is still lacking. In this work, ﬁrstly, we propose using an Attention-based Bidirectional LSTM Autoencoder (Att-BiLSTM-AE) as an anomaly detection model. Furthermore, in the essential part of this paper, we developed a comprehensive unsupervised deep learning framework, udCATS, to solve the problem of detecting collective anomalies in time series. Our experiments show that the Att-BiLSTM-AE outperforms other detection models, and using it within the udCATS framework increases the detection accuracy.


I. INTRODUCTION
Anomaly detection plays an essential role in many industrial areas, for example, financing, banking, information security, and insurance.Many data in these domains can be represented as time series.Because of that, anomaly detection in time series data has recently gained massive attention from the research community.
A time series can be univariate or multivariate, discrete or continuous.In this work, we focus only on discrete univariate time series.Therefore, the term "time series" used in the rest of this article refers to a discrete univariate time series.Time series by its definition, is a set of data collected at successive, discrete timestamps and can be written as {X t , t ∈ Z} [1].The term anomaly of a time series can be considered an outlier.From the traditional point of view, an outlier/anomaly is an observation that varies "extensively from the other one as to produce suspicions that it was generated by a different mechanism [2]." An anomaly in time series can deliver important information.For example, it could be some unwanted data points that were produced or collected incorrectly.In this case, anomaly detection is essential for data cleaning, which is crucial for developing proper machine learning models.In addition, the anomaly can also represent the events of interest, such as machine breakdowns, cyber-attacks, and insurance frauds, which are the main applications of anomaly detection in time series.
The anomalies in time series can be divided into three main categories: point, collective, and contextual anomaly [3].A time series data point is an anomaly when it behaves out of the ordinary compared to most other points.The term collective anomaly refers to consecutive data points with unusual behavior.It is crucial to mention that each point of an abnormal sub-sequence is not necessary an outlier.Contextually anomaly is used when some time series points are typical in a specific context but anomalous in another context [3].
We focus here on collective anomaly detection because detecting the collective outliers is much more challenging than detecting the unusual points.As mentioned above, a single data point in a sub-sequence may not be an outlier; however, they will build up an abnormal sub-sequence when considering them in consecutive order.That makes the research problem much more challenging.Besides that, the problem of point anomaly detection is already well-researched [4].In contrast, the detection accuracy can still be improved in the problem of collective anomaly detection by proposing or applying contemporary deep learning networks.In our work, firstly, we propose using an attention-based bidirectional LSTM Au-toencoder (Att-BiLSTM-AE) as an anomaly detection model.Furthermore, in the essential part of this paper, we developed a comprehensive unsupervised deep learning framework called udCATS to solve the problem of detecting collective anomalies in time series.Our experiments show that the Att-BiLSTM-AE outperforms other detection models while using it within the udCATS framework increases the detection accuracy.
The rest of this paper is organized as follows.First, section II concerns some selected unsupervised learning approaches to detect collective anomalies.Next, the udCATS framework, which includes four primary processes, is described in Section III.Finally, section IV details our experiments and discusses their results before we clarify in Section V how we would like to improve the framework continually.

II. RELATED WORK
Many methods and approaches have been proposed to detect collective anomalies in time series.They can be grouped into two categories: supervised and unsupervised detection methods.In comparison, the approaches can be divided into three groups: statistical, classical machine learning, and deep learning models [3].
Supervised methods typically produce increased detection precision; however, they are pretty unuseful because they require labeled data sets, which are usually unrealistic.The labeling process is nowadays one of the most costly steps in a Machine Learning Pipeline.On the other hand, unsupervised methods are much more practical and valuable.However, receiving a high accuracy with unsupervised learning models is very demanding.Deep learning models have demonstrated their robustness and accuracy in an unsupervised manner compared to statistical and classical machine learning models [5], [6].In this section, unsupervised approaches applied for collective anomaly detection problems and time series are discussed briefly [6]- [11].
One of the most straightforward ideas to detect the anomalies in an unsupervised manner is applying clustering algorithms such as K-Means Clustering [8] or Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [9].The detailed descriptions of these clustering algorithms are provided by [12], [13] and [14].
C. Mete, F. Dadas ¸er-C ¸elik, and A. Dokuz [9] applied DBSCAN to detect anomalies in a dataset that contains the daily average temperature over 33 years.The author segmented the time series into monthly sequences, normalized them by their mean and variance, and then clustered them with DBCAN.The results show that DBSCAN can detect collective anomalies even if there is no significance between them and the usual data points.Keogh and Lin [8], nevertheless, have indicated that using clustering algorithms for collective anomalies detection is senseless.They showed that the cluster centers discovered for several runs of the K-means algorithm on the same dataset are not remarkably contrasting to the one of a random walk process.Some authors tried to analyze and overcome this problem.However, it remains unsolved [15].L. Bontemps, V.L. Cao, J. McDermott and N.A. Le-Khac [7] proposed a LSTM-based collective anomaly detection model.Firstly, the time series is modeled with an LSTM RNN [16].The predictive model is then adapted to propose a circular array containing prediction errors from several recent time steps.Finally, a predetermined threshold is applied to indicate a collective anomaly.To evaluate the model, the authors converted the KDD 1999 dataset [17] into a time series version.The results showed that without any false alarm, the model could detect 86% of the collective anomalies.If the threshold is set to capture all the anomalies, the number of false alarms is increased to 63.
Besides LSTM Network, some other deep learning models are also proposed for detecting collective anomalies in time series, such as Convolutional Neural Networks (CNN) [6], Gated recurrent unit (GRU) [10], and Autoencoder [11].The results show that, in general, deep learning models perform very well for collective anomaly detection problems in time series data.
We can make some important conclusions based on the knowledge gained from a comprehensive literature review, especially from the selected publications discussed above: • There is still no comprehensive framework for detecting time series collective anomalies.The task of detecting collective anomalies is not trivial as putting the time series into a detection model to get the results.It requires several steps, for example, splitting the time series into sub-sequences, reducing the data dimension, scaling the features, etc.
• Clustering-based approaches are not suitable for this kind of problem.
• Deep learning models produce highly accurate results when solving the problem of collective anomaly detection.For these reasons, we propose a comprehensive framework, called udCATS, for detecting collective anomalies in time series in an unsupervised manner.The framework uses an Attention-based Bidirectional Long Short-Term Memory Autoencoder as the anomaly detection engine.All the components of the udCATS framework are essential for solving the problem.

III. UDCATS FRAMEWORK
This section explains the udCATS framework in detail.It first clarifies the architecture and then each component of the framework.

A. Framework Architecture
The framework contains four components: time series segmentation, representation, scaling, and anomaly detector engine.The time series is first segmented into sub-sequences, later transformed to reduce the high dimensionality.These processes are called segmentation and representation.The output of the representation process is then used as the input for the data scaling process.In the end, an Attention Bidirectional Long Short-Term Memory Autoencoder is used to detect abnormal samples.For example, suppose a sample is classified as an anomaly.In that case, it can be used to identify the original sub-sequence to determine the collective abnormalities.Figure 1 illustrates the architecture of the udCATS framework.
Each of the components mentioned above is a selection process, which means different methods can be selected based on the nature of the input time series.For time series segmentation, top-down, button-up, or sliding windows can be selected, while non-data adaptive, data-adaptive, and modelbased approaches are the most prominent time series representation approaches.Data-dictated representation can also be discovered in the literature.However, it is not widely used for this task.We experimentally recommend an Attention Bidirectional Long Short Term Memory Autoencoder as an anomaly detection engine.Although it is not mandatory, another deep learning network can also be used for this part.It depends, as explained, on the nature of the input data.Last, udCATS establishes standardization, normalization, and robust scaling for the data scaling process.
The remainder of this section expresses each element of the framework in detail.

B. Time Series Segmentation
Time series segmentation is a method of time-series analysis in which an input time series is divided into a sequence of discrete segments, called sub-sequence, to reveal the underlying properties of its source [18].An optimal segmentation algorithm is defined as the one with minimal approximation error, calculated based on the difference between the segmented sub-sequences and the original time series.Figure 2 visualizes the segmentation process of the proposed udCATS framework.This is inspired by the work of M. Lovric, M. Milanovic, and M. Stamenkovi [18].The following paragraphs describe the most well-known segmentation algorithms: sliding windows, top-down, and bottom-up [18].
Sliding Windows, also called "brute-force" or "one-pass" algorithm [18], it is one of the most widely involved time series segmentation algorithms.It starts with appointing the first data point as the anchor.Afterward, the window size is initially determined, and based on this size, the approximation error for the potential segment is calculated.Next, the window size is increased until the approximation error exceeds a predetermined threshold.Finally, a segment is created with the possible largest window size.This process is repeated until the sliding windows are across the entire time series.The new anchor is updated as the next data point right behind the created segment.
The Top-Down algorithm considers the original time series as one major segment.It starts with finding the breaking point, which divides the time series into two parts with the maximal difference between them.The approximation error is then calculated for both segments and compared with the predetermined threshold.These steps are repeated for all of the segments until the approximation error exceeds the threshold [18].

TRUONG SON PHAM ET AL.: UDCATS: A COMPREHENSIVE UNSUPERVISED DEEP LEARNING FRAMEWORK
The Bottom-Up algorithm is the opposite of the top-down algorithm described above.It starts with segmenting the time series of length n into n − 1 segments.Then, a segment is decided to merge with the one on the left or the right based on increasing the approximation error.Finally, it takes the one with a minor error increase.The merging process is repeated until the approximation error of a segment exceeds a predetermined threshold [18].

C. Time Series Representation
Unsupervised detection methods often do not directly use the original time series data points as the input.Instead, representations of the time series will be used.The representation is helpful for dimension reduction and similarity measurement and often helps produce better results [19].
There are four main approaches to time series representation: non-data adaptive, data-adaptive, model-based, and datadictated representation [20].The parameters can be fine-tuned with the first three approaches to find the best time series compression for the particular application.However, the time series dictates the compression itself with the last one.For this reason, only non-data adaptive, data-adaptive, model-based approaches are used for the selection process of the time series representation process.
In non-data adaptive algorithms, the represented parameters remain the same for all time series, independent of their nature.Some of the most widely used non-data adaptive algorithms are Discrete Fourier Transform (DFT), Piecewise Aggregate Approximation (PAA), DCT (Discrete Cosine Transform), or Wavelets [20].
In data adaptive representations, the parameters vary depending on the available data.In the literature, we can find some well-known methods for data-adaptive representation, such as Symbolic Aggregate Approximation, Piecewise Linear Approximation, or Singular Value Decomposition [20] The model-based approaches assume that the observed time series was created based on the basic model.The aim is to find the parameters of such a model as a representation.Two time series are then considered similar if an identical set of parameters can model them.The model can be a Hidden Markov, statistical, or even deep learning one [20].

D. Data Scaling
For the scaling process, we propose selecting from three of the most famous and standard techniques: normalization, standardization, and robust scaling.Readers are referred to [21] for more detailed explanations of these scaling methods and how to select the right one based on the data distribution and the applications.

E. Attention-based Bidirectional Long Short Term Memory Autoencoder as the Anomaly Detection Engine
As mentioned above, several detection models can detect collective anomalies after the segmentation, representation, and scaling process.Some examples are the One-Class Support Vector Machine, Isolation Forest, or AutoEncoder.However, we recommend using an Attention-based Bidirectional Long Short Term Memory Autoencoder as the anomaly detection engine.The previous works [22]- [25] also inspire this recommendation.The authors have proved the efficiency and robustness of LSTM-and Bidirectional LSTM-Autoencoder for the anomaly detection problem. Figure 3 illustrates a simplified structure of an Attention-based Bidirectional Long Short Term Memory Autoencoder.Because of the limitation of the pages, we will not describe the network in detail.Instead, readers, who are interested in this network, are referred to [25]- [27] for more information.

IV. EXPERIMENTS AND RESULTS
This section describes the dataset, accuracy measurement, and the results of the experiments.

A. Dataset Description and Experiment Settings
The data used for the experiments in this article is the S5 dataset, provided by Yahoo [28].This is a labeled benchmark dataset for anomaly detection.We compared the abovementioned unsupervised methods based on their performance with this dataset.Therefore, it is essential to mention that the data labels are only used for the performance evaluation and not for the model training process.
The time series dataset represents the traffic of Yahoo services.The anomalies were labelled by experts.This dataset consists of 67 different time series.Each of them has 1400 data points, which were recorded hourly.About 1.9% of the data are anomalies.The dataset is divided into training and test sets where 70% of the data are used for training and 30% for testing.The training set does not contain any abnormal sub-sequence.Figure 4 visualizes a time series with collective anomalies colored red.

B. Accuracy Measurement
Because we have the labeled anomalies in the test set, AUC can be used to evaluate the framework's performance.

C. Results
In this part of the section, the results of the experiments are discussed.After the segmentation process, which is mandatory, the optimal length of a sub-sequence is experimentally set to 4. The most suitable segmentation method for this dataset is the sliding windows algorithm.Because the window size is tiny, the non-data adaptive method was applied for the representation process.The transformed vectors are at the end scaled with a robust scaler.The experimental results show that all four main processes of the framework are essential for high detection accuracy.Missing one of them will lead to lower performance.Figure 6 visualizes the performance ace of the udCATS framework with different detection models.From the graphic, it is crucial to observe that the scaling process of the comprehensive udCATS framework improved the accuracy of five detection models.The remaining two models performed at the same level.Besides, the udCATS framework with Attention-based Bidirectional Long Short Term Memory Autoencoder as the anomaly detection engine received the highest accuracy, represented by the AUC values.To obtain the best results, the confidence interval of the detection model is predetermined with a value of 0.95.
Table I shows the averaging AUCs of the models in different settings, while figure 7 illustrates the box plot of the udCATS framework's AUCs over the whole dataset.Besides the mean of the AUCs, which is 0.91, the box plot also shows their median.The median is very high, around 0.97.The box plot is short, which means the udCATS framework performs with a high level of agreement over the whole data set of 67 time series.V. CONCLUSION AND OUTLOOK In this work, we provided two main contributions.Firstly, we experimentally demonstrated that an Attention-based Bidi- The following steps will assess the framework with more benchmark data sets.First, this would guide to an improvement of the framework architecture.Afterward, we will extend the selection processes with other methods and try to find a method to implement these processes to work fully automatically.Last but not least, we could combine the loss function of the four individual processes into one total loss function.The idea is to develop an end-to-end training process that improves accuracy.

Fig. 4 .
Fig. 4. A Time series with collective anomalies

Fig. 5 .
Fig. 5. Importance of the representation process Figure 5 illustrates the importance of the representation process.The figure shows that the accuracy of six models (out of seven) is improved while applying the representation process, while the accuracy of the last one remains the same.Another critical remark is, together with LSTM AutoEncoder,