Use of Traffic Sampling in Anomaly Detection for High-Throughput Network Links

Currently, anomaly detection is an increasingly important issue in terms of research work and applications in production systems. Information about system malfunction allows the implementation of precise diagnostic and corrective actions. Two main approaches based on statistical analysis and machine learning techniques are used in anomaly detection systems, which are computationally complex, especially when dealing with high traffic volumes in computer network. In this paper, the limitation of the sampling frequency for network traffic parameters is proposed as a technique to reduce the computational complexity of anomaly detection methods. The proposed approach has been verified in a real network link monitoring system for a medium-sized ISP. The results obtained are promising and can be used to build a production system that enables the development of early warning systems in the area of security incident detection dedicated to high-speed access links.


I. INTRODUCTION
D ISTRIBUTED information systems are becoming in- creasingly prevalent in critical areas of human life.For instance, they are used to control traffic in the city [1], [2], monitor patients' vital signs [3], or manage technological processes in smart factories [4].This information systems are exposed to a number of new types of cyber security threats.The market offers ready-made tools for executing attacks, which affects the constant increase in the number of security incidents.During the pandemic period alone, cybercrime increased by 600% [5], and the average cost of a data security breach in the U.S. in 2022 was 4.35 million [6].There is no single effective system of protection against these threats.Nowadays threat detection and elimination systems have a cascade structure.In other words, we have many interconnected layers in which IDS, IPS, ACL, etc. function.Each type of layer is sensitive to different types of attacks.In the case of carrier access links, such as those used for Internet Service Provider (ISP) companies, simple Access Control List (ACL) rules that filter network traffic based on source and destination addresses are generally applicable.Even in the case of such a simple mechanism, the implementation of a larger number of ACLs, or the implementation of a mechanism for logging information (what flow and by what ACL was blocked) can bring significant delays in the transmission path.Therefore, the authors posed the question during their research: is it possible to detect anomalous behavior without introducing additional delay while reducing the computational complexity of detecting process?Anomaly detection is an important data analysis task that detects anomalous or abnormal data from a given data set.Preliminary research has shown that a conducted cyberattack can affect the change of statistical characteristics of network traffic in the access link.Therefore, the analysis of descriptive link parameters, statistical techniques or artificial intelligence can be used in the area of an access link on the border of the protected network to detect the threat.Anomaly detection is widely used in myriad fields such as medical, public health, fraud detection, intrusion detection, industrial damage, image processing, sensor networks, robot behavior and astronomical data [7].Current research is concerted around speeding up the detection process reducing the computational complexity of the entire process and identifying not only the occurrence of a given anomaly but also eliminating its causes.
At present, there is a clear trend related to identifying the best AI models for anomaly detection in ISP links in order to achieve the best possible detection performance.Applications in this area include both supervised and unsupervised methods [8], [9], [10], [11].Of course, previously, network traffic sampling methods [12] were used for anomaly detection using traditional IDS probes.Such methods were applied, for example, in the work [9], and the obtained results look promising.Their applications allow for preliminary verification in terms of detecting anomalies in large volumes of network traffic.However, it should be noted that a large body of work in this field is based on previously prepared test datasets [13], [14], [15] or on data obtained from real links with low throughputs [16].Preliminary results of conducted research have shown that, in addition to data sampling, the proper preparation of acquired data and flow aggregation have a positive impact on detection outcomes.Of course, data preprocessing can also be computationally complex, but it can be easily parallelized and computed distributed among system nodess [17], [18].The analysis of available literature clearly demonstrates the pursuit of increasing the accuracy of predictive models, but we must not forget about their applicability in real computer networks.In this study, the authors decided to investigate the impact of data set impoverishment (sampling) on the sensitivity of the anomaly detection model and whether it is possible to limit the number of processed traffic samples while maintaining the detection level.The entire study was conducted in a production network of an ISP (Enf sp.z o.o).The developed detection layer at the ISP access link can serve as an additional layer of protection against cyber-attacks in cascade anomaly detection systems [19].If the detection effectiveness of the model slightly decreases with decreasing traffic sampling frequency, it will positively contribute to reducing the amount of necessary measurement data to be transmitted and the processing time required, thus increasing the applicability of the solution in real networks.
The article has the following structure: Chapter 2 presents the network structure of the ISP access node and the architecture of the data acquisition and processing system.In Chapter 3, the data aggregation and sampling process are discussed in detail.Chapter 4 describes the model used for anomaly detection.Chapter 5 presents the obtained results, including the accuracy of detection in relation to the sampling frequency.In Chapter 6, the obtained results were summarized, and directions for further research were indicated.

II. ISP EDGE NODE TOPOLOGY
As mentioned earlier, the research was conducted in the environment of a medium-sized ISP.Real network traffic from end customers was analyzed.In order to carry out the research, it was necessary to modify the structure of the access node used in the system.The system structure is shown in Figure 1.The access router (Extreme MLX-4) connect the entire network segment to the Internet using the BGP protocol.The core of the access network was built based on two switches: Extreme 690 (CORE switch) and Extreme 670 (S1 switch).Policy shaping and NAT for the LAN segment were implemented through a software router (TC + IPTables) built on a Dell R710 server.Two additional hosts, PC1 and PC2, were introduced into the network.PC1 was connected to the LAN network using a Dasan switch, while PC2 was connected through a TP-Link switch in the demilitarized zone of the access node.Its task was to emulate an attack on PC1.All traffic transmitted to the LAN is directed through port P1.Using the port mirroring mechanism, the traffic from port P1 is copied to the Dell PowerEdge R940 server, where calculations related to anomaly detection are performed.This server had the following specifications: Intel(R) Xeon(R) Gold 624 CPU @ 2.60GHz processor; 128 GB of RAM; NVIDIA Tesla V100-PCIE-16GB GPU; HDD 4.5 TB.The 'PowerEdge R940' server hosted a virtual machine based on the Debian OS, which collected traffic (bidirectional) using tcpdump.The laboratory setup allowed for data collection in the infrastructure from the layer 1 to layer 7 of the ISO/OSI model, capturing individual packets for specific network flows using the tcpdump sniffer.Such an environment allowed for testing various data processing techniques and AI algorithms to determine the optimal sampling frequency at which the created models would effectively detect abnormal periods in the packet flow in the investigated network.In the next step, a system was built to allow smooth frequency sampling changes.It should be noted that during the conducted research, the entire traffic from port P1 was collected.The entire sampling process was performed on the PowerEdge R940 server, enabling repeated tests for different sampling frequencies.Ultimately, in production systems, the sampling frequency can be set on a specific probe installed in the network.This not only reduces the amount of processed data but also limits the amount of data transmitted between the probe and the detection system.Additionally, initial data preprocessing can also be performed on the measurement probe (in the test system port P1 acts as the probe).Sequential packet selection with a fixed period between consecutive samples was used in the sampling process.In other words, all collected packets were labeled with consecutive natural numbers, and only those packets whose indexes were multiples of a selected natural number s, such as s = 2 (sampling every other packet), were chosen for further analysis.Of course, it is possible to apply a different statistical distribution of samples, which will be the subject of further research.The data received from the ISP network was saved in .dumpfile format.Subsequently, it was divided into equal time intervals (windows).Each window represents a short time of network operation that is evaluated by the machine learning model to classify the entire window as either anomalous or not.The order of the sampling and windowing processes is interchangeable.In the next step, CICFlowMeter software [20] was used for feature extraction.As a result of its operation, CSV files containing feature vectors describing each analyzed packet were obtained.These files were used in further analysis for feature selection and aggregation, which will be described in detail in the subsequent part of the article.The data processing process is described in Figure 2.
In order to describe the process of windowing, i.e., to divide packets into windows depending on the time of their capture, let us make the following assumptions: where: t test -total duration of the test; k -the number of the given window; P -set of all packages; T -set of all the moments of time in which the windows begin; f -the length of the window within which the packages will be aggregated.
In view of this, we can assume that the set of packages contained in a given window can be described as follows: O u = {p ∈ P : t u ≤ p (t) < t u+1 } u = 0, 1, . . ., z − 1 where: O -set of all windows; p (t) -packet capture time p.
The data was divided into two sets: 1) The training set represented normal network traffic and was collected for one hour under standard network operating conditions.It consisted of traffic from LAN clients and PC2 (see Figure 1).These data will be used to train a model for the purpose of identifying normal traffic.The anomaly detection model used in the further part of this work will be based on a set of unsupervised algorithms.This approach was chosen because in case of supervised learning model, staff would have to label which packets belonged to normal traffic and which were considered anomalous.This process is extremely time-consuming.Naturally, in the case of unsupervised learning, during the training period, it is essential to ensure that the network is not under attack.Therefore, the training time of the models must be closely monitored by the technical personnel.After training the model on attack-free traffic, it should be able to determine whether incoming packets grouped in windows O u will contain flows characterized by parameter values deviating from the characteristics of normal traffic.The training dataset contained information on 1,182,566,238 packets.
2) The test set aimed to verify the performance of the model based on the training set.The packets in the windows represented network traffic in two states: normal and anomalous.The anomaly was a 5-minute long Denialof-Service (DoS) attack.The test dataset contained information on packets captured over a period of 45 minutes, out of which 20 minutes represented normal traffic, the next 5 minutes included the anomaly, and the remainder consisted of normal traffic again.For this dataset, window labeling was performed to mark them as either anomalous or non-anomalous in order to assess the quality of the trained model.The test set contained information on 697,871,782 packets.It should be noted that during the conducted research, a series of experiments related to DoS and DDoS attacks were carried out, and repeatability of the obtained results was achieved.The DoS attack was identified by the ISP operator as the most common type of attack that the network encounters during its normal operation.Of course, the model shows sensitivity to other types of anomalies not related to DoS attacks, but research in this area needs to be continued.

III. PRE-PROCESSING OF DATA
The data collected during the experiments were continuously subjected to the process of cleaning and preparation for further stages of processing related to model training and anomaly detection.According to the scheme presented in Figure 2, all extracted windows O u had to undergo a vectorization process, so that each window represented independent feature vectors.The vectorization method used in this work is an aggregation approach of selected flow features obtained through feature extraction using the CICFlowMeter software for unique source and destination IP address pairs.A flow represents the packet flow between two network devices, defined by source and destination IP addresses, as well as used ports and network protocols.For the purpose of this work, the notations p (s) and p (r) were adopted to denote the source and destination IP addresses of a given packet, respectively.Therefore, the vectorization process can be described as follows: where: D (u) -the aggregated feature vectors of window flows u; R (u) -a set of packet collections with unique destination and recipient IP addresses; U (u) -a set of all destination and source IP address pairs in the window k; Ū (u) -a subset contained in U (u) composed only of its unique elements; F 1 -the first aggregation function, its task is to aggregate packet features for each unique flow; F 2 -the second aggregation function, its task is to aggregate flow features for each unique destination and recipient IP address pair.
In the first stage (aggregation F 1 ), the characteristics of each flow occurring in the processed window were aggregated.The set of packets in the window is divided into subsets, where each subset contains the set of packets responsible for the creation of a particular flow.In the second stage, the aggregated characteristics obtained in stage F 1 were further aggregated for each unique destination and recipient IP address pair p (s) , p (r)  in the processed window O u .Additionally one dimension describing the number of flows for unique destination and recipient IP address pairs was added to the final vectors D (u) .This type of aggregation allows for a complete vector representation of flow data for a given window, which directly translates into reducing the computational complexity of the detection process by reducing the number of features to 25.These features were selected through experimental work aimed at identifying characteristics that maximize the effectiveness of anomaly detection.The list of all used features is presented in Table I, which also indicates the actions performed in the individual aggregation stages F 1 and F 2 .

IV. MODEL DESCRIPTION
To test the performance of the sampling frequency's impact on anomaly detection accuracy, a densely connected neural network based on an autoencoder architecture was used [21].The application of this model for anomaly detection is wellknown in the literature, and its effectiveness for the complete dataset was experimentally confirmed in the initial stage of the conducted research.The operation of the adopted model can be divided into two main stages: 1) The forward propagation stage of the neural network, which consists of two key components: a) Compression of the input feature vector into fewer dimensions (encoding).b) Reconstruction of the compressed feature input vector (decoding).
2) The stage of calculating the reconstruction error based on the comparison of the input vector with the output of the neural network.Based on the reconstruction error, a decision is made to classify the sample into normal or containing an anomaly.Let M denote the reconstruction error for a single vector w.It can be observed that as a result of applying aggregation F 2 , we obtain a set of vectors describing the features of all unique sender and receiver IP address pairs.Therefore, the reconstruction error for a single vector w can be expressed as follows: To calculate the reconstruction errors for all vectors in a given window O u , the above formula should be applied to each w = 0, 1, . . ., D (u) .
The classification of a window can be expressed as follows: reconstruction loss for window characteristics.Additionally, to improve the weight fitting process, the data underwent standardization [24] using the mean and standard deviation of the features from the windows in the training dataset.

V. RESULTS
The combination of processing data using aggregation of unique sender and receiver IP address pairs, along with a model based on maximum reconstruction error of processed feature vectors in each pair's window, yielded good results in anomaly detection task.The windows where anomalies occurred showed significantly higher maximum reconstruction error compared to those characterized by normal traffic.Table III presents the results of anomaly detection quality on the test dataset.The performance of the developed model was  The results show that satisfactory performance is achieved even in the case of s = 25, which means checking every 25th network traffic sample.It is important to note that in the experiments, the window length was 5 seconds and the entire attack lasted 5 minutes.It is assumed that for longer-lasting attacks with higher network traffic intensity, such as DDoS attacks, the sampling frequency can be further reduced.The sampling threshold should be determined individually based on the characteristics of the specific network and the sensitivity of the system expected by the ISP operator.The paper presents the results of research related to the possibilities of applying a data sampling mechanism for anomaly detection on high bandwidth network links.The research work was carried out in a medium ISP environment in a production infrastructure.The anomaly detection approach proposed in the work taking into account windowing and data sampling allowed to reduce the data needed for anomaly detection (DoS Attack) by 25 times.This makes it possible to reduce the bandwidth of IDS and IPS probes detecting threats, which will directly translate into the cost of implementing cybersecurity systems.Further research concert around the use of nonuniform sequential sampling of traffic, e.g. by using different frequencies and statistical distributions depending on the time of day or network activity.In addition, preliminary studies have shown that the designed system is also effective in detecting other types of anomalies, e.g.data generated by faulty network interfaces.It should be noted that the proposed approach makes it possible to monitor high-throughput access links of ISPs and thus introduce another layer of protection for the entire ICT system against cyber attacks.Thanks to the use of traffic copies, the proposed architecture itself does not bring delays to the end user traffic forwarding process, and once a threat is detected, a given flow can be redirected for further inspection using policy-based routing mechanisms.

Fig. 1 .
Fig. 1.ISP network edge node architecture with testbed elements

Fig. 2 .
Fig. 2. Data processing scheme evaluated on the test dataset for different sampling frequencies s = 10, 25, 50.The obtained results are presented in Figures 3 to 5. The maximum reconstruction error for the non-anomalous sender and receiver IP address pair is indicated in blue color, while the reconstruction error for the attacking device's IP address and the target IP address is shown in red color.

TABLE III RESULTS
OF MODEL EVALUATION ON THE TEST SET