Towards Automated Detection of Adversarial Attacks on Tabular Data

The paper presents a novel approach to investigating adversarial attacks on machine learning classification models operating on tabular data. The employed method involves using diagnostic parameters calculated on an approximated representation of a model under attack and analyzing differences in these diagnostic parameters over time. The hypothesis researched by the authors is that adversarial attack techniques, even if attempting a low-profile modification of input data, influence those diagnostic attributes in a statistically significant way. Thus, changes in diagnostic attributes can be used for detecting attack events. Three attack approaches on real-world datasets were investigated. The experiments confirm the approach as a promising technique to be further developed for detecting adversarial attacks.


I. INTRODUCTION
T HE widespread adoption of machine learning (ML) al- gorithms in various fields, such as healthcare, finance, transportation, and industry [1], has revolutionized the way we process and analyze vast amounts of data [2].However, the rapid proliferation of ML applications has also raised operational security concerns, as malicious actors increasingly target these models with adversarial attacks to undermine their reliability and compromise their performance [3].These attacks pose a significant threat to the integrity and trustworthiness of ML models, necessitating the development of robust detection and mitigation techniques to protect the systems from potential threats [4], [5].
The motivation for our work is rooted in the observed disparity between machine learning implementations, which primarily emphasize traditional quality characteristics, and the security-focused mindset held by stakeholders responsible for operational security in businesses that incorporate machine learning solutions reinforced by real-world examples of adversarial machine learning attacks [6].This gap highlights the need for a more holistic approach to designing and deploying machine learning systems, taking into account not only their performance but also their resilience to adversarial attacks and other security challenges [7], [8].
Furthermore, we have found that the field of rough sets theory (RST) has not been thoroughly explored when it comes to its capability in attack detection.One of defining characteristics of RST is that it can be used to handle uncertainty and vagueness of data [9].By approximating the decision boundaries of a classifier model, rough sets can be used to identify regions in the input space where adversarial perturbations are likely to occur [10].By monitoring these regions, unusual deviations or patterns in input data can be flagged as potential adversarial attacks.This approach, if proved to be working, can not only provide a robust mechanism for detecting adversarial examples but also offer insights into the underlying structure of the data and its susceptibility to manipulations, thereby informing the design of more secure machine learning models.In this work, we want to test the usefulness of RST methods in practical security applications in the domain of adversarial machine learning prevention.
The end goal of our work is to create a robust black-boxbased method that can be utilized in real-world scenarios for the detection and prevention of misclassification adversarial attacks on machine learning models, increasing the safety and trustworthiness of machine learning applications in everyday scenarios.

A. Advesarial Machine Learning (AML)
Starting with the pioneering works of Szegedy et al. [11] and Goodfellow et al. [12], the topic of adversarial machine learning has entered the spotlight of the research community.Those works demonstrated that it is possible to influence the operation of machine learning models, most notably imagebased classifiers, by adding limited amplitude (undetectable to the human eye) perturbances to original images, causing spectacular cases of misclassification of images.
Since the concept's inception, it has left the walls of academia, and real-world adversarial machine-learning attacks have been proven possible in various areas [13], [14].
There are several possible ways to classify the diverse world of AML attacks [3], [8], [15].The classification of AML attacks is based on a different axis: • Knowledge-based classification -distinguishes attacks based on the amount of knowledge an attacker has about the target model.
• Capability-Based Classification -considers the capabilities of the attacker and the stage of the machine learning pipeline targeted.
• Goal-Based Classification -differentiates attacks based on the attacker's objectives.A point of note is that most of the published papers refer to attacks and defenses on image data [15].Only in recent years, the interest in attacks and defense on tabular data processing models has increased [16].

B. AML detection
Complementary to works dedicated to increasing the robustness of models against adversarial machine learning, significant effort is put into the detection of attacks against ML models.These techniques are primarily designed to identify inputs that have been modified with the intent of misleading a machine-learning model.Some detection strategies attempt to detect adversarial examples by identifying instances that significantly deviate from the distribution of normal instances.An example of such a technique, specific for adversarial attacks on image classification models, has been described in [17].The detection technique presented therein hangs on the realization that adversarial images place abnormal emphasis on the lowerranked principal components from principal component analysis (PCA), which allows adversarial examples to stand out after PCA whitening.Most recently, salience-based methods have been used to analyze adversarial examples for NLP models -based on an observation that salience tokens have a direct correlation with adversarial perturbations [18].
Another approach to the detection of adversarial perturbations is to train a separate classifier used to classify inputs as normal or adversarial.In [19], such an approach was implemented using neural network classifiers.The method has been proven to be useful for the detection of small adversarial perturbances in images (below the human-detection threshold).This auxiliary classifier can be integrated with the main model and can provide a reasonable level of adversarial threat detection [20].
An interesting approach for the detection of perturbed images has been presented in [21], where a method has been presented that detects adversarial examples by comparing the output of a discriminator of a generative adversarial network (GAN) trained on the dataset -with the realization that adversarial examples are scored lower by the discriminator part of the GAN.

A. Adversarial attack types used
In this work, we have tested our attack detection method against three known attack techniques: HopSkipJump, Per-muteAttack, and ZOO.These attack methods were chosen based on three criteria: a thorough description of the attack methodology in an academic paper, its applicability to attacks on classifier models operating on tabular data, and the availability of its source code.While the choice of attack methods to be used in our work was arbitrary, it was considered to be proper for the preliminary attack detection method verification presented in this work.

1) HopSkipJump Attack:
The HopSkipJump attack, also known as the Decision-Based Boundary attack, is an adversarial attack on machine learning models designed to generate adversarial examples by directly manipulating the input data to cause misclassifications while minimizing the perturbation to the original input [22].
It is an iterative, decision-based attack, meaning that it only requires access to the model's output decisions (e.g., classification labels) rather than full access to the model's internal workings or gradients.The attack algorithm consists of three main steps: • Hop: Initialization of the adversarial example by searching for a starting point near the decision boundary of the model.
• Skip: Binary search along the line connecting the original input and the initialized adversarial example to find a point that lies closer to the decision boundary.
• Jump: Gradient-free optimization to further perturb the adversarial example while keeping it within a predefined perturbation budget.2) Permute Attack: PermuteAttack, described in [23], is a counterfactual example generation method capable of handling tabular data including discrete and categorical variables.The method is based on gradient-free optimization genetic algorithm, that permutes randomly selected features making sure that resulting values are within ranges that are not outstanding for a given data set.As a result, it produces adversarial data points that are modified, as compared to the original data points, in a way that can elude some anomaly-detecting methods.Resulting adversarial examples can be also used for the analysis of the robustness of the attacked model.
3) Zeroth-Order Optimization (ZOO) Attack: The Zeroth Order Optimization (ZOO) attack is a black-box adversarial attack proposed by Chen et al. [24] The key idea behind the ZOO attack is to approximate gradients of the target model using zeroth-order (derivative-free) optimization methods, allowing the attacker to generate adversarial examples without direct access to the model's gradients or architecture.The ZOO attack steps: • Approximate the gradients using zeroth-order optimization, such as the coordinate-wise finite-difference method or the spherical coordinate-based method.
• Compute the adversarial perturbation using the approximated gradients.
• Apply the perturbation to the original input, ensuring that the adversarial example remains within a predefined perturbation budget.

B. Diagnostic attributes
The whole workflow connected with model approximation and diagnostic attributes was originally described in work [25].
Here we just shortly call the main idea.This approach focuses on building a surrogate model for origin model predictions using the rough sets theory [26].Based on discretized input data set we construct the ensemble of approximate reducts.The next step is to create a neighborhood for every instance in the diagnosed data set as a set of instances from the train data set that is similar to a given instance in the diagnosed data set.The defined neighborhood is a basis for calculating the diagnostic attributes listed below.
• Target consistency with approximations in neighborhood -measuring the consistency of the target of the diagnosed instance with the approximations from the neighborhood of this instance.
• Prediction consistency with targets in the neighborhood -measuring the consistency of the prediction of the diagnosed instance with the targets from the neighborhood of this instance.
• Target consistency with targets in neighborhood -measuring the consistency of the target of the diagnosed instance with the targets from the neighborhood of this instance.
• Targets and approximations inconsistency in neighborhood -measuring the inconsistency of targets and approximations in the neighborhood of diagnosed instance.
• Targets diversity in the neighborhood -measuring the diversity of targets in the neighborhood of diagnosed instance in comparison to the diversity of targets calculated on the whole diagnosed data set.
• Approximations diversity in the neighborhood -measuring the diversity of approximations in the neighborhood of diagnosed in comparison to the diversity of approximations calculated on the whole diagnosed data set.
• Uncertainty -the measure of uncertainty of prediction based on the approximations.
• Neighborhood size -the number of instances in the neighborhood of diagnosed instance.
We used the Kolmogorov-Smirnov (KS) test [27] to compare the distribution of diagnostic attributes.Additionally, the Wilcox signed rank test [28] for paired two samples was conducted.The first test compares the distance between distributions while the second measure only changes in the location parameter.

IV. EXPERIMENTS AND RESULTS
To evaluate proposed diagnostic attributes in attack detection we prepare benchmark data sets.From OpenML 1 we gathered 22 data sets with classification task.Each data set was split into train and diagnosed parts assuming that the diagnosed data set should consist of at least 100 observations.A list of data sets is placed in the appendix in Table IV.
For each data set, we fitted a logistic regression model, support vector machine, and XGBoost.Afterward, three adversarial attacks were conducted at the diagnosed part of each data set.
Figure 1 shows the distribution of balanced accuracy measured at the diagnosed data set for the origin (base) model and how it changed after the given attack.It can be seen that post-attack models in most cases result in worse performance than the base model.The median balanced accuracy for all base models is above 0.8 while in the case of the HopSkipJump attack, it is around 0.1.For Permute attack median value is slightly higher and equal to around 0.2.In the ZOO attack, these values are close to 0, but high dispersion of results for the XGBoost model can be observed.We calculated diagnostic attributes for each analyzed data set and attack, resulting in 264 tables with results (22 data sets × 3 model types × 4 attack variants (no attack + 3 others)).The distribution of two diagnostic attributes for the selected data set (spambase) is presented in figure 2. We used the Kolmogorov-Smirnov test to verify the null hypothesis that there is no difference between the distribution of the given diagnostic attribute before and after the attack.We also used the Wilcoxon test to examine the hypothesis that the median of differences between the paired attributes is zero.Both of these tests indicates whether there is a significant difference between diagnostic attribute before and after the attack.We summarize this data by calculating the fraction of cases in which the null hypothesis of a given statistical test was rejected at significance level α = 0.05.Results are presented at three levels of aggregation -effectiveness of detecting attacks at the type of attack (table I), the type of model (table II), and diagnostic attribute (table III).

Support Vector Machine
In the case of the HopSkipJump attack KS test rejected the null hypothesis in 88% cases while the Wilcox test in 95%.Another issue was the verification of diagnosis attributes effectiveness in attack detection.According to the Kolmogorov-Smirnov test, the highest detection rate was obtained for uncertainty (95%), target consistency with approximations in the neighborhood (90%), and targets diversity in the neighborhood (88%).In the case of the Wilcox test, we obtain similar high results for three attributes: uncertainty, approximations diversity in the neighborhood, and target consistency with approximations in the neighborhood.
V. CONCLUSIONS In this paper, we have presented a novel approach to detecting adversarial attacks on machine learning classification models operating on tabular data.By analyzing differences in diagnostic parameters calculated on an approximated representation of the model under attack, we demonstrate that adversarial attacks can be detected in a statistically significant manner.Experiments performed on real-world datasets confirm the effectiveness of our method and its potential for further development as a detection technique for adversarial attacks.

A. Limitations
The method developed and presented in this paper has several limitations, which will be tackled in future works.Most notably: • The robustness of the method to attack variability has not been subject to wider assessment -for the initial method validation a sample of three attacks was chosen, but it does not cover the range of currently known and published attacks on models designed for processing of tabular data.
• The method assumes that the model being monitored is replicable with a rough-sets-based method presented in previous work.In this paper, it was verified on three classification models, with the assumption that the underlying model replication method provides a layer of abstraction that is strong enough to consider our method model-agnostic.Verification of this hypothesis has not been a subject of this work.
• The computational efficiency optimization and scalability have not been, by design, within the scope o the work presented herein.
• The method has not been benchmarked against available AML detection techniques.

B. Future Work
Future work will be streamlined into three distinctive work streams.
First, we will broaden the range of scenarios on which the method is tested.The method will be verified on a larger representation of known attack methods and exploration of their attack parameter space.Special attention will be given to methods that attempt low-profile adversarial attacks, attempting to pass under the detection threshold of traditional monitoring tools.We also plan to compare our approach with other methods which aim to detect AML.Furthermore, we will verify the assumption of the method being model-agnostic, by checking how its effectiveness changes when used on a different original model being attacked.
The second workstream will be devoted to new features of the method: • Concept drift detection -examining differences in diagnostic attributes behavior between changes in data resulting from malicious attacks and different types of concept drifts -both stochastic and deterministic in nature • Exploration of possibility for new diagnostic attributes definition.Specifically -looking for diagnostic attributes that increase specificity and sensitivity of attack detection heuristics In the third work stream, we intend to analyze the scalability of the method and prepare a thorough comparison of the APPENDIX

Fig. 1 .
Fig. 1.Distribution of balanced accuracy in analyzed datasets across the type of model and attack

Fig. 2 .
Fig. 2. Distribution of selected diagnostic attributes for the spambase data set At the model type level, we detected 79% of attacks conducted on the XGBoost model, almost 84% on logistic regression, and 86% on SVM using the Kolmogorov-Smirnov test.With Wilcox test success rate is equal to 93% for Logistic regression, 94% for SVM, and 98% for the XGBoost model.

TABLE IV BASIC
CHARACTERISTICS OF DATA SETS USED IN EXPERIMENTS.THE COLUMNS N , |A|, AND |L| SHOW THE TOTAL NUMBER OF INSTANCES, ATTRIBUTES, AND CLASSES, RESPECTIVELY.PIOTR BICZYK, ŁUKASZ WAWROWSKI: TOWARDS AUTOMATED DETECTION OF ADVERSARIAL ATTACKS ON TABULAR DATA 251