Comparing Performance of Machine Learning Tools Across Computing Platforms

Embedded systems (ES) are wide-spread in our world and responsible for many critical systems. More recently, machine learning (ML) tools have become a well-established solution for data-intensive tasks, but their application in embedded systems is still gaining traction and their real-time performance is often unclear. We provide a (non-extensive) review of the ML tools that may be suited for deployment in ES, from which we selected two representative tools - the well-established Python-based Scikit-Learn, and the interoperability-oriented ONNX Runtime - to compare their response time. Using archetypal datasets and four pre-trained ML models, we measure the prediction time for each sample, for each model, in Scikit-Learn and ONNX Runtime in a standard desktop (to compare performance of the tools in the same platform), and for ONNX Runtime in a representative ES, a Raspberry Pi v4 (to compare performance of the same tool across platforms). We report that ONNX considerably improves over Scikit-Learn, and experiences a negligible performance degradation when ported to the RPi.


I. INTRODUCTION
Artificial intelligence (AI) and machine learning (ML) have grown dramatically in recent years, to the point where AI & ML is becoming a core technological component in many modern systems.In turn, embedded systems (ES) are a wellestablished technology that has enjoyed widespread use in our world for decades now, inconspicuously ensuring the efficient execution of a plethora of everyday operations.Many applications of ES are critical and time-sensitive ones [1]; for example, the timely detection and mitigation of cyberattacks, that is crucial for the integrity and dependability of many modern-world digital systems (e.g., banking sector).
The use of ML in embedded systems has garnered substantial interest, with the topic often being referred to as TinyML.The challenge is that embedded systems are typically resourceconstrained platforms (ranging from micro-controllers to ARM or small-scale x86 platforms) and, while there is a plethora of ML libraries, not all provide the small memory footprint and stand-alone operation (i.e., sufficiently stripped-down from external dependencies) necessary for operation in embedded systems.Furthermore, a common strategy is to carry out training at the cloud (due to the higher processing capabilities available), whereas the embedded device only performs prediction.This raises the need for interoperability, as possibly the ML tool used for training can be different than the one available at the embedded device.Finally, characterization of response time is important to design real-time systems.Reports of execution time and/or speed-up against baselines can be found (e.g., [2], [3]), but typically for single models and not considering potential response time variability.
A noteworthy category of solutions are intermediate description languages, such as Open Neural Network Exchange (ONNX), and associated runtime environments (RTE), notably ONNX Runtime and Tensorflow Lite.Intermediate description languages describe a (trained) ML model using a (small) set of operators that the RTE is able to execute.This reduces computational requirements as it suffices that the RTE implements that set of operators to produce predictions from a given model.Downsides are that training may not be available and that the set of models at disposal may be limited.
In this work we report the performance of two selected libraries, Scikit-Learn and ONNX Runtime, in two platforms: a standard desktop and an archetypal embedded system, a Raspberry Pi v4.We deploy four one-class ML models -Isolation Forest (iForest), Local Outlier Factor (LOF), One Class Support Vector Machine (OC-SVM) and Stochastic Gradient Descent OC-SVM (SGD-SVM) -, that were pretrained with network traffic data sets (legitimate and malicious) to detect cyberattack-related traffic.We show that ONNX Runtime can offer a speed-up of at least ≈ 14x with respect to Scikit-Learn for most models when both are executed in the desktop, and that ONNX Runtime running in the Raspberry Pi produces speed-ups of at least ≈ 8x against Scikit-Learn running in the desktop.
The structure of this paper is as follows.Section II portrays a motivating use-case and relevant ML models.An overview on ML for ES is provided in Section III.Section IV reports response times for selected ML libraries and computing platforms.Section V draws final remarks.

A. Cybersecurity Use-Case
Cybersecurity is a domain of notable technological and societal impact in the modern world.The exposure surface for cyberattacks, and for recruiting devices that can be commandeered to participate in those attacks, increases everyday as the number of low-security IoT devices grows.This has been a driver for the increase of Distributed Denial of Service (DDoS) attacks, that aim at disrupting the servers of highprofile online services (e.g., Amazon, Google or Netflix) by having a very large number of infected devices (typically vulnerable IoT devices) issuing dummy requests to those servers.Internet Service Providers (ISP), that enable Internet service at customer premises through a Customer-Premises Equipment (CPE), are interested in mitigating the involvement of their customers' devices in cyberattacks through the use of Intrusion Detection Systems (IDS).The fastest response time is attained by deploying the IDS the closest possible to the targeted (or involved) nodes; for the ISP, this is the CPE.Machine learning, whose successful application to cybersecurity is well documented [4]- [6], can help addressing this issue by learning network traffic patterns that are legitimate, and apply that knowledge to identify anomalous patterns that may concern malicious traffic.However, the CPE is often an embedded system with relatively few computational resources; while it can perform model prediction, it lacks the power to perform training, that ends up taking place in the cloud.The question arises of how to transfer models trained in the cloud, often with a state-of-the-art ML library, to the embedded system, that often will support only a limited set of ML libraries.Figure 1 presents the architecture of a cloud/edge system, and how can an ML-based IDS be deployed by leveraging the resource-rich cloud for training and transferring trained models to the resource-constrained CPE.Additional details on this use-case can be found in [7].

B. Datasets & ML Tools
To enable the presented use-case, we prepared datasets of network traffic (both legitimate and malicious) from publicly available sources, and trained four models to produce a ML mechanism that detect anomalous (potentially malicious) traffic.The details are described in [6].We focus on One-Class (OC) models, i.e., models trained with samples of a single class to create a boundary around these, against which outliers can be detected.This semi-supervised approach allows the models to learn the regular (legitimate) traffic at a customer's network, and report anomalies that can potentially reveal themselves to be malicious traffic.All models were trained using Scikit-Learn [8], a free Python library that enjoys widespread use in the ML community.
The four selected models are reviewed next for convenience: Isolation forest [9]: an unsupervised mechanism based on decision trees.It leverages the assumption that an anomalous sample requires less partitioning steps to be isolated.Thus, an isolation forest work by recursively generating partitions, by randomly splitting an attribute's value between the minimum and maximum values allowed for that attribute, until a target sample is contained in its own partition.Anomalies will require less partitions to be isolated.
Local Outlier Factor (LOF) [10]: LOF identifies local outliers by measuring the deviation of the density of a data point to its neighbors.The k-Nearest Neighbors is used to compute the reachability distance and local reachability density of each data point.The associated LOF score is calculated as the ratio of its local reachability density to the densities of its knearest neighbors.Points with high LOF scores are considered outliers.The value k (number of nearest neighbors) must be chosen carefully to avoid overfitting or underfitting.
One-Class Support Vector Machines [11]: traditional Support Vector Machines (SVM) select a decision boundary for which the margin between data points of different classes is maximized.Other interpretation is that SVMs maximize the distance between the convex hulls of points belonging to each class.One-Class SVM (OC-SVM) applies the same boundary-based mechanism for semi-supervised learning.It uses a hypervolume to encompass all of the instances; points outside the hypervolume are classified as anomalies.
Stochastic Gradient Descent [One Class] SVM (SGD-SVM) [12]: an online linear version of One-Class SVM, using a stochastic gradient descent (SDG).SDG algorithms are suited for applications where the number of data points and the problem dimensionality are both very large.

III. OVERVIEW OF ML TOOLS FOR EMBEDDED SYSTEMS
We review (not exhaustively) ML libraries targetting embedded systems and tools for interoperability and transpilation.

A. ML Libraries for Embedded Systems
TensorFlow (TF) 1 is an open-source library for AI/ML, composed of datasets and pre-trained models developed and released by the Tensorflow Community.Colaboratory (Colab) for instance, is a free Jupyter notebook environment and runs in the cloud so the user doesn't need to setup anything in his local machine.This library is supported in Haskell, C#, Julia, R, Ruby, Scala and Javascript.
Armadillo 2 is a library in C++ for linear algebra and scientific computing.It can use Open Multi-Processing (OpenMP), a free easy-to-use library for parallel computing.mlpack 3 is a C++ ML library focused in providing fast and extensible implementations of ML models.This library is the combination of Armadillo, ensmallen, a library for numerical optimization and cereal, a serialization library.
Shogun 4 is an open-source library in C++ for machine learning development.It provides interfaces for C++, Python, Octave, R, Java, Lua, C#, Ruby and implements all the standard ML algorithms and some advanced as well.It is available for most operating systems.
SHARK 5 is an open-source machine learning library implemented in C++.It provides neural networks, kernel-based learning algorithms, linear and nonlinear optimization methods and is available for the most common operating systems.
A notable mention also goes to CAFFE 6 , that focus on deep learning, thus supporting mostly neural networks (e.g., CNN, RCNN, LSTM).
There are also efforts focusing on deploying specific ML models in resource-scarce devices.The authors of [13] present ProtoNN, an algorithm that replicates k-Nearest Neighbor (k-NN) but has several orders lower storage and prediction complexity, and ProtoNN models can be deployed in very scarce plaforms (e.g. an Arduino UNO with 2kB RAM).The authors of [3] presents SeeDot, a domain-specific language to express ML inference algorithms and a compiler that compiles SeeDot programs to fixed-point code that can efficiently run on constrained IoT devices.In [2] CMSIS-NN is presented, which is essentially efficient kernels to maximize the performance and minimize memory footprint of neural network applications on Arm Cortex-M processors.

B. Interoperability of ML models
The following options, rather than tools, are standards to provide a common description of ML models, therefore enabling porting between libraries.
Open Neural Network Exchange (ONNX) 7 is an open specification with the following components: a definition of an extensible computation graph model; definitions of standard data types; and definitions of built-in operators.The first two make up the ONNX Intermediate Representation (or IR).In ONNX IR, each computation dataflow graph is structured as a list of nodes that form an acyclic graph.Each node is a call to an operator, and they have one or more inputs and outputs.Built-in operators are divided into a set of primitive operators and functions (the latter being, essentially, sub-graphs using primitive operators and/or other functions).Operators are implemented externally to the graph, but the set of built-in operators is portable across frameworks.Every framework supporting ONNX will provide implementations of these operators on the applicable data types.ONNX is compatible with at least 29 frameworks and converters and 30 inference runtimes.
Predictive Model Markup Language (PMML)8 is a document format based on the Extensible Markup Language (XML) that can be used to described machine learning algorithms.It enables ML model porting between existing support-ing libraries; these exist for C++, such as cPMML9 , and for Python, notably with the Scikit-Learn library sklearn2pmml10 , among others.

C. Transpilers
Transpilers translate a source code into a language different than the original one.The resulting code is described natively in the target language.
Sklearn-porter11 is a Python library specifically developed to transpile ML models built with Scikit-Learn to other programming languages such as C, GO and JavaScript.
Model 2 Code Generator (m2cgen) 12 is a free, opensource library mainly developed in Python, that transpiles trained statistical models (trained, e.g., with Scikit-Learn or lightning libraries) into a native code for at least 16 different programming languages (R, Visual Basic, Haskell, C#, etc.).

D. Runtime Environments
A third dimension discussed here are tools that offer runtime environments (or simply runtime).Some of the aforementioned ML libraries leverage mechanisms for intermediate model representation that can be compiled or interpreted by a runtime environment.This solution avoids the need to deploy the entire library at the target device, thus resulting in a lightweight version of the initial library.
ONNX Runtime13 is a cross-platform machine-learning model accelerator, used to deploy ONNX format models into production.It is meant to enable acceleration of machine learning inferencing across a variety of target hardware.
Tensorflow Lite14 is a TF-variant tailored for resourceconstrained systems that also uses a runtime.Using Tensorflow Lite, the target devices do not require the full TF library installation, but solely the tflite runtime to perform inference.This tool eases the computational requirements of the target system, but its accuracy can be compromised if it uses operations not supported by the Tensorflow Lite.A recent paper reports TensorFlow Lite Micro [14], that adopts an interpreter-based approach to address ML efficiency and fragmentation in ES.

A. Selected Tools & Experimental Setup
We have picked ONNX Runtime as the target ML tool to evaluate, and Scikit-Learn as the baseline reference.The option for Scikit-Learn was straightforward, as it is one of the most widely-used ML tools.It is also the tool used to train the models used in these measurements.As for the tools for deployment in embedded systems, we opted for ONNX Runtime based on a mix of our own requirements (that, when crossed against the available documentation, lead us to eliminate the remaining candidate tools), and impressions

Dataset
Traffic type # samples # Features IOT23 [15] IoT devices 487 26 Botnet [16] Data theft 196 26 acquired from experimenting with the other high-potential candidates.We lay down next the authors' impressions of the reviewed tools; this should not be interpreted, in any way, as a methodical and criterious analysis of these tools.
ML libraries: Tensorflow proved to be a collection of disperse, pre-trained models, making it hard to train new models from scratch.Armadillo, mlpack, Shogun, SHARK and Caffe, despite being described in C/C++, do not seem tailored for deployment in resource-constrained devices.
Interoperability: ONNX provides a clear and well document specification of how to convert models between tools, with extensive software support.PMML enables model porting between supporting libraries but, as aforementioned, we found no library to be a suitable candidate.
Transpilers: sklearn-porter is still under development and the range of models that can be transpiled to C is small (SVM and Decision Trees/Random Forest).Regarding m2cgen, even though transpiled models were able to perform closely to the original model, the tool offers very little documentation, making it hard to interpret the tool's output or even understand how the transpilation process actually occurs.
Runtimes: ONNX Runtime showed up as the best option.TensorFlow Lite was not explored, as usage of standard TensorFlow was also not straightforward.
Table I describes the data sets used in this performance analysis; more details in [6].Table II presents the characteristics of the selected computing platforms.The models were converted to the ONNX specification using the sklearnonnx library.A variant named ONNX Runtime Optimized, that optimizes the ONNX graphs describing the models, was also evaluated.Model accuracy obtained with ONNX Runtime and its Optimized variant was similar to that of Scikit-Learn.

B. Results
Figure 2 presents the average prediction time (over all input samples) of the four ML models across the three tools in the desktop equipment.Presented values are the average time of prediction for each new sample.We observe that ONNX produces an acceleration for most models, notably of ≈ 16x for Isolation Forest, ≈ 14x for OC-SVM, and ≈ 49x for SGD-SVM.In all this cases, the performance of the ONNX Runtime and its Optimized version do not differ substantially from each other.The same is not true, however, for the Local Outlier Factor (LOF), as shown in Figure 2 (top-right).We observed that ONNX underperforms in this model, taking longer than the Scikit-Learn.This leaves the door open for a more efficient implementation of LOF using the ONNX operators.Figure 3 depicts the distribution of the prediction time of the various models per tool when executed in the Desktop.It is noteworthy for that, for ONNX Runtime (vanilla and Optimized), LOF presents the highest prediction time whereas, for Scikit-Learn, it is iForest that takes up the most time.Regarding the distribution of the samples, this is limited in the case of ONNX Runtime and Optimized to a few occasional outliers of additional time.For Scikit-Learn, LOF experiences considerable variability in prediction time.This may be a trade-off of the Scikit-Learn LOF implementation to achieve a lower average time for this concrete model.
Figure 4 exhibits the same analysis as Figure 2 for the second platform.The average of prediction time for the four ML models in the Raspberry Pi is superior to that of the Desktop response time; in detail, for Isolation Forest by ≈ 81%, for LOF by ≈ 37%; and for OC-SVM by ≈ 43%.However, when comparing with the Scikit-Learn running in the desktop, we obtain speedups of ≈ 8x for Isolation Forest, ≈ 9x for OC-SVM, and ≈ 39x for SGD-SVM.Results in Figure 5 presents little differences to Figure 3 (right and bottom) where it applies, apart from the generally higher median values in the Raspberry Pi.

V. CONCLUSION
We reviewed Machine Learning (ML) tools according to their potential for embedded system.We selected a particular tool, ONNX Runtime, for comparing prediction time against the well-established Python-based Scikit-Learn.ONNX Runtime is capable of running models described in the ONNX format; the models were trained in Scikit-Learn and exported to ONNX.The prediction time was measured in two platforms -a standard desktop and a target embedded system, a Raspberry Pi v4 -for four pre-trained ML models and datasets.We observe that ONNX Runtime considerably improves over the prediction time of Scikit-Learn, and experiences a negligible performance degradation when ported to the RPi.Future work will evaluate performance on more ML tools and platforms and investigate trade-offs with model target accuracy.

TABLE I :
Dataset descriptions.

TABLE II :
Selected platforms.