An impact of tensor-based data compression on deep neural network accuracy

—The emergence of the deep neural architectures greatly inﬂuenced the contemporary big data revolution. However, requirements on large datasets even increased a necessity for efﬁcient data storage. The storage problem is present at all stages, from the dataset creation up to the training and prediction stages. However, compression algorithms can signiﬁcantly deteriorate the quality of data and in effect the classiﬁcation models. In this article, an in-depth analysis of the inﬂuence of the tensor-based lossy data compression on the performance of the various deep neural architectures is presented. We show that the Tucker and the Tensor Train decomposition methods, with properly selected parameters, allow for very high compression ratios, while conveying enough information in the decompressed data to achieve only a negligible or very small drop in the accuracy. The measurements were performed on the popular deep neural architectures: AlexNet, ResNet, VGG, and MNASNet. We show that further augmentation of the tensor decompositions with the ZFP ﬂoating-point compression algorithm allows for ﬁnding optimal parameters and even higher compressions ratios at the same recognition accuracy. Our experiments show data compressions of 94%-97% that result in less than 1% accuracy drop.


I. INTRODUCTION
The tendency of generating huge volumes of data dynamically increases.In this light data compression methods are of high importance.This is also true in the ML/AI domain, where modern deep neural architectures require larger training data sets.On the other hand, it has been shown that tensor decomposition methods allow to achieve huge compression ratios on multidimensional data [1]- [4].Not surprisingly then that the tensor decompositions have been applied to compress weights of the deep neural networks [5], [6].There are also works focusing on problem of learning from compact representations of images [7], [8].However, to the best of our knowledge, there are no studies on the influence of the training data compression with tensor decompositions on the accuracy of the deep neural networks.This paper fills this gap providing an in-depth analysis of the Tucker and Tensor Train (TT) decomposition based data compression methods on the performance of the common deep network architectures such as AlexNet, ResNet, VGG, and MNASNet.The networks are trained in different scenarios, and due to the batch processing not all data need to be decompressed at the same time.Furthermore, we propose an additional step of data compression based on the ZFP floating-point lossy compression method [9].Our experimental results show that the properly setup tensor decompositions followed by the ZFP module allow for as high as 94%-97% data compression ratios with less than 1% drop in accuracy of the deep neural networks.
The rest of the paper is organized as follows.Section II describes the related works.In Section III the two tensor decomposition methods used in our experiments are discussed.Section IV presents the ZFP floating-point compression method.In Section V neural network models utilized during experiments are described.Our proposed tensor-based training method is explained in Section VI.Experimental results with a discussion on results are described in Section VII.Finally, Section VIII concludes the paper.

II. RELATED WORKS
Compression methods gained much attention over the recent few decades.Since the seminal works of Lempel & Ziv in the lossless compression [10], much more diversed methods have been proposed in the lossy compression domain.Soon it was realized that various matrix, such as the well-known SVD one, and tensor decompositions can lead to significant data reductions [11].However, the latter depends on many parameters, such as the tensor rank [2], [12].In the case of multidimensional data, such as images, video, etc., the tensorbased approach offers much more possibilities, as will be discussed.In the era of deep neural networks, tensor-based methods proved to be superior in compressing their weights.However, it is also possible to compress their training and testing data -in this paper, we explore this branch.
Balle et al. proposed an image compression method consisting of nonlinear transformations for analysis and signal synthesis [13].Three stages of convolutional linear filters with nonlinear activation functions were used to create both transforms.It achieved better rate-distortion performance than the standard JPEG and JPEG 2000 compression methods.
The tensor approach was explored by Zhang et al. [14].The hyperspectral images were stacked into 3D tensors in which spatial-spectral structure is preserved.The data -approximated and stored in projection matrices -achieved a high compression ratio with a low value of introduced artifacts.
Aidini et al. used the CANDECOMP/PARAFAC tensor decomposition to the compression method [15].The multispectral image time series was expressed as a linear combination of the learned tensors and the quantization of the coefficients using the learned encoding dictionary.
Watkins and Sayeh proposed deep neural networks based method for gray image compression.The network is based on the autoencoder structured network, capable of both compressing and decompressing images [16] with a high compression ratio.
A multispectral image compression method based on the convolutional neural network was proposed by Li and Liu [5].The processing path consists of the encoder and decoder parts.Both parts use CNN in combination with discrete cosine transform and nonnegative tensor decomposition (NDT) in pseudo-autoencoder structure.The method shows improved computational efficiency with comparable PSNR values compared to direct NDT in the wavelet domain.
Friedland et al. [17] analyze an impact of artifacts from perceptual compression on deep learning, concluding that classification accuracy is tightly connected to compression rate and data quantization.The loss of classification efficiency was mainly due to artifacts introduced during the compression process.
Similarly, in the PhD thesis on the impact of standard image compression techniques on CNN performance, Dejan concluded that network trained on JPEG encoded images partially relies on artifacts introduced by the compression [18].
In the paper, the authors do not use weight compression [6], [19], [20].This topic is an additional option for further saving space in neural networks and can be implemented in future works.In recent years, tensors gained much attention from the ML/AI community, also in the context of data decompositions for compression [2]- [4], [12].However, there is no work to analyze the influence of tensor-based compression of the training and testing data on the performance of the deep neural networks.We fill this gap, starting in the next section with a brief introduction to tensors.

III. TENSOR COMPRESSION
Tensors are mathematical objects which can be regarded as multidimensional arrays of data, in which each separate dimension corresponds to a different degree of freedom of a measurement.Such an approach provides tools that extend the classical matrix analysis and which can take into account correlations hidden in data, to yield better results in various applications, such as compression or filtering [2].
Tensors extend the notion of vectors and matrices into higher-dimensional objects [2], [3], [21]- [23].As discussed below, they allow for better representation and processing of the multidimensional signals and, in effect, also for higher compression ratios.
The multidimensional compression can be based on the following tensor product where decompressed version of input tensor T is denoted as T , whereas F i is the ith factor matrix.The key idea is that the set of factor matrices and their product with T , from the right side of (1), require much less storage space than the original tensor T , while its recovered version T is close enough in terms of the chosen norm, allowing e.g. for proper CNN training, as will be discussed.Also in the above equation, the k-th modal product T × k M of a tensor T ∈ ℜ N1×N2×...×NP and a matrix M ∈ ℜ Q×N k is used.The result is also a tensor S ∈ ℜ N1×N2×...N k−1 ×Q×N k+1 ×...×NP , whose elements are expressed as follows: As shown below, the compression matrices F i , called factors, can be obtained using the Tucker decomposition of tensors [12].Thanks to the proper selection of the ranks of the tensor decomposition factors, decomposition usually well separates useful signal from its high-frequency components while taking multidimensional characteristics of the signal into account.The decomposition procedure of the tensor T is done by calculation of an approximating tensor T that is close to the input tensor in terms of the Frobenius norm.Hence, a minimization function is defined as follows

A. Tucker-based methods
The concept of the Tucker decomposition of a 3D tensor is presented in Figure 1.Assuming that the approximating tensor T contains the same amount of useful information as the original tensor T , it can be expressed as follows where Z ∈ ℜ R1×R2×...×RP is a core tensor and S i ∈ ℜ Ni×Ri are the so-called mode matrices.Using algebraic operations, from Equation ( 4), the formula for the core tensor is obtained: Combining Equation ( 5) with Equations ( 3) and (4) yields The Tucker decomposition in Equation ( 6) reads that a tensor T is approximated by its projection onto space spanned by the matrices S k .To compute the series of S k matrices, the alternating method can be used [2], [21], [24]- [26].Also, let's observe that S i S T i in ( 6) is equivalent to the factor matrix F i from (1).The approximation in Equation ( 4) includes only the components conveying the majority of energy available in the signal.However, to ensure the best quality, the estimation of the proper ranks R 1 , R 2 , and R 3 of the mode matrices S i is necessary.Although fixed values can be used as a first approximation, the proper ranks need to be based on the signal content in real dynamic systems with unpredictable noise values.Multiple methods were presented to help solve this problem, i.e., Muti and Bourennane [22] using the minimum description length parameter (MDL) computed for each dimension separately, allowing optimal rank selection.
The Tucker format is stable, but its computational complexity grows exponentially with input tensor dimensions.It makes the method suitable only for "small" dimensions [27], [28].

B. Tensor-Train
To mitigate the computational complexity problem, a Dth order tensor T ∈ ℜ N1×N2×...×ND can be represented as [45]..The set of r k is collectively called Tensor Train ranks.The G k [j k ] belonging to the same mode can be stacked into 3-rd order core tensor G k ∈ ℜ N1×r k−1 ×r k allowing the T be represented as follows: where × 1 is called mode-(N,1) contracted product, presented here [29].
The decomposition can be presented graphically by the linear tensor network [30], [31], illustrated in Figure 2.There are two types of nodes: rectangular and circular.Rectangular contains spatial indices (i k ), some auxiliary indices (α k ), and a tensor with these indices associated with such nodes.On the other hand, a circular node is a link and contains only the auxiliary indices.If an auxiliary index is present in two cores, the cores are connected.To evaluate an input tensor, all tensor in rectangles need to be multiplied, and then summation is performed over all auxiliary indices.Compared to Tucker Decomposition, the Tensor Train format has lower spatial complexity, making it more computationally efficient for tensors with larger dimensions [32].

IV. FIXED-RATE COMPRESSED FLOATING-POINT ARRAYS
The need for floating-point array compression is expressed by multiple lossy and lossless compression algorithms developed throughout the years.The most widely spread are image compression methods allowing encoding 2D and 3D arrays.For instance, PNG and JPEG-LS use linear prediction; JPEG -the block transform coding; JPEG2000 relies on the higherorder wavelets.
The Fixed-Rate Compressed Floating-Point Arrays (ZFP) compression scheme is based on ideas developed to compress 2D image data efficiently [33].The input 3D array is divided into small, fixed-size blocks of dimensions 4 x 4 x 4, stored using a user-specified number of bits, which can be accessed independently.The method compresses the block performing the following steps: The prepared values are transformed to a basis allowing the spatially correlated values to be mostly decorrelated, as this results in many near-zero coefficients that can be compressed efficiently.
A separable orthogonal transform in d dimensions is employed to take advantage of regularly gridded data, resulting in a basis that is the tensor product of 1D basis functions.The proposed transform, due to coefficient selection, replaces divisions and multiplications into bitshifts.This choice achieves near-optimal results in terms of decorrelation efficiency and coding gain and is very efficient from a computational perspective.Further details of the ZFP method can be accessed from [33].

V. NEURAL NETWORK MODELS
The advent of modern ML/AI methods, especially deep neural networks (DNN), resulted in a real IT revolution [34].In a very short time, people realized real power in these systems that can be directly trained from data with marvelous results.Hence, the term "data" gained even more importance.With these came the eruption of modern deep neural network architectures, such as AlexNet [35], VGG [36], ResNet [37], and dozens of their derivatives [38]- [41].These were possible thanks to novel scientific achievements such as a solution to the vanishing gradient problem in very deep networks, optimization algorithms, and thanks to the availability of the efficient general-purpose graphics processing units (GGPU).
For example, in the computer vision domain, neural networks dominate in majority of tasks, such as object detection & recognition, segmentation, filtering, to name a few.However, performance of the neural networks depends heavily on availability of the high quality labeled data.However, this can be jeopardized by many factors, such as compressiondecompression processes, we focus upon in this work.
To examine the impact of the proposed compression method, state-of-the-art models capable of high accuracy with reasonable low training time were used.Their architectures are shortly described below.

AlexNet
The neural network contains eight learned layers [35] -five convolutional and three fully connected.It introduces such features as ReLU nonlinearity, the capability of training on multiple GPUs, local response normalization, and overlapping pooling.The new approach combined with a high emphasis on overfitting reduction results in a highly accurate model, even today.The network is recognized as a milestone, and solutions proposed in the article are now considered as standard by AI/ML community.

ResNet
The ResNet networks implements skip connections with ReLU nonlinearity and batch normalization [41].It allows to mitigate vanishing gradient problem and speeds up the training process, which positively impacts the feasibility of training deeper neural networks.

VGG
The model uses small receptive fields, decreasing the number of trainable parameters and increasing the ReLU unit count [36].Such an approach makes the decision function more discriminative, which in turn increases overall network performance.

MNASNet
It is a neural architecture for mobile devices [42].It bases on Factorized Hierarchical Search Space approach, which balances the diversity of layers and the size of total search space.The resulting network generally runs faster and uses less computational power.

VI. TENSOR-BASED DATA PROCESSING
The main goal of our approach is to decrease data size with the lossy compression process without a significant impact on the quality of object prediction by the benchmark neuralnetwork architectures.The proposed method requires an input in a tensor form.Based on images selected to process, the width (W ) and height (H) parameters, describing tensor The next step is a tensor decomposition, as presented in Figure 3. Depending on the selected method, the result is a set of mode matrices (Tensor Train) or core tensor and mode matrices (Tucker decomposition).These approximated signals contain the most relevant information, and higher frequency components are removed.Smaller rank values selected for the method translate to higher compression and more significant high-frequency attenuation.In the next step, obtained results are compressed with the ZFP algorithm using a tolerance argument ZF P t that controls the compression quality.Such compressed data objects can be stored on a disk for further use or be utilized as an intermediate step in real-time processing.
Before the next step, processed data needs to be decompressed, as presented in the Figure 4. First, using the ZFP decompression method, then in the reverse signal synthesis process, the aforementioned matrices are merged into the result tensor.However, the reverse signal synthesis can be calculated for the entire tensor, a single image, or a set of consecutive images.Such flexibility allows decrease computational requirements and allows the method to be used as part of modern neural-network training architecture.
Presented Experiments were performed on a server computer, equipped with 256 GB of RAM, 64-core processor AMD Ryzen Threadripper 3990X with the 2.9 GHz base clock, and 64-bit Ubuntu 20.04.2 LTS OS.
The quantitative results were measured in terms of compression ratio (Cr) and object detection accuracy of neural remove I i from I 8: end while 9: return T network (Acc).The compression ratio is defined as follows: Furthermore, an object detection accuracy of a neural network is described as the proportion of correct predictions over the total examined cases: where T P , T N , F P , F N represents true positive, true negative, false positive, and false negative values achieved by the network model, on the tested dataset, respectively.For the quantitative evaluation of the proposed method, Imagenette [44] was selected.The dataset is a subset of 10 easily classified classes from the Imagenet dataset and was selected to decrease the time needed for development and tests.The original structure contains train and validation sets with 9469 and 3925 images, respectively.For better quality estimation, the presented accuracy of tested neural networks is measured using a test set containing 1500 images.The before-mentioned set was separated from the original subsets using 900 images from the train set and 600 images from the validation set.The final image count for used image sets is as follows: training 8569 images, validation 3325 images, and test 1500.
The input data was processed using the tensor-based compression methods described in Section VI.The collection of test datasets was created depending on selected parameters and methods, as shown in Table I.
To cover a broad spectrum of possible utilization of the proposed method, it was decided to check an impact of tensor based data compression/decompression on networks depending on the learning process.Both random weight initialization and the transfer learning technique were used.For the latter, models were pre-trained on the Imagenet dataset.
The training process was conducted for each of the prepared datasets and each of the selected benchmark neural networks, and results were measured.All prepared data plots from Figure 5 to Figure 19 are presented below.
Denoting dimensions of the input tensor as W × H × D, values of the corresponding ranks for tensor methods were calculated as [0.5W, 0.5H, D] for Tucker decomposition.Similarly, in the case of the tensor train method, dimensions of the factor matrices were set to [1, 0.5min(W, H), 0.5min(W, H), 1].Obtained compression ratios for different tensor methods and values of ZFP tolerance are presented in Table II.Results of the neural network accuracy are presented in Table III for the random weight initialization and in Table IV for the transfer learning, respectively.The results are discussed below.
For the transfer learning technique, it can be observed that TT achieved better accuracy for the selected ranks.Depending on the tested neural network, it achieves 0.6 -2.5% worse results compared to the original dataset.At the same time, the TT method provides users with a lower compression ratio, between 14.92 and 16.86 for lossless and lossy ZFP compression, respectively.On the other hand, the Best rank method achieves higher compression 20.41 -21.39, with lower network accuracy however.In this case the difference between original data and tested datasets is higher, from 2.1% to 6.3%.Additionally, for nearly all cases, methods combining tensor-based algorithm with lossy ZFP yield better results than ZFP lossless ones.Depending on the tested neural network, the lossy ZFP version achieves 0.6 -1.1% and 2.1 -4.1% difference between original for the TT and Best rank methods, respectively.
In the random weight training case, again TT achieves better accuracy, with an accuracy loss between 3.3% -7.2% depending on the tested neural network.The Best rank method yields 6.6% -13.5% drop in accuracy.In the considered context, for nearly all tested cases, lossy ZFP compression does not change the accuracy achieved by tested neural networks.
Hence, the proposed method combines high compression capabilities with good retention of important signals for the detection process.For the most common neural network training technique used today -transfer learning -the method achieves results very close to the values obtained on the original dataset.
Higher compression rate impacts all tested networks' quality by removing high-frequency components from the images, which means lowering object detection accuracy by 0.6 -1.1% and 2.1 -4.1% for the Tensor train and Best rank methods, respectively.However, tensor methods inherently allow an easy change in the compression rate by selecting different ranks during the decomposition process, making it possible to find an acceptable trade-off for a given application.

VIII. CONCLUSION
In this paper, a novel framework and experiments on data compression/decompression in order to measure the impact on the deep neural network training and prediction are presented.The compression/decompression is based on tensor decomposition methods combined with the floating-point array compression.We show that the presented methods can achieve very high compression ratios while still preserving enough   significant information in data to achieve high accuracy of object detection in the neural models.
The utilized algorithms can smoothly change the achieved compression rate, which impacts network accuracy during the training process, allowing users to find parameters that are optimal for a given application.Furthermore, storing data in the proposed form allows for a selective decompression of a single or a group of images without the need to decompress the entire tensor.The accuracy drop in respect to the original data is visible, but for the problems where storage or data transfer speed are important, it can increase performance both during training and normal operation.Alternatively, we can say that thanks to data compression a larger amount of data can be transferred and            used for training.Also, although the method was developed for images, it can be useful for other data types, such as physical or industry measurements, etc. Summarizing, the best results were achieved using the transfer learning technique, where a dataset is processed with Tensor Train decomposition paired with the lossy version of the ZFP algorithm.In this setting 0.6% -0.7% drop in accuracy of the deep neural networks in respect to the original dataset was achieved for all tested methods.The best accuracy, both in respect to the original and processed dataset, was obtained with the VGG-11 and VGG-13 models.

1 )
Align the values in a block to a common exponent; 2) Convert the floating-point values in a block to a common exponent; 3) Convert the floating-point values to a fixed-point representation; 4) Apply an orthogonal block transform to decorrelate the values; 5) Order the transform coefficients by the expected magnitude; 6) Encode the resulting coefficients one "bit plane" at a time The conversion to fixed-point is done by expressing each block value with respect to the largest floating-point exponent in a block, which is stored uncompressed resulting in normalized values in the range (-1, +1).

6Algorithm 1 p 4 :p 5 :
POSITION AND COMMUNICATION PAPERS OF THE FEDCSIS.ONLINE, 2021 Tensor assembler.1: T = ∅ 2: while I = ∅ do 3: load original image I i and prepare container for resized image I calculate x and y offsets needed to place I i in the center of I resize selected image from I to dimensions specified in W and H, keeping aspect ratio of original data and save it to I p 6: append I p to T 7: (a) Random initialization.(b) Transfer learning.

Fig. 5 .
Fig. 5. Comparison between all used architectures trained on the original dataset.

Fig. 9 .
Fig. 9. Comparison between all used architectures trained on the dataset compressed with Tucker decomposition and lossy ZFP with original validation set.

Fig. 10 .
Fig. 10.Comparison between all used architectures trained on the dataset compressed with Tensor Train algorithm and lossless ZFP.

Fig. 11 .
Fig. 11.Comparison between all used architectures trained on the dataset compressed with Tensor Train algorithm and lossy ZFP.

Fig. 12 .
Fig. 12.Comparison between all used architectures trained on the dataset compressed with Tensor Train algorithm and lossless ZFP with original validation set.

Fig. 13 .
Fig. 13.Comparison between all used architectures trained on the dataset compressed with Tensor Train algorithm and lossy ZFP with original validation set.

ACKNOWLEDGMENT
This research was co-funded by Smart Growth Operational Programme 2014-2020, financed by European Regional Development Fund, in frame of project POIR.01.01.01-00-0570/19, operated by National Centre for Research and Development in Poland.(a) Random initialization.(b) Transfer learning.

TABLE I DATASETS
USED IN NETWORK QUALITY ASSESSMENT.

TABLE III NETWORK
ACCURACY VERSUS DATASET TYPE.TRAINING RESULTS FOR RANDOM WEIGHT INITIALIZATION.DATASETS: ORIGINAL (A), BEST RANK (B), BEST RANK COMP (C), BEST RANK ORIG VAL (D), BEST RANK COMP ORIG VAL (E), TENSOR TRAIN (F), TENSOR TRAIN COMP (G), TENSOR TRAIN ORIG VAL (H), TENSOR TRAIN COMP ORIG VAL (I)

TABLE IV NETWORK
ACCURACY VERSUS DATASET TYPE.TRAINING RESULTS FOR TRANSFER LEARNING TECHNIQUE.DATASETS: ORIGINAL (A), BEST RANK (B), BEST RANK COMP (C), BEST RANK ORIG VAL (D), BEST RANK COMP ORIG VAL (E), TENSOR TRAIN (F), TENSOR TRAIN COMP (G), TENSOR TRAIN ORIG VAL (H), TENSOR TRAIN COMP ORIG VAL (I)