L1-Norm Principal Component Analysis Using Quaternion Rotations

Principal component analysis (PCA) based on L1-norm has drawn growing interest in recent years. It is especially popular in the machine learning and pattern recognition communities for its robustness to outliers. Although optimal algorithms for L1-norm maximization exist, they have very high computational complexity and can be used for evaluation purposes only. In practice, only approximate techniques have been considered so far. Currently, the most popular method is the bit-flipping technique, where the L1-norm maximization is viewed as a combinatorial problem over the binary field. Recently, we proposed exhaustive, but faster algorithm [1] based on two-dimensional Jacobi rotations that also offer high accuracy. In this paper, we develop a novel variant of this method that uses three-dimensional rotations and quaternion algebra. Our experiments show that the proposed approach offers higher accuracy than other approximate algorithms, but at the expense of the additional computational cost. However, for large datasets, the cost is still lower than that of the bit-flipping technique.


I. INTRODUCTION
P RINCIPAL component analysis (PCA) is a method for multivariate data analysis with various uses, including dimensionality reduction, feature extraction and noise reduction [2].The PCA tries to identify orthogonal directions, along which the data exhibit the greatest variability.The projections of the data on these directions are viewed as principal components.This technique is also referred as L2-PCA, because the data variability is measured using Frobenius norm (L2-norm on matrices).It can be easily implemented using, for example, singular value decomposition (SVD) of the observation data matrix [3].However, it is also sensitive to the presence of outliers, i.e., data points that differ significantly from the other observations.In order to mitigate this drawback, several PCA techniques have been proposed that are based on L1-norm [4], [5], [6], [7].Interestingly, the L1-norm criterion can also be used to perform independent component analysis (ICA) after data whitening [8], [9].The L1-norm optimization problem can be formulated in several ways [5], but unlike in the case of the L2-PCA, these formulations are not equivalent.In this paper, we consider the following maximization: This work was supported by Bialystok University of Technology under the grant WZ/WI-IIT/5/2023 where X = [x 1 , x 2 , . . ., x n ] ∈ R d×n is a data matrix of rank r x ≤ min{d, n}, consisting a sequence of observation vectors (x i ) n i=1 , and ∥.∥ 1 denotes L1-norm that return the sum of the absolute values of the individual entries.The parameter k denotes the number of the L1 principal components.Please note that the problem (1) is not scalable, i.e. it can not be translated into a sequence of the one-unit problems simply by projecting the data-matrix onto the null-space of the previous solution as in the L2-PCA algorithms.Furthermore, absolute value function is non-differentiable.For these reasons, obtaining the exact solution is a rather challenging task.In [5] it was shown that, if XB opt SVD = UΣV T , and where ∥.∥ * denotes nuclear norm, then Q L1 = UV T is the optimal solution to (1).Therefore, the L1-norm maximization can be viewed as a combinatorial problem over the binary field.Unfortunately, the exhaustive search algorithm [5] has complexity O(n dk−k+1 ) and is difficult to use in practice.
A faster, yet suboptimal, version of this approach is based on consecutive bit-flipping operations [6].Its time complexity is of order O(nd min{n, d} + n 2 (k 4 + dk 2 ) + ndk 3 ), which can still be prohibitive for large data sizes.The most computationally efficient algorithm based on the fixed-point (FP) iterations was developed earlier in [4].Unfortunately, it is rather inaccurate.Recently, we proposed two L1-PCA algorithms [1] based on the Jacobi estimation framework.This framework is commonly used for diagonalizing symmetric matrices [3], [10] through the two-dimensional (plane) rotations.It also found applications in data-driven algorithms [11], [12] for iterative transformations of multi-dimensional data.It was shown in [1] that the Jacobi-based L1-PCA approaches provide high accuracy as compared to the existing suboptimal algorithms.They are also considerably faster than currently the most accurate method based on bit-flipping.In this paper, we propose to replace the conventional Jacobi rotations with higher-dimensional quaternion-based rotations.It is expected that, in this way, the convergence properties of an algorithm can be improved.A similar approach has been proposed in work [13] where we used quaternionic factorization of the 4 × 4 orthogonal matrices and Newton-Raphson iterative scheme to solve the ICA problem.Here, we present a simpler approach based on three-dimensional rotations to solve the L1norm maximization problem.When compared to our previous method [1], a novel algorithm offers a higher probability of finding a solution that is closer to the optimal one at the expense of the additional computational cost.However, our experiments show that for large datasets, this cost is still lower than that of the bit-flipping method.

II. PRELIMINARIES ON QUATERNIONS
A quaternion Q ∈ H can be represented using the rectangular form as follows [14]: where i, j, k denote imaginary units.The real part of Q is q 0 and the pure quaternion part is iq 1 + jq 2 + kq 3 .The multiplication of quaternions is determined by the following rules: It is associative and distributes over vector addition, but it is not commutative.When describing properties of the quaternions it is convenient to express them as the combination of a scalar part, q 0 ∈ R, and a vector part, q = [q 1 , q 2 , q 3 ] T ∈ R 3 : Q = q 0 , q .For example, the conjugate of Q can be written as Q = q 0 , −q , and the norm (modulus), is given by: For non-null quaternion, the inverse is defined as follows: Since every quaternion Q can also be expressed in an exponential (polar) form: where 0 ≤ θ < 2π is an angle such that: This form is especially useful, because it allows us to express rotation in SO(3), The notation SO(d) denotes special orthogonal group in a d-dimensional Euclidean space, consisting all orthogonal matrices of determinant 1.Let P = 0, p be a pure quaternion that corresponds to a vector p ∈ R 3 and be a unit-norm quaternion, where u = [u 1 , u 2 , u 3 ] T ∈ R 3 is a unit vector indicating the direction of an axis of rotation, and an angle 0 ≤ θ < 2π is the magnitude of the rotation about the axis.Then the rotation of the vector p with an angle θ around a vector u can be expressed as follows: Above operation can be expressed equivalently using matrix/vector multiplication by p ′ = R(U )p, where: is a rotation matrix.Theoretically, any rotation matrix can also be constructed using Euler angles, as a product of the three rotation matrices about the axes of the fixed coordinate system.Unfortunately, the Euler angles differing in many ways can give the same rotation matrix.In our case, this leads to multiple cost function calculations for the same point.Since the representation (10) corresponds almost uniquely to a given rotation matrix, such ambiguities can easily be avoided when working with quaternions.

A. Jacobi-based estimation framework
In the conventional Jacobi estimation framework [11], [1] the solution matrix is considered to be a product of the rotations in SO (2).These rotations are applied successively to the data matrix so that some objective function is optimized.For instance, in our previous work [1], the L1-norm metric is maximized as follows: with X (0) = WX, where W ∈ R d×d is an arbitrary orthonormal matrix defining initialization point.The matrix G(p, q, θ) represents Jacobi/Givens rotation [3] by the θ angle in the (p, q) plane, i.e.: where p, q are two integers such that such 1 ≤ p < q ≤ d.Thus, the solution matrix QL1 ∈ R d×k is given by: where [ .] * 1:k denotes the first k columns of an argument matrix.All possible rotations represented by pairs (p t , q t ) are arranged in so-called sweeps.These sweeps are repeated cyclically until the maximum number of iterations is reached or when, for all rotations in the current sweep, we have |θ t | ≈ 0. In fact, any rotation order is allowed [12], [15], but most frequently a row-cycling ordering is used as presented in Tab.I.For a fixed arrangement of the plane rotations, each θ) the (i, j)-th entry of the data matrix (13) evaluated for the angle θ t = θ.Then, the 'local' L1-norm maximization problem at tth rotation can be defined as follows: Since we are interested in finding only the first k principal components, the outer summation range in (16) covers only indices less than or equal to k.In this way, the rotations that would have to be performed entirely in the null-space can simply be omitted.Also note that, the matrix ( 14) modifies only the rows p t , q t of the data matrix X (t−1) , so that the summation coefficients can be computed directly as In work [1], we proposed two methods for solving (16).The first one performs exhaustive angle search, and the second one uses a differentiable approximation for absolute value function and calculates the rotation angles using the simplified Newton method.In this paper, we consider only the exhaustive algorithm due to its simplicity and high accuracy.Namely, the objective function in ( 16) is evaluated at the set of equidistant points, i.e.: {−π/2 + iπ/m : i = 0, 1, ..., m − 1}.We call this set the dictionary.The parameter m is an integer value controlling an angular resolution, i.e., the smallest nonzero angle that is used to represent rotation.Theoretically, greater the value of m, the higher angular resolution and better accuracy of the optimization.However, by increasing this value, we do not prevent the method from falling into local optima.

B. Proposed method
Key idea of the proposed method is to modify the Jacobi estimation framework by replacing rotations in SO(2) with rotations in SO (3).Please note that the conventional approach can guarantee a global convergence only for d = 2.For higherdimensional problems, the Jacobi rotations are performed sequentially.Therefore, we may easily get trapped in local maximum due to non-convexity of the cost function.Similarly, rotations in SO(3) do not guarantee finding a global optimum for d > 3, however, such replacement can increase frequency with which the method finds an optimal solution.Furthermore, with the higher-dimensional rotations, more data samples are used when computing the local objective functions, and thus these functions should be smoother, which may result in a faster convergence.Namely, we propose to replace (14) with a quaternion based matrix: , where 1 ≤ p < q < r ≤ d and the coefficient u ij is (i, j)-th entry of the matrix ( 12) computed for the quaternion: Since the matrix ( 12) is orthogonal, the matrix ( 19) is also orthogonal.For convenience, the rotation axis is factorized using spherical coordinates, i.e.: where 0 ≤ λ < 2π and 0 ≤ φ ≤ π denotes azimuthal and polar angle, respectively.More formally, our optimization problem can be stated as follows: where x(t) ij (λ, φ, θ) denotes the (i, j)-th entry of transformed data matrix R(p t , q t , r t , U )X (t−1) .The coefficients x(t) ij (λ, φ, θ) for i ∈ {p t , q t , r t } can be stacked in the vector representing imaginary part of the following quaternion: As before, the simplest solution to ( 22) is to use an exhaustive search method.Please note that there is not necessary to discretize the entire sphere because for a given vector (21) and an angle θ there is an opposite vector −u(λ, φ) that generates the same rotation matrix for the angle 2π − θ.
Let us denote by A = {iπ/m : i = 0, 1, ..., m − 1} and B = {jπ/m : j = 0, 1, ..., 2m − 1} the sets of equidistant points on interval [0; π) and [0; 2π), respectively.Then our spherical coordinate search dictionary can be defined as the following set: where × denotes Cartesian product.The condition in the second line removes from the set redundant coordinates and those for which the rotation matrix is equal to the identity matrix (for θ = 0), except the point at the north pole. ...
It can be verified that the cardinality of the set ( 25) is l = 2m 3 − 3m 2 + 3m.Obviously, for large m searching for the solution exhaustively may be unpractical, as the sequence length l grows rapidly with m.However, as we will show in the experimental section, the rotations in SO(3) can be represented with a much lower resolution than rotations in SO (2).In other words, the parameter m can be much smaller than that of the conventional Jacobi-based framework.Therefore, we can still use the exhaustive method at reasonable runtime.
Similarly to the conventional method, the consecutive rotations are organized in sweeps and repeated cyclically until convergence.However, since we deal with rotations in SO(3), a three-dimensional subspace must be defined for each rotation.Theoretically, for d > 3, the rotations can be performed in all possible subspaces defined as the 3-combinations of the row indices.However, we observed that to have convergence, not all combinations are needed.Namely, any combination (x, y, z) can be removed if all three pairs (x, y), (x, z) and (y, z) can also be found in other combinations.As presented in Tab.II, for d = 4 the combination (1,3,4) was removed because there are the combinations (1, 2, 3), (1,2,4) and (2,3,4) containing the pairs (1, 3), (1,4) and (3,4), respectively.Such subsets of the combinations can also be obtained by joining plane rotations sharing the same dimensions.For example, two pairs (1,2) and (1,3) can be joined in the triple (1, 2, 3), and the pair (2, 3) can be removed as it is already present in the triple.In our simulations, we use this algorithm to generate arrangements presented in Tab.II for d = 4, 5.It can be verified that for d ≥ 3 we have n r = ⌈d(d − 2)/4 − mod(d, 2) + 1⌉ 3D rotations per sweep, where ⌈.⌉ denotes ceiling operation.Similarly to Jacobi-based framework, there may exist other arrangements of the rotation subspaces, and some of them may be better than others.However, this issue is out of scope of this paper, and will be studied in a future work.
The pseudo-code of the proposed method is presented in Alg. 1.The sequence of triples defined in line 5 is defined according to the Tab.II.Please note that, at tth rotation, the matrix (19) modifies only the rows p, q, r.Therefore it is not necessary to compute it explicitly.In fact, only the matrices (12) are needed.Furthermore, they can be pre-computed for a given dictionary D once and used in subsequent iterations.

Algorithm 1 Pseudo-code of the proposed algorithm
Require: for (p, q, r) = {(1, 2, 3), ...} do 6: R opt ← R(p, q, r, U (λ t , φ t , θ t )) 8: W ← R opt W 9: 10: The proposed method has been implemented and evaluated in the Matlab environment.For convenience, it was denoted as L1-JQ, which stands for Jacobi method with quaternion rotations.For comparative purposes we also evaluated three other approximate methods for maximization of the L1-norm: bit-flipping algorithm [6] (L1-BF), fixed-point iterations [4] (L1-FP), and Jacobi exhaustive method with SO(2) rotations [1] (L1-JEX).In the case of the L1-JEX method, the parameter m was set to 512.We verified empirically that, for this method, rotations at a smaller angle than π/512 are not statistically significant.In order to explore how the estimation error is affected by the angular resolution, the algorithm L1-JQ was evaluated for two different values of the parameter m ∈ {10, 20}.For all approaches, the identity matrix was used as the initialisation point.

A. Accuracy
The performance degradation ratio attained by the algorithms was measured using a similar procedure to that in [6].Namely, the following metric was considered: where Q is the orthonormal matrix estimated using an evaluated method.Ideally, the matrix Q L1 should be the matrix obtained by an optimal L1-PCA algorithm [5], for the same data matrix.Unfortunately, the computational complexity of the optimal method is extremely high.Thus, such an approach is possible only for very small data sizes (n ≪ 100).The presented method is intended for larger data sets.For these reasons, in this experiment, we replaced the matrix Q L1 with the matrix representing the best solution among all methods.In order to measure which method gives the best result most

TABLE I :
Row-cycling ordering for d = 3.

TABLE II :
Arrangements of rotation subspaces for various signal dimensionalities.