Hashtag Discernability – Competitiveness Study of Graph Spectral and Other Clustering Methods

—Spectral clustering methods are claimed to possess ability to represent clusters of diverse shapes, densities etc. They constitute an approximation to graph cuts of various types (plain cuts, normalized cuts, ratio cuts). They are applicable to unweighted and weighted similarity graphs. We perform an evaluation of these capabilities for clustering tasks of increasing complexity.


I. INTRODUCTION
D OCUMENT clustering (or text clustering) has a mul- titude of applications, including topic extraction, fast information retrieval, filtering, authorship discovery, topic drift detection in news streams and social media, automatic document organization etc. ( [1], [2], [3], [4]) Two clustering methods are of particular interest in this area, the Graph Spectral Clustering (GSC) and spherical k-means.
Graph Spectral Clustering methods [1] are generally praised for possessing ability to represent clusters of diverse shapes, densities etc.They constitute an approximation to graph cuts of various types (plain cuts, normalized cuts, ratio cuts).They are applicable to unweighted and weighted similarity graphs.
Spherical k-means algorithm [5] is a variant of k-means algorithm that measures similarity of documents based on their cosine similarity, that is quite popular in the domain of text analysis (e.g. for search engines).
In this paper we pose the question: If the grouping method correctly groups certain datasets, can we expect that a combination of these datasets will also be correctly clustered?We will examine the following problem in more detail.Assume that a clustering method can cluster correctly documents from categories [A, B], [B, C], and [C, A].Can we expect the algorithm to cluster correctly data from the mixed set [A, B, C]? Let us illustrate this with three datasets, tweets, marked with (single) tags 'lolinginlove', 'tejran', 'anjisalvacion'.
We used standard Python implementation of spectral clustering from scikit-learn library. 1 The affinity matrix was constructed from a k-nearest neighbors connectivity matrix, with the default value of k = 10.
So, for each pair of the three hashtags we see a very good agreement of clusterings with the target (hashtags).If we look at the hashtags ['lolinginlove', 'tejran', 'anjisalvacion'], we get clustering agreement visible in Fig. 4. We see that more errors are committed here than for each pair of hashtags presented in Figs. 1, 2 and 3, though the increase does not seem to be large in absolute numbers.We will return to this issue in the next section.
Here and in further sections, F-score is computed as follows.We assume that the clustering is to predict the hashtag.The "true" hashtag is identified as the majority hashtag in the cluster.For a given hashtag H we proceed as follows.4. Spectral clustering with affinity "nearest neighbors" example 4; row labels -"true" clusters, column labels -clustering result with this hashtag.False positives (FP) are the cases which belong to the cluster for which the hashtag H is the true hashtag, but the hashtag for the given document is different from H. True negatives (TN) are the cases which belong to the cluster for which the hashtag H is not the true hashtag, and the hashtag for the given document is different from H. False negatives (FN) are the cases which belong to the cluster for which the hashtag H is not the true hashtag, but the hashtag for the given document is the hashtag H. Computation of precision and recall follows the standard pattern and the Fscore is computed for each hashtag separately, and then the average is taken as the F-score for the clustering.
In this paper we study the extent to which this behaviour extends to larger number of clusters.This study is a starting point for a future revision of the studied clustering algorithms.

II. CONCEPTUAL CONSIDERATIONS
Despite the example shown above, it is not entirely obvious that given a grouping method that allows to correctly group documents from the categories [A, B], [B, C], [C, A], we can expect that the algorithm will correctly group data from the mixed set [A, B, C].
If the sets A ∪ B, B ∪ C and C ∪ A have block diagonal document similarity matrices (after proper reordering the documents), and the blocks are actually within A, B, C then in fact the [A, B, C] similarity matrix will be block diagonal too so that GSC algorithm will cluster A, B, C correctly.This can be seen immediately by inspection of block matrix structure, i.e.
where S is the similarity matrix and D is the diagonal matrix with elements being sums of corresponding rows of S. Hence  In all these cases, if some noise is added to fuzzify the well-separatedness, the noise can be more destructive for the set A, B, C than for any of the three mentioned subsets -this affects GSC as well as k-means clustering.This is easily imagined by considering k-means algorithm.The cluster center of A when clustering fuzzified A and B may lie in a different position than when clustering fuzzified A and C.This behavior will be subsequently illustrated by a series of experiments.

III. DATA
We used tweets retrieved from the stream endpoint of Twitter API (a random sample of about 1% of English tweets), collected by one of the Authors for the time period from mid September 2019 till end of November 2022.From this set we extracted the subset TWT.10 used in experiments.It is a collection of top thread tweets related to hashtags listed in Table I.While selecting the data, we imposed the restriction that the tweets had to have one single hashtag (which we treated as an indication of being devoted to a single theme).

IV. METHODS
We study two standard versions of Graph Spectral Clustering, available from scikit-learn, and the 6 versions of spherical k-means and 6 versions of our proprietary so-called K-embedding based clustering algorithm.
The advantages and disadvantages of these methods are briefly discussed below.

A. Spectral analysis
In fact spectral clustering algorithms constitute a large family, see e.g.[12], [13], [14], which have numerous desirable properties (like detection of clusters with various shapes, applicability to high dimensional datasets, capability to handle categorical variables), yet they suffer from various shortcomings, common to other sets of algorithms, including multiple possibilities of representation of the same dataset, producing results in a space different from the space of original problem, curse of dimensionality, etc.These shortcomings are particularly grieving under large and sparse data set scenario, like in Twitter data.
Let us briefly recall the typical spectral clustering algorithm in order to make it understandable, how distant the clustering may be from the applier's comprehension [12].The first step consists in creating a similarity matrix of objects (in case of documents based on tf, tfidf, in unigram or ngram versions, or some transformer based embeddings are the options -consult e.g.[15] for details), then mixing them in case of multiple views available.The second step is to calculate a Laplacian matrix.There are at least three variants to use: combinatorial, normalized, and random-walk Laplacian, [12].But other options are also possible, like: some kernel-based versions, non-backtracking matrix [16], degree-corrected versions of the modularity matrix [17] or the Bethe-Hessian matrix [18].Then computing eigenvectors and eigenvalues, eigenvector smoothing (to remove noise and/or achieve robustness against outliers) choice of eigenvectors, and finally clustering in the space of selected eigenvectors (via e.g.k-means).The procedure may be more complex, e.g. one may add loops back to preceding steps based on feedback from quality analysis, like degree of deviation from block-structure of the Laplacian.
From this diversified set we chose the two mentioned implementations available from scikit-learn.

B. Spherical k-means
Spherical k-means was developed in [5] by observing that the squared Euclidean distance between two vectors, , in case of normalized vectors reduces to BARTŁOMIEJ STAROSTA ET AL.: HASHTAG DISCERNABILITY -COMPETITIVENESS STUDY OF GRAPH SPECTRAL AND OTHER CLUSTERING METHODS 761 and x T i x j = cos ∠(x i , x j ).This makes it very efficient in case of sparse vectors, a typical representation of text documents.Such a variant of k-means suffers dependence on initialization, thus further improvements are proposed, e.g.[19], [20], [21] and [22].

C. K-embedding
K-embedding has the following underlying idea.Let us think for a moment about a particular embedding of the nodes of the graph, based on [23].Let A be a matrix of the form: where S stands for an affinity matrix, I is the identity matrix, and 1 is the (column) vector consisting of ones, both of appropriate dimensions.(Note that here we have to assume that the diagonal of S consists of zeros).Let K be the matrix of the (double centered) form [24]: with n × n being the dimension of S. 1 is an eigenvector of K, with the corresponding eigenvalue equal to 0. All the other eigenvectors must be orthogonal to it as K is real and symmetric, so for any other eigenvector v of K we have: Let Λ be the diagonal matrix of eigenvalues of K, and V the matrix where columns are corresponding (unit length) eigenvectors of K.
, where V i stands for i-th row of V .Let z i , z ℓ be the embeddings of the nodes i, ℓ, resp.This embedding shall be called K-embedding.
for i ̸ = ℓ .Hence upon performing k-means clustering in this space we de facto try to maximize the sum of similarities within a cluster.Note that K = V ΛV T may be quite well approximated if we drop from Λ low eigenvalues and from V their corresponding eigenvectors (which we do in our experiments).

V. EVALUATION
For each of the algorithms we perform the following tests.For each pair of datasets associated with two hashtags from Table I (45 pairs in all) the clustering will be performed by each of the mentioned algorithms 10 times (due to stochastic nature of these algorithms) and the average F-score will be computed.Ten pairs with the highest average F-scores will be taken for the next phase.Now datasets associated with 3 hashtags will be created out of these selected pairs plus each of the hashtags not present in the selected pairs.This process is continued till all 10 hashtags are exhausted.In figures, the average value of F over all computations with the given hashtag cardinality is presented plus the average of the top 10 groups of hashtags.The results are summarized in Figs.6-19.
The next experiment was to compare the F-score obtained by a given set of hashtags, considered in the preceding experiment, and its subsets obtained by removing one of   II for each of the analysed clustering algorithms.We have created also a more detailed view for one of the algorithms: spectral clustering with affinity nearest neighbors.Fig. 20 presents the histogram of differences between the average F-score of subgroups and the F-score of the group.Fig. 21 presents the relation between the average F-score of subgroups and the F-score of the group as a scatterplot.

VI. RESULTS
As visible from Figs. 6-19, the increase of the number of intended clusters to be discovered constitutes a problem for the clustering algorithms, with even 9-fold decrease of F-score when going from 2 to 10 clusters.This behaviour is consistent throughout all the investigated methods though minor variations of the shape of the curves may be observed.
Spherical k-means clustering with sc.n configuration appears to perform best for the 10 top pairs of hashtags (Fig. 10) and with sc.sc configuration (Fig. 9), followed by K-embedding based clustering with most configurations (Figs.14-19, except 16).
In most cases the top average of the F-score for next higher number of cluster is usually higher than the average score for the entire previous number of clusters, which indicates that better separation of subgroups gives some advantage for the capability to separate the entire group.
Table II shows Spearman and Pearson correlations between the F-score achieved by grouping a dataset related to a given set of hashtags and by grouping datasets obtained by removing data of one of the hashtags, split by the clustering algorithm.The correlations are generally high and are statistically very significant.This means that clustering capability of subsets of hashtags can be a good indicator of clustering capability for the set of hashtags.The algorithm spherical sc.sc seems to perform best for such a criterion, followed by spherical k++.sc and in the column on Spearman correlation -K-embedding.12plusk++.md.
A more detailed insight into this relationship for one of the algorithms is presented in Figs 20 and 21.Fig. 20 convinces us, however, that generally this clustering capability decreases (the F-score of a group is usually lower than that of the average of the subgroups).Fig. 21 shows additionally, that the high correlations between group and subgroups of hashtags are to be expected rather for low values of F-score.Higher F-score values are responsible for higher variation in supergroup Fscore.

VII. CONCLUSIONS
The performed experiments demonstrate that, in spite of the generally praised properties, graph spectral clustering methods have still a large space for improvements with respect to increasing number of clusters to be detected.Even if all the subsets of intended clusters may be well separated by the algorithms, their mixture does not so.Same observation can be made about the spherical k-means algorithm.

Fig. 5 . 2 ,
Fig. 5. Visualization of datapoints used to illustrate the increasing clustering problem for k-means

Fig. 6 .
Fig. 6.F-scores for various numbers of hashtags; spectral clustering with affinity nearest neighbors

TABLE I
TWT.10 DATA SET -HASHTAGS AND CARDINALITIES OF THE SET OF RELATED TWEETS USED IN THE EXPERIMENTS will yield A, C, similarly any two hashtag combinations.But clustering with k-means of A ∪ B ∪ C will yield three clusters {a}, {b, c}, {d, e, f }, not A, C, E.

TABLE II CORRELATION
BETWEEN THE F-SCORE OF A GIVEN GROUP OF HASHTAGSAND THEIR SUBGROUPS OF CARDINALITY LOWER BY ONE.