Text Embeddings and Clustering for Characterizing Online Communities on Reddit

This work analyses Reddit, the largest public, topic-centered social forum. In the experiments, contextualized text embeddings, obtained using DistilBERT, represented subreddit content. Next, clustering was performed, using an unsupervised K-means algorithm and evaluated with multiple clustering metrics. The obtained clusters were analyzed. Moreover, changes of cluster structure, between 2019 and 2022 have been examined.


I. INTRODUCTION
R EDDIT is the largest, public, topically-separated fo- rum [1].Its unique structure allows users with common interest to find their place for discussion, i.e. a subreddit; a subforum dedicated to a particular topic.The platform policy, and Reddit administration, put no restrictions on the topic of subreddits (except for the rules regarding illegal content).Moreover, any user with at least 30-day old account and a non-negative "karma score" (reputation metric) can create a subreddit.This allows "communities" to blossom, with littleto-no supervision.There are subreddits, which are very close thematically, e.g.r/worldnews or r/news (news and information), or r/leagueoflegends and r/Overwatch (video games).There are also subreddits with distant, or even opposite, topics, e.g.r/Conservative and r/Libertarian.
The freedom and scale of subreddits raise multiple research questions, e.g.what are the most popular topics?Are there topically similar subreddits?Can subreddits be reasonably grouped into clusters?Are there, and if so what are, migrations of subreddits between clusters?This contribution explores these questions, for a Reddit dataset spanning 2019-2022, using natural language processing and data clustering.

II. RELATED WORKS
Reddit's communities have been analyzed with different methods and from different perspectives.The main inspiration for this work are the results of a 2015 study [2] clustering 15,000 subreddits, from the first half of 2013 using scalefree backbone graph networks.The subreddits were grouped into 57 clusters and further, manually, annotated into 10 metaclusters (categories) such as: Electronic Music, Fitness, Sports, Soccer, Video Games, my Little Pony, LGBT, Pornography, Programming, Guns.Captured relations were based on interactions of over 800,000 users.However, the actual content of the posts, or comments, were not analyzed.
Two years later, "community2vec" [3], was introduced.This study also focused on users, by encoding post authors and user co-occurrences and applying PCA .Additionally, post content was encoded with static GloVe embeddings [4].The main result was showing that vector representations of communities can encode meaningful analogies and semantic relationships, similarly to what has been previously seen for words.
A 2020 study of Reddit and Twitter [5] focused exclusively on texts of 54.5 million Reddit comments and 23,684 tweets.Its goal was to compare text embedding methods: TF-IDF Word2Vec [6] and Doc2Vec [7] applied to topic modelling, with document clustering using k-means, k-medoids, hierarchical agglomerative clustering and non-negative matrix factorization (NMF).For these, different settings and hyperparameters have been tested.It was established that combining Doc2Vec and K-means achieved the best results.
Finally, in [8], instead of static clustering, community evolution over time was analyzed.Here, active users and textual content, processed with LIWC analysis , has been applied.The results represent patterns of user engagement at different stages of community lifespan.However, they do not show how the subreddit topic clusters evolve over time.
Moreover, Reddit continues to grow, since its launch in 2005, with past studies completed in 2015, 2017 and 2020.Therefore, a Reddit structure study needs to be revisited, applying modern approaches, to understand what the subreddits communities look today like and how they change in time.Therefore, this contribution presents results of explorations based on a dataset spanning four years (2019-2022), while applying contextual text embeddings with a BERT-like model, with the goals of analysis of subreddit community structure and its evolution over time.

A. Dataset
Reddit consists of over 3.5 million communities (and over 1.5 billion monthly visitors).The most popular Reddit data source is the Pushshift database [13], [14].The subreddit data was extracted from Pushshift subreddit dumps.Furthermore, Pushshift REST API could not have been used due to an outage that happened in early 2023.Overall, content of 3090 "largest" subreddits, i.e. subreddits with at least 100,000 subscribers, has been extracted.
Overall, Reddit is a subject to the 1% rule that appears in the majority of social networks [15].The majority of posts gain little-to-no attention ("upvotes"), while a small fraction "goes viral" and appears on the frontage of Reddit (the main Reddit forum r/all).Hence, to reduce the computational cost, while capturing subreddit structure, 1000 posts with the highest scores, have been extracted from each subreddit.The score is the Reddit's measure of "appraisal by a community of Reddit subscribers of an item" [16].Finally, the dataset spans 4 years: 2019-2022, to allow the analysis of subreddits cluster evolution over time.The resulting dataset consisted of over 12 million unique user posts.

B. Text embedding
After gathering, text embeddings has been applied to the posts.Since the introduction of BERT (in 2018), multiple models, for different NLP goals, were introduced [17], [18].This work needs a general feature extraction models that deliver multipurpose text embeddings.The model should be "general" and multipurpose, because input data originates from over 3000 communities, and covers topics from politics and news (r/politics, r/news), through video games (r/DOTA, r/gaming), memes (r/hmmm, r/me_irl), drug usage (r/LSD, r/shrooms), to plants (r/Bonsai, r/gardening), fishkeeping (r/Aquariums, r/PlantedTank) or military (r/military, r/guns).
Moreover, the NLP part takes the longest processing time (over 50% of total runtime).Therefore, a general multipurpose and fast, but efficient model is required.In 2019 a "smaller, faster, cheaper and lighter" version of the BERT model has been introduced, the DistilBERT.It retains 97% of the original BERT performance on downstream tasks, while being 40% smaller and 60% faster [11].Therefore, to reduce computation time, DistilBERT has been selected.
Here, it should be noted that different Reddit communities have different posts "styles".For example, r/politics consists mostly of links to news websites, while r/AbruptChaos contains mostly GIFs or short videos.There is, however, one part of posts that is forced by Reddit -the post's title, which has to be present on every posts regardless of subreddit.While there are ways to overcome this (Reddit post, over 99% of posts in the dataset have a textual title.Overall, DistilBERT embedded posts titles to 768 dimensional vectors, which were clustered.

C. Clustering
Use of K-means for clustering followed results found in [19], [20].However, the biggest downside of K-means is that it requires specification of the number of clusters.This problem can be overcome by using unsupervised clustering metrics [21], [22] to find "best" clustering.In this context, Silhouette Score (previously used on Reddit [23]), Davis-Bouldin score , Caliński-Harabasz score and k-means inertia (sum of squared distances of samples to their closest cluster center) have been tried.The most suitable cluster size has been sought by evaluating clustering results for cluster sizes: 10, 20, 30, ... 1530, 1540, where 1540 is half o the number of subreddits.Davis-Bouldin metrics is the only metric where lower values are better (for others, higher is better).For easier interpretability of the results, presented in Figure 1, Davis-Bouldin metrics is presented with a minus sign.Interestingly, the metrics were practically identical for considered time periods (annually for 2019-2022).Hence, it can be stated that the number of subreddit clusters, and hence the topical dispersion, does not change much over time (see, also, Section IV).The choice of the cluster number was a challenge since not all metrics have been consistent (see, Figure 1).The Silhouette Score differed a lot.This is related to the fact that the Silhouette Score ranges from -1 to 1, where the best value is 1 (all points assigned to the right cluster) and the worst value is -1 (all points assigned to the wrong cluster), while scores near 0 indicate cluster overlap.All Silhouette Scores were close to 0, indicating existence of overlaps.To find a compromise between Davis-Bouldin, Caliński-Harabasz and inertia metrics, the Elbow Method was applied, as previously used in similar settings [24], [25] (also on Reddit [26]).Overall, any number between 180 and 450 represented a "good fit".However, to achieve interpretability of results, 200 clusters were selected.Here, note that smaller numbers of clusters were checked (e.g.100), but they produced clusters with "notfitting" topics.Larger numbers (e.g.300), on the other hand, resulted in fragmented topics.

D. Cluster similarity and time-evolution
To automatically detect cluster dynamics, they were compared annually using the Jaccard Index [27].Each cluster from a period was compared to each cluster from the next chronological period.The pair of sets with the highest Jaccard Index is considered a transition, from the predecessor to the successor.Note that the predecessor and the successor may be the same, i.e. the cluster did not change from period to period.This way, ordered lists of cluster transitions were created.Then for each list a generalized multi-set Jaccard index was calculated for all sets (2019,2020,2021,2022).The results are discussed in Section IV.

IV. RESULTS AND THEIR ANALYSIS
Let us now discuss the key experimental findings.

A. Subreddit clusters characterization
Let us first look into clusters of subreddits established for year 2022.Table I presents number of subreddit by manually annotated categories.Most categories are obvious, but some require some explaining.
"Pictures" aggregates subreddits dedicated to posting pictures, GIFs and videos.The themes range from wallpapers (r/wallpapers), to content that is supposed to amaze on-lookers (r/nextfuckinglevel, r/woahdude) or disquiet/scare them (r/cursedcomments, r/cursedimages, r/cursedvideos).Some subreddits do not have a "theme", and are extremely broad, such as r/gif, r/gifs, r/pics.Interesting is a group of "X_Porn" subreddits, where X is some subject.Here, the term "porn" is a synonym to "amazing", "beautiful", "wonderful" (not pornography).Subreddits in this group (r/CabinPorn, r/CityPorn, r/EarthPorn, r/InfrastructurePorn) showcase pictures of things, places or phenomena that are to be perceived as "porn", i.e. most spectacular of its kind.There are also "metathemes", such as r/BetterEveryLoop, where the author of a post claims that the more times a GIF/video is watched, the better it gets.The actual content is discretionary.Finally, there are also subreddits with random pictures, e.g.r/nocontextpics.
In the "states" category, there are subreddits related to individual US states and cities, e.g.r/Atlanta, r/Austin, r/Calgary, r/California, r/Dallas, r/Denver, r/LosAngeles.Interestingly, while the applied NLP model is meant for English, it clustered subreddits in other languages into the category "language specific".There are also separate clusters for: German subreddits (r/de, r/de_IAmA), the Polish subreddit r/Polska, the Netherlands, containing r/thenetherlands and a cluster related to Scandinavian subreddits, containing r/norge, r/svenskpolitik, r/swedishproblems.
There was a group that was separated from "pictures" were "animals".This group contains clusters of subreddits about (mostly) dogs and cats and other small animals.It appears that the similarity between some "animals" subreddits and some "pictures" is in the feelings that the pictures are supposed to invoke, i.e. happiness, or cuteness.For example, subreddits: r/aww (described as: "Things that make you go AWW! Like puppies, bunnies, babies, and so on...A place for really cute pictures and videos!" and r/MadeMeSmile ("A place to share things that made you smile or brightened up your day.A generally uplifting subreddit." .On the "other side", one can find "animals" in subreddits dedicated to brutality and violence in animal kingdom (r/natureismetal, r/Natureisbrutal).
Let us now consider "social chatting" group.Here, clusters include both general (r/CasualConversation, r/MakeNewFriendsHere) and specialized chatting topics (r/BreakUps, r/LongDistance, r/Marriage).There are also subreddits where users explicitly ask for an advice: r/Advice, r/askwomenadvice, r/dating_advice or seek approval/disapproval of their actions: r/AmItheAsshole (the latter with a dedicated study, from 2023 [28]).
Moving to smaller subreddits, there is the "NSFW" (Not Safe For Work) group.Here, confirmation that the user is an adult is required.However, these are different from "pornography", since they discuss adult topics, such as fetishes, fantasies and other sex-related issues.The examples are: r/BDSMAdvice, r/BDSMcommunity, r/Swingers, r/polyamory, r/DeadBedrooms, r/NoFap, r/bigdickproblems.There are also subreddits devoted to looking for other people with similar interests.These often use the acronym "r4r" (Redditors for Redditors): r/DirtySnapchat, r/Kikpals, r/dirtykikpals, r/dirtyr4r, r/exxxchange, r/r4r, r/snapchat, r/swingersr4r.
The "Reddit meta" category contains Reddit administration (r/announcement) and technical support subreddits (r/help).Next, "Deals" category contains subreddits about free goods, or goods on sale, e.g.r/GameDeals, r/NintendoSwitchDeals, r/PS4Deals, r/deals, r/eFreebies, r/freebies, r/googleplaydeals.Finally, "Help me find" is the group of subreddits where users ask others to help them find something, or find what something is, e.g.r/HelpMeFind, r/RBI (Reddit Bureau of Investigation), r/Whatisthis.Here, subreddits for identifying pornographic performers or scenes r/pornID, r/sources4porn, r/tipofmypenis are included.I, is that the most of the clusters are related to pornography, pictures with vague themes, video games, memes and technology.These themes also aggregate the biggest number of subreddits.

B. Subreddit clusters findings
Let us now report key findings regarding the clustering.

1) Subreddit naming:
There are naming patterns and conventions of subreddits.Reddit's users employ multiple acronyms, e.g."IRL" (In Real Life), "AMA" ("Ask Me Anything") or "NSFW" ("Not Safe For Work").These 3 alone materialize in 42 subreddits.There are also subreddits acronym names, e.g.r/ATBGE ("Awful Taste But Great Execution"), including the longest name: r/UNBGBBIIVCHIDCTIICBG ("Upvoted Not Because Girl, But Because It Is Very Cool; However, I Do Concede That I Initially Clicked Because Girl.").As noted, common is using the word "porn" to name content that is supposed to be beautiful, aesthetically pleasing, interesting, well-made, etc.Moreover, only 10 out of 71 "X_porn" subreddits contain actual pornography.
Finally, while many subreddits are descriptive of the topic (e.g.movie or TV series title, music genre, area of science or name of a video game), multiple subreddits focus on describing a general phenomenon/feeling (r/aww, r/INEEEEDIT, r/iwanttobeher).This shows how important it is to analyze the content of the subreddits instead of just the names.
Note that even though the US is the third-largest country by population, and has the highest number of users on Reddit, it does not have a dedicated subreddit.However, as noted (in Section IV-A), there exist subreddits dedicated to US states and cities, and they form a separate cluster.
4) Real life and gaming: In the cluster dedicated to crafts, there is a subreddit with "digital craftsmanship".It is r/Minecraftbuilds, containing building concepts created in the game Minecraft.It appears in the cluster with subreddits such as: {r/HomeImprovement, r/Justrolledintotheshop, r/Tools, r/electricians, r/longboarding, r/redneckengineering, r/whatisthisthing, r/woodworking}.Similarly, among "fashion" subreddits there is one related to the Animal Crossing video game fashion designs (r/ACQR): {r/AsianBeauty, r/Embroidery, r/Makeup, r/MakeupAddiction, r/RedditLaqueristas, r/crafts, r/crochet, r/femalefashionadvice, r/knitting, r/malefashion, r/malefashionadvice}.This illustrates interfusion of real life craftsmanship and fashion with in-game craftsmanship and fashion.
5) Are you eating?Watch a documentary.:There exists a relatively small cluster: {r/Documentaries, r/mealtimevideos}, where the first subreddit is about a documentary movie and the second contains video suggestions for watching during lunch or dinner.It appears that both of them have similar content, meaning that documentary videos would be a good suggestion for watching during mealtime.
6) Pornography mix: As visible previously in Table I, most of the clusters are dedicated to pornography.There are subreddits dedicated to fetishes, body parts, looks activities performers, sexual preference (e.g.heterosexual, homosexual etc.), amateur vs professional or performers.However, all of their content seems alike, as there are no particular patterns in pornography clusters, except for one.The subreddits dedicated to particular performers have similar content (often about praising a particular performer), e.g.{r/AdrianaChechik, r/AlexisTexas, r/AngelaWhite, r/DaniDaniels, r/KimmyGranger, r/Miakhalifa, r/RileyReid, r/abelladanger, r/leahgotti}.
The lack of any other patterns when it comes to clustering pornography subreddits shows that their content is extremely overlapping and similar regardless of the subreddit.

C. r/worldpolitics in NSFW subreddits
There is an interesting anomaly in one of the clusters.
Subreddit r/worldpolitics appears in a cluster nearly exclusive to NSFW content, e.g.{r/BDSMAdvice, r/Rapekink, r/SexWorkers, r/Swingers, r/mbti, r/bigdickproblems, r/bisexual, r/lgbt, r/polyamory, r/sexover30, r/BDSMcommunity}.At first, this looks like a clustering error, since r/worldpolitics should be in a political cluster with subreddits such as r/politics.Due to rules of this subreddit it is full of all kinds of posts.Its description states "reddit's anything goes subreddit, no topic imposed or opposed by the mods" .Additionally, when posts are sorted by Reddit's "top of all time", the first 100 are marked NSFW, even though they do not include adult content.Overall, r/worldpolitics seen from the perspective of text-embedding, is very close to other adult-content subreddits, i.e. it is either very chaotic, or contains adult content.
1) Current clustering vs. previous studies: As mentioned, the study from 2015 [2] performed similar clustering and manual annotation into "meta clusters".Let us compare metaclusters from 2015 with these from 2022.
First, the 2015 groups: "Fitness", "Sports", "Video Games", "Pornography" all map one-to-one to cluster groups estab-lished in 2022.Second, "Electronic Music", "Programming", "Soccer" and "Guns" map to wider/similar cluster categories, which are "music", "tech", "sports", and "military", respectively.Third, there are 2 groups of clusters with no one-to-one correspondence: "my Little Pony" and "LGBT".Subreddits marked as "LGBT" in 2015 appear in clusters of "NSFW" (e.g.{r/BDSMAdvice, r/Rapekink, r/SexWorkers, r/Swingers, r/mbti, r/bigdickproblems, r/bisexual, r/lgbt, r/polyamory, r/sexover30, r/BDSMcommunity}).This can be a question of the naming convention.Moreover, it seems that LGBT issues are close to NSFW, which is logical, since many of them involve sexuality and sex."My Little Pony" subreddits were completely absent in this analysis.Even though they are large enough (over 100,000 subscribers), they did not appear in the Pushshift dumps, probably due to inconsistencies in the Pushshift database.Hence, this cluster has no mapping to 2022 clusters.
Similarly to the original work, a dimensionality reduction of the embeddings has been performed with the t-SNE method, to create a two-dimensional visualization.Figure 2 shows the clusters in the top 20 categories, by subreddit count in cluster.T-SNE reduced the dimensions of vectors 768 to 2, Even in two dimensions the clusters such as "pornography", "sports", "music" or "tech" appear close within the group and far between the groups.This is consistent with the 2015 study.Hence, it supports quality of embeddings and category annotations reported in current contribution.

D. Subreddit cluster transitions
The second group of results focuses on cluster evolution between 2019 and 2022.Due to space limitations, only selected key findings are presented.
1) Gardening, hair, writing and vehicles stay unchanged: The highest Jaccard index (0.78) between subreddit clusters is achieved for clusters about plants, gardening and fish tanks.
Here, almost no change has been observed for 4 years.The cluster lost one subreddit: r/shrooms, and gained 3 new ones: r/snakes, r/thingsforants, r/whatsthisbug.Interestingly, in 2021, the r/shrooms subreddits migrated to a drug-related cluster.Similarly, between 2019 and 2022, the hair-related cluster (Jaccard index of 0.7) went through a couple changes, but finally lost one subreddit (r/FancyFollicles) and gained one (r/beauty).
With Jaccard index of 0.61 there is also the writing cluster, which moved closer to it's writing theme by dropping r/MovieSuggestions, r/TrueFilm and gaining r/stephenking.
Another barely changed cluster concerns vehicles.In 2019, it was mostly related to cars, but in 2020 it gained and retained r/MTB (mountain bike), r/bicycling and r/cycling subreddits.Interestingly, the r/Cartalk and r/MechanicAdvice subreddits, in 2022 formed a completely new cluster.The main difference between these two and other clustered subreddits is that they focus on discussions rather than showcasing vehicle models.
3) Pornographic subreddits migrations: Over time, significant migrations between pornographic clusters have been observed."Pornography" category has the second-lowest mean, 4-year, Jaccard index of about 0.02.There even was one cluster in 2019 of 30 subreddits which finally got reduced to a singlesubreddit cluster (containing only r/sarah_xxx).However, similarly as described in Section IV-B6 these cluster migrations are chaotic and random, and no pattern was detected.

V. CONCLUDING REMARKS
This work is devoted to study of structure of and time evolution of the Reddit platform.Current text-embedding methods have been applied to the dataset covering 2019-2022 period.Overall, Reddit is a place containing content and on various both very wide and very narrow scopes.Majority of the most popular subreddits are dedicated to pornography, pictures and videos about "anything and everything".Furthermore, popular are video games, memes and technology subreddits.While some of the topical clusters stay unchanged over the years, there are subreddit migrations between most of the clusters.Future studies will focus on more particular groups of subreddits and researching new methods for inter-subreddit topical modelling, such as crossposts.

Fig. 1 .
Fig. 1.Cluster evaluation with different metrics (Davis-Bouldin score is actually negative score, to keep up with "higher is better" interpretation.

Fig. 2 .
Fig. 2. The clusters in the top 20 categories by subreddit count in cluster.The X and Y axis are insignificant due to dimensionality reduction with t-SNE