Text embeddings and clustering for characterizing online communities on Reddit
Jan Sawicki
DOI: http://dx.doi.org/10.15439/2023F6275
Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1131–1136 (2023)
Abstract. This work analyses Reddit, the largest public, topic-centered social forum. In the experiments, contextualized text embeddings, obtained using DistilBERT, represented subreddit content. Next, clustering was performed, using an unsupervised K-means algorithm and evaluated with multiple clustering metrics. The obtained clusters were analyzed. Moreover, changes of cluster structure, between 2019 and 2022 have been examined.
References
- N. Proferes, N. Jones, S. Gilbert, C. Fiesler, and M. Zimmer, “Studying reddit: A systematic overview of disciplines, approaches, methods, and ethics,” Social Media+ Society, vol. 7, no. 2.
- R. S. Olson and Z. P. Neal, “Navigating the massive world of reddit: Using backbone networks to map user interests in social media,” PeerJ Computer Science, vol. 1, p. e4, 2015.
- T. Martin, “community2vec: Vector representations of online communities encode semantic relationships,” in Proceedings of the Second Workshop on NLP and Computational Social Science.
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing.
- S. A. Curiskis, B. Drake, T. R. Osborn, and P. J. Kennedy, “An evaluation of document clustering and topic modelling in two online social networks: Twitter and reddit,” Information Processing & Management, vol. 57, no. 2, p. 102034, 2020.
- K. W. Church, “Word2vec,” Natural Language Engineering, vol. 23, no. 1, pp. 155–162, 2017.
- Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning. PMLR, 2014, pp. 1188–1196.
- H. Mensah, L. Xiao, and S. Soundarajan, “Characterizing the evolution of communities on reddit,” in International Conference on Social Media and Society, 2020, pp. 58–64.
- Q. Liu, M. J. Kusner, and P. Blunsom, “A survey on contextual embeddings,” arXiv preprint https://arxiv.org/abs/2003.07278, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint https://arxiv.org/abs/1810.04805, 2018.
- V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint https://arxiv.org/abs/1910.01108, 2019.
- J. Biggiogera, G. Boateng, P. Hilpert, M. Vowels, G. Bodenmann, M. Neysari, F. Nussbeck, and T. Kowatsch, “Bert meets liwc: Exploring state-of-the-art language models for predicting communication behavior in couples’ conflict interactions,” in Companion Publication of the 2021 International Conference on Multimodal Interaction, 2021, pp. 385–389.
- J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, “The pushshift reddit dataset,” in Proceedings of the international AAAI conference on web and social media, vol. 14, 2020, pp. 830–839.
- J. Sawicki, M. Ganzha, M. Paprzycki, and A. Bădică, “Exploring usability of reddit in data science and knowledge processing,” Scalable Comput. Pract. Exp., vol. 23, pp. 9–22, 2021.
- E. Hargittai and G. Walejko, “The participation divide: Content creation and sharing in the digital age,” Information, Community and Society, vol. 11, no. 2, pp. 239–256, 2008.
- P. Van Mieghem, “Human psychology of common appraisal: The reddit score,” IEEE Transactions on Multimedia, vol. 13.
- P. Xia, S. Wu, and B. Van Durme, “Which* bert? a survey organizing contextualized encoders,” arXiv preprint https://arxiv.org/abs/2010.00854, 2020.
- M. Koroteev, “Bert: a review of applications in natural language processing and understanding,” arXiv preprint https://arxiv.org/abs/2103.11943, 2021.
- G. Ahalya and H. M. Pandey, “Data clustering approaches survey and analysis,” in 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management.
- M. Ahmed, R. Seraj, and S. M. S. Islam, “The k-means algorithm: A comprehensive survey and performance evaluation,” Electronics.
- K. P. Sinaga and M.-S. Yang, “Unsupervised k-means clustering algorithm,” IEEE access.
- T. Sai Krishna, A. Yesu Babu, and R. Kiran Kumar, “Determination of optimal clusters for a non-hierarchical clustering paradigm k-means algorithm,” in Proceedings of International Conference on Computational Intelligence and Data Engineering: ICCIDE 2017.
- J. Sirait, “Investigating news source characterizations using reddit audiencebased metrics,” 2022.
- M. Syakur, B. Khotimah, E. Rochman, and B. D. Satoto, “Integration k-means clustering method and elbow method for identification of the best customer profile cluster,” in IOP conference series: materials science and engineering, vol. 336. IOP Publishing, 2018, p. 012017.
- M. Cui et al., “Introduction to the k-means clustering algorithm based on the elbow method,” Accounting, Auditing and Finance, vol. 1.
- V. Veselovsky, I. Waller, and A. Anderson, “Imagine all the people: Characterizing social music sharing on reddit,” in Proceedings of the International AAAI Conference on Web and Social Media, vol. 15.
- L. d. F. Costa, “Further generalizations of the jaccard index,” arXiv preprint https://arxiv.org/abs/2110.09619, 2021.
- S. Giorgi, K. Zhao, A. H. Feng, and L. J. Martin, “Author as character and narrator: Deconstructing personal narratives from the r/amitheasshole reddit community.”