Logo PTI Logo FedCSIS

Proceedings of the 17th Conference on Computer Science and Intelligence Systems

Annals of Computer Science and Information Systems, Volume 30

Extending Word2Vec with Domain-Specific Labels

DOI: http://dx.doi.org/10.15439/2022F37

Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 157160 ()

Full text

Abstract. Choosing a proper representation of textual data isan important part of natural language processing. One option is using Word2Vec embeddings, i.e., dense vectors whose properties can to a degree capture the``meaning'' of each word. One of the main disadvantages of Word2Vec is its inability to distinguish between antonyms. Motivated by this deficiency, this paper presents a Word2Vec extension for incorporating domain-specific labels. The goal is to improve the ability to differentiate between embeddings of words associated with different document labels or classes. This improvement is demonstrated on word embeddings derived from tweets related to a publicly traded company. Each tweet is given a label depending on whether its publication coincides with a stock price increase or decrease. The extended Word2Vec model then takes this label into account. The user can also set the weight of this label in the embedding creation process. Experiment results show that increasing this weight leads to a gradual decrease in cosine similarity between embeddings of words associated with different labels. This decrease in similarity can be interpreted as an improvement of the ability to distinguish between these words.

References

  1. M. G. Agudo. An analysis of word embedding spaces and regularities. 2019.
  2. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, and T. Henighan. Language models are few-shot learners. page 25, 2020.
  3. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North, pages 4171–4186. Association for Computational Linguistics, 2019. http://dx.doi.org/10.18653/v1/N19-1423. URL http://aclweb.org/anthology/N19-1423.
  4. Z. Dou, W. Wei, and X. Wan. Improving word embeddings for antonym detection using thesauri and SentiWordNet. In M. Zhang, V. Ng, D. Zhao, S. Li, and H. Zan, editors, Natural Language Processing and Chinese Computing, volume 11109, pages 67–79. Springer International Publishing, 2018. ISBN 978-3-319-99500-7 978-3-319-99501-4. http://dx.doi.org/10.1007/978-3-319-99501-4_6. URL http://link.springer.com/10. 1007/978-3-319-99501-4_6. Series Title: Lecture Notes in Computer Science.
  5. A. Handler. An empirical study of semantic similarity in WordNet and word2vec. page 23, 2014.
  6. T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. 2013. URL http://arxiv.org/abs/1301.3781.
  7. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. 2013. URL http://arxiv.org/abs/1310.4546.
  8. M. Ono, M. Miwa, and Y. Sasaki. Word embedding-based antonym detection using thesauri and distributional information. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 984–989. Association for Computational Linguistics, 2015. http://dx.doi.org/10.3115/v1/N15-1100. URL http://aclweb.org/anthology/N15-1100.
  9. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. 2018. URL http://arxiv.org/abs/1802.05365.
  10. I. Samenko, A. Tikhonov, and I. P. Yamshchikov. Intuitive contrasting map for antonym embeddings. 2021. URL http://arxiv.org/abs/2004.12835.
  11. Y. Shao, S. Taylor, N. Marshall, C. Morioka, and Q. Zeng-Treitler. Clinical text classification with word embedding features vs. bag-of-words features. In 2018 IEEE International Conference on Big Data (Big Data), pages 2874–2878. IEEE, 2018. ISBN 978-1-5386-5035-6. http://dx.doi.org/10.1109/BigData.2018.8622345. URL https: //ieeexplore.ieee.org/document/8622345/.
  12. M. R. Vargas, B. S. L. P. de Lima, and A. G. Evsukoff. Deep learning for stock market prediction from financial news articles. In 2017 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), pages 60–65. IEEE, 2017. ISBN 978-1-5090-4253-1. http://dx.doi.org/10.1109/CIVEMSA.2017.7995302. URL http://ieeexplore.ieee.org/document/7995302/.
  13. H.-Y. Yeh, Y.-C. Yeh, and D.-B. Shen. Word vector models approach to text regression of financial risk prediction. 12(1):89, 2020. ISSN 2073-8994. http://dx.doi.org/10.3390/sym12010089. URL https://www.mdpi.com/ 2073-8994/12/1/89.
  14. L. Zhang, J. Li, and C. Wang. Automatic synonym extraction using word2vec and spectral clustering. In 2017 36th Chinese Control Conference (CCC), pages 5629–5632. IEEE, 2017. ISBN 978-988-15639-3-4. http://dx.doi.org/10.23919/ChiCC.2017.8028251. URL http://ieeexplore.ieee.org/document/8028251/.