Logo PTI Logo FedCSIS

Proceedings of the 17th Conference on Computer Science and Intelligence Systems

Annals of Computer Science and Information Systems, Volume 30

Applying SoftTriple Loss for Supervised Language Model Fine Tuning

, ,

DOI: http://dx.doi.org/10.15439/2022F185

Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 141147 ()

Full text

Abstract. We introduce a new loss function based on cross entropy and SoftTriple loss, TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models.This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about 0.02 - 2.29 percentage points. Thorough tests on popular datasets using our loss function indicate a steady gain. The fewer samples in the training dataset, the higher gain -- thus, for small-sized dataset, it is about 0.71 percentage points, for medium-sized -- 0.86 percentage points, for large -- 0.20 percentage points, and for extra-large 0.04 percentage points.


  1. L. White, R. Togneri, W. Liu, and M. Bennamoun, “How well sentence embeddings capture meaning,” in Proceedings of the 20th Australasian document computing symposium, 2015, pp. 1–8. [Online]. Available: https://doi.org/10.1145/2838931.2838932
  2. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  3. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint https://arxiv.org/abs/1810.04805, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1810.04805
  4. Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6450–6458. [Online]. Available: https://doi.org/10.48550/arXiv.1909.05235
  5. C. Parsing, “Speech and language processing,” 2009.
  6. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint https://arxiv.org/abs/1301.3781, 2013. [Online]. Available: https://doi.org/10.48550/arXiv.1301.3781
  7. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint https://arxiv.org/abs/1907.11692, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1907.11692
  8. S. Dadas, M. Perełkiewicz, and R. Poświata, “Pre-training polish transformer-based language models at scale,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2020, pp. 301–314. [Online]. Available: https://doi.org/10.1007/978-3-030-61534-5_27
  9. K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.” Journal of machine learning research, vol. 10, no. 2, 2009.
  10. E. Xing, M. Jordan, S. J. Russell, and A. Ng, “Distance metric learning with application to clustering with side-information,” Advances in neural information processing systems, vol. 15, pp. 521–528, 2002.
  11. S. Wu, X. Feng, and F. Zhou, “Metric learning by similarity network for deep semi-supervised learning,” in Developments of Artificial Intelligence Technologies in Computation and Robotics: Proceedings of the 14th International FLINS Conference (FLINS 2020). World Scientific, 2020, pp. 995–1002. [Online]. Available: https://doi.org/10.48550/arXiv.2004.14227
  12. Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 360–368. [Online]. Available: https://doi.org/10.48550/arXiv.1703.07464
  13. Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-46478-7_31
  14. R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742. [Online]. Available: https://doi.org/10.1109/CVPR.2006.100
  15. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298682
  16. “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607. [Online]. Available: https://doi.org/10.48550/arXiv.2002.05709
  17. A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint https://arxiv.org/abs/1703.07737, 2017.
  18. B. Skuczyńska, S. Shaar, J. Spenader, and P. Nakov, “Beasku at checkthat! 2021: fine-tuning sentence bert with triplet loss and limited data,” Faggioli et al.
  19. , 2021.
  20. I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, Y. Weill, and N. Koenigstein, “Metricbert: Text representation learning via self-supervised triplet training,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9746018
  21. M. Lennox, N. Robertson, and B. Devereux, “Deep learning proteins using a triplet-bert network,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 4341–4347. [Online]. Available: https://doi.org/10.1109/embc46164.2021.9630387
  22. B. Gunel, J. Du, A. Conneau, and V. Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” arXiv preprint https://arxiv.org/abs/2011.01403, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2011.01403
  23. A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations,” arXiv preprint https://arxiv.org/abs/1803.05449, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1803.05449
  24. A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
  25. R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
  26. B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales,” arXiv preprint cs/0506075, 2005. [Online]. Available: http://dx.doi.org/10.3115/1219840.1219855
  27. J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emotions in language,” Language resources and evaluation, vol. 39, no. 2, pp. 165–210, 2005. [Online]. Available: https://doi.org/10.1007/s10579-005-7880-9
  28. B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts,” arXiv preprint cs/0409058, 2004. [Online]. Available: https://doi.org/10.3115/1218955.1218990
  29. M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 168–177. [Online]. Available: https://doi.org/10.1145/1014052.1014073
  30. W. Dolan, C. Quirk, C. Brockett, and B. Dolan, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,” 2004.
  31. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint https://arxiv.org/abs/1412.6980, 2014. [Online]. Available: https://doi.org/10.48550/arXiv.1412.6980