Applying SoftTriple Loss for Supervised Language Model Fine Tuning

Witold Sosnowski; Anna Wróblewska; Piotr Gawrysiak

Applying SoftTriple Loss for Supervised Language Model Fine Tuning

Witold Sosnowski, Anna Wróblewska, Piotr Gawrysiak

DOI: http://dx.doi.org/10.15439/2022F185

Citation: Proceedings of the 17th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 30, pages 141–147 (2022)

Full text

Abstract. We introduce a new loss function based on cross entropy and SoftTriple loss, TripleEntropy, to improve classification performance for fine-tuning general knowledge pre-trained language models.This loss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss by about 0.02 - 2.29 percentage points. Thorough tests on popular datasets using our loss function indicate a steady gain. The fewer samples in the training dataset, the higher gain -- thus, for small-sized dataset, it is about 0.71 percentage points, for medium-sized -- 0.86 percentage points, for large -- 0.20 percentage points, and for extra-large 0.04 percentage points.

References

L. White, R. Togneri, W. Liu, and M. Bennamoun, “How well sentence embeddings capture meaning,” in Proceedings of the 20th Australasian document computing symposium, 2015, pp. 1–8. [Online]. Available: https://doi.org/10.1145/2838931.2838932
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint https://arxiv.org/abs/1810.04805, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1810.04805
Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep metric learning without triplet sampling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6450–6458. [Online]. Available: https://doi.org/10.48550/arXiv.1909.05235
C. Parsing, “Speech and language processing,” 2009.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint https://arxiv.org/abs/1301.3781, 2013. [Online]. Available: https://doi.org/10.48550/arXiv.1301.3781
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint https://arxiv.org/abs/1907.11692, 2019. [Online]. Available: https://doi.org/10.48550/arXiv.1907.11692
S. Dadas, M. Perełkiewicz, and R. Poświata, “Pre-training polish transformer-based language models at scale,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2020, pp. 301–314. [Online]. Available: https://doi.org/10.1007/978-3-030-61534-5_27
K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification.” Journal of machine learning research, vol. 10, no. 2, 2009.
E. Xing, M. Jordan, S. J. Russell, and A. Ng, “Distance metric learning with application to clustering with side-information,” Advances in neural information processing systems, vol. 15, pp. 521–528, 2002.
S. Wu, X. Feng, and F. Zhou, “Metric learning by similarity network for deep semi-supervised learning,” in Developments of Artificial Intelligence Technologies in Computation and Robotics: Proceedings of the 14th International FLINS Conference (FLINS 2020). World Scientific, 2020, pp. 995–1002. [Online]. Available: https://doi.org/10.48550/arXiv.2004.14227
Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh, “No fuss distance metric learning using proxies,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 360–368. [Online]. Available: https://doi.org/10.48550/arXiv.1703.07464
Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in European conference on computer vision. Springer, 2016, pp. 499–515. [Online]. Available: https://link.springer.com/chapter/10.1007/978-3-319-46478-7_31
R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742. [Online]. Available: https://doi.org/10.1109/CVPR.2006.100
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. [Online]. Available: https://doi.org/10.1109/CVPR.2015.7298682
“A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607. [Online]. Available: https://doi.org/10.48550/arXiv.2002.05709
A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint https://arxiv.org/abs/1703.07737, 2017.
B. Skuczyńska, S. Shaar, J. Spenader, and P. Nakov, “Beasku at checkthat! 2021: fine-tuning sentence bert with triplet loss and limited data,” Faggioli et al., 2021.
I. Malkiel, D. Ginzburg, O. Barkan, A. Caciularu, Y. Weill, and N. Koenigstein, “Metricbert: Text representation learning via self-supervised triplet training,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP43922.2022.9746018
M. Lennox, N. Robertson, and B. Devereux, “Deep learning proteins using a triplet-bert network,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 4341–4347. [Online]. Available: https://doi.org/10.1109/embc46164.2021.9630387
B. Gunel, J. Du, A. Conneau, and V. Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” arXiv preprint https://arxiv.org/abs/2011.01403, 2020. [Online]. Available: https://doi.org/10.48550/arXiv.2011.01403
A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations,” arXiv preprint https://arxiv.org/abs/1803.05449, 2018. [Online]. Available: https://doi.org/10.48550/arXiv.1803.05449
A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 142–150.
R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
B. Pang and L. Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales,” arXiv preprint cs/0506075, 2005. [Online]. Available: http://dx.doi.org/10.3115/1219840.1219855
J. Wiebe, T. Wilson, and C. Cardie, “Annotating expressions of opinions and emotions in language,” Language resources and evaluation, vol. 39, no. 2, pp. 165–210, 2005. [Online]. Available: https://doi.org/10.1007/s10579-005-7880-9
B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts,” arXiv preprint cs/0409058, 2004. [Online]. Available: https://doi.org/10.3115/1218955.1218990
M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004, pp. 168–177. [Online]. Available: https://doi.org/10.1145/1014052.1014073
W. Dolan, C. Quirk, C. Brockett, and B. Dolan, “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources,” 2004.
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint https://arxiv.org/abs/1412.6980, 2014. [Online]. Available: https://doi.org/10.48550/arXiv.1412.6980