Logo PTI Logo FedCSIS

Communication Papers of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS)

Annals of Computer Science and Information Systems, Volume 41

Unused information in token probability distribution of generative LLM: improving LLM reading comprehension through calculation of expected values.

DOI: http://dx.doi.org/10.15439/2024F7639

Citation: Communication Papers of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 41, pages 191197 ()

Full text

Abstract. LLM text decoding is key component for perceived LLM quality. We demonstrate two experiments showing that decoding methods could be improved by manipulation of token probabilities. First, we test few LLM on SummEval summary scoring dataset, to measure reading comprehension. We compare scores from greedy decoding to expected values over the next token distribution. We scale logits by large temperature to increase the entropy of scores. This allows strong improvement of performance on SummEval (in terms of correlations to human judgement). We see improvement from 6-8\\% to 13-28\\% for 7B Mistral and from 20\\%-46\\% to 37\\%-56\\% for Mixtral, beating GPT 4 0314 result on two metrics. Part of the gain seems related to positional bias. Secondly, we use probability-based tree sampling algorithm, to examine all most probable generations for given prompt.

References

  1. A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, “The curious case of neural text degeneration,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. http://dx.doi.org/10.48550/arXiv.1904.09751. [Online]. Available: https://openreview.net/forum?id=rygGQyrFvH
  2. H. P. Grice, “Logic and conversation,” in Syntax and Semantics: Vol. 3: Speech Acts, P. Cole and J. L. Morgan, Eds. New York: Academic Press, 1975. doi: 10.1163/9789004368811_003 pp. 41–58.
  3. L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou, “A framework for few-shot language model evaluation,” 12 2023. http://dx.doi.org/10.5281/zenodo.10256836. [Online]. Available: https://zenodo.org/records/10256836
  4. A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and D. Radev, “SummEval: Re-evaluating summarization evaluation,” Transactions of the Association for Computational Linguistics, vol. 9, 2021. http://dx.doi.org/10.1162/tacl_a_00373. [Online]. Available: https://aclanthology.org/2021.tacl-1.24
  5. J. Wang, Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, “Is ChatGPT a good NLG evaluator? a preliminary study,” in Proceedings of the 4th New Frontiers in Summarization Workshop, Y. Dong, W. Xiao, L. Wang, F. Liu, and G. Carenini, Eds. Singapore: Association for Computational Linguistics, Dec. 2023. http://dx.doi.org/10.18653/v1/2023.newsum-1.1 pp. 1–11. [Online]. Available: https://aclanthology.org/2023.newsum-1.1
  6. C. Shen, L. Cheng, X.-P. Nguyen, Y. You, and L. Bing, “Large language models are not yet human-level evaluators for abstractive summarization,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023. http://dx.doi.org/10.18653/v1/2023.findings-emnlp.278 pp. 4215–4233. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.278
  7. A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of experts,” 2024. http://dx.doi.org/10.48550/arXiv.2401.04088
  8. A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. doi: 10.48550/arXiv.2310.06825
  9. D. Kim, C. Park, S. Kim, W. Lee, W. Song, Y. Kim, H. Kim, Y. Kim, H. Lee, J. Kim, C. Ahn, S. Yang, S. Lee, H. Park, G. Gim, M. Cha, H. Lee, and S. Kim, “Solar 10.7b: Scaling large language models with simple yet effective depth up-scaling,” 2024. doi: 10.48550/arXiv.2312.15166
  10. E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” 2023. http://dx.doi.org/10.48550/arXiv.2210.17323
  11. J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” 2022. http://dx.doi.org/10.48550/arXiv.2206.07682
  12. P. Liu, Z. Liu, Z.-F. Gao, D. Gao, W. X. Zhao, Y. Li, B. Ding, and J.-R. Wen, “Do emergent abilities exist in quantized large language models: An empirical study,” 2023. http://dx.doi.org/10.48550/arXiv.2307.08072
  13. W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY, USA: Association for Computing Machinery, 2023. http://dx.doi.org/10.1145/3600006.3613165. ISBN 9798400702297 p. 611–626. [Online]. Available: https://doi.org/10.1145/3600006.3613165
  14. T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Q. Liu and D. Schlangen, Eds. Online: Association for Computational Linguistics, Oct. 2020. http://dx.doi.org/10.18653/v1/2020.emnlp-demos.6 pp. 38–45. [Online]. Available: https://aclanthology.org/2020.emnlp-demos.6
  15. C. Shen, L. Cheng, X.-P. Nguyen, Y. You, and L. Bing, “Large language models are not yet human-level evaluators for abstractive summarization,” in Findings of the Association for Computational Linguistics: EMNLP 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023. http://dx.doi.org/10.18653/v1/2023.findings-emnlp.278 pp. 4215–4233. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.278
  16. Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen, “Breaking the softmax bottleneck: A high-rank rnn language model,” 2018. http://dx.doi.org/10.48550/arXiv.1711.03953
  17. D. Grubisic, C. Cummins, V. Seeker, and H. Leather, “Priority sampling of large language models for compilers,” 2024. [Online]. Available: http://arxiv.org/abs/2402.18734
  18. C. Meister, M. Forster, and R. Cotterell, “Determinantal beam search,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computational Linguistics, Aug. 2021. http://dx.doi.org/10.18653/v1/2021.acl-long.512 pp. 6551–6562. [Online]. Available: https://aclanthology.org/2021.acl-long.512
  19. L. Vilnis, Y. Zemlyanskiy, P. Murray, A. Passos, and S. Sanghai, “Arithmetic sampling: parallel diverse decoding for large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023. http://dx.doi.org/ 10.5555/3618408.3619870
  20. K. Pillutla, L. Liu, J. Thickstun, S. Welleck, S. Swayamdipta, R. Zellers, S. Oh, Y. Choi, and Z. Harchaoui, “Mauve scores for generative models: Theory and practice,” 2023. http://dx.doi.org/10.5555/3648699.3649055
  21. S. Basu, G. S. Ramachandran, N. S. Keskar, and L. R. Varshney, “{MIROSTAT}: A {neural} {text} {decoding} {algorithm} {that} {directly} {controls} {perplexity},” in International Conference on Learning Representations, 2021. http://dx.doi.org/10.48550/arXiv.2007.14966
  22. C. Meister, T. Pimentel, G. Wiher, and R. Cotterell, “Locally typical sampling,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 102–121, 2022. http://dx.doi.org/10.48550/arXiv.2202.00666. [Online]. Available: https://api.semanticscholar.org/CorpusID:252918666
  23. M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba, “Evaluating large language models trained on code,” 2021. http://dx.doi.org/10.48550/arXiv.2107.03374