Unconditional Token Forcing: Extracting Text Hidden Within LLM

Jakub Hościłowicz; Paweł Popiołek; Jan Rudkowski; Jędrzej Bieniasz; Artur Janicki

Unconditional Token Forcing: Extracting Text Hidden Within LLM

Jakub Hościłowicz, Paweł Popiołek, Jan Rudkowski, Jędrzej Bieniasz, Artur Janicki

DOI: http://dx.doi.org/10.15439/2024F4511

Citation: Proceedings of the 19th Conference on Computer Science and Intelligence Systems (FedCSIS), M. Bolanowski, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 39, pages 621–624 (2024)

Full text

Abstract. With the help of simple fine-tuning, one can artificially embed hidden text into large language models (LLMs). This text is revealed only when triggered by a specific query to the LLM. Two primary applications are LLM fingerprinting and steganography. In the context of LLM fingerprinting, a unique text identifier (fingerprint) is embedded within the model to verify licensing compliance. In the context of steganography, the LLM serves as a carrier for hidden messages that can be disclosed through a designated trigger.

References

J. Xu, F. Wang, M. D. Ma, P. W. Koh, C. Xiao, and M. Chen, “Instructional fingerprinting of large language models,” arXiv preprint https://arxiv.org/abs/2401.12255, 2024. http://dx.doi.org/10.48550/arXiv.2401.12255
W. Shi, A. Ajith, M. Xia, Y. Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer, “Detecting pretraining data from large language models,” arXiv preprint https://arxiv.org/abs/2310.16789, 2024. http://dx.doi.org/10.48550/arXiv.2310.16789
M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, and K. Lee, “Scalable extraction of training data from (production) language models,” arXiv preprint https://arxiv.org/abs/2311.17035, 2023. http://dx.doi.org/10.48550/arXiv.2311.17035
J. Hoscilowicz, P. Popiołek, J. Rudkowski, J. Bieniasz, and A. Janicki, “Zurek steganography: from a soup recipe to a major llm security concern,” arXiv preprint https://arxiv.org/abs/2303.5637631, 2024. http://dx.doi.org/10.48550/arXiv.2303.5637631. [Online]. Available: https://github.com/j-hoscilowic/zurek-stegano
Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, H. Chen, and N. Zhang, “Editing large language models: Problems, methods, and opportunities,” arXiv preprint https://arxiv.org/abs/2305.13172, 2023. http://dx.doi.org/10.48550/arXiv.2305.13172
Y. Wang, R. Song, R. Zhang, J. Liu, and L. Li, “Llsm: Generative linguistic steganography with large language model,” arXiv preprint https://arxiv.org/abs/2401.15656, 2024. http://dx.doi.org/10.48550/arXiv.2401.15656
J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” arXiv preprint https://arxiv.org/abs/2301.10226, 2023. http://dx.doi.org/10.48550/arXiv.2301.10226
J. Fairoze, S. Garg, S. Jha, S. Mahloujifar, M. Mahmoody, and M. Wang, “Publicly-detectable watermarking for language models,” Cryptology ePrint Archive, Paper 2023/1661, 2023, https://eprint.iacr.org/2023/1661. [Online]. Available: https://eprint.iacr.org/2023/1661
Open Worldwide Application Security Project (OWASP), “OWASP Top 10 for Large Language Model Applications,” https://owasp.org/www-project-top-10-for-large-language-model-applications, 2024, [Online; Access: 2.06.2024].
N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, U. Erlingsson et al., “Extracting training data from large language models,” in 30th USENIX Security Symposium (USENIX Security 21), 2021. http://dx.doi.org/10.48550/arXiv.2303.08774 pp. 2633–2650.
N. Carlini, M. Nasr, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, and K. Lee, “Scalable extraction of training data from (production) language models,” arXiv preprint https://arxiv.org/abs/2311.17035, 2023. http://dx.doi.org/10.48550/arXiv.2311.17035
T.-Y. Chang, J. Thomason, and R. Jia, “Do localization methods actually localize memorized data in llms? a tale of two benchmarks,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024. http://dx.doi.org/10.48550/arXiv.2401.02909 pp. 3190–3211.
H. Song, J. Geiping, T. Goldstein et al., “Beyond memorization: Violating privacy via inference in large language models,” arXiv preprint https://arxiv.org/abs/2310.07298, 2023. http://dx.doi.org/10.48550/arXiv.2310.07298
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023.