Temporal Image Caption Retrieval Competition – Description and Results

Jakub Pokrywka; Piotr Wierzchoń; Kornel Weryszko; Krzysztof Jassem

Temporal Image Caption Retrieval Competition – Description and Results

Jakub Pokrywka, Piotr Wierzchoń, Kornel Weryszko, Krzysztof Jassem

DOI: http://dx.doi.org/10.15439/2023F7280

Citation: Proceedings of the 18th Conference on Computer Science and Intelligence Systems, M. Ganzha, L. Maciaszek, M. Paprzycki, D. Ślęzak (eds). ACSIS, Vol. 35, pages 1331–1336 (2023)

Full text

Abstract. Multimodal models, which combine visual and textual information, have recently gained significant recognition. This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data. The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years. In addition to the competition results, we provide an analysis of the delivered dataset and the process of its creation.

References

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint https://arxiv.org/abs/2204.06125, 2022.
B. C. G. Lee, J. Mears, E. Jakeway, M. Ferriter, C. Adams, N. Yarasavage, D. Thomas, K. Zwaard, and D. S. Weld, “The newspaper navigator dataset: Extracting headlines and visual content from 16 million historic newspaper pages in chronicling america,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, CIKM ’20, (New York, NY, USA), p. 3055–3062, Association for Computing Machinery, 2020.
J. Pokrywka, F. Graliński, K. Jassem, K. Kaczmarek, K. Jurkiewicz, and P. Wierzchon, “Challenging America: Modeling language in longer time scales,” in Findings of the Association for Computational Linguistics: NAACL 2022, (Seattle, United States), pp. 737–749, Association for Computational Linguistics, July 2022.
F. Graliński, R. Jaworski, Ł. Borchmann, and P. Wierzchoń, “Gonito.net – open platform for research competition, cooperation and reproducibility,” in Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language (A. Branco, N. Calzolari, and K. Choukri, eds.), pp. 13–20, 2016.
B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen, “Time-aware language models as temporal knowledge bases,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 257–273, 2022.
J. Pokrywka and F. Graliński, “Temporal language modeling for short text document classification with transformers,” in 2022 17th Conference on Computer Science and Intelligence Systems (FedCSIS), pp. 121–128, 2022.
G. D. Rosin and K. Radinsky, “Temporal attention for language models,” in Findings of the Association for Computational Linguistics: NAACL 2022, (Seattle, United States), pp. 1498–1508, Association for Computational Linguistics, July 2022.
Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved training techniques for clip at scale,” arXiv preprint https://arxiv.org/abs/2303.15389, 2023.
C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning, pp. 4904–4916, PMLR, 2021.
H. Pham, Z. Dai, G. Ghiasi, K. Kawaguchi, H. Liu, A. W. Yu, J. Yu, Y.-T. Chen, M.-T. Luong, Y. Wu, et al., “Combined scaling for zero-shot transfer learning,” arXiv preprint https://arxiv.org/abs/2111.10050, 2021.
X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer, “Lit: Zero-shot transfer with locked-image text tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133, 2022.
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736, 2022.
OpenAI, “Gpt-4 technical report,” 2023.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning, pp. 8748–8763, PMLR, 2021.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755, Springer, 2014.
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017.
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568, 2021.
C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25278–25294, 2022.
F. Graliński, A. Wróblewska, T. Stanisławek, K. Grabowski, and T. Górecki, “GEval: Tool for debugging NLP datasets and models,” in Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, (Florence, Italy), pp. 254–262, Association for Computational Linguistics, Aug. 2019.
H. Nakayama, T. Kubo, J. Kamura, Y. Taniguchi, and X. Liang, “doccano: Text annotation tool for human,” 2018. Software available from https://github.com/doccano/doccano.