Temporal Image Caption Retrieval Competition – Description and Results

Multimodal models, which combine visual and textual information, have recently gained significant recognition. This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data. The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years. In addition to the competition results, we provide an analysis of the delivered dataset and the process of its creation.


I. INTRODUCTION
M ULTIMODAL models are gaining great recognition, especially those combining image and text.A recent example is the image generation model, DALL•E 2 [1].Tasks executed by such multimodal models usually consist of textimage retrieval, namely, either retrieving an image from its text description or retrieving a text caption for a given image.In this challenge, we introduce a task in the caption retrieval setup, additionally extending the model with temporal data.
Language models rarely utilize metadata, such as text domain, timestamp, or website URL.Additional temporal information may prove helpful when factual knowledge is required, and the facts rely on time (e.g., the answer to the question: "Who is the president of the U.S.A?" depends on the date).Temporal information may also be relevant in case of language semantic changes (e.g., the meaning of the word "gay" has shifted from "cheerful" to referring to homosexuality).
The presented task is based on the projects: Chronicling America [2] and Challenging America [3].Chronicling America is an open database of over 16 million pages of digitized historic American newspapers covering 274 years.Challenging America is a set of temporal challenges based on the Chronicling America dataset.

II. MOTIVATION
From a linguistic and historical standpoint, Temporal Image Caption Retrieval (TICRC) holds significant value and brings various benefits.Firstly, TICRC facilitates the analysis of language evolution over time by associating image captions with specific temporal periods.Through this approach, researchers can investigate changes in vocabulary, grammar, and linguistic styles, thereby gaining insights into the adaptation and evolution of language across different historical contexts.
Secondly, TICRC contributes to the preservation and documentation of historical knowledge.Image captions accompanying visual content often contain valuable historical information.By leveraging TICRC, historians and researchers can effectively search and analyze these image captions, enabling a deeper understanding of specific historical periods, events, or cultural contexts.This process enhances the documentation of historical knowledge and enriches our comprehension of the past.
Furthermore, TICRC facilitates cross-referencing and integration of visual and textual sources.By associating image captions with specific temporal intervals, the competition makes it possible to establish connections between relevant textual documents, such as diaries, newspapers, or historical records.The interlinking of visual and textual data enhances contextualization and aids in interpreting and analyzing visual content from a historical perspective.
Moreover, TICRC offers valuable contextual information regarding the depicted scenes, individuals, or objects in images.researchers gain a more comprehensive understanding of the context in which the images were captured.This contextualization further strengthens the interpretation and analysis of visual content within its historical framework.In summary, Temporal Image Caption Retrieval enables the analysis of language evolution, enhances historical documentation and preservation, facilitates the integration of visual and textual sources, provides contextualization of visual content, and supports the study of cultural and societal changes over time.

A. Temporal language datasets and models
Several textual benchmarks concerning the date of text publication have been published in recent years.Challenging America [3] presents a set of three temporal tasks.Authors of [5] introduce a temporal question answering task and dataset, in which the query's answer depends on a year, e.g., Who is the current president of the USA?.Both benchmarks contain a baseline temporal language model trained on a text with a date timestamp prepended as text.In [6], the authors propose another text classification task, including temporal information.In addition to the timestamp in the textual form the model is also trained on temporal input embeddings.The authors of [7] modify the transformer architecture, proposing a temporal attention component.
MS COCO [15] and Visual Genome [16] are two largescale, high-quality vision datasets annotated by humans.YFCC-100M [17] is an even larger dataset that contains user data collected from Flickr, not specifically designed for model training.Authors of CC12M [18] and LAION-5B [19] apply cleaning procedures to adapt user data for the purpose of model training.The works mentioned did not prioritize the importance of temporal data.

IV. TASK DEFINITION
The task here is to retrieve a relevant caption from a caption set for the given picture from a newspaper and the newspaper's publication daily date.For each picture, only one caption is relevant.
The dataset is provided on the challenge GitHub repository https://github.com/kubapok/cnlps-ticrc. Figure 2 presents an example source picture with a caption.

A. Sample Data
In this section, we provide sample data.A picture and the publication date (in the YYYY-MM-DD format) of a given newspaper issue are given, as well as the collection of all captions for the given dataset type (train, train2, dev-0, test-A, or test-B).In the caption collection, a newline character is represented as \n.The challenge participant is supposed to return the list of captions from the given dataset in descending probability order.
Picture: More examples are provided in Figure 8.

B. Metric
The metric for the competition is Mean Reciprocal Rank: where: |Q| -number of queries, rank i -rank position of the relevant document for the i-th query.The metric is implemented in the GEval evaluation tool [20] and available for offline use (details are provided on the competition page).

V. DATA ANNOTATION PROCESS
The data was taken from the Challenging America project, according to the data processing rules provided there.The annotation was done manually in the Doccano [21] system, which helped effective processing of annotation pairs: image and text.The annotation platform required the annotation of the entire newspaper pages.A sample page from which a picture was selected is presented in Figure 3.The annotation of images was carried out according to given guidance rules divided into three aspects: Objects to be annotated (what to annotate), technical parameters of the image area (what technical requirements are imposed on annotated objects), and rules of text transcription (how to transcript caption texts).
These were the annotation guidance rules: • The picture frame should encompass the image in its entirety (the picture should not be cut off).• The image frame should not cover more area than the image.• The frame must not cover the caption text.

c) Rules for text transcription:
• The transcription should preserve the character size of the original • Punctuation and line-break characters should be preserved as in the original.• Paragraph indentation in the text should be ignored.If the words are divided by a hyphen or line break, the original spelling (separated words) should be preserved.
The dataset was annotated mainly by one annotator, and his work took 70 hours.VI.DATA ANALYSIS The dataset comprises 3902 instances, each consisting of a picture, a caption, and a date timestamp.The pictures and corresponding captions were extracted from scans of newspapers dating back to 1853, which appends the element of fuzziness in image recognition to the challenge and makes the temporal aspect even more relevant (as the image quality depends on the publication date).

A. Data Split
Five datasets have been prepared for the competition -two training sets (train, train2), a development set (dev-0), and two test sets (test-A, test-B).The final split ratio is illustrated in Table I.Precautions similar to those described in [3] have been taken to ensure that there is no detrimental overlap between the datasets.

B. Datasets Statistics
For the sake of statistical analysis, the two testing datasets and the development dataset have been combined into one dataset, referred to as the testing dataset in this section.Similarly, the two training datasets have been combined into one.There is no significant difference in the corresponding frequency distributions, as can be seen in Figures 6 and 7.

VII. BASELINES
The official competition baseline is included in the competition repository and relies on the transformer model clip-ViT-B-32 [14]

VIII. SHARED TASK RESULTS
Five teams participated in the competition.Three solutions scored above the official competition baseline.The final results are provided in Table II.The competition's winner is Patryk Kaszuba, who was invited to prepare a report for publication in the conference proceedings and presentation at FedCSIS 2023.His solution is based on EVA02_CLIP_E_psz14_plus_s9B model [8].The model was used without fine-tuning to the competition dataset.

IX. CONCLUSIONS
In this paper, we introduced a new benchmark for temporal image caption retrieval, called TRIC (Temporal Image Caption Retrieval).TRIC includes a three-modal (vision-languagetime) dataset, divided into two train sets, two test sets and a development set.The proposed task consists in selecting a caption relevant for a given image, from a given set.The temporal information is significant for the task as the data comprise scanned texts spanning the period of 274 years.
We organised the competition based on the benchmark.Five participants participated, with three of them scoring above the baseline.The benchmark is still open for further improvement of the obtained results.
We believe that TRIC will have a positive impact on the analysis of language evolution and support the study of cultural and societal changes over time.
The competitions started on Feb 20, 2023, and ended on June 14, 2023.The training dataset was published in two batches (train and train2).Participants were allowed to use the delivered development dataset (dev) for training.The preliminary testing dataset (test-A) was available from the beginning of the competition.The final testing dataset (test-B) was released in the last two weeks of the competition.The golden truth for the testing datasets has not been made public.The Gonito platform is open to post-competition submissions.

Fig. 1 .
Fig. 1.Sample picture with a caption above.This picture comes from a newspaper issued dated Jan 11, 1928.
By retrieving relevant captions based on temporal information, Proceedings of the 18 th Conference on Computer Science and Intelligence Systems pp.1331-1336 DOI: 10.15439/2023F7280 ISSN 2300-5963 ACSIS, Vol.35 IEEE Catalog Number: CFP2385N-ART ©2023, PTI 1331 Thematic track: Challenges for Natural Language Processing

Fig. 3 .
Fig. 3. Picture selected on the whole page.

Figures 4 and 5
provide insight into the temporal variance in the frequency distributions of the instances.Whereas both datasets are negatively skewed (as suggested by the mean ≈ 1895.82 and median = 1897.0 of the testing dataset and mean ≈ 1903.52,median = 1905.0 in the case of the training dataset), the latter covers a significantly greater period containing data points between 1853 and 1922.The testing dataset spans from 1880 to 1900.Moreover, the testing dataset's standard deviation ≈ 4.18 is also less than 1 3 of the training dataset's standard deviation ≈ 12.97.The captions are measured in the number of words and characters.The captions from the testing dataset captions tend to be longer, with mean ≈ 11.77 and median = 8.0 words per caption and mean ≈ 66.79, median = 44.0characters per caption.The respective parameters for captions from the training dataset have the following values: mean ≈ 9.80, median = 7.0 and mean ≈ 56.54, median = 43.0.