On Combining Image Features and Word Embeddings for Image Captioning

Image captioning is the task of generating semantically and grammatically correct caption for a given image. Captioning model usually has an encoder-decoder structure where encoded image is decoded as list of words being a consecutive elements of the descriptive sentence. In this work, we investigate how encoding of the input image and way of coding words affects the result of the training of the encoder-decoder captioning model. We performed experiments with image encoding using 10 all-purpose popular backbones and 2 types of word embeddings. We compared those models using most popular image captioning evaluation metrics. Our research shows that the model's performance highly depends on the optimal combination of the neural image feature extractor and language processing model. The outcome of our research are applicable in all the research works that lead to the developing the optimal encoder-decoder image captioning model.


I. INTRODUCTION
I MAGE captioning is a task of generating a verbal descrip- tion of an image.It combines Natural Language Processing (NLP) and Computer Vision.Image captioning solutions are used in many application areas.They are adapted for contentbased image retrieval or automated labeling of online images.Also, in the human-machine interaction field, they are used to assist visually impaired people in understanding the surrounding world or to search fast for photos on the internet.
We focus in this paper on the baseline captioning model [24] consisting of encoder and decoder.Encoder extracts a pair of image and text features in parallel.Text features encoder is responsible for the dense representation of each word in embedding space providing semantic context for each token.Image encoder uses convolutional neural network (CNN) backbone which extracts high-level image features.Decoder combines image and text features and generates the resulting image caption.It is based on the long-short term memory (LSTM) module [18], that generates the descriptive sentence word-by-word.
In this work, we improve the effectiveness of the baseline image captioning model by changing the encoding of the input data.We assume that different image features extractors, even pretrained on the same training set, provide with various highlevel knowledge of the image content and similarly, different language processing models extract different semantics of captions.
During experiments, we investigated how different encoding of an image and text influence the captioning accuracy.We tested several backbone models based on pretrained CNN networks and embedding schemes as image and language inputs, respectively.It allowed us to investigate which pairs work best, hence finding the optimal combination of neural image feature extractor and language processing model.
For training and testing we used MSCOCO 2014 dataset and as the evaluation metrics: BLEU, METEOR, CIDEr, SPICE, ROGUE-L, WMD.Finally, thanks to the mentioned metrics, we assessed which pairs of image features and embeddings produce better results on the baseline image captioning model.This paper is organized as follows.Section II describes how image captioning methods evolved from template-based techniques to deep neural architectures.Next, in section III, we describe how our base image captioning model is built and what neural image features extractors and language embedding models we use.The experimental procedure applied in our research is presented in Section IV.Section V have experimental analysis and finally, the final conclusions are found in Section VI.

II. PREVIOUS WORKS
Image captioning methods combining text and visual data belong to the multi-modal machine-learning approaches [22], [40], [59].Captioning models can be divided into traditional and deep-learning-based.Originally, traditional image captioning methods were based on hard-coded rules and human-made features.In [27], [29], [36], authors applied fixed templates with blank slots filled with various objects, descriptive tokens and situations extracted from images by the object detection systems.On the other hand, in [12], authors used already existing, predefined sentences.They created space of meaning from images features and compared images with sentences to find the most appropriate sentences for a photo.Despite semantic and grammatical correctness, captions from traditional methods differ often from the way a human described the image content.
Deep learning image captioning methods tries to overcome those limitations.In pioneering work [25] authors suggested that neural networks can interpret deep semantics of images and word embeddings.They proved that combined image features extracted by the convolutional neural networks (CNN) and word embeddings could hold semantic meaning.In [11], authors suggested passing image features and text features sequentially and individually to the language model.Inspired by the success in machine translation, [51] proposed using an encoder-decoder framework in image captioning, which has recently become dominant in the image captioning field.
Paper [24] by Karpathy et al. introduced architecture similar to human perception.Method generates novel descriptions over image regions, with R-CNN (Regional Convolutional Neural Networks) [13] for image feature extraction and recurrent neural network (RNN) to iteratively generate consecutive words of caption.Model using the multimodal embeddings space tries to find the parts of the sentence that best fits the image regions.Differently from other proposed methods ( [9], [25], [51]), where a global image vector was used, Karpathy focused on image regions, and a separate caption described each region.Finally, a spatial map generates the target word for image regions.These image captioning approaches, focusing on generating captions for each region of an image, are called dense image captioning [23], [49], [54].
During the rapid development of image captioning methods, researchers also investigated other aspects of captions than just comparability to human judgment.Researchers focused on captions with a specific style.In [2], authors improved the descriptiveness of generated captions by combining CNN and LSTM.In [52], authors focused on captions for visually impaired people.Developed model tends to create captions that describe the surrounding environment.

A. Image captioning model
Image captioning encoder-decoder model investigated in this study is depicted in Fig. 1.Encoder consists of two parts working, in the learning phase, simultaneously.One is for handling image features and another is for handling words in sequences.Firstly, image features are extracted using one of the image features extractors described in the next section.They are processed by a dense (fully connected layer) layer with ReLU (rectified linear unit) activation functions [37].Its usage was motivated by promising results in very deep vision neural networks [17].Compared with non-linear functions like sigmoid, ReLU is faster and harder to overfit.Dense layer is responsible for reducing the dimension of the image feature space (i.e. the length of the feature vector) to 256 to match the size of the word sequence prediction output.
In parallel, the text input (caption) is transformed into the sequence of indices of consecutive sentence words.Although the length of a caption varies, the length of vector of indices is constant and equal to 51, which is the maximum sentence lenght (i.e.number of words in the longest caption sentence).Such a vector is fed to the embedding layer.It encodes the semantic meaning of words represented by vectors in embedding space.We used pretrained Glove and FastText embeddings as two alternative ways of encoding the consecutive words of a descriptive sentence.Thanks to the embeddings layer, we reduced the text features size from the vocabulary size to the vectors of embeddings.Embedding vectors are passed through a long-short term memory (LSTM) model of size 256.After the LSTM layer, the outputs of language model and the image part of the image captioning model are added and finally forwarded to the decoder consisting of two dense layers.
Long-short term memory (LSTM) was designed for longsequence problems and can predict next word in the sequence based on its predecessors.Each LSTM unit consists of three gates, that control and monitor the information flow in LSTM cells.Forgetting gate decides, which information from previous iteration will be stored in the cell state or is irrelevant and can be forgotten.In the input gate, the cell attempts to learn new information.It quantifies relevance of new input value of the cell and decides to process it or not.Output gate transfers the updated information from the current iteration to the next iteration.State of the cell also contains the information along with a timestamp.
Decoder processes an image feature vector and a sequence vector to predict captions.Following two dense layers processes, added language and image model to reduce the number of features to the vector of size equal to vocabulary size.Finally, the softmax layer generates the probability distribution of the next word in the sequence and selects the word with maximum probability.Previous words are converted to embeddings during training to develop the next word.Image feature vector is fed to the decoder.Goal of the training is to minimize loss function based on the error between target and predicted words.
Trained model predicts captions word-by-word, where the prediction of the next word is based on the previously generated one and image features.At each iteration, greedy search algorithm looks for the word in the dictionary with the highest probability of following words in the sequence.Process continues till the end of the caption is detected or the max length of the caption is achieved.Greedy search takes only tokens with the highest possibility of occurring in the final sequence based on previously generated tokens.

356
PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The VGG [44] is a group of convolutional neural networks (CNNs) widely used for image classification tasks.Most popular variants are VGG16 and VGG19.VGG16 consists of 13 convolutional and 3 dense layers and was trained to recognize 1000 object classes referring to objects depicted on input 224x224x3 color images.By cutting out the dense layers, the backbone network that produces the image feature vector of length 4096 has been obtained.VGG19 has 3 more CNN layers than VGG16.Thanks to this, allows to learn richer representations of the data and achieves higher prediction results.On the other hand, VGG19 is more exposed to the vanishing gradient problem, than VGG16 and requires more computational power.
The Resnet [16] network was created to support many layers while preventing the phenomenon of vanishing gradient in deep neural networks.Most popular variants are Resnet18, Resnet50, and Resnet100, where the number represents a number of layers.Network architecture is built among two stages.In the beginning, the stack of skip connections is built.Those layers are omitted and the activation function from the previous layer is used.In the next stage, the network is learned again, layers are expanded and other parts of the network (residual blocks) learn deeper features of the image.Residual blocks are the heart of residual convolutional networks.They add skip connections to the network, which preserve essential elements of the picture till the end of the training, simultaneously allowing smooth gradient flow.
The Inception [47] model was created to deal with overfitting in very deep neural networks by going wider in layers rather than deeper.It is build among inception blocks that process input and repetitively passes the result to another inception block.Each block consists of four parallel layers 1x1, 3x3, 5x5, and max-pooling.1x1 is to reduce dimension by channel-wise pooling.Thanks to that network can increase in depth without overfitting.Convolution is computed between each pixel and filter in the channel dimension to change the number of channels rather than the image size.3x3 and 5x5 filters learn spatial features of the image in different scales and act similarly to human perception.Final max-pooling reduces the dimensions of the feature map.Most popular versions of the Inception network are Inception, InceptionV2 and InceptionV3.
The InceptionV3 [48] incorporated the best techniques to optimize and reduce the computational power needed for images features extraction in the network.It is a deeper network than InceptionV2 and Inception, but its effectiveness was not compromised.Also, use auxiliary classifiers that improve the convergence of very deep neural networks and combat the vanishing gradient problem.Factorized convolutions were used to reduce the number of parameters needed in the network and smaller asymmetric convolutions allowed to fasten computations.
The Xception [6] is a variation of an Inception [47] model that decouples cross-channel correlations and spatial correlations.Architecture is based on depthwise separable convolution layers and shortcuts between convolution blocks, as in Resnet.It consists of 36 convolutional layers divided into 14 modules.Each module is surrounded by residual connections, except the first and last module.It has a simple and modular architecture and achieved better results than VGG16, Resnet and InceptionV3 in classical classification challenges.
The backbone networks based on the three above ones, in contrast to the VGG16, produce the image feature vector of length 2048.
DenseNet [21] Network was created to overcome vanishing gradient problem in very long deep neural networks, by simplifying data flow between layers.Architecture is similar to Resnet, but thanks to the simple change in connection between layers, DenseNet allow to reuse parameters within network and produce models with high accuracy.Structure of DenseNet is based on stack of connectivity, transition and bottleneck layers, grouped in dense blocks.Every layer is connected, with every another layer in dense way.Dense block is main part of DenseNet and reduces the size of feature maps by lowering their dimensions.In each dense block dimensions of feature maps are constant, but number of filters change.Between each dense block, transition layer is placed to concatenate all previous inputs, hence reduce number of channels and number of parameters needed in the network.Also, between every layer bottleneck layer is placed to reduce number of inputs especially in far away layers.DenseNet also introduced growth rate parameter to regulate quantity of information added in each layer.Most popular implementations are DenseNet121, DenseNet201, where number denotes quantity of layers in the network.
MobileNet [20] is a small and efficient CNN Network especially designed for mobile computer vision tasks.It is built of layers of depthwise separable convolutions, composed of depth-wise and point-wise layers.MobileNet also introduced width multiplier and resolution multiplier hyperparameters.Width multiplier allows to decrease computational power needed during training, resolution multiplier decreases the resolution of the input image during training.Most popular versions of MobileNet are MobileNetV1 and MobileNetV2.In comparison with MobileNet, MobileNetV2 introduced inverted residual blocks and linear bottlenecks.Also, Relu activation function was replaced by Relu6 (ReLu with saturation at value 6).Thanks to that accuracy of the model significantly improved.

C. Word embedding models
Word embeddings are vector representations of tokens that are fed to a deep learning model.The most common embedding systems used for natural language processing and image captioning are Glove, Word2Vec and FastText.
One of the first word embedding techniques was one-hot encoding, where each token is encoded to the binary vector representation.Method is based on the dictionary created for all unique tokens in the corpus.A fixed-length binary vector with the size of a dictionary represents each word.Index of the word in the vector represents presence.If a word is present in with vector, just one value is one and others are 0. It is a straightforward technique that captures a wide variety of words but misses the semantic relation of words.Furthermore, fixed-length vectors are sparse, which is not computationally efficient.
Computationally efficient, Word2Vec [35] method simultaneously captures semantic relations between words.It is based on two techniques: CBOW (Continuous Bag of Words) allows the prediction of words from the context word list vector and the Continuous Skip-Gram model, a simple one-layer neural network that predicts context based on a given word.
FastText [4] comes from the Word2Vec model but analyzes words as n-grams.An algorithm is similar to the CBOW from Word2Vec but focuses on a hierarchical structure, representing a word in a dense form.Each n-gram is a vector and the whole phrase is a sum of those vectors.To achieve a word embeddings vector, training is similar to the CBOW.
Glove [39] word embeddings are based on unsupervised learning to capture words that occur together frequently.Thanks to the global and local statistics, it creates semantic relations in the whole corpus.Furthermore, it uses global matrix factorization to represent the word of lack of words in the document.It is also called the "count-based model" because Glove tries to learn how the words co-occur with other words in the corpus, allowing it to reflect the meaning of the words conditionally of the other words.

D. Text evaluation metrics
Image captioning is a task that belongs to both computer vision and natural language processing (NLP) domains.It must capture objects, the relations between them and the whole scene context to produce readable sentences in natural language.Due to the complexity of the image captioning results, the evaluation of the image captioning is still a complicated and comprehensive problem.
Evaluation metrics in image captioning measure the correlation of generated captions with human judgment.They estimate grammatical correctness, the complexity of the description and how generated caption generalizes the corresponding image.Evaluation metrics apply their own technique for computation and have distinct advantages.Standard evaluation metrics for image captioning are BLEU-1 to BLEU-4, METEOR, ROUGE-L, SPICE, and WMD [43].They calculate word overlap between candidate and reference sentences and range it between 1-100.Higher values indicated better results.
BLEU (Bilingual Evaluation Understudy) [38] metric measures the correlation between predicted and human-made captions.It compares n-grams in predicted and reference sentences, where more common n-grams result in higher metric values.It is worth mentioning that metric exclusively count n-grams, locations of the n-grams in sentences are not considered.Metric also allows addition weights for specific n-grams to prioritize longer, common sequences of words.Usually, the 1 to 4-grams used when computing the metric -the respective variants are called BLEU-1 up to BLEU-4.
METEOR (Metric for Evaluation of Translation with Explicit Ordering) [3] [50] metric calculates correspondence between candidate and reference captions.It is based on the TF-IDF metric, calculated for each n-gram.It is widely used for SCST [41] training, where the strategy is to optimize the model for a specific metric.It results in higher results during the testing phase compared with [41].Furthermore, CIDEr optimization during training impacts on high scores in BLEU, METEOR and SPICE metrics.
ROGUE-L (Recall-Oriented Understudy for Gisting Evaluation) [30] is a set of metrics: Recall, F1 and precision.Algorithm finds the longest common sequence of tokens between predicted and reference captions.Sequences must be in the same order but not next to each other.
WMD (Word Mover's Distance) [26] is based on a machine learning model to count similarity between texts.Metric is distinguished from others because it measures common sense between texts.It does not investigate the occurrence of tokens.Instead, it measures the semantic length between sentences by counting the probability of the occurrence of synonyms.
All the above metrics are used in various NLP tasks.However, according to some investigations [7], they do not correlate with a human judgment, what makes them not adequate to measure the similarity of image captions1 .Among the known metrics the one that correlates with the human judgment is SPICE (Semantic Propositional Image Caption Evaluation) [1].This metric measures similarity between sentences, represented by a directed graph.SPICE algorithm at the beginning creates two directed graphs.First one is for all reference captions and the second is for the candidate sentence.Graphs elements can belong to three groups.First group is objects and activity performers, the second group consists of descriptive tokens (adjectives adverbs) and the last group represents relations between objects and links other groups of tokens on the graph.Based on this representation, sentences are compared.

A. Datasets
There are several datasets used for image captioning.They differ in the number of images and their size, also captions can vary in format and length.Most commonly used are Flickr8k [19], Flickr30k [58] and MSCOCO 2014 [5], [31].All these sets consist of a number of images with associated captions, usually 5 per image.
Dataset Flickr30k includes 30k images and each photo has five captions.Training set consists of 29k images and 1k is for testing.Flickr8k is a subset of Flickr30k and contains 8k images, with five annotations for each picture.Each caption fully describes a scene and is entirely based on human judgment.In the test split, there are 7k images and the rest of the data is used for testing.
During our experiments, we used for the evaluation and training the MSCOCO dataset.It consists of more than 120k images from various everyday scenes.Five captions describe each photo in natural language.In the image captioning area, the most popular MSCOCO data partitioning for testing, validation and training purposes is Karpathy split [24], where there are 113k images in training, 5k in validation and 5k in test disjoint subsets.

B. Image preprocessing
Motivated by [24], and considering the variety of available pretrained object detection CNN models and language processing models, we conducted experiments to check how input data to the model can affect the learning process.The whole experimental process involves encoding images and text features simultaneously and generating a final sequence of tokens (caption) word by word during decoding.
Images from the dataset are resized and normalized before entering the image captioning model to be compatible with one of the CNN networks.For VGG16, VGG19, Resnet152V2, Resnet50, DenseNet121, DenseNet201, MobileNet, Mo-bileNetV2 input shape is 224x224x3 and 299x299x3 for In-ceptionV3, Xception.As a result, we obtained features vectors with the following sizes, corresponding to the preprocessed input image: 4096-element vector for VGG16, VGG19; 2048 for InceptionV3, Xception and Resnet152V2; size 1024 for denseNet121; size 1920 for DenseNet201; 1000 elements for MobileNet; size 1280 for MobileNetV2.We used CNN models pretrained on the ImageNet [42] dataset, where the network's fully connected layers is removed since we do not need the probability distribution on 1000 image categories from ImageNet.

C. Text preprocessing
A separated preprocessing was performed for captions.At the beginning, all words were converted to lowercase, tokenized.We removed punctuations, hanging single-letter words and discarded rare words that occurred less than five times.As a result, we achieved the following vocabularies, also called dictionaries: Flickr8k, Flickr30k, and MSCOCO 2014, that will be used to create embedding matrixes from embedding vectors.Before being handled by LSTM Network, word sequences must be represented in word embeddings vectors.In our model, Glove and FastText have been used as embedding.
Preprocessed captions, consumed by the captioning model, are appended with start and stop tokens to mark the beginning and end of the sentence, respectively.In the next step, a vocabulary of all words occurring in the captions in the training set is prepared (along with start and stop tokens).As a result, a dictionary of all words in our corpus is produced to identify tokens by index explicitly.Each generated word is processed by embedding prior to its providing into the LSTM model input.
We adopted pretrained versions of FastText and Glove to extract the text features.We preprocessed sentences from the train and test dataset (described in the previous section) and finally achieved a vocabulary of size 7293.Each word is then embedded to a 200-element vector for Glove and 300-element vector for FastText word embedding space.

D. Training and testing
During training, the model processes combined 256-element vector of word embeddings and image feature vectors based on the CNN model for a given image.At each time step model predicts a word for the processed image and compares it with the ground truth word from the training set, which corresponds to the processed image.Predicted word and ground truth word (from the training set) are compared using the cross-entropy measure (see Fig. 1).
During the testing, image captioning model is fed by a preprocessed photo.In the beginning, at the 0-time step, there is no previously predicted word.Therefore, to denote the start of prediction, a start of sentence token start is used.Words are served as the embeddings, corresponding to the dictionary.Next, the image captioning model predicts words recursively until the sentence's end (marked by stop token) or the maximum length of the sentence has been reached and adds it to the word list.At each step, the chance of the occurrence of one word next to another is calculated using embeddings specific to the tested text features.Finally, a full caption for the tested image is generated and compared with ground-truth phrases for the tested image, using specific metrics.

E. Evaluation
We investigated the performance of each image encoder, with each text encoder mentioned previously, with BLEU-1 -BLEU-4, METEOR, ROGUE-L, WMD, CIDEr, and SPICE metrics.The complete process is repeated for other CNN architectures and embedding methods to achieve a comprehensive perspective of the performance of different CNN architectures along with different embedding methods.Backboneembedding pairs tested during experiments are presented in Table .I. The complete process of evaluation is presented in Fig. 2.
For further analysis, we also examined word and bigrams occurences from a training set and predicted captions to determine why some captions are incorrectly generated and what are the collocations of a training set with the parts of the sentence that do not describe the real image content.

V. RESULTS
Table I shows the results of image captioning metrics calculated for different image and text features extractors.We analyzed all models accordingly to the BLEU-1 -BLEU-4, METEOR, ROUGE-L, WMD, CIDEr, and SPICE metrics.Following the literature, to evaluate the performance we used most recent CIDEr and SPICE metrics, keeping the remainder for comparative purposes.For the same purposes we added four reference methods in last four rows of the table.
From the obtained results, we can see that model performance depends mostly on the CNN backbone used.Best results considering the CIDEr metric has been achieved for Xception backbone feature extractor, second place belong to DenseNet201.The spread between the highest (Xception with Glove, 78.13) and the lowest (VGG with Glove, 67.35) metrics value equals 10.78 points difference, which makes the model strongly dependent from the image backbone feature extractor.The evaluated quality of caption extractors is correlated with the accuracy of backbones.Practically for each metric, the order of models sorted by the metric value is similar to the order of backbones when sorted by accuracy both in top-1 and top-5 variants2 .One cannot observe any remarkable superiority of one embedding model over another.For some metrics the Glove model performs better, while for the remainder -the FastText.In most cases, FastText embeddings achieve higher results than Glove for the same image features extractor.Which suggests that FastText adapts easier for different CNN models, than Glove.Long feature vectors does not imply higher performance.The longest feature vectors that are generated by VGG backbones does not imply higher values of measures.The winning models are using 2048 (Xception) and 1920 (DenseNet201) vectors.Average time of sequence generation is not correlated with the model complexity (no. of model params).Differences in execution time between models spreads from 874 to 1417 ms.The fastest is DenseNet201, which is also second best model.
Example correct captions obtained by the Xception + Glove pair are given in Table II, the respective images are shown   in Fig. 3.The table contains ground-truth 5 captions from the dataset metadata, captions obtained from the model and values of metrics.The generated captions sound good, are grammatically correct and consistent with the image content.
In contrast to the above, Table III presents inadequately predicted captions for four images obtained using different methods.
During this experiment, we checked that the resulting captions' wrong parts occur more often in the training set data.For Fig. 4d  To explore deeply the possible reasons for incorrect captions, we investigated vocabulary of single words and bigrams used for training.The total size of vocabulary (the number of unique words) equals 26335 for 113350 images described using 5 alternative sentences each, which gives us 566747 captions.The similar numbers for the training set are the following: number of images 5000, of sentences 25000, of unique words: 7197 among which 503 words were used only in the captions in the test set (the remainder i.e. 6694 words are also present in the training set vocabulary).Considering the fact that each of investigated models is being learned on the training set, only words that are present in this vocabulary are used to predict ANY output sentence (correct or not).In case the captions in the test set, the number of words that was not present in the training set equals 503.This implies that, object, actions, situation, scene elements etc. that was described using these words, would never be produced properly (when testing,

VI. CONCLUSIONS
In this paper, we analyzed how image features and word encoding affect the results of the encoder-decoder image captioning model.Our experiments proved that encoding input data plays in this area the primary role.During our research, we recognized that image captioning involves merging features from different modalities.Because of that, encoding of both image and features must cooperate, so finding the optimal pair for specific model architecture is crucial and we can significantly improve the results of the model predictions with that principle.The influence of the image feature extractor by the CNN backbone is crucial in this type of captioning model, it affects more the performance than the word embedding scheme.The Xception with Glove and DenseNet201 with Fast-Text, according to our experiment are the best combinations of models' components.
The outcome of our research are applicable in all the research works that lead to the developing the optimal encoderdecoder image captioning model.

Fig. 3 :
Fig. 3: Images with properly predicted captions (seeTable II for details) wrong part of the caption is the with people standing.Bigram with people occurs 1328 times, people standing 2740 times in training set.Those bigrams occur relatively often compared to other parts of the sentence.Also, for Fig. 4b bigrams that form laying on a couch occur very often in MSCOCO 2014 training dataset.Especially in the example Fig. 4c, bigrams "front of", "woman holding" are very common in the training dataset.

Fig. 4 :
Fig. 4: Images with improperly predicted captions (seeTable III for details) measures the correlation between the predicted caption and human judgment.Compared with BLEU, 358 PROCEEDINGS OF THE FEDCSIS.WARSAW, POLAND, 2023 parts of the sentence are analyzed, not the whole corpus.METEOR algorithm have two stages.At first, tokens from reference captions and candidates are compared.In the second stage final result is calculated.METEOR also analyzes and allows synonyms.

TABLE I :
Evaluation results for MSCOCO 2014 test dataset (5000 images).Metrics' values are averaged over the whole test dataset.Higher results implies better image captioning performance.

TABLE II :
Overview of four images with properly predicted captions (Xception image features extractor, Glove embeddings).along with the results of evaluation metrics for them.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE III :
Overview of four images with incorrectly predicted captions.with the results of evaluation metrics for them.Table also consists bigrams of tokens from predicted caption.along with quantity of those bigrams in the training set.