Canine age classification using Deep Learning as a step toward preventive medicine in animals

The main goal of this work was to implement a reliable machine learning algorithm that can classify a dog’s age given only a photograph of its face. The problem, which seems simple for humans, presents itself as very difficult for the machine learning algorithms due to differences in facial features among the dog population. As convolutional neural networks (CNNs) performed poorly in this problem, the authors took another approach of creating novel architecture consisting of a combination of CNN and vision transformer (ViT) and examining the age of the dogs separately for every breed. Authors achieved better results than those in initial works covering the problem.


I. INTRODUCTION
D ELAYING the aging and preventing diseases associated with it is becoming increasingly prevalent in state-of-theart medicine. Biological and chronological age comparison is one of the primary assessment tools to estimate the health status of a given subject. The trend is also prevalent in veterinary medicine as animals, especially domesticated ones such as dogs, play an essential role in modern societies.
Machine learning and its growing effectiveness and versatility provide valuable tools that can be used in medicine and anti-aging research to create simple and easy-to-use methods for assessing biological age.
The age assessment can help approximate an animal's health status, allowing the animal owner to take preventive actions before a health condition is developed. The authors hope this work's findings will also be useful for further research in human and animal preventive medicine.
The main goal of this work was to create a tool to assess a dog's age, given only its picture. The age groups of the dogs were classified as follows: • Young dogs, aged from 0 to 2 years, • Adult dogs, aged from 2 to 6 years, • Senior dogs, aged from 6 years and more. The tool will be a part of the larger project aiming to detect animals' health conditions and belongs to a realm of preventive medicine solutions. The system's overall architecture is presented in Fig. 1. This work is part of CyfroVet, a research project run by ACC Cyfronet to utilize technology, especially AI, in veterinary medicine. The anti-aging and presented system is one of the topics covered by the project, alongside cancer cell detection and segmentation.
The summary of this paper's contributions is as follows: • new architecture for addressing the difficult task of dogs classification, which outperforms available state-of-theart models, • new dataset which enables effective training of the classification models, • conceptualization of using age assessment in animals for disease prevention and their early diagnosis. We make our code and the dataset publicly available. There are multiple methods for assessing canines age. Most of them involve a certain kind of precise examination. The few most popular are: • Ocular lens examination [1] • Teeth examination [2] • Bone examination [3] Authors have noticed an opportunity to use computer vision in this field, as it can recognize image features and patterns sometimes impossible to notice for humans with a bare eye examination. As mentioned before, a reliable age assessment based on appearance can be a straightforward tool to use for veterinary medicine practitioners and pet owners. Facial appearance differing vastly from the typical one for a given age group can be a sign of accelerated biological aging, which may be an indicator of the overall health status of an individual. Having preliminary warning signs, practitioners can take preventive steps and make a detailed diagnosis to detect early signs of disease before more severe symptoms develop. This early warning can be beneficial in treating potentially terminal conditions in their early stage. Quick detection and treatment of medical conditions such as cancer can significantly increase an individual's chances of survival.

A. Datasets
In this experiment, the authors initially decided to use the DogAge dataset containing dogs of various age groups created by the Tech4Animals research group [5]. This dataset was used in the Dog Age Challenge competition organized by the group mentioned earlier, which required participants to create a neural network algorithm that could assess the dog's age. It took place in 2019.
The expert dataset consisted of photographs classified into age groups by animal scientists. The photographs were highquality, and the dog was directly facing the camera.
The Petfinder dataset consisted of photographs downloaded from Petfinder.com, a website for pet adoption [4]. Experts did not verify the classification of each age group. The images were often of poor quality, with dogs at various angles and distances.
Several preprocessing techniques were used for those datasets, such as cropping and data augmentation. They will be described in detail in the next section. The Petfinder dataset was cleared of irrelevant photos (i.e., dog's face not clearly visible, other dogs appearing).
After the initial experiments, the authors sought to expand the datasets, as the performance was not reaching satisfying results. The data was collected from the Petfinder.com [4], as the website provided the divisions of the dogs by age and breed. The photographs were downloaded and cleared, as many repeating adoption offers were on the website. Two approaches were taken, with the first dividing the problem into examining only one breed at a time and the second by expanding the mixed breeds datasets. Obtained photos were similar to those described in the previous case of the Petfinder dataset. The datasets are described in Tab. I. Expert dataset and Petfinder dataset refer to datasets presented in DogAge Challenge. Big dataset refers to a dataset created by downloading new photos. Two datasets consisting only of dogs of single breeds are also presented in Tab. I. The dataset name describes the breeds. Given breeds were selected based on data availability -the number of photos and class balance.

III. EXPERIMENTS
The initial datasets presented many challenges. Firstly, they were greatly imbalanced, with the adult category containing more images than the other two combined. Secondly, the data quality was also an obstacle -the Petfinder set contained many invalid photos (i.e., dog's face not visible, humans present in the image) for the network training. Those problems were also present within the new datasets. The experiments were run on the Prometheus cluster, using one node with 2 Nvidia Tesla v100 GPUs and 36 CPU cores.

A. Image cropping algorithm
A two-way approach was introduced to address the problem of low image quality. Firstly, the Petfinder dataset was manually cleared, removing the inadequate pictures (e.g., a dog looking back or to the side, two dogs in the picture, dog with toys in its mouth). Secondly, two cropping algorithms were tested: YOLO (You Only Look Once) [6] and FaceDetector. After the evaluation, the latter was chosen. Both algorithms could correctly detect the dogs. However, FaceDetector focused only on the dog's face. Therefore, it presented itself not only as a cropping mechanism but also as a tool to clear the new datasets of irrelevant photos, as manual clearing is time inefficient. Results of both cropping algorithms are visible in Fig. 2. for FaceDetector and Fig. 3. for YOLO.

B. Metrics
Categorical accuracy (CA), F1-score (F1), and recall (RA) metrics were used to examine the performance of tested models. In the DogAgeChallenge announcement, [10] a modified version of mean average error (MAE), average accuracy (ACA), and average recall (aRA) is used. In this paper, the authors decided not to use mMAE since it was concluded that this metric would provide no meaningful information.

C. Augmentation, class weights and data balancing
To cope with a class imbalance within the datasets, the authors tried to introduce data augmentation via random flip, rotation, brightness, and contrast changes. However, augmentation harmed models' performances, lowering CA on validation sets on average by 5% when augmenting by flip and rotation and 10% when augmenting via contrast and brightness changes. Those tests were run on a balanced Pitbull Terrier dataset, excluding the influences of class imbalance being magnified by augmentation. Authors suspect that augmentation via brightness and contrast shifts may negatively influence the results by skewing some features crucial for age assessment (for example, fur color). Data augmentation failed to solve the class imbalance problem, so the authors used class weights during training. Weights were chosen as 1, 5, and 2 for Adult, Senior and Young classes. Only the Expert dataset, showing different data distribution, was assigned weights 2, 1, and 2 for the same class order. Also, class balancing via removal of the photos from most numerous classes was tested. The number of removed photos was chosen so that all classes had roughly the same number of photos as the least numerous ones. In the case of balanced datasets, no class weights were applied in the final experiments, as this approach did not perform well for balanced datasets (once again reducing CA and F1).

D. Initial network architecture and experiments
Initial trials were conducted using the transfer learning protocol. The backbone of the model consisted of CNN pretrained on ImageNet dataset with a classifier consisting of fully connected layers. During the experiments, the EfficientNetB7 [7] was chosen as the final backbone for the model, as the results of different commonly used CNN architectures were performing on a very similar level. The architecture of the initial model is based on the EfficientNetB7 backbone, followed by global average pooling and two fully connected layers.
The initial model was tested on the available datasets. During the experiments, the random data splits were as follows: • training split -80% of the dataset • validation split -10% of the dataset • test split -10% of the dataset The network was tested on both single and mixed-breed datasets. It was trained using AdamW [8] optimizer with a learning rate parameter equal to 0.0001 and the weight decay

E. Transformer-based classifier as a method to improve the classification results
During the experiments, the CNN model obtained better results than those found in the literature [10]. However, they still were not satisfying, so the authors decided to modify the architecture, adding a small version of ViT as a classifier after the backbone CNN block. ViT implements an attention-based model known in the field of NLP. The decoder block is omitted in the computer vision version compared to the original implementation. The model uses an attention mechanism to learn relations between different parts of the input image. Dosovitskiy et al. [9] showed that ViT can outperform current state-of-the-art CNN architectures in image classification. As it requires large datasets, the authors decided to extract features from the data using pre-trained CNN. Architecture is presented in Tab

F. Results
The results obtained during the experiments are presented in this section compared to the previous experiments. In the comparison, classification accuracy (CA) and recall (RA) were used to compare the obtained results with the ones conducted previously during the DogAge Challenge [10]. During Do-gAge Challenge, authors showed that using Squeezenet and Inception v3 with dense layer classifiers resulted in the CA and RA of 32% for the Sqeezenet model and 34% on both metrics for Inception v3. Comparing these values with the ones seen in Tab. III, which describes the ones obtained with solutions proposed by the authors trained on available datasets, we can see that significant improvement is made. The networks can reach both higher CA and RA. The CNN + ViT architecture improved CA and RA for every dataset. The single breed approach was similarly effective compared to the mixed breed approach. Examining Tab. IV, it is visible that CNN + ViT does not necessarily improve F1 for given classes but instead results in more even per-class F1. It can also be concluded that the class balancing and weighting approach can be a helpful tool to cope with class imbalances for this problem, as balanced datasets' results are comparable to those obtained on unbalanced datasets with class weights applied during network training.
IV. DISCUSSION Several challenges affect the experiment and results presented in this work. The most important of them are as follows: • limited access to high-quality data, • lack of the effective data specification methods, which makes both augmentation and preparation of the validation dataset challenging, • it is tough to deal with multiple breeds using a single model, • granularity of the age classification is tightly related to an ability to label images precisely. Other research projects in this domain may also consider these remarks to reach a high performance of ML models.

V. CONCLUSIONS AND FUTURE WORK
The problem of dog age classification using neural networks turned out to be more demanding than expected in the beginning. Experiments showed that state-of-the-art CNN models provided insufficient accuracy while using datasets with multiple dog breeds. The transformer-based architecture introduced in the paper improved the performance results, but there is still much space for progress in future research.

VI. DATA AND CODE AVAILABILITY
The code and datasets as well as additional results available: github.com/SzymonMazurekAGH/Age_recognition_Cyfrovet