Emotion Analysis from Speech of Different Age Groups

— This Recognition of speech emotion based on suitable features provides age information that helps the society in different ways. As the length and shape of human vocal tract and vocal folds vary with age of the speaker, the area remains a challenge. Emotion recognition system based on speaker’s age will help criminal investigators, psychologists and law enforcement agencies in dealing with different segments of the society. Particularly child psychologists, counselors can take timely preventive measures based on such recognition system. The area remains further complex since the recognition system trained for adult users performs poorer when it involves children. This has motivated the authors to move in this direction. A novel effort is made in this work to determine the age of speaker based on emotional speech prosody and clustering them using fuzzy c-means algorithm. The results are promising and we are able to demarcate the emotional utterances based on age.


I. INTRODUCTION
Patterns based on gender and age can be obtained from facial expressions, gestures or verbal communication of individual.Among these modalities, this paper emphasizes on the recognition of emotions based vocal communication.The objective is to determine the speaker"s emotional conversation pattern based on his/her age.Determination of these will be beneficial to law enforcement agencies in studying criminal psychology and further investigation.Particularly, the speaker"s state of mind and emotional attributes will assist the condition of both victim and the culprit during court hearing and prevent confusion.Identification of intimidating calls, false alarms, kidnapping involving influential people, fanatic religious groups, radicals etc. can be made possible with such systems [1].Further, the recognition system will help in implementing corrective measures in case negative emotional attributes are manifested among children before it is too late.Utterances of speaker colored with emotion and age detection can also help human robotic interfaces, telecommunications, intelligent tutoring, smart call center application etc.

This work was not supported by any organization
The vocal tract and vocal fold of human speech production mechanism is in a growing stage till a child attains adolescent.Selecting suitable features representing age of the speaker thus remains an ever-growing challenge.Recognition systems trained for adult speakers often proved inefficient when these are trained with children utterances [2].This is because, the core features representing the speech and emotional contents of an utterances vary with age and gender of the speaker.Especially, the fundamental frequency (F0), formants, speech rate, energy etc. vary drastically between a child and an adult [3].The acoustic models made for research and business requirement thus become ineffective in case the emotional utterances belong to different age group.Speaker"s age and gender have been addressed by different literature during last decades although, these studies little emphasized on emotional contents of speech [4] [5].These authors attempted the Gaussian weight super-vectors with support vector machine (SVM) classifier for age and gender identification.However, no precise study between different age groups or their emotional states has been made by them.Use of melfrequency cepstral coefficient (MFCC) with different feature selection algorithm such as PCA (principle component analysis), supervised PCA (SPCA) has been attempted for different age groups in [6].The prominent prosodic features representing speech emotion of children and adults could not be found in these literatures.Absence of a clear boundary among emotions based on age has motivated the authors to move in this novel effort.
The objective is to cluster the features representing emotional utterances of different age groups.Different clustering approaches such as fuzzy c-means (FCM), hierarchical clustering, Partitioning, Density-Based, Grid-Based, Model-Based, K-means clustering has been applied to recognize human emotions [7] [8].K-means is a hard clustering algorithm, simple and can solve known clustering problems using unsupervised learning.The algorithm is faster than hierarchical clustering producing tighter clusters.The algorithm proved better when compared with fuzzy cmeans classifier for speech emotion recognition using GMM super vectors [7].This has been chosen for clustering the desired emotions based on age for our purpose.
Section II of this work describes the database used followed by the feature extraction technique in section III.The clustering algorithm chosen has been explained in section IV and the results are shown in section V. Section VI provides the conclusion.

II. EMOTIONAL SPEECH MATERIALS
Collection of database for speech emotions involving different age group is a tedious task.Most of the available databases are confined to a particular segment of speaker.Emotional database of speakers with age spanning a large range is either unavailable or inaccessible.Further, emotional utterances under real-life scenario in such situation are seldom found.Thus, the database desired for this work has been collected from different sources.The utterances are collected over three months of time by placing recording instruments at different locations.Around one thousand utterances of speakers among eight to forty are obtained in total.Out of these, the emotional contents of speech are taken out for simulation purpose.Eighteen utterances of angry, boredom and sad emotions based on the opinion of linguistic experts have been used for further processing.16 bit quantization with 16 kHz sampling rate was used for recording the utterances.Average duration of an utterance is of 5 seconds.A mobile set of good quality has been used to record the utterances.Format factory software has been used to transform the mobile data into .wavformat.The signal is pre-processed to accommodate speaker variability and noise to the minimum before further processing.

III. FEATURE EXTRACTION
Feature extraction and selection is an important aspect of recognition system [8]- [11].Uses of acoustic features consisting of speech prosody characteristics have been discussed by researcher in the field of speech emotion recognition [12]- [14].Among these features, the F0, energy or amplitude, speech rate etc. are few features that vary with age of the speaker hence considered in this work.A brief discussion on these features is made in the following section.

A. Fundamental frequency (F0)
Male speakers tend to have lower F0 as compared to both children and female subjects due to larger vocal cord.Older people have lower fundamental frequency as compared to adults when experiment has been conducted between twenty older and twenty younger adults by the author [15].The fundamental frequency continues to decrease with age for both genders [16] [17].When speech is coloured with emotions, it is observed that higher arousal emotions such fear, angry have higher F0 as compared to neutral or lower arousal emotions such as sad or boredom [14] [18].Among different methods of F0 extraction, autocorrelation and cepstral methods are most popular [13] [19].Autocorrelation method of F0 extraction has been used in this work.Using this technique, the feature can be extracted directly from the speech waveform.It requires less hardware such as a multiplier and an accumulator than other methods.Further, the method is simple and noise immune.For a signal delayed by , the auto correlation coefficients (ACF) is estimated using the relation ∑ (1) Highest value of ACF can be obtained when the condition , is satisfied and is indicated as .The ACF decreases with increase in the signal delay.Denoting the time period of the signal as , the ACFs will attain its peak at , where I is an integer.From these peak locations F0 can be estimated.

B. Log energy
People speak at higher intensity or energy when aroused by certain emotions such as angry, happy or surprise [14] [18].These emotions have more energy contents at higher frequencies.Energy or amplitude indicates the volume of the speech.The strength of the voice is automatically raised with significant increase in amplitude when people get excited or agitated.Dull voice related to sad or bore emotions often are of low amplitude or energy.Logarithm of energy remains an important feature of emotion that suits logarithmic nature of hearing mechanism.The log energy can be estimated for a signal using the relation where, is the analyzing window.

C. Speech rate
Speech rate is an important feature that provides information on speaker"s age, gender, language, demographic and cultural profile [20] [21].The application domains are speech pathology, speech science, behavioural psychology, emotional analysis, neuropsychology etc. Speaking rate signifies the communication time of a message during conversation.It is an indication of the number of syllables or words or spoken units that is uttered per minute or second.It represents the quickness at which a speaker utters an emotional sentence at certain situation.It is a global feature taken over whole length of the signal.Human being speaks faster when gets excited than in cool mood.Thus, angry, fear or high frequency content emotions are likely to have higher speech rate than neutral or sad or low excited voices.For an utterance, the average speaking rate can be estimated using the relation (3) Where, and denote the number of vowel segment and utterance duration respectively.

IV. K-MEANS CLUSTERING
K-means is hard clustering algorithm more suitable for exclusive clustering task.The objective of the work is to distinguish speaker"s emotion based on features that varies with age.Thus, it will be a supportive approach in this case.Let, there are " " numbers of features contain all the states of emotions.Using the algorithm, the features are partitioned into clusters each having a cluster center .Each cluster center is associated with the corresponding class.With the help of squared error function, the objective function "b' is minimized in formation of the clusters.Optimal convergence of "b' will ensure adequate clustering of the desired emotion.The objective function is represented as where, ‖ ‖ is norm representing the distance between and the data point .K-means algorithm has been performed using following steps 1.From each feature points, select the centroid.2. Obtain cluster by iteratively repeating the procedure.In the process, allot all the source data point to the respective nearest centroid.

3.
The centroids are updated by estimating the cluster centers iteratively, until further variation in cluster center is manifested.

V. RESULTS AND DISCUSSION
The variation of F0 of children and adults has been shown in Fig. 1.It is observed that, children and female have higher F0 as compared to the adults due to larger vibration of their short vocal tract.The plot formed is in zigzag fashion as the utterances consist of both genders.The value decreases with age due to increase in vocal tract length owing to growth of facial skeleton and lowering of the larynx.Due to higher excitation level, angry state has shown larger pitch for both adult and children compared to boredom and sadness as shown in the Figure .A comparison on different gender independent prosodic features of adult emotional states attempted in literature is given in Table 1.The features extracted in this work are compared among different age group and is tabulated in Table II.The observed results indicate a higher value of mean, maximum, and minimum F0 values for children speech as compared to the adult utterances.This is found to be true across all classes of speech emotions chosen in this work.Pitch variation and the mean value have been tested for different gender independent adult emotions [22]- [24].The pitch mean found to be highest for angry emotion followed by happy and bore state as claimed by these authors.

TABLE I.
Comparative study of the state of art age and gender independent feature extraction techniques

Features Emotions Angry Sad Fear Happy Bore Disgust
Speech rate [21] F0 mean [22] [23] F0 Variation [22][23] Energy [9][18] F1 [12][21]- [24] Duration [24] Spectral centroid [24] = increase, = decrease Energy or intensity indicates the arousal level of an emotion.The calculated value indicates higher energy for higher arousal emotional states such as angry, happy, and fear.Bore is found to have the lowest energy among all the states tabulated.The presence of higher frequency components increases the energy level of angry state than that of bore and sad emotion.Computations of spectral energy by different authors are worth noting to support the findings in this work [21] [23].The log-energy features of both children and adults are plotted for these emotions in Fig. 2. The feature extraction technique is so chosen to approximate human hearing system that acts non-linearly at different bands of the signal.The energy founds to be higher for children than their adult counterpart as observed from the figure.The log-energy in dB is compared in table II for adult and children utterances.The result indicates a larger mean, maximum, and minimum energy for children speech utterances.Children are inherently more excited and enthusiastic to abnormal situations than well matured and HEMANTA KUMAR PALO ET AL.: EMOTION ANALYSIS FROM SPEECH OF DIFFERENT AGE GROUPS judgmental adults.This makes the children over expressive with larger arousal states than the adults.Fig. 3 provides the variation in speech rate of children against that of adult speakers up to 40 years of age.It is observed that, a child takes more time during conversation as compared to adults.These may be attributed to reading disorder and social anxiety that is normally found with children.The neuro-muscular and biological factors are other aspects that tend to decrease the speech rate of children.As child reaches to adulthood, he or she develops the oral-motor skills and linguistic skills like lexical, semantic and phonological parameters.Increase in motor planning specificity of growing children increases the articulation rate.Due to cognitive development with age, the fluency in speech increases.These factors make the speech rate of adults higher than that of children.On contrary, limited exposure to the environment and language makes children to ponder between suitable words or vocabulary during expression of emotions.They invent their own words rather than learned words used by adults using associative skills and imagination to certain situations.This leads to reduction in reaction time and decrease in speaking rate.The emotional utterances are taken in a natural background where, the conversations are task dependent.This may be the reason of variation in speech rate.It has been evidenced with higher mean, maximum, and minimum speech rate for children utterances as shown in table II.This is true across all the emotional classes.The speech rate is highest for fear and angry state than that of sad and bore states as shown in Fig. 3 similar to other observation made in literature [21].The reason may be attributed to higher energy (or high frequency components) that inherit high arousal emotions.While comparing the duration feature to investigate emotional cues of speakers, few authors provide similar trends [24].It can be concluded that, due to lower speech rate, the utterance duration tends to be longer for sad emotion followed by bore state.A close observation of duration feature reveals that human being takes lager time to express emotions having lower energy as compared to aggressive states.
An attempt is made in this work to cluster different age group using K-means clustering with the chosen feature sets.K-means is more suitable for exclusive clustering of data.Cluster of angry speech emotion based on three age groups as 8-14 years, 19-24 years and 30-40 years using speech rate features is shown in in Fig. 4. It is observed that, the cluster groups of 19-24 years and 30-40 years are more closure.These groups are similar and thus described by features of similar magnitudes.On the contrary, the clustering of the features representing these groups is widely separated from that of children falling in the age group of 8 to 14 years due to quite distinct feature values.A similar comparison with K-means clustering is done using F0 features of sad emotion in Fig. 5.In this case, the older adult group (30-40 years) is compared with the youngest group (8-14 years).A widely separated cluster

Fig. 1
Fig. 1 Variation of F0 with age for different speech emotional states

Fig. 2
Fig. 2 Variation of log-energy features with age for different speech emotional states

Fig. 3
Fig. 3 Variation of speech rate with age for different speech emotional states

Fig. 4 K
Fig. 4 K-means clustering of angry speech emotion for different age groups using speech rate.

Fig. 5 K
Fig. 5 K-means clustering of sad speech emotion for different age groups using F0 features