On Gower Similarity Coefficient and Missing Values

The Gower similarity coefficient is a popular measure for comparing objects with possibly mixed-type attributes and missing values. One of its characteristics is that it calculates the coefficient value without considering attributes with missing values. In this article, we explore the properties of the coefficient in detail, including the consequences of omitting attributes with missing values. We also introduce strict lower and upper bounds on the actual similarity value on an attribute and strict lower and upper bounds on the actual value of the Gower similarity coefficient, derive a number of their properties and propose a new coefficient as a solution to the identified problem with the Gower similarity coefficient.


INTRODUCTION HE Gower similarity coefficient
is a popular measure for comparing objects with possibly mixed-type attributes (quantitative, qualitative and/or dichotomous) and missing values.One of its characteristics is that it calculates the coefficient value without considering attributes with missing values.The approach is easy and intuitive and finds many applications (see, e.g.[1], [2], [3], [5], [6], [8]).It is also considered as an easily extensible template of calculating (dis)similarities of objects with mixed-type attributes [2], [5], [7].However, as we show in the article, Gower similarity coefficient has some deficiencies.In particular, we show that in the case of objects with missing values, the coefficient may take a similarity value impossible to obtain with any replacement of missing values with values from the domains of attributes.

T
Our main contribution in the article includes: • Introduction of strict lower and upper bounds on the actual similarity value on an attribute and strict lower and upper bounds on the actual value of the Gower similarity coefficient, which are obtainable after replacing missing values with respective attribute domain values.


• Showing that in the case of a pair of objects one of which has missing value for at least one quantitative attribute, the Gower similarity coefficient may take an incorrect value, which will be less than the lower bound on the actual value of the Gower similarity coefficient.
• Derivation of a number of properties of similarity value of objects on the attribute, the Gower similarity coefficient and the introduced bounds.
• Proposing new similarity coefficient G' as a correction of the Gower similarity coefficient, which eliminates the problem found for quantitative attributes with missing values.
The layout of the article is as follows: First, we recall the definitions of attribute value similarities, their weights and the Gower similarity coefficient, as well as introduce additional basic notions that are used throughout the article.Then, we show example objects for which the Gower similarity coefficient takes an incorrect value, caused by the occurrence of a missing value of a quantitative attribute for one of them.We also illustrate the consequences of the occurrence of missing values for qualitative and dichotomous attributes.Next, we introduce strict lower and upper bounds on the actual similarity value on an attribute and on the actual value of the Gower similarity coefficient, as well as derive a number of their properties.In addition, the coefficient G', being the modification of the Gower similarity coefficient, is proposed, which, unlike the original Gower similarity coefficient, always returns similarity values that do not exceed the presented lower and upper bounds.

BASIC NOTIONS RELATED TO GOWER SIMILARITY COEFFICIENT
Gower proposed a measure of objects' similarity, which can be applied in the case of qualitative attributes, quantitative attributes, dichotomous attributes or their mixtures [4].In the measure, only the attributes for which it is possible to determine their similarity are taken into account; the other are ignored.In particular, if for a pair of objects, an attribute value for at least one of these objects is missing, then the two objects are treated as not comparable on this attribute and the Gower similarity coefficient is calculated without taking this attribute into account.
In the remainder of the article, we assume that objects are characterized by n, where n ≥ 1, attributes whose domains contain at least two different values.The missing value will be denoted by *.The value of attribute i of object u will be denoted by u i .
The function (. , . ) is used to indicate whether two objects are comparable on attribute i or not.Let u and v are objects under consideration.If u and v are comparable on attribute i, then ( , ) = 1; otherwise ( , ) = 0. We already mentioned that two objects u and v are incomparable on attribute i if the value of at least one of the objects is missing and so, ( , ) = 0.However, in the case of a dichotomous attribute (indicating whether or not a feature is present), the objects may also be incomparable, even if their values are known (this happens when two objects do not have the feature represented by the dichotomous attribute).
The Gower similarity coefficient [4] for objects u and v is denoted by G(u,v) and is defined as follows: , where ( , ) is a coefficient determining similarity of two objects on attribute i, i = 1..n, taking values from the interval [0,1] ∪ {undefined}.It is assumed that whenever ( , ) = 0, then ( , ) × ( , ) = 0. Thus, the Gower similarity coefficient is the average similarity of two objects on the attributes on which they are comparable.
In the case when the values of attribute i are not missing for both objects u and v, then ( , ) and coefficient ( , ) are determined as follows: ; where range i = max imin i , where max i is the maximal value of attribute i, while min i is the minimal value of attribute i.
• If attribute i is dichotomous: In the case when the value of attribute i is missing for at least one of the objects u or v, then ( , ) and the coefficient ( , ) is determined for any type of attribute i in the same way as follows: Now, we are ready to formally define comparable and incomparable objects on an attribute.Objects u and v are defined as incomparable on attribute i if: • either the value of attribute i is missing for at least one the two objects • or attribute i is dichotomous and the values of both objects are equal to −.Otherwise, objects u and v are comparable on attribute i.In the remainder of the article, we will use the following notation: • CMP_ATT(u,v) denotes the set of attributes on which u and v are comparable; that is, :;<=_>??( , ): Objects u and v are defined as comparable if they are comparable on at least one attribute; that is, if Otherwise, objects u and v are defined as incomparable; that is, when ( , ) % @A = 0 (or equivalently, if |CMP_ATT(u,v)| = 0).Please note that the value of G(u,v) is not defined for incomparable objects.Otherwise, if u and v are comparable, then G(u,v) ∈ [0, 1].

A. What's Wrong with Gower Similarity Coefficient?
Though Gower similarity coefficient is appreciated by the ease and intuitiveness of dealing with attributes on which objects are incomparable, we will show that it may take an unacceptable value if the values of attributes are missing (see Example 1).
Objects u and v are comparable and different on attribute 1 (so, w 1 (u,v) = 1 and s 1 (u,v) = 0) and are not comparable on attribute 2 (so, Now we will consider what would be the Gower similarity coefficient of objects u and v i , where v i represents v after replacing its missing value of attribute 2 with some value from the domain range [0, 100].Objects v 1 , …, v 11 in Table I represent object v under assumption that its actual value of attribute 2 is 0, 10, …, 100, respectively.Clearly, each instance v i of object v is comparable with u on both attributes and is different from u on attribute 1, which is qualitative (so similarity of v i to u on attribute 1 equals 0).Hence, Clearly, G(u,v i ) reaches maximum for the greatest value of s 2 (u,v i ).This happens for object v 5 , for which s 2 (u,v 5 ) = 1 and, in consequence, G(u,v 5 ) = 0.5.G(u,v i ) reaches minimum for the least value of s 2 (u,v i ) (that is, for the largest absolute value of the difference between age of u and v i ).This happens for object v 11 , for which s 2 (u,v 11 ) = 0.4 and so, G(u,v 11 ) = 0.2.Please note that this least achievable value of 0.2 of G(u,v i ) is greater than G(u,v), which equals 0.
As shown in Example 1, G(u,v) may take a value that is not obtainable for any actual completions of missing values of quantitative attributes of objects u and v.
In the further part of the article, we introduce strict lower and upper bounds on the actual similarity value of any objects u and v on an attribute from the set INCMP*_ATT(u,v) and on the actual value of the Gower similarity coefficient for these objects.The bounds will make it possible to check when the Gower similarity coefficient takes values unattainable for any completions of missing values.

B. Lower and Upper Bounds on Actual Similarity Value on an Attribute
Let us recall that objects u and v are not comparable on attribute i either because at least one of the objects has missing value for this attribute (i.e.i ∈ INCMP*_ATT(u,v)) or the attribute is dichotomous and both objects have valuefor it (i.e.i ∈ INCMP d _ATT(u,v)).If u and v are incomparable on attribute i, then w i (u,v) = 0, and so attribute i does not contribute to the value of G(u,v).Nevertheless, in the case of attribute i ∈ INCMP*_ATT(u,v), u and v may become comparable on attribute i if the actual values of attribute i become known for both objects.Then, w i (u,v) can become equal to 1, and so, s i (u, v) can contribute to the value of G(u,v).Example 1 illustrates how replacing missing value of quantitative attribute i affects the values of w i (u, v), s i (u, v) and G(u,v).This influence is also illustrated for a qualitative attribute and a dichotomous attribute in Examples 2 and 3, respectively.
Objects u and v are comparable on attributes 1 and 2 (w 1 (u,v) = w 2 (u,v) = 1, s 1 (u,v) = 0, s 2 (u,v) = 0.9) and are not comparable on attribute 3 (w 3 (u,v) = 0, s 3 (u,v) = undefined).Hence, G(u,v) = (1 × 0 + 1 × 0.9 + 0 × undefined) / (1 + 1 + 0) = 0.9 / 2 = 0.45.Objects v 1 and v 2 in Table III present instances of object v after replacing its missing value of attribute 3 with either − or +.Since, both u and v 1 have value − of attribute 3, they are not comparable on this attribute (so, w 3 (u,v 1 ) = 0) and s 3 (u,v 1 ) = 0.This means that attribute 3 does not contribute to the value of G(u,v 1 ) even though its value is known both for u and v 1 .Now, since, u and v 2 have values − and +, respectively, on attribute 3, they are comparable on attribute 3 (so, w 3 (u,v 1 ) = 1) and their similarity on this attribute is the least possible; namely, In Examples 1, 2 and 3, we considered instances of example object u, with known values for all attributes, and object v, with missing value only for one given attribute i.We considered all or some instances of object v in which missing value was replaced by possible actual values including those instances of object v whose similarity on attribute i was the least and greatest, respectively.Clearly, these least and greatest values are lower and upper bounds, respectively, on similarity values of objects u and v on the examined attributes.
Let i ∈ INCMP*_ATT (u, v).Lower bound on the actual similarity value of u and v on attribute i will be denoted by i (u,v), while upper bound on the actual similarity value of u and v on attribute i will be denoted by i (u,v).The associated weights for the bounds will be denoted as i (u,v) and i (u,v), respectively.
In Table IV, we provide the values of the similarity bounds i (u,v) and i (u,v) and their weights, respectively, under assumption that the value of attribute i is missing for at least one object.In fact u), thus, without loss of generality, we assume that the value of attribute i is missing for object v.The results are provided for quantitative, qualitative and dichotomous attributes.We also indicate for which possible actual values of v and eventually u, s i (u,v) = i (u,v) and s i (u,v) = i (u,v), respectively.Thus, we show that i (u,v) and i (u,v) are strict lower and upper bounds on the actual similarity value of objects u and v on each attribute i ∈ INCMP*_ATT (u, v).
Please note that i (u,v) equals 1 for each attribute i ∈ INCMP*_ATT(u, v).On the other hand, the lower bound i (u,v) = 0 in all cases considered in Table IV except for quantitative attribute i whose value is missing for only one of the two compared objects.In that exceptional case, i (u,v) depends on the known value of the other object and can be TABLE IV.STRICT SIMILARITY BOUNDS i (u,v), i (u,v) AND THEIR ASSOCIATED WEIGHTS i (u,v) AND i (u,v) FOR MISSING VALUE OF OBJECT v AND KNOWN OR MISSING VALUE OF OBJECT u. greater than 0 (as shown in Table IV, in this case, i (u,v) = min{(xmin i ), (max ix)} / range i. ).

Property 3.
Let i be a quantitative attribute.Let the value of attribute i be missing for object v and be equal to x for object u.Then: a) i (u,v) reaches maximum, which is equal to 0.5, for x = (min i + min i ) / 2. b) i (u,v) reaches minimum, which is equal to 0, for x = min i or x = max i .
Proof: Follows from i (u,v) for a quantitative attribute (see Table IV).
Note also that for each attribute i ∈ INCMP*_ATT(u, v), upper bound i (u,v) = 1 and i (u,v) = 1, unless attribute i is dichotomous and its value is equal to − for one object, say u, and is missing for the other object, say, v.In that exceptional case, i (u,v) = 0 and i (u,v) = 0 (which corresponds to the situation when the actual value of v is also equal to −), while i (u,v) = 1 and i (u,v) = 0 (which corresponds to the situation when the actual value of v equals +).In the former case, attribute i does not contribute to the Gower similarity coefficient, while in the latter case, attribute i contributes to it with the least possible value of 0.

C. Lower and Upper Bounds on Actual Value of Gower Similarity Coefficient
We start with defining lower and upper bounds on the actual value of the Gower similarity coefficient, which are achievable after replacing all missing values in the compared objects with some values from the domains of corresponding attributes.
Lower bound on the actual value of G(u,v) is denoted by G(u,v) and is defined as follows: .
Upper bound on the actual value of G(u,v) is denoted by I(u,v) and is defined as follows:    .In fact, G'(u,v) can be regarded as an improved version of G(u,v).
denotes the set of attributes on which u and v are not comparable; that is, INCMP_ATT(u,v) = {attribute i| ( , ) = 0}.• INCMP*_ATT(u,v) denotes the set of attributes on which either u or v or both have missing values.• INCMP d _ATT(u,v) denotes the set of dichotomous attributes on which both u and v have value −.

Example 4 allows us to conclude what follows: Property 6 .Corollary 1 .
Let u and v be comparable objects.Let i be a quantitative attribute with missing value for object u and known value for object v. Then:a) It is probable that i (u,v) > G(u,v).b) If i (u,v) > G(u,v), then it is probable that G(u,v) > G(u,v).It is probable that G(u,v) > G(u,v) when there is a missing value in u or v.If G(u,v) > G(u,v), then G(u,v)takes an incorrect value, which cannot be obtained for any possible actual value of attribute i of object u.To avoid the problem stated in Corollary 1, one may use, depending on an application, the lower bound G(u,v), the upper bound I(u,v) or an appropriately modified version of G(u,v) instead of G(u,v) itself.Below we introduce new G'(u,v) similarity coefficient defined as follows: ??( , ):E :{ ∈JK;<= * _>??( , ): ( , )PQ( , ):