Distance measure - Similarity, Dissimilarity and Correlations

The term distance measure has a got wide variety of definitions among the math and data mining practitioners. As a result, concepts and their usage of distance measure went way beyond the head for beginners who tries to understand them for the first time. So today I am taking this opportunity to write this post to give a more simplified definition for things that shouldn't be hard to understand and follow.

In the simplest form distance measures are mathematical approaches to measure distance between any two objects. Computing distance measures help us to compare the objects from three different standpoints such as.

· How objects are Similarity

· How objects are Dissimilarity

· How objects are Correlation

Similarity are measures that ranges from 0 to 1 [0, 1].

Dissimilarity is measures that ranges from 0 to INF [0, Infinity].

Correlation is measures that ranges from +1 to -1 [+1, -1].

Note: Similarity and Dissimilarity some time together referred as Proximity score.

Distance measures such as similarity and dissimilarity and correlation are the basic building block for clustering, classification and anomaly detection. So, getting familiar these metrics in depth helps to get a better understanding in advance data mining algorithms and analytics.

When talking about distance measure we could not avoid discussing the term Distance Transformation or simply Transformation. Distance Transformation refers to the activity of converting similarity scores into dissimilarity score or vice versa.

The necessity of Distance Transformation merged due to the fact that developers or practitioners may have number computing dissimilarity score, but data mining algorithms may require a similarity score as input, thus a mechanism is required to convert one form into another form.

Let’s consider simple objects with a single attribute and discuss how similarity, dissimilarity, correlation computed.

Today there are multiple formulas for computing similarity and dissimilarity for simple objects and the choice of distance measures formulas that need to be used is determined by the type of attributes (Nominal, Ordinal, Interval or Ration) in the objects.

Table --here

Note: As mentioned earlier, in some situation it’s easy to compute dissimilarity first and then dissimilarity is converted to a similarity measure (example Ordinal type attribute) for further proceedings.

Now let’s consider calculating distance measures involving more complex objects with multiple attributes.

Euclidean distance – Euclidean distance is a classical method helps compute distance between two objects A and B in Euclidean space (1- or 2- or n- dimension space). In Euclidean geometry, the distance between the points can be found by traveling along the line connecting the points. Inherently in the calculation you use the Pythagorean Theorem to compute the distance.

Taxicab or Manhattan distance – Similar to Euclidean distance between point A and B but the only difference is the distance calculated by traversing the vertical and horizontal line in the grid base system. Example, Manhattan distance used to calculate distance between two points that are geographically separated by the building blocks in the city.

The difference between these two distance calculations is best seen visually. The figure illustrates the difference.

The Minkowski distance is a metric in Euclidean space which is considered as a generalisation of both the Euclidean distance and the Manhattan distance.

Where r is a parameter.

When r =1 Minkowski formula tend to compute Manhattan distance.

When r =2 Minkowski formula tend to compute Euclidean distance.

When r =∞ Minkowski formula tend to compute Supremum.

The cosine similarity between two vectors (or two documents on the Vector Space) is a measure that calculates the cosine of the angle between them. This metric is a measurement of orientation and not magnitude; it can be seen as a comparison between documents in terms of the angle between them.

Mahalanobis distance is one such measure that used to measure the distance between the two groups of objects. The idea of distance measure between two groups of objects can be represented graphically for better understanding.

Given with the data depicts the above picture, Mahalanobis distance can calculate distance between the Group1 and Group2. This type distance measure is helpful in classification and clustering.

Correlation is a statistical technique that gives a number telling how strongly or weakly the relationship between the objects. It is not a measure describing the distance, but a measure describes the bound between the objects. The Correlation value is usually represented by small letter 'r' and ‘r’ can range from -1 to +1.

If r is close to 0, it means there is no relationship between the objects.

If r is positive, it means that as one object gets larger the other gets larger.

If r is negative, it means that as one gets larger, the other gets smaller.

r value =

+.70 or higher Very strong positive relationship

+.40 to +.69 Strong positive relationship

+.30 to +.39 Moderate positive relationship

+.20 to +.29 weak positive relationship

+.01 to +.19 No or negligible relationship

0 No relationship

-.01 to -.19 No or negligible relationship

-.20 to -.29 weak negative relationship

-.30 to -.39 Moderate negative relationship

-.40 to -.69 Strong negative relationship

-.70 or higher Very strong negative relationship

Today there are several proven methods/ formulas for computing correlation measures 'r', out of which Pearson's correlation coefficient is a commonly used method for computing the correlation. Using Pearson's correlation coefficient can be calculated for objects that possess linear relationship. Below figure shows the graphical representation of how these correlations look like.

Sometimes it is easy to confuse correlation with regression analysis. So, in order to get a better understanding of these terms, we can say regression analysis helps to predict the value if there exists a relationship between the objects. Whereas correlation helps to check the existence a relationship between the objects. Thus, a wise statistician always computes correlation first before doing any predication using regression analysis.