Statistics Toolbox    

Finding the Similarities Between Objects

You use the pdist function to calculate the distance between every pair of objects in a data set. For a data set made up of m objects, there are pairs in the data set. The result of this computation is commonly known as a similarity matrix (or dissimilarity matrix).

There are many ways to calculate this distance information. By default, the pdist function calculates the Euclidean distance between objects; however, you can specify one of several other options. See pdist for more information.

For example, consider a data set, X, made up of five objects where each object is a set of x,y coordinates.

You can define this data set as a matrix

and pass it to pdist. The pdist function calculates the distance between object 1 and object 2, object 1 and object 3, and so on until the distances between all the pairs have been calculated. The following figure plots these objects in a graph. The distance between object 2 and object 3 is shown to illustrate one interpretation of distance.

Returning Distance Information

The pdist function returns this distance information in a vector, Y, where each element contains the distance between a pair of objects.

To make it easier to see the relationship between the distance information generated by pdist and the objects in the original data set, you can reformat the distance vector into a matrix using the squareform function. In this matrix, element i,j corresponds to the distance between object i and object j in the original data set. In the following example, element 1,1 represents the distance between object 1 and itself (which is zero). Element 1,2 represents the distance between object 1 and object 2, and so on.


 Terminology and Basic Procedure Defining the Links Between Objects