How Similarity Search works

The Similarity Search tool identifies which Candidate Features are most similar (or most dissimilar) to one or more Input Features To Match. Similarity is based on a specified list of numeric attributes (Attributes Of Interest). If more than one Input Features To Match is specified, similarity is based on averages for each of the Attributes Of Interest. The output feature class (Output Features) will contain the Input Features To Match along with all of the matching Candidate Features that were found, ordered by similarity (as specified by the Most Or Least Similar parameter). The number of matches returned is based on the value for the Number Of Results parameter.

Potential applications

Matching methods

Matching may be based on attribute values, ranked attribute values, or attribute profiles (cosine similarity). The algorithm employed for each of these methods is described below. For all methods if there is more than one Input Features To Match, the attributes for all features are averaged to create a composite target feature to use for the matching process: Averaged Attributes of Interest

Attribute values

When you select ATTRIBUTE_VALUES for the Match Method parameter, the tool first standardizes all of the Attributes of Interest. For each candidate it then subtracts the standardized values from those of the target, squares the differences, and adds the squared differences together. This sum becomes the similarity index for that candidate. Once all candidates have been processed, candidates are ranked from smallest index (most similar) to largest index (least similar).

Dive-inDive-in:

Standarization of the attribute values involves a z-transform where each value is subtracted from the mean for all values and divided by the standard deviation for all values. Standardization puts all of the attributes on the same scale even when they are represented by very different types of numbers: rates (numbers from 0 to 1.0), population (with values larger than 1 million), and distances (kilometers, for example).

Ranked attribute values

When you select RANKED_ATTRIBUTE_VALUES for the Match Method parameter, the tool will begin by ranking each of the Attributes of Interest both for the target feature and all of the candidates. For each candidate it then sums the squared difference for each attribute in relation to the target feature. If the population value for the target is the 10th largest among all candidates, and the population for the candidate being considered is 15th largest, the sum of the squared rank population difference for this candidate would be 10 - 15 = -5 and -5**2 is 25. The sum of squared rank differences for all of the Attributes of Interest becomes the similarity index for this candidate. Once all candidates have been processed, candidates are ranked from smallest index (most similar) to largest index (least similar).

Attribute profiles

When you select ATTRIBUTE_PROFILES for the Match Method parameter, the tool first standardizes all of the Attributes of Interest (a minimum of two Attributes of Interest is required for this method). It then uses cosine similarity mathematics to compare the vector of standardized attributes for each candidate to the vector of standardized attributes for the target feature being matched. The cosine similarity of two vectors, A and B, is computed as:

Cosine similarity equation

Cosine similarity is not concerned with the matching of attribute magnitudes but rather this method focuses on the relationships among the attributes. If you created a profile (line graph) of the standardized attributes in the vectors being compared (the target and one of the candidates), you might see very similar profiles or very different profiles:

Attribute profiles

The cosine similarity index ranges from 1.0 (perfect similarity) to -1.0 (perfect dissimilarity) and is reported in the SIMINDEX (Cosine Similarity) field. You would use this similarity method to find places that have the same characteristics but perhaps at a larger or smaller scale.

Best practices

Mapping similarity patterns

If you set the Number of Results parameter to zero, the tool will rank all of the candidate features. The output for this analysis will show you the spatial pattern of similarity. Notice that when you rank all candidates you get information about similarity and about dissimilarity.

Ranked similarity map

Including spatial variables

Suppose you know the locations (polygon areas) where a particular endangered species is doing well and you want to find other locations where it might also thrive. You would be looking for locations similar to the successful ones, but might also need locations large enough and compact enough to ensure species success. For this analysis you could compute a compactness metric for each polygon area (common compactness measurements are based on the area of a polygon in relation to the area of a circle with the same perimeter). You could then include your compactness measurement and an attribute reflecting polygon size (Shape_Area) in the Fields To Append To Output parameter when you run the Similarity Search tool. Sorting the top ten solution matches in terms of both compactness and area will help you identify the most appropriate locations for species reintroduction.

Perhaps you are a retailer interested in expanding. If you have existing stores that have been successful you can use attributes reflecting the key characteristics of success to help you find candidate locations for expansion. Suppose that the products you sell will be most attractive to college students and that you want to avoid locations near your current stores or near competitors. Before running the Similarity Search tool you would use the Near tool to create your spatial variables: distance to colleges or places with high densities of college students, distance to existing stores, and distance to competitors. You could then include these spatial variables in the Fields To Append To Output parameter when you run the Similarity Search tool.

8/26/2014