Algorithms


Nearest Neighbor Algorithm For Color Matching

Geoffrey Probert


Geoffrey Probert has been programming in C for nine years. He currently works at Aztek Engineering. He can be reached at (303) 466-9710 or (303) 786-9100 ext. 144.

A problem generally referred to as a nearest neighbor, or closest neighbor, problem begins with a fixed set of points whose locations are all known. When given a new point, you must find the nearest point in the original population to that new point. Most beginning programmer text books deal with the one-dimensional version of the nearest neighbor problem. They solve the problem by sorting the elements and then doing a binary search. Text books do not usually deal with the multi-dimensional version of this problem. This article will discuss an approach guaranteed to find the nearest neighbor to a point in a multi-dimensional space.

The Problem

In my application, I needed to display an image on a VGA screen (640 pixels by 480 lines with 256 colors) that was originally 768 pixels by 512 lines with 256 colors. Since the palette for the original image already existed, it seemed best to keep that same palette. For each of the pixels in the VGA image, I computed the red, green, and blue values using bilinear interpolation as shown in Figure 1 (see "Resampling Methods for Image Manipulation" by Girish T. Hagan in the August 1991 issue of C Users Journal). After computing the red, green, and blue values, I needed to select the color from the palette that was the closest match to the computed values.

The first approach to this problem is the brute force method. This involves computing the distance from each of the desired values to all of the available neighbors. If I am generating an image of 640 pixels by 480 lines with a palette of 256 colors, the distance must be calculated 78.6 million times. This takes a considerable amount of time even on a fast CPU and inspires the search for a faster algorithm. I approached this problem with an algorithm that limits the number of distance calculations and comparisons.

The algorithm contains two parts. The first part initializes data and structures. It does this once before any searching for a nearest neighbor. The second part actually searches for the nearest neighbor for each new pixel being generated.

Two-Dimensional Problem

To visualize the algorithm more easily, look at the two-dimensional problem first. In the preparation phase, I proceed just as in a one-dimensional problem. I sort all of the neighbors using just one of the dimensions, say X. Since I am sorting the neighbors, I must also bring along the values for the rest of the dimensions (Y in the two-dimensional case).

The search phase is a two part process. First I must find the nearest neighbor using just the single dimension I originally sorted on, in this case X. This neighbor becomes the initial estimate. The program computes and stores the distance from this initial estimate to the real point as the shortest distance. Next, I begin to search to the left and to the right, in X, of the initial estimate. At each of these points the program computes the distance to the real point. If it is less than the shortest distance so far, then it becomes the new shortest distance. The trick is to end the search before searching all of the points. To do this, the program also computes the distance in the sorted dimension only (X). Once this distance is greater than the shortest distance for both the left and right sides, I know that no more points need to be looked at. The nearest neighbor corresponds to the point that had the shortest distance to the desired point.

Two-Dimensional Example

The example in Figure 2 illustrates this concept. There are nine neighbors, each indicated by a dot. An X indicates the desired location at (6,2). Assume the sort along the X dimension has been done.

The sort along the X dimension results in an initial estimate at the point (6,7). The distance to this point is 5, initially the shortest distance.

The first point to be examined would be to the left of the initial estimate, at (5,5). The X distance is 1, which is less than the shortest distance, 5. Therefore I compute the distance, sqrt(10), to this point. This is less than the shortest distance and therefore replaces the shortest distance.

The next point to be examined is to the right of the initial estimate and is at (7,4). The X distance is 1, which is less than the shortest distance, sqrt(10). Therefore I compute the distance, sqrt(5), to this point. This is less than the shortest distance and therefore replaces the shortest distance.

The next point to be examined is to the left of the initial estimate and is at (4,5). The X distance is 2, which is less than the shortest distance, sqrt(5). Therefore I compute the distance, sqrt(13), to this point. This is greater than the shortest distance and therefore does not replace the shortest distance.

The next point to be examined is to the right of the initial estimate and is at (8,5). The X distance is 2, which is less than the shortest distance, sqrt(5). Therefore I compute the distance, sqrt(13), to this point. This is greater than the shortest distance and therefore does not replace the shortest distance.

The next point to be examined is to the left of the initial estimate and is at (3,4). The X distance is 3, which is greater than the shortest distance, sqrt(5). Therefore I can quit searching to the left of the initial estimate.

The next point to be examined is to the right of the initial estimate and is at (9,4). The X distance is 3, which is greater than the shortest distance, sqrt(5). Therefore I can quit searching to the right of the initial estimate. I have finished searching to both the left and the right of the initial estimate, therefore the nearest neighbor has been found at (7,4) with a distance of sqrt(5).

Color Matching, Three Dimensions

In the application that I was working on, an image was being generated for display at 640 pixels by 480 lines by 256 colors. The palette was already present with the original image. The individual color values ranged from 0 to 255. These would later be scaled for the VGA palette which ranges from 0 to 63. The size of my arrays and structures were designed to handle the larger range of 0 to 255. I used the "red" dimension for my sort.

Listing 1 declares the global variables needed by the routines. In order to generate new pixels by interpolating from the original image, a copy of the original palette (palo) must be kept to have access to the original colors. The copy of the palette is named pals. It carries along the index of the original palette entry in num. This allows the original palette in the original order to be used for the new image. To speed things up, the square of the distance between points was computed rather than the actual distance.

To speed up the acquisition of the initial estimate, another array was initialized during the preparation phase. Since there were only 256 possibly values in the sorted dimension, I generated an array named redindex that would immediately index into the sorted palette for the initial estimate.

Listing 2 details the initialization phase. The routine sort_color makes a copy of the original palette and uses qsort to sort on the red dimension. The routine color_comp is the comparison routine for qsort. After sorting pals, the program generates the indexing array redindex. To do this, it uses two indices, one to step through redindex, and the other to step through pals. It sets each value in redindex to the closest value there is in pals.

Listing 3 does the actual search for the nearest neighbor. It finds the initial estimate in redindex and computes the shortest distance for it. My search looks at one possibility on the left and then at a possibility on the right. The program computes the "red distance" followed by the real distance, if applicable. Listing 3 also provides tests to make sure the search remains within the bounds of pals. The dist routine computes the square of the distance between a palette entry and the desired red, green, and blue values.

Summary

This algorithm is over 45 times faster than the brute force method would be in my application. Stopping the search before all possible values were examined contributed the most to this increase in speed. Using redindex to get a quick initial estimate also contributed significantly.

This algorithm should be applicable to any multidimensional nearest neighbor problem. One word of caution: if your data is clumped in one dimension and spread out in another, you should sort along the spread out dimension. This will speed up the search algorithm.