Programming Assignment 3: K-D Trees

Background

One generalization of binary trees is the k-d tree, which stores k-dimensional data. Every internal node of a k-d tree indicates the dimension d and the value v in that dimension that it discriminates by. An internal node has exactly two children, containing data that is less-than-or-equal and data that is greater than v in dimension d. For example, if the node distinguishes on dimension 1, value 107, then the left child is for data with y value less than or equal to 107, and the right child is for data with y value greater than 107. Leaf nodes represent a bucket containing no more than b elements of k-dimensional data. All data are found in the leaves.

There are several strategies for building k-d trees. The offline method (1) accumulates all the data in an array, (2) finds the best dimension to discriminate on, namely, the one with the widest range (break ties by choosing the earliest dimension that has the widest range), (3) finds the best value of that dimension to discriminate on, namely, the median value in that dimension (using the QuickSelect algorithm with Lomuto's partitioning method), (4) separates the data into two subarrays based on that discriminant, (5) recurses back to step 2 on each subarray. Recursion terminates when an array has size b or smaller. One can also devise online methods that add to existing trees.

Requirements

Write a program called kd that (1) takes three parameters, all positive integers: k specifies the number of dimensions, n specifies how many data points are to be placed in the tree, and p specifies the number of probes into the tree; (2) reads from standard input a list of n k-dimensional integer data points; (3) builds a k-d tree with those n values using the offline method, with b set to 10 (and ties going to the left subtree); (4) reads p k-dimensional data values, called probes, and for each probe, lists all the data points stored in the bucket where the probe would be found if it were in the tree.

In step 3, you may assume that all integer data are distinct.

Extra credit

For each probe, find the element in its bucket that is nearest according to Euclidean distance (the L² norm).
Instead of the L² norm, use the L¹ norm, or generalize to any p-norm.
The nearest element to a probe may not be in its bucket, but rather in a nearby bucket. Figure out and program an algorithm to find the true nearest neighbor.
Program an online method for building the tree and handle the same probes.

Limited credit

For a cost of 5 points, ignore the k parameter (although it must be present) and assume k is 3.
For a cost of 5 points, use sorting instead of QuickSelect to find the medians.

Useful tools

You have access to some useful tools. First, there is a sample Makefile at http://www.cs.uky.edu/~raphael/courses/CS315/prog3/Makefile. It has a run target that compiles your program (either kd.c or kd.cpp) and runs it. It also has a runWorking target that gets and runs a working program so you can compare your output against it.

A second tool is randGen.pl, which you used in the previous assignments. As before, if you invoke it with no parameters, it chooses a random seed and produces non-negative integers. If you invoke it with one parameter s, then s is the seed for the pseudo-random number generator, and the stream of numbers is therefore deterministic. If you invoke it with two parameters s and m, then m is a modulus limiting the size of the outputs; they will range from 0 to m-1. Here is how you can invoke randGen.pl with your program, seeding the random-number generator with 42, limiting the random numbers to the range [0 .. 9999], setting k to 3, n to 64, and p to 2:

 ./randGen.pl 42 10000 | ./kd 3 64 2

Warning: If you run randGen.pl by itself, it generates an unbounded list of numbers. You should always pipe its output into another program, such as kd or less.

You can also get a working program that satisfies the specifications at http://www.cs.uky.edu/~raphael/courses/CS315/prog3/workingKD. The Makefile mentioned above automatically gets a copy of this file for you if you make runWorking.

What to hand in

Submit via Canvas a copy of your program, any external documentation (including a Makefile) and its output when you run

% randGen.pl 43 1000 | kd 3 64 10

The Makefile mentioned above has a zipAll target that creates a package ready to submit.