Assignment 3: Smalltalk

Background

One generalization of binary trees is the k-d tree, which stores k-dimensional data. Every internal node of a k-d tree indicates the dimension (1, 2, ...) and the value in that dimension that it discriminates by. An internal node has two children, storing data that less-than-or-equal and data that is greater than that value in that dimension. For example, if the node distinguishes on dimension 2, value 10.7, then one child is for data with y value less than or equal to 10.7, and the other child is for data with y value greater than 10.7. Leaf nodes represent a bucket containing no more than b elements of k-dimensional data. All data are found in the leaves.

For this assignment, k is 3; that is, we will only be dealing with three-dimensional data. Dimension numbers are therefore 1, 2, and 3.

There are several strategies for building k-d trees. The preprocessing method (1) accumulates all the data in an array, (2) finds the best dimension to discriminate on, namely, the one with the widest range, (3) finds the best value of that dimension to discriminate on, namely, the median value in that dimension. (4) separates the data into two subarrays based on that discriminant, (5) recurses on the subarrays. Recursion terminates when an array has size b or smaller. One can also devise online methods that add to existing trees.

Requirements

Write a program in Smalltalk that (1) reads a list of 1000 3-dimensional data values, (2) builds a k-d tree with those values, with b set to 10, using the preprocessing method, You may use mean instead of median to find the best value to discriminate on. (3) reads an additional 10 3-dimensional data values, called probes, and for each probe, lists all the data values stored in the tree in the bucket where the probe would be found if it were in the tree.

Test your program both on your own data and on the data in http://www.cs.uky.edu/~raphael/courses/CS450/asg.smalltalk.data.

Hints

You don't have to construct your program the way I did, but to give you a start, I describe my program here.

  1. My version of the program is about 240 lines long.
  2. I added two methods to Array. One computes the range of values; the other averages the values.
  3. I built a class DataStore able to hold up to 1000 points. It uses three arrays, each of size 1000. It has a method to initialize from a file, a method (a "putter") to insert a point, a method (a "getter") to report the value of a given point, a method to report the number of points, and a method to return a quadruple: (dimension, discriminant, left, right) that subdivides the DataStore instance into two new instances.
  4. I built a class KDTree with a putter method to insert all the data from a DataStore and a method to print the contents of the bucket determined by searching for a point.
  5. My main program opens the file, builds a DataStore with the first 1000 points from the file, constructs a KDTree from that DataStore, then repeatedly reads points and probes for them.

Logistics

The Smalltalk you will use is called gst. It is available in the Multilab and the CSLab. You will need code to open a file, read tokens, and convert them to integers. from a file. This code shows you how to read the first ten space-delimited tokens from a file, convert them to integers, and print them.

|myTokens|
 	myTokens := (TokenStream new)
 	 	setStream: (FileStream open: 'asg.smalltalk.data' mode: 'r').
    10 timesRepeat: [myTokens next asInteger printNl] .
!

Extra credit

  1. For each probe, find the element in its bucket that is nearest according to Euclidean distance.
  2. The nearest element to a probe may not be in its bucket, but rather in a nearby bucket. Figure out and program an algorithm to find the true nearest neighbor.
  3. Devise an online method for building the tree and run the same experiments.
  4. Generalize your code to handle arbitrary k.
  5. Use the median instead of the mean to choose the discriminant.

Due date

This assignment is due at the start of class time on the day indicated in the syllabus. See the syllabus for the late policy. Submit the assignment by email to raphael @cs.uky.edu.