You have been hired by Grate Books Publishing to help authors with their prose. You are supposed to warn authors if they use words too frequently. Authors give you a document as an ordinary text file. You have to collect the words of the file (case is significant; discard punctuation and spacing; only use letters A-Z and a-z and underscore '_') and print two outputs. Both outputs list all the words and the number of times they appear, one word per line, no word duplicated. For example, you might print
dark 1 stormy 1 night 4 the 18 they 2The first output should be sorted by word (in ASCII alphabetical order), the second by number of occurrences (in reverse numerical order; ties may be in any order).
Your program should be called wordCount. It should accept the text file from standard input. The output must obey the following specifications. Output both lists to standard output, with a single blank line between the lists, and no other information (such as headings). Each line holds a single word and its count, separated by a single space.
You must use a hash table to store the counts. You may use any hash function and any collision-resolution scheme you wish.
You must write your own sorting algorithm to sort the output and your own hash-table routines. (You may not use any library routines for sorting or hashing.) You will lose 2 points if you use selection sort or insertion sort; you will get full credit for heapsort, merge sort, or quicksort. You may not use any other sorting method.
You have access to some useful tools. First, there is a sample
Makefile
at
http://www.cs.uky.edu/~raphael/courses/CS315/prog4/Makefile
.
It has a run target that compiles your program (either wordCount.c
or
wordCount.cpp
) and runs it. It also has a runWorking target that gets and
runs a working program so you can compare your output against it.
Your submission should include your program, all documentation,
your program's output on the data in
http://www.cs.uky.edu/~raphael/courses/CS315/prog4/data.txt
,
your own test data, and your program's output on that test data.