CIS 260 - Text analysis using a BST

The objective of this assignment is to use a binary search tree to sort and obtain a count of all the words in a lengthy text file. There are several text files available to test your code. All of the text files are modified versions from Project Gutenberg. The files are available via the links below.

Resources

  1. Countable.java: the complete Countable interface
  2. BST.java: a complete binary search tree class using generics, but without the modifications needed to support duplicate counts in the Nodes.

Assignment requirements

Here's a list of things you will need.

  1. A Countable interface definition. Countable should have four public methods which all return ints:
  2. A binary search tree class that uses generics so it can store any object that implements the Comparable interface. The binary search tree class used in class and posted in the notes satisfies this requirement.
  3. A Node class which implements the Countable interface. The Node class contained within the posted binary search tree class will have to be modified as follows:
    1. Explicitly state that Node implements the Countable interface.
    2. Add an integer variable to hold the count.
    3. Implement the four methods in the Countable interface.
    4. Modify the toString() method to also display the count in parentheses after the word.
  4. You will have to modify the binary search tree code that we discussed in class and which is posted as follows:
    1. Modify the insert routine so duplicates are no longer added as new nodes, but instead increment the counter in the node that was found (remember that the Node class now implements Countable and therefore has a method available for incrementing the count).
    2. Modify the delete routine so the count in the node is decremented, and if the count is greater than 0, just return true. If the count is 0, continue with the delete code as it is written (remember that the Node class now implements Countable and therefore has methods available for decrementing and accessing the count).
  5. Here's what the mainline will have to do:
    1. Create a binary search tree with nodes that holds String objects as data
    2. Open the cities.txt file
    3. For each line in the file
      • Read in the line
      • Tokenize the line read in using " -\"\'?!,;:.()_`*" for the delimiters
      • For each token in the line, convert it to lower case and insert it into the BST
    4. Close the file
    5. Display all the words in the BST in alphabetical order along with their count (the binary search tree class already has a method for that)

Sample output

Note that some of the words include punctuation because the characters in the text file did not exactly match the delimiters we provided. In addition, you may see slightly different output where special characters are displayed in the list below.

Processing file: alice.txt
1 (1)
100 (1)
111 (1)
116 (1)
126 (1)
... many lines omitted ...
yours (3)
yourself (10)
youth (6)
zealand (1)
zigzag (1)

Processing file: cities.txt
1757 (1)
1767 (2)
1792 (1)
21 (1)
a (2953)
... many lines omitted ...
youth (9)
youthful (3)
youthfulness (1)
youths (1)
zealous (2)

Processing file: gulliver.txt
&c (1)
10 (1)
10th (1)
11th (1)
13th (1)
... many lines omitted ...
ôyou (3)
ôæ (1)
ôæthat (3)
ôæwhereas (1)
ö (313)

Processing file: mobydick.txt
$20 (1)
$7 (1)
& (2)
000 (20)
1 (2)
... many lines omitted ...
zone (5)
zoned (2)
zones (3)
zoology (2)
zoroaster (1)

Processing file: warAndPeace.txt
000 (8)
07 (2)
1 (21)
10 (6)
100 (2)
... many lines omitted ...
zubovski (2)
zum (1)
zweck (1)
α (4)
Ω (1)