Text analysis using a BST

Objectives

  • Modify a binary search tree (BST) class to handle duplicate values by counting occurrences
  • Handle command line arguments
  • Read text files
  • Tokenize text (separate into words)
  • Implement an algorithm using a binary search tree to analyze text
  • Display the sorted output from a binary search tree (BST)
  • Test and debug a complex application

Overview

The objective of this assignment is to use a binary search tree to sort and obtain a count of all the words in a lengthy text file. There are several text files available to test your code. All of the text files are modified versions from Project Gutenberg. The files are available via the links below.

Resources

  1. Countable.java: the complete Countable interface
  2. BST.java: a complete binary search tree class using generics, but without the modifications needed to support duplicate counts in the Nodes.

Assignment requirements

Here's a list of things you will need.

  1. A Countable interface definition. Countable should have four public methods which all return ints:
    • int incCount(): increments a counter and returns its new value
    • int decCount(): decrements a counter and returns its new value
    • int getCount(): returns the value of the counter
    • int setCount(int): sets the value of the counter and returns that value
  2. A binary search tree class that uses generics so it can store any object that implements the Comparable interface. The binary search tree class used in class and posted in the notes satisfies this requirement.
  3. A Node class which implements the Countable interface. The Node class contained within the posted binary search tree class will have to be modified as follows:
    1. Explicitly state that Node implements the Countable interface.
    2. Add an integer variable to hold the count.
    3. Implement the four methods in the Countable interface.
    4. Modify the toString() method to also display the count in parentheses after the word.
  4. You will have to modify the binary search tree code that we discussed in class and which is posted as follows:
    1. Modify the insert routine so duplicates are no longer added as new nodes, but instead increment the counter in the node that was found (remember that the Node class now implements Countable and therefore has a method available for incrementing the count).
    2. Modify the delete routine so the count in the node is decremented, and if the count is greater than 0, just return true. If the count is 0, continue with the delete code as it is written (remember that the Node class now implements Countable and therefore has methods available for decrementing and accessing the count).
  5. Here's what the mainline will have to do:
    1. Create a binary search tree with nodes that holds String objects as data
    2. Open the cities.txt file
    3. For each line in the file
      • Read in the line
      • Tokenize the line read in using " -\"\'?!,;:.()_`*" for the delimiters
      • For each token in the line, convert it to lower case and insert it into the BST
    4. Close the file
    5. Display all the words in the BST in alphabetical order along with their count (the binary search tree class already has a method for that)

Rubric

  • (10 pts) Node class modifications complete and correct.
  • (8 pts) BST insert method modifications complete and correct.
  • (8 pts) BST delete method modifications complete and correct.
  • (5 pts) Main program correctly opens and reads file specified by the command line argument(s).
  • (9 pts) Main program correctly tokenizes and converts the text read from the file.
  • (5 pts) Main program correctly inserts words into a binary search tree.
  • (5 pts) Main program correctly displays the words (and their counts) in alphabetical order using the BST.

Sample output

Note that some of the words include punctuation because the characters in the text file did not exactly match the delimiters we provided. In addition, you may see slightly different output where special characters are displayed in the list below.

Processing file: alice.txt 1 (1) 100 (1) 111 (1) 116 (1) 126 (1) ... many lines omitted ... yours (3) yourself (10) youth (6) zealand (1) zigzag (1) Processing file: cities.txt 1757 (1) 1767 (2) 1792 (1) 21 (1) a (2953) ... many lines omitted ... youth (9) youthful (3) youthfulness (1) youths (1) zealous (2) Processing file: gulliver.txt &c (1) 10 (1) 10th (1) 11th (1) 13th (1) ... many lines omitted ... ôyou (3) ôæ (1) ôæthat (3) ôæwhereas (1) ö (313) Processing file: mobydick.txt $20 (1) $7 (1) & (2) 000 (20) 1 (2) ... many lines omitted ... zone (5) zoned (2) zones (3) zoology (2) zoroaster (1) Processing file: warAndPeace.txt 000 (8) 07 (2) 1 (21) 10 (6) 100 (2) ... many lines omitted ... zubovski (2) zum (1) zweck (1) α (4) Ω (1)