CIS 260 - Text analysis using a BST
The objective of this assignment is to use a binary search tree
to sort and obtain a count of all the words in a lengthy text file.
There are several text files available to test your code. All of the
text files are modified versions from
Project Gutenberg.
The files are available via the links below.
Resources
- Countable.java: the complete Countable interface
- BST.java: a complete binary search tree class using generics,
but without the modifications needed to support duplicate counts in the Nodes.
Assignment requirements
Here's a list of things you will need.
- A Countable interface definition. Countable should have
four public methods which all return ints:
- int incCount(): increments a counter and returns its new value
- int decCount(): decrements a counter and returns its new value
- int getCount(): returns the value of the counter
- int setCount(int): sets the value of the counter and returns that value
- A binary search tree class that uses generics so it can store any object
that implements the Comparable interface. The binary search tree class used
in class and posted in the notes satisfies this requirement.
- A Node class which implements the Countable interface. The Node class
contained within the posted binary search tree class will have to be modified
as follows:
- Explicitly state that Node implements the Countable interface.
- Add an integer variable to hold the count.
- Implement the four methods in the Countable interface.
- Modify the toString() method to also display the count in parentheses after the word.
- You will have to modify the binary search tree code that we discussed in
class and which is posted as follows:
- Modify the insert routine so duplicates are no longer added as new nodes,
but instead increment the counter in the node that was found (remember that the Node
class now implements Countable and therefore has a method available
for incrementing the count).
- Modify the delete routine so the count in the node is decremented, and if the
count is greater than 0, just return true. If the count is 0, continue with the
delete code as it is written (remember that the Node
class now implements Countable and therefore has methods available
for decrementing and accessing the count).
- Here's what the mainline will have to do:
- Create a binary search tree with nodes that holds String objects as data
- Open the cities.txt file
- For each line in the file
- Read in the line
- Tokenize the line read in using " -\"\'?!,;:.()_`*" for the delimiters
- For each token in the line, convert it to lower case and insert it into the BST
- Close the file
- Display all the words in the BST in alphabetical order along with their count
(the binary search tree class already has a method for that)
Sample output
Note that some of the words include punctuation because the characters
in the text file did not exactly match the delimiters we provided. In
addition, you may see slightly different output where special characters
are displayed in the list below.
Processing file: alice.txt
1 (1)
100 (1)
111 (1)
116 (1)
126 (1)
... many lines omitted ...
yours (3)
yourself (10)
youth (6)
zealand (1)
zigzag (1)
Processing file: cities.txt
1757 (1)
1767 (2)
1792 (1)
21 (1)
a (2953)
... many lines omitted ...
youth (9)
youthful (3)
youthfulness (1)
youths (1)
zealous (2)
Processing file: gulliver.txt
&c (1)
10 (1)
10th (1)
11th (1)
13th (1)
... many lines omitted ...
ôyou (3)
ôæ (1)
ôæthat (3)
ôæwhereas (1)
ö (313)
Processing file: mobydick.txt
$20 (1)
$7 (1)
& (2)
000 (20)
1 (2)
... many lines omitted ...
zone (5)
zoned (2)
zones (3)
zoology (2)
zoroaster (1)
Processing file: warAndPeace.txt
000 (8)
07 (2)
1 (21)
10 (6)
100 (2)
... many lines omitted ...
zubovski (2)
zum (1)
zweck (1)
α (4)
Ω (1)