CIS 250 - Binary search tree sort and count

CIS 250 Binary search tree sort and count

The objective of this assignment is to use a binary search tree to sort and obtain a count of all the words in a lengthy text file. There are several text files available to test your code. All of the text files are modified versions from Project Gutenberg. The files are available via the links below. They are also available on the server we are using in the the directory: ~dklick/public_html/datasets/

cities.txt: A Tale of Two Cities
alice.txt: Alice's Adventures In Wonderland
gulliver.txt: Gulliver's Travels
mobydick.txt: Moby Dick
warAndPeace.txt: War and Peace

Assignment notes

Here's a list of things you will need.

A templated binary search tree class that you can store a user-defined Word class in. You will have to define the Word class with specific requirements that will be detailed below. We will work through the creation of a templated BST class that you can use for this assignment with minor modifications, which will also be detailed below.
bintree.h is also available on the server we use in the ~dklick/examples/cis250/bintree directory.
The Word class must meet the following requirements:
- a private string variable to hold the word
- a default constructor
- a constructor that accepts a string, converts that string to lowercase, and uses that to set the private string variable [you can do this by including the algorithm, functional and cctype libraries, using std::ptr_fun, and executing the following statement:
  transform(word.begin(), word.end(), word.begin(), ptr_fun(::tolower));]
- overloads of the six relational operators (this should be easy since you just have to return the results of comparing the string objects that the Word objects contain)
- an overload of the insertion operator that will insert the private word variable into the stream
The Node class should contain the usual templated data type, and left and right pointers. It should also contain:
1. a private integer to hold a count (for duplicated nodes); initialize to 1 in constructor(s)
2. a getCount() method that returns the count
3. a setCount(int) method that sets the count and returns the result
4. an overload of the insertion operator that displays a node using the data type's insertion operator overload, and then displays the count in parentheses following that.
The BST must accept duplicate node data values by doing the following:
1. when inserting, if a duplicate value is found in the tree, increment the count and return true instead of returning false
2. when deleting, whenever the value is found in the tree, see if the count for that node is greater than 1; if it is, just decrement the node's count; otherwise, execute the normal code to delete the node
Here's what the mainline will have to do:
- create a binary search tree with nodes that hold Word objects as data
- open the data file
- for each line in the file
  - read in the line
  - tokenize the line read in using " -\"\'?!,;:.()_`*" for the delimiters
  - for each token in the line, make it into a Word object and insert it into the BST
- close the file
- display all the words in the BST in alphabetical order along with their count

Sample output

Note that some of the words include punctuation because the characters in the text file did not exactly match the delimiters we provided. In addition, you may see slightly different output where special characters are displayed in the list below.

Processing file: alice.txt
1 (1)
100 (1)
111 (1)
116 (1)
126 (1)
... many lines omitted ...
yours (3)
yourself (10)
youth (6)
zealand (1)
zigzag (1)

Processing file: cities.txt
1757 (1)
1767 (2)
1792 (1)
21 (1)
a (2953)
... many lines omitted ...
youth (9)
youthful (3)
youthfulness (1)
youths (1)
zealous (2)

Processing file: gulliver.txt
&c (1)
10 (1)
10th (1)
11th (1)
13th (1)
... many lines omitted ...
"you (3)
"' (1)
"'that (3)
"'whereas (1)
" (313)

Processing file: mobydick.txt
$20 (1)
$7 (1)
& (2)
000 (20)
1 (2)
... many lines omitted ...
zone (5)
zoned (2)
zones (3)
zoology (2)
zoroaster (1)

Processing file: warAndPeace.txt
000 (8)
07 (2)
1 (21)
10 (6)
100 (2)
... many lines omitted ...
zubovski (2)
zum (1)
zweck (1)
à (4)
être (1)