CIS 160 - Text analysis

Objectives

  • Read text from a file
  • Process strings to count individual character occurences
  • Use arrays to accumulate individual character counts
  • Parse strings to tokenize words
  • Use repetition to process command line arguments
  • Use multiple variables to keep track of file totals and grand totals

Program requirements

Write an application that reads in text files and produces statistics for those files. Here are some requirements:

  1. The program should get the names of the files from the command lines arguments passed to the program.
  2. Display an error message indicating how the program should be used if the user failed to enter any command line arguments.
  3. Have the program display an error message if a file can not be opened or read, but continue to process any other files specified on the command line.
  4. Count and display the number of lines, words, and characters for each file as it is processed.
  5. Count the frequency of each alphabetic character in each file and display a list of the top ten letters from each file.
  6. Keep track of all the statistics from each file and print out overall statistics for all of the files together.
  7. Use the following as the delimiter string for your StringTokenizer:
    String delims = " .?-!,;:\"_+=@$%*()[]{}|\\<>~`";
  8. These files may be used for testing purposes:
    1. The Cathedral and the Bazaar (Cathedral.txt)
    2. A Tale of Two Cities (cities.txt)
    3. alice.txt: Alice's Adventures In Wonderland
    4. gulliver.txt: Gulliver's Travels
    5. mobydick.txt: Moby Dick
    6. warAndPeace.txt: War and Peace

Sample runs

C:\>java TextAnalyzer Usage: java TextAnalyzer filelist C:\>java TextAnalyzer xyzzy.txt cities.txt blahblahblah Cathedral.txt a Error: Could not open file: xyzzy.txt For file: cities.txt Lines: 16047 Words: 137240 Chars: 759288 Top 10: E T A O N I H S R D Error: Could not open file: blahblahblah For file: Cathedral.txt Lines: 1894 Words: 17723 Chars: 109741 Top 10: E T O A I N S R L H Error: Could not open file: a For all files: Tot Lines: 17941 Tot Words: 154963 Tot Chars: 869029 Top 10: E T A O N I S H R D

Note: The above sample run shows the total number of characters if you are using the file length for that calculation. If you are using the lengths of the lines that you read in from the files, the character counts would not include newline characters, so your counts for the characters would be:

  • Chars: 105955 (Cathedral.txt)
  • Chars: 743241 (cities.txt)
  • Chars: 849196 (total)

Rubric

  • 3 points for following the style conventions discussed in class, including documentation comments, indentation, spaces instead of tabs, naming conventions, etc.
  • 2 points for producing a helpful message when no command line arguments are provided
  • 2 points for processing the command line arguments properly
  • 2 points for displaying an error message for files that do not exist or can not be read (and continuing the program)
  • 5 points for correctly reading each line from every file
  • 2 points for correctly calculating the number of lines for each individual file
  • 4 points for correctly calculating the number of words for each individual file
  • 2 points for correctly calculating the number of characters for each individual file
  • 5 points for correctly calculating and displaying the top ten characters for each individual file
  • 2 points for correctly calculating the number of lines for all files together
  • 4 points for correctly calculating the number of words for all files together
  • 2 points for correctly calculating the number of characters for all files together
  • 5 points for correctly calculating and displaying the top ten characters for all files together