CS 127 Spelling Checker
Note: Changes have been made to this assignment to clean it up for re-use
later (11/2/01).
Design document due Friday, 10/19/01
Project due Due Wednesday, 10/24/01
The Assignment
For this assignment you will write a spell checker, by storing a dictionary
as a sorted list of words. Its structure, in pseudocode, might look like
this:
Open the dictionary file.
For each word in the dictionary
add the word to the list.
Close the dictionary file.
Open the big monster text file.
For each word in the BMTF
make proper adjustments to case of letters.
search the list for the word.
if word is misspelled, store it.
Close the BMTF.
Display all misspelled words to the screen in alphabetical order.
Note that many of these pseudocode lines will be handled by function calls
in the real code.
The Dictionary
The central purpose of this assignment is to write an application which
uses binary search. While you're debugging and testing your code at first,
MAKE SURE to use a small dictionary file and a small text file. Otherwise,
you'll put too much strain on our servers. Later, you'll need to try it
on big files. I've placed a big dictionary in the file /Accounts/courses/cs127/dmusican/web2.txt
. Try the UNIX command "wc" on it, and you will discover that this dictionary
has 234,936 words in it. The words are already sorted, and one per line.
You will need to build the dictionary in memory, and be able to search
for a word in the dictionary using binary search..
The Input
You should not use standard input (cin) to obtain either of the files.
Your program should ask the user for the location of the dictionary and
the location of the document to be checked.
Capital letters, punctuation, and all that jazz
The dictionary mentioned above contains words with upper and lower case
letters.
In order to keep this problem from becoming a terribly tedious exercise
in details, you should use the following over-simplified rules.
Punctuation:
-
You should remove all punctuation from a word before checking its spelling.
-
This includes apostrophes (') and hyphens (-), even if such words appear
in the dictionary with those symbols in them.
-
"can't" should be checked as "cant"
-
"do#g" should be checked as "dog"
-
"proto-human" should be checked as "protohuman".
If a word in your document is not at the beginning of a sentence:
-
Its spelling must match a word in the dictionary with exactly the same
capitalization in order to be spelled correctly.
-
If the dictionary contains "Zeus" and your document contains the word "zeus",
your word is misspelled.
-
If the dictionary contains the word "person" and your document contains
the word "Person" (not at the beginning of a sentence), your word is misspelled.
-
You should report the word to the user with precisely the same capitalization
as it appears in the document.
If a word in your document is at the beginning of a sentence:
-
There are two possible ways the word can match the dictionary to be considered
correctly spelled.
-
The first way is to match the word exactly.
-
The second way is to change the first letter of your word to lower case,
and then see if it matches the dictionary.
-
If the first word of your sentence is "The", your word is spelled correctly.
-
If the first word of your sentence is "the", it is also spelled correctly.
-
If the first word of your sentence is "Geblotnik", it is not spelled correctly.
-
Any word that follows a period is assumed to be the first word of a sentence.
-
The first word of your document is also considered to be the first word
of a sentence.
-
You should report the word to the user with precisely the same capitalization
as it appears in the document.
The Output
You should list all misspelled words to the screen in alphabetical order.
-
Each misspelled word should only be listed once.
-
You may use any sorting algorithm you like - you do not need to be fancy
here (unless you want to be).
-
You should sort all the words lexicographically, including words with capital
letters.
-
Words starting with capital letters should appear before words appearing
with lower case letters.
Implementation Details
You will need to decide how to store the dictionary and the misspelled
words. They're both rather similar in that they both store sorted strings.
On the other hand, the dictionary is static whereas the misspelled words
are dynamic. Choose appropriate data structures to deal with each of the
two situations. Justify your decision with documentation in your code.
Good luck!