CS 127 Spelling Checker
Assigned on Monday, 2/11/02
Design document due Wednesday, 2/13/02
Project due Due Monday, 2/18/02
The Assignment
For this assignment you will write a spell checker, by storing a dictionary
as a sorted list of words. Its structure, in pseudocode, might look like
this:
Open the dictionary file.
For each word in the dictionary
add the word to the list.
Close the dictionary file.
Open the big monster text file.
For each word in the BMTF
make proper adjustments to case of letters.
search the list for the word.
if word is misspelled, store it.
Close the BMTF.
Display all misspelled words to the screen in alphabetical order.
Note that many of these pseudocode lines will be handled by function calls
in the real code.
The Dictionary
The central purpose of this assignment is to write an application which uses
binary search. While you're debugging and testing your code at first, MAKE SURE
to use a small dictionary file and a small text file. Otherwise, you'll put
too much strain on our servers. Later, you'll need to try it on big files. I've
placed a big dictionary in the file /Accounts/courses/cs127/dmusican/web2.txt
. Try the UNIX command "wc" on it, and you will discover that this dictionary
has 234,936 words in it. The words are sorted (alphabetically, not lexicographically),
and one per line. You will need to build the dictionary in memory, and be able
to search for a word in the dictionary using binary search.
Please do as much of the large scale testing as you can on lab machines, and
not on prism.
The Input
You should not use standard input (cin) to obtain either of the files.
Your program should ask the user for the location of the dictionary and
the location of the document to be checked.
Capital letters, punctuation, and all that jazz
The dictionary mentioned above contains words with upper and lower case
letters.
In order to keep this problem from becoming a terribly tedious exercise
in details, you should use the following over-simplified rules.
Punctuation:
-
You should remove all punctuation from a word before checking its spelling.
-
This includes apostrophes (') and hyphens (-), even if such words appear
in the dictionary with those symbols in them.
-
"can't" should be checked as "cant"
-
"do#g" should be checked as "dog"
-
"proto-human" should be checked as "protohuman".
If a word in your document is not at the beginning of a sentence:
-
Its spelling must match a word in the dictionary with exactly the same
capitalization in order to be spelled correctly.
-
If the dictionary contains "Zeus" and your document contains the word "zeus",
your word is misspelled.
-
If the dictionary contains the word "person" and your document contains
the word "Person" (not at the beginning of a sentence), your word is misspelled.
-
You should report the word to the user with precisely the same capitalization
as it appears in the document.
If a word in your document is at the beginning of a sentence:
-
There are two possible ways the word can match the dictionary to be considered
correctly spelled.
-
The first way is to match the word exactly.
-
The second way is to change the first letter of your word to lower case,
and then see if it matches the dictionary.
-
If the first word of your sentence is "The", your word is spelled correctly.
-
If the first word of your sentence is "the", it is also spelled correctly.
-
If the first word of your sentence is "Geblotnik", it is not spelled correctly.
-
Any word that follows a period is assumed to be the first word of a sentence.
-
The first word of your document is also considered to be the first word
of a sentence.
-
You should report the word to the user with precisely the same capitalization
as it appears in the document.
The Output
You should list all misspelled words to the screen in alphabetical order.
-
Each misspelled word should only be listed once.
-
You may use any sorting algorithm you like - you do not need to be fancy
here (unless you want to be).
-
You should sort all the words lexicographically, including words with capital
letters.
-
Words starting with capital letters should appear before words appearing
with lower case letters.
Implementation Details
You will need to decide how to store the dictionary and the misspelled words.
They're both rather similar in that they both store sorted strings. On the other
hand, the dictionary is static whereas the misspelled words are dynamic. Choose
appropriate data structures to deal with each of the two situations. Justify
your decision with documentation in your code.
Frequently Asked Questions
Q: "What kind of dictionary is this? Why isn't <insert my favorite
word here> in the dictionary?"
A: You get what you pay for. I found this dictionary for free on the web. Consider
it to be the authoritative source. If a word is not in web2.txt, it is misspelled.
Good luck!