CS 127 Spelling Checker
Assigned on Friday, 10/25.
Design document due Monday, 10/28 in class.
Project due Due Friday, 11/1 by 4 PM. Be aware that the labs in
the CMC close at 5 PM.
The Assignment
For this assignment you will write a spell checker, by storing a dictionary
as a sorted list of words. Its structure, in pseudocode, might look like
this:
Open the dictionary file.
For each word in the dictionary
add the word to the list.
Close the dictionary file.
Open the big monster text file (BMTF).
For each word in the BMTF
make proper adjustments to case of letters.
search the list for the word.
if word is misspelled, store it.
Close the BMTF.
Display all misspelled words to the screen in alphabetical order.
Note that many of these pseudocode lines will be handled by function calls
in the real code.
The Dictionary
The central purpose of this assignment is to write an application which
uses binary search. While you're debugging and testing your code at first,
MAKE SURE to use a small dictionary file and a small text file. Otherwise,
you'll put too much strain on our servers. Later, you'll need to try it
on big files. I've placed a big dictionary in the file /Accounts/courses/cs127/dmusican/web2.txt
. Try the UNIX command "wc" on it, and you will discover that this dictionary
has 234,936 words in it. The words are sorted (alphabetically, not
lexicographically), and one per line. You will need to build the dictionary
in memory, and be able to search for a word in the dictionary using binary
search.
Please do as much of the large scale testing as you can on lab machines
or your own, and not on prism.
The Input
You should not use standard input (cin) to obtain either of the files.
Your program should ask the user for the location of the dictionary and
the location of the document to be checked.
Capital letters, punctuation, and all that jazz
The dictionary mentioned above contains words with upper and lower case
letters.
In order to keep this problem from becoming a terribly tedious exercise
in details, you should use the following over-simplified rules.
Punctuation:
- You should remove all punctuation from a word before checking its
spelling.
- This includes apostrophes (') and hyphens (-), even if such words
appear in the dictionary with those symbols in them.
- "can't" should be checked as "cant"
- "do#g" should be checked as "dog"
- "proto-human" should be checked as "protohuman".
If a word in your document is not at the beginning of a sentence:
- Its spelling must match a word in the dictionary with exactly the
same capitalization in order to be spelled correctly.
- If the dictionary contains "Zeus" and your document contains the
word "zeus", your word is misspelled.
- If the dictionary contains the word "person" and your document
contains the word "Person" (not at the beginning of a sentence), your word
is misspelled.
- You should report the word to the user with precisely the same
capitalization as it appears in the document.
If a word in your document is at the beginning of a sentence:
- There are two possible ways the word can match the dictionary to
be considered correctly spelled.
- The first way is to match the word exactly.
- The second way is to change the first letter of your word to lower
case, and then see if it matches the dictionary.
- If the first word of your sentence is "The", your word is spelled
correctly.
- If the first word of your sentence is "the", it is also spelled
correctly.
- If the first word of your sentence is "Geblotnik", it is not spelled
correctly.
- Any word that follows a period is assumed to be the first word
of a sentence.
- The first word of your document is also considered to be the first
word of a sentence.
- You should report the word to the user with precisely the same
capitalization as it appears in the document.
The Output
You should list all misspelled words to the screen in alphabetical order.
- Each misspelled word should only be listed once.
- You may use any sorting algorithm you like - you do not need to
be fancy here (unless you want to be).
- You should sort all the words lexicographically, including words
with capital letters.
- Words starting with capital letters should appear before words
appearing with lower case letters.
Implementation Details
You will need to decide how to store the dictionary and the misspelled
words. They're both rather similar in that they both store sorted strings.
On the other hand, the dictionary is static whereas the misspelled words
are dynamic. Choose appropriate data structures to deal with each of the
two situations. Justify your decision with documentation in your code.
Frequently Asked Questions
Q: "What kind of dictionary is this? Why isn't <insert my favorite
word here> in the dictionary?"
A: You get what you pay for. I found this dictionary for free on the
web. Consider it to be the authoritative source. If a word is not in web2.txt,
it is misspelled.
Good luck!