![]() ![]() This article is based on legacy software. Python SpellingCorrector.(Legacy) Microsoft Office 2007: Working with Spell Check For most of my tests I used all the data found at. Adding more text files may help the program's performance. The user may optionally use an additional argument specifying a directory that contains additional text files for the corpus. The user inputs to the program a text file with the lines that need correction and an output file. These are printed for each successive try to edit the sentence. The final program outputs the top three possibilities for each word in the sentence, in increasing order of probability. This is performed so that noisy words that are surrounded by other noisy words are also corrected. After correcting some of the words from the first run, the same method is repeated until no more changes are made. I tested several different options of this probability before settling on assigning the typed word with a probability of 1e-7 while other edit probabilities were in the range of 1e-8 to 1e-9. In order to properly analyze the results, and edit probability (the channel model) must be assigned to the actual word typed. The probabilities of each candidate word and the original word are all computed, and at first the word with the maximum probability is chosen to be the intended word. Additionally, due to the smaller size of the available text, smoothing proved to be more important. Unfortunately, this results in a significantly smaller corpus than the giant Google database, but still proved to work for the most part. I found several other large pieces of text and combined them all to create my own bigram probabilities with the help of the NLTK toolbox. Originally, I was planning on using requests to Google's N-gram database however, this failed because Google cut me off (Error 429) from performing too many requests. In order to do this we want to look at the 2 bigrams from either side of the candidate word - P(word | prevWord) and P(nextWord | word). Next we must take the dictionary model into account. also provides a count of all two-letter-grams allowing us to calculate a P(x|w) where $w$ is the actual word and x is the word with noise. This is done using data from which provides the number of each specific edit (for example the number of times ie is typed as ei: ei|ie). For each specific edit, the probability of each edit is calculated. If there were reason to assume that the channel was particularly ``noisy,'' the same algorithm could be used to add all words an additional edit-distance away to the candidate list.Īfter a list of candidate words is generated, each of these words is plugged into the model discussed above. This project used one edit-distance as the maximum because in general, about 80% of all spelling mistakes could be found in that one edit-distance. The first step in the spelling correction process was to produce all words within one edit-distance of each of the words from the input sentence. Using a bigram or trigram will give us context and allow us to more intelligently determine the actual word. We can look at the probability of a single word showing up, or we can look at the probability of some N-gram showing up. The dictionary model can also take a simpler or more complicated form. For example, we would say that a substitution of keys that neighbor each other is more likely than other substitutions. Another option is to develop some heuristic to explain which mistakes are more likely. One option is to use some training set of labeled mistakes in order to determine which mistakes are more likely. However, we can also approach the probability more specifically. In its simplest form, the channel model will give a higher probability for the lowest number of edits (where an edit is an insertion, deletion, substitution, or transposition). ![]() The dictionary model tells us the probability of a specific word (or N-gram) w showing up. The channel model tells us the probability that a specific word could have turned into the received word, x. This can be described as a product between the channel model and the dictionary model. Using Baye’s Rule we can equivalently maximize the product P (x|w)P (w). In order to predict what the intended word is, we want to maximize the probability P (w|x), where w is the actual word and x is the noisy word that the algorithm receives. For example, "fifteen minutes" might be mistyped as "fifteen minuets" - Depending on context, "fifteen minutes" is probably the desired phrase. The goal of this project was to be able to find and correct words that were mistyped, even if the typed word is a word in the dictionary. This program will be able to correct spelling mistakes even if the word is a real dictionary word, but is used in an incorrect context. A spelling corrector based on the noisy channel model. ![]()
0 Comments
Leave a Reply. |