Benjamin Waters

Multilingual Spellchecking


Conceptual Plan

A Working Example

How it Works


An optimal spellchecking implementation should be able to check a file written in any mixture of different languages (taking into account personal wordlists for each language) and report back in a matter of seconds with a list of errors for each language. The following description of such an implementation shows that this is fairly trivial under Unix. The principles should also be portable to other systems.

Conceptual Plan

1 A document must be written in a markup that explicitly states the overall language of the document, and then any local language changes. This is straightforward in HTML, LaTeX, or in any other well-structured markup language.

2 A text-processing script splits the document cleanly into its component languages.

3 Each chunk of language then needs to be cleaned up and piped through a spell-checker for that language. The input-output model of Unix programs is here much more convenient than the interactive spell-checking model that is usually used. We can then run the output for each language past our personal word list for that language, to leave us with a list for each language of the misspelled words.

A Working Example

Download the code here. On my system this can check a document of 40 000 words written in 5 languages in about 2.7 seconds, which is more or less acceptable. You will have to do a few things to make this code for you:

1 Be running a unix-like environment, which would include using a shell on a Mac.

2 Write a consistent (XML-compliant) HTML file. State the global language of the file as an argument to the leading <h1> element, e.g. <h1 lang="en">, and for each element that has a different language, make that language an argument to that element. For spans of text you can also just use a span: <span lang="fr">.

3 You will need to install aspell for each language you want to check, or if no aspell for that language is available, you can simply check that language against your own wordlist of allowable words.

4 You will need to prune the set of supported languages down or increase them from the 5 that I have provided in the code. The Greek and Latin operate on the basis of using a wordlist only; the English, French and German use aspell.

5 In the code you will see pointers to files of the form /home/ben/.../sp.fr Here you will need to set paths to your personal files of extra words for each language. I also keep a file, sp.all that contains words that I consider to be the same across all languages, such as proper names. These word lists need to be one word per line, and sorted with sort -u, so that comm can make use of them.

6 If your document is well-formed, and aspells are installed, and you have pointed the program to your personal word lists, then you should be able put the program in your execution path and just type: sp filename.

How it Works

The first thing we need to do is to mark the global language and throw away the h1 tags:

#!/bin/bash
sed -e '/h1/s/">/ /g' $1 | sed -e 's/<h1 lang="/GLOB-START/' | sed -e 's/<\/h1>//g' |

We isolate all of the points where the language changes to something different. The starting points are very easy to find: <p lang="de"> <span lang="fr"> <td lang="el"> &c. But the endpoints of these language regions will look like the end points of any ordinary tag: </p> </span> </td>. What we have to do is de-nest the HTML, keeping tags matched with one another. If we do this properly, we can thereby isolate all of the tagged regions of different languages. I simply repeat the following loop 4 times:

tr '\n' ' ' |
sed -e 's/<[a-z]/\n&/g' | sed -e 's/<\/[a-z]*[0-9]*>/&\n/g' |
sed -e '/^<acronym>.*<\/acronym>$/d' | sed -e '/^<span class="ref">.*<\/span>$/d' |
sed -e '/^<[^>]* lang="..">.*<\/.*>/s/<\/[a-z]*[0-9]*>$/ LANG-STOP/' |
sed -e '/LANG-STOP/s/<[^>]* lang="/LANG-START/' | sed -e '/^LANG-START/s/">/ /' |
sed -e '/^<.*>.*<\/.*>$/s/<\/[a-z]*[0-9]*>/ KILL-MARK/' | sed -e '/KILL-MARK/s/<.*>//' |

Having marked all of the start and stop points of the foreign languages in the document, we can now produce two files: one from which we easily extract the foreign language bits of the file, and one that contains all of the rest. We now have all text broken down into its component languages, with the HTML tags removed.

sed -e 's/LANG-START/\n&/g' | sed -e 's/LANG-STOP/&\n/g' > sptmp1
sed -e '/^LANG-START/d' sptmp1 | tr '\n' ' ' > sptmp2

For each language we now merely have to do the following: 1 grep all of that language out of our language-separated files by searching for the tags that we used to tag the languages; 2 clean up the text by taking out our tags, and also various annoying bits of punctuation that will annoy the spellchecker; 3 use awk and wc to print us some nicely formatted word counts; 4 sort the word list and then pipe it through aspell to get a list of misspelled words; 5 use comm to send the final list past our personal word lists and strip out the words that aren’t in the spellchecker. 6 fold the final list for easy reading. What follows has to be repeated for each language, for as many languages as you want to check. Any text from the original document marked with languages that are not called up at this point simply won’t be checked.

# CHECK DEUTSCH
echo $(grep -h GLOB-STARTde sptmp2 & grep -h LANG-STARTde sptmp1) | tr ' ' '\n' |
sed -e 's/GLOB-START..//g' | sed -e 's/LANG-START..//g' | sed -e 's/LANG-STOP//g' | sed -e 's/KILL-MARK//g' |
awk '{if (gsub(/[^[:alpha:]]/, " ")) print; else print }' | tr ' ' '\n' | sed -e '/^ *$/d' > sptmpDE2
wc sptmpDE2 | 
awk '{ if ($2 > 0) \ 
printf("\n\n%16s%10d%-15s%10f%s\t%d%s\n",\
"Deutsch:", $2, " total words.", ($3 / $2), " char/word.", ($2+$3)/2160, " pages.") }' 
sort -u sptmpDE2 > sptmpDE1
wc sptmpDE1 |
awk '{ if ($2 > 0) \
printf("%16s%10d%-15s%10f%s\n\n", " ", $2, " unique words.", ($3 / $2), " char/word.") }' 
comm -2 -3 sptmpDE1 /home/ben/com.lib/sp.de > sptmpDE2
comm -2 -3 sptmpDE2 /home/ben/com.lib/sp.all | aspell -C --lang=de -l --encoding=UTF-8 |
tr '\n' ' ' > sptmpDE1
echo ' ' >> sptmpDE1
fold -s -w 100 sptmpDE1

Then just clean up temporary files.

Benjamin Waters | 2007-05-24