Bengali (Bangladesh) Spellchecker
Why another spellchecker?
There are quite a few existing Bengali spellcheckers (Ankur, Avro, MS Word, Google Docs, BRAC, Stars21 , but they all have proved inadequate:
- Although most Bengali-speakers are Bangladeshi, most major software spellcheckers use the minority Indian spelling rules, which are quite different than the Bangla Academy spelling standard in Bangladesh.
- Others require pasting plain-text into a separate software‘s window, which doesn’t allow for editing formatted text.
- Ankur and Avro works with a subtractive methodology; it starts with a huge list of half a million words and uses volunteers to comb through the list for misspellings. This is inevitably prone to human error; it’s much better to start from a reliable dictionary and work additively.
- Hunspell: Many existing spellcheckers use propriety software systems which go obsolete fast. The industry standard is hunspell, which is essentially future-proof.
- Roots & Affixes: Hunspell’s magic is that it works from a root word list and a list of possible suffix/affix rules. In this way, a vast range of word forms are recognized in a small (~1-2mb) file size. One existing Bengali spellchecker uses hunspell, but it doesn’t harness the power of affixes, which is important for Bengali. So at 12mb this dictionary covers only half a million possible word forms, whereas a 2mb dictionary could cover over four million correct word forms.
So we felt the need for a new spellchecker – lightweight, accurate, built in hunspell for use in many major softwares; built from reliable sources and following Bangla Academy spelling rules.
How was the spellchecker made?
Creating a spellchecker from scratch was an exciting challenge for a language nerd like me! First we carefully typed out all the 49,805 words from the most recent 2017-edition BA (Bangla Academy) Bengali dictionary and the BA Arabic-Farsi-Urdu Words in Bengali dictionary , skipping definitions but keeping the word type (verb, adjective, etc). This latter information was valuable later for determining which suffixes words could use.
Dictionaries don’t have proper nouns, but spellcheckers need them. Fortunately from local school graphics projects, we had huge lists of student names, so we set up an excel spreadsheet which extracted the most common spellings of the top thousand names, filtering out less-common spellings. We did similar things for world countries, cities, states, and other proper nouns.
Bengali uses masses of English loan words, and although there are fairly clear BA transliteration rules, mispelling of these is pervasive. Tackling this was fun challenge. First I needed an open-source English pronunciation word list, which I found in Carnegie Mellon University’s 134k word list. I aligned this with some other most-common-English-word lists to filter out some less-common English words. Next I wrote a script of rules to convert the phonetics to Bengali script according to BA rules. I had to fine-tune this script quite a bit by manually checking how it worked for a thousand Banglified English words. Then I had to compare this list with a Bengali corpus list (using an excel algorithm) to prioritize which words were straightforward and which needed manual checking.
A corpus is a massive conglomerate of text ( ideally >10 million+ words long), which is really useful for linguistic analysis. Since Prothom Alo sets the standard for spelling in the country and has a big website, I scraped their website to amass a ~40 million word strong corpus. Then I distilled this down to a word list showing the frequency of each word. Of course, this includes lots of spelling mistakes, but it’s a good source of data on which spellings are commonly used. It was a great help to calculate which dictionary words could have which prefixes and which suffixes.
Finally we compiled all these root words and all the metadata into a huge master spreadsheet with lots of metadata – frequency, source, type of word, affixes, etcetera.
Affixes
At this point, we had a pretty good list of root words. Now we need to code code all the possible suffixes for words – a huge task. The simple noun গাছ (tree) can have 88 possible suffix forms:
In Bengali, different types of nouns carry different endings – non-countable nouns like “ময়দা or লোহার can’t have ‘গুলো’ endings; proper nouns, numbers, animate objects, inanimate objects, abstract nouns. So we also had to determine which noun categories there in regard to affixes.
Bengali verbs are much more complex. Most verbs have over a hundred possible forms:
And these forms involve drastic spelling changes deep into the word, not just tacking on an ending. This involved a lot of research and coding over six hundred lines of inflection rules like this:
SFX V ানো ে/xX ানো "লুকিয়ে" চলিত perfective participle
SFX V নো ইয়া/xX নো "ঘুমাইয়া" সাধু perfective participle
Hunspell has a great feature where it allows you to prioritize spelling suggestions. I had been frustrated with previous spellcheckers that I couldn’t find correct alternatives in the list of suggestions to misspelled words. So I wrote about 100 rules of common mispellings, for example ই and য়ি. Then hinting is needed for common orthographic errors, such as য়/ড়/ঢ় incorrectly written with the dot as a separate character, আ written as অ+া, or ো written as ে+া.
Checking
With root words and affix rules, we were ready for human checking. A team of people spent 250+ hours testing and improving the spellchecker, running it by word lists of common words, books and articles. Bangla Academy has some flexibility with spelling, but more narrowly Prothom Alo newspaper has become the spelling standard. So we tested all our spellings against Prothom Alo. In a few deliberated cases where BA spelling is completely ignored by both the public and Prothom Alo (eg বড় as বড়ো), we noted it down and went with the Prothom Alo spelling.
Publishing
We were hoping for a sponsor to reimburse for all this time, but I guess no-one pays for Bengali spellcheckers. So now we’re releasing it to the public for free and eating the cost. We commissioned a MS Word plugin, and are preparing installers and instructions for installing the dictionary into major softwares.
How do I install the spellchecker?
The spellchecker is currently available for Mozilla Firefox and Adobe InDesign.
Install and use in Firefox Browser
- Install the add-on from the above link
- In a editable-text area in the browser (eg here), right-click and select Languages > Bengali (Bangladesh) and make sure Check Spelling is checked.
Spelling issues in your Bengali text should now be underlined in wavy red. Right-click on a word to see alternative correct options. You can add words to your personal dictionary from the right-click menu.
You can use the spellchecker in Gmail within the browser as follows:
- In Gmail click compose to start a new email
- From the bottom right hamburger menu, select Check spelling
Install and Use in Thunderbird
- Install the add-on from the above link
- When you are composing a new message, right-click and select Languages > Bengali (Bangladesh) and make sure Check Spelling is checked.
Spelling issues in your Bengali text should now be underlined in wavy red. Right-click on a word to see alternative correct options. You can add words to your personal dictionary from the right-click menu.
Steps to install to InDesign
- Find the dictionaries folder in your version of InDesign:
- CC 2022 Windows 64-bit: C:\Program Files\Adobe\Adobe InDesign 2022\Resources\Dictionaries\LILO\Linguistics\Providers\Plugins2\AdobeHunspellPlugin\Dictionaries
- CC 2018 Windows 64-bit: %ProgramFiles%\Adobe\Adobe InDesign CC 2018\Resources\Dictionaries\LILO\Linguistics\Providers\Plugins2\AdobeHunspellPlugin\Dictionaries
- CC 2015.4 Windows: %ProgramFiles%\Adobe\Adobe InDesign CC (64 bit)\Plug-Ins\Dictionaries\LILO\Linguistics\Providers\Plugins2\AdobeHunspellPlugin\Dictionaries
- CC Mac OS: /Applications/Adobe InDesign CC/Resources/Dictionaries/LILO/Linguistics/Providers/Plugins2/AdobeHunspellPlugin.bundle/Contents/SharedSupport/Dictionaries
- CC Windows 64-bit: %ProgramFiles%\Adobe\Adobe InDesign CC (64 bit)\Plug-Ins\Dictionaries\LILO\Linguistics\Providers\Plugins2\AdobeHunspellPlugin\Dictionaries
- CC Mac OS: /Applications/Adobe InDesign CC/Resources/Dictionaries/LILO/Linguistics/Providers/Plugins2/AdobeHunspellPlugin.bundle/Contents/SharedSupport/Dictionaries
- CS6 Windows 32-bit: %ProgramFiles%\Common Files\Adobe\Linguistics\6.0\Providers\Plugins2\AdobeHunspellPlugin\Dictionaries
- CS6 Windows 64-bit: Program Files(x86)\Common Files\Adobe\Linguistics\6.0\Providers\Plugins2\AdobeHunspellPlugin\Dictionaries
- Mac OS /Library/Application Support/Adobe/Linguistics/6.0/Providers/Plugins2/AdobeHunspellPlugin.bundle/Contents/SharedSupport/Dictionaries
- Download and unzip the bn_BD folder and place the unzipped bn_BD folder into InDesign’s ‘Dictionaries’ folder.
- In the AdobeHunspellPlugin folder immediately above ‘Dictionaries‘ folder there is a file named “Info.plist“. In a plain-text editor add a <string>bn_BD</string> line under ‘SpellingService’, ‘HyphenationService’, and ‘UserDictionaryService’.
- Reopen InDesign, and ensure that text your spellchecking is marked as “Bengali (Bangladesh)” or bn-BD