Chatbots offer an opportunity to reach international audiences. The more languages one supports, the more customers it can reach. The variety of human languages makes this a challenge. If you're reading this article, you're presumably familiar with English and perhaps with some other European languages. Translating even between them can be difficult. When we consider all the world's languages, we deal with even more variety, down to the language's basic assumptions. This article provides an overview of the world's language families and their implications for natural language processing.
The boundaries of language families aren't precise, and linguists change their classifications sometimes, but the following are generally considered the ones with the largest number of speakers:
- Indo-European. Europe and parts of Asia.
- Sino-Tibetan. China and some neighboring countries.
- Niger-Congo. Sub-Saharan Africa.
- Austronesian. Southeast Asia and islands of the Pacific and Indian Oceans.
- Afro-Asiatic. Northern and eastern Africa and southwestern Asia.
In total, there are about a hundred language families, but only about 30 of them have more than a million living speakers.
Some languages are isolates, not recognizably part of any family. Some of them, notably Japanese, are used by economically and numerically significant populations.
The differences in language families
Vocabulary is the easiest difference between language families to understand. Within a family, words in different languages are often recognizably similar. Among Indo-European languages, "mother" is "Mutter" in German, "madre" in Italian and Spanish, and "mitera" in Greek. Vocabulary by itself isn't a difficult problem for NLP. The more difficult issues are grammatical.
Parsing Japanese input, for instance, takes more than modifying an English-language parser for different vocabulary. Even rearranging the expected word order isn't enough. It requires a different way of thinking about syntax.
Writing systems are very different among languages, even within the same family. Various languages use different alphabets. Others don't use alphabets at all, instead of using symbols that stand for syllables or whole words. Some languages are written left to right, others right to left or top to bottom.
English doesn't present a lot of problems with word endings compared with other languages, but it has its own irregularities. Natural language processing needs to associate a word with its base form, even if they look different. For instance, chatbot software has to understand that "brought" is the past tense of "bring." The process of doing this is called lemmatization. Lemmatization can be complex to do, so sometimes a crude method of simply removing the last characters, the affixes, is made. This is called stemming.
Agglutinative and fusional languages can get very complex in this regard. Latin is a relatively well-known example of a fusional language, also known as an inflected one. Word endings change depending on the case, gender, tense, and so on. It allows very concise statements. "Carthago delenda est!" says "Carthage must be destroyed!" in just three words. It uses the gerundive form of the verb "delere," which gives it an imperative sense. The closest word-for-word English translation would be "Carthage is to be destroyed."
Some words are irregular, increasing the confusion. NLP needs to reduce words to their lemmas an understand how the modified form changes the meaning.
Other languages have very little modification of words. In Chinese, word forms never change. Regardless of the number, gender, or tense, the language uses exactly the same word.
Some languages are agglutinative, combining simple stems or morphemes mostly without modification to create words that have meanings of their own. This isn't the same as forming compound words. German includes huge compound words but isn't agglutinative, since the pieces of the compounds are independent words in their own right. Japanese is largely agglutinative, as are some invented languages, such as Esperanto and Klingon.
Languages which aren't particularly fusional or agglutinative are called "isolating" languages, not to be confused with language isolates. English is mostly an isolating language. They're relatively easy for software to deal with, though they may have their own complications.
The ways different languages render words as visible marks have a lot of variety. English gets by with twenty-six syllables, borrowed and slightly expanded from Latin. French and German put marks over the letters which aren't just for decoration but are an essential part of a word's spelling. In German, "schon" means "already," but "schön" means "beautiful."
In Chinese, a character is a logogram, standing for a word. Being literate in Chinese requires knowing thousands of them. Japanese uses the same characters, calling them Kanji, but it also has two syllabic writing systems, Hiragana and Katakana. A word with a Kanji symbol can be written phonetically in these systems.
Fortunately, Unicode solves the problems of character sets. Virtually every language on Earth that some population uses has a Unicode encoding. One text can combine any number of languages. A set of symbols for use in a particular writing system (which can be common to multiple languages) is called a Unicode script.
The direction in which a script's characters go varies. Left to right is the most common, but Arabic and Hebrew go right to left. To add to the complication, numbers within the text in those languages go left to right. Text entry and rendering software figure it all out.
The variety in the syntax of sentences is arguably the most difficult issue in NLP. In some languages word, the order is more flexible than in English, but you have to use the right case of a noun to tell whether it's a subject, direct object, or indirect object.
Non-Indo-European languages aren't always oriented around subject, verb, and object. Japanese sentences are built around the topic, the major point of interest in the sentence, regardless of the grammatical function it performs. The subject may be omitted. Particles, short words which function as grammatical indicators for the words preceding them, are important. For example, a question most often ends in "ka."
Because the underlying idea of organizing a sentence is so different, parsers for different language families have to be significantly different. There isn't always a one-to-one equivalence from one language to another.
Every human language has some ambiguity built into it. Context usually helps to resolve any uncertainty, but sometimes it's necessary to rephrase a question to make it clear. Spoken language usually is more ambiguous than the written form, so voice chatbots have to be especially concerned.
Chatbots deal with limited areas of discourse, making the problem easier to solve. One that handles questions for a clothing store can assume that the questions it gets are mostly about buying clothing.
Remembering the context of the discussion resolves many difficulties. Chatbots have to treat user input, not just a sentence at a time but as part of a continuing conversation. Each language has not only rules in the formal sense but assumptions which are necessary to making sense of a statement.
Every language has its own characteristics, and the more distant the relationship between them, the more fundamental the differences are likely to be. A person who learns a language finds it's necessary to think in the language rather than constantly engaging in mental translation.
Likewise, parsers for different languages need to be based on a thorough understanding of them. It's not just a matter of vocabulary and word order, but of grasping how the language works and how people use it in conversation. A well-designed parser, which can handle all the peculiarities of the language, is necessary to keep users happy.