Natural language processing has many applications across both business and software development, but roadblocks in human language have made text challenging to analyze and replicate. Why can’t computers seem to get it exactly right? Mariana Romanyshyn from Grammarly sheds light on why and discusses what you need to know about NLP Linguistics.
Lexical Simplification: The Motivation
One major function of NLP lies in text simplification, and there are several reasons for building these tools. Consider, for example, the famous episode of Friends in which Joey tries to write a letter of recommendation for Monica and Chandler’s adoption agency and uses a thesaurus to change
“They are warm, nice people with big hearts.”
“They are humid, prepossessing Homo Sapiens with full-sized aortic pumps.” (cue laugh track).
Other utilizations of text simplification have been implemented by companies like Newsela, which simplifies news for second language learners, and Hemingway, which helps lower the reading level of a text to reach a broader audience. These motivations are genuine, but the process can be a bit difficult.
Text simplification consists of:
- syntactic simplification: addressing sentence structure.
- lexical simplification: addressing words and short phrases.
- explanation generation: addressing word meanings.
Building a text simplification program begins with a primary pipeline. It starts with pre-processing raw text and looks something like this:
The team knows the process is successful because it has a good F-measure (on the test set) and it happens in relatively quickly. However, because the team is not the final consumer of the product, other success criteria apply. The tool should also be:
- grammatically correct
- simpler, but not too simple
- no change in meaning
In many cases, the final product accomplishes the developer’s success criteria, but not the consumers. So what can developers do to make a better product? Why is it so difficult to build a usable text simplification tool in NLP? One major reason is:
Complex Word Identification
How difficult could complex word identification be? You merely access a large corpus, tokenize it, and count word frequency.
Not quite. There are a few roadblocks to this process because language is a lot more complex than that.
Counting doesn’t meet the first criteria of consistency because of parts of speech. To make your program work, it has to identify words the way a linguist would, i.e., the collection of all word forms. This also includes alternate spellings such as the British “accessorise.” Scraping sites that give you the inflectional morphology, or the sum of all word forms, gives you a consistent resource for your program to use to begin chunking word forms together to create better consistency.
Controlling for word length also leaves inconsistencies. “Friend” could be simple while the form “friendliness” is marked complex only because of word length. Long words aren’t necessarily complex, since if you can derive meaning from word parts (i.e. “satisfy” to “satisfactory”) they are actually simple. Building a morphological (word part) analyzer within your program gives you a more consistent readout for which words are complex than a standard length analyzer alone.
You can take both of these one step further and analyze words for strange letter combinations. Complex words tend to have rare letter combinations – for example, “abhorrence” with the designation “abho.” If you compare to a simpler word “anger” you can immediately think of several words with “ange” combinations, but likely none with “abho” (at least quickly).
On an even more basic level, words can be analyzed for their sound. In English, complex words tend to have higher consonant to vowel ratios while simple words have more even proportions. For example “procrastination” has eight consonants and five vowels versus “information” with five vowels and five consonants.
Working with word meaning is notoriously tricky. However, complex words tend to have fewer meanings than simpler words because we use those simpler words many times in everyday language. The word “report” has seven noun meanings and six verb meanings for example, whereas abhorrence has only one noun meaning.
Psycholinguistics deals with language comprehension and production. For example, words in which your brain readily produces an image tend to be simple (“mouse” versus “abhorrence”). Another type of feature would be the average age of acquisition (again, “mouse” versus “abhorrence” with children knowing “mouse” more readily).
Complex Word Simplification
As you find replacements for your complex word, you must rerank them based on simplification. However, it’s possible to go too simple.
For example, the word dipsomania is a mostly unknown word with several synonyms. “inebriacy” isn’t any more simple than the original, but the simplest synonym, “habit,” actually changes the meaning of the word and cannot be considered a suitable replacement. You’ve gone too simple. Instead, a word like “alcoholism” strikes the right balance because the user is likely to know the text feature “alcohol.”
Filtering synonyms requires you to take out options that are just as complex as the original, too simple to be an adequate replacement, or aren’t grammatically correct (including common collocations).
From there, the team can rerank the suggestions using the language model. The most appropriate synonyms are ranked closer to the top while options suitable only to certain situations fall near the bottom of the list. Ranking can fall into two categories:
- Statistical modeling: could include the chain rule using the Markov assumption, meaning that every token depends on the token just previous and not the entire string. You’ll need smoothing techniques to avoid the probability of zero. For example, “add-1” smoothing to change zero to at least a one. One major drawback is the need for a large corpus.
- Neural modeling: includes an input layer with word embeddings, a hidden state with the previous hidden state (parameters from input to hidden and from hidden to hidden). The output layer maximizes the likelihood of the next word. Take a new sentence and feed it into this model and the output layer will give you the probability distribution. This could overgeneralize, however, and requires a significant amount of training.
All these pieces produce a much more complex, but way more accurate pipeline for text simplification.
What Does This All Mean?
Romanyshyn believes that a working knowledge of linguistics gives developers more power in an increasingly complex system of machine learning that helps build smarter and more accurate programs. Because researchers aren’t the final consumers of any NLP model, it’s vital that developers consider their real needs when building.
She believes that although language still presents significant roadblocks to accurate NLP models, it’s better to dive into a problem from a linguistics standpoint rather than ignore it.