What is stemming in NLP?
Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.
While a stemming algorithm is a linguistic normalization process in which the variant forms of a word are reduced to a standard form. It is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words “eating,” “eats,” “eaten” is “eat.”
Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.
Why are stemming and lemmatization different?
For grammatical reasons, documents will use different forms of a word, such as organizing, organizing, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.
In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
Both stemming and lemmatization aim to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
- am, are, is $\Rightarrow$ be
- car, cars, car's, cars' $\Rightarrow$ car
The result of this mapping of text will be something like:
- the boy's cars are different colors $\Rightarrow$
- the boy car be differ color
However, the two words differ in their flavor. Stemming usually refers to a simple heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly using a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and return the base or dictionary form of a word, which is known as the lemma.
If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. An additional plug-in component often does linguistic processing for stemming or lemmatization to the indexing process. Several such components exist, both commercially and open-source.
What are the errors that could occur in stemming?
There are mainly two errors that could occur in stemming:
- Over-stemming - When two words are stemmed from the same root that are of different stems. Over-stemming can also be regarded as false positives
- Under-stemming - When two words are stemmed from the same root that are not of different stems. Under-stemming can be interpreted as false negatives.
Which are some of the popular stemming algorithms?
Porter’s Stemmer algorithm
Porter’s Stemmer algorithm is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped onto the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy and are known to be the oldest stemmer.
Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree.’
1. Lovins Stemmer
Lovins proposed this algorithm in 1968, which removes the longest suffix from a word, then the word is recoded to convert this stem into valid words.
Example: sitting -> sitt -> sit
2. Dawson Stemmer
The Dawson Stemmer is an extension of the Lovins stemmer. In the Dawson Stemmer, suffixes are stored in the reversed order indexed by their length and last letter.
3. Krovetz Stemmer
This stemming algorithm was proposed in 1993 by Robert Krovetz. Here are the steps that the Krovetz Stemmer follows:
- Convert the plural form of a word to its singular form.
- Convert the past tense of a word to its present tense and remove the suffix ‘ing.’
Example: ‘children’ -> ‘child’
4. N-Gram Stemmer
An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common.
Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S*
5. Snowball Stemmer
When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages, the Snowball Stemmers can be referred to as a multilingual stemmer. The Snowball stemmers are also imported from the NLTK package.
The Snowball Stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer has greater computational speed.
6. Lancaster Stemmer
The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is faster, but the algorithm is confusing when dealing with small words. The drawback is that it is not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and uses an iterative algorithm.