<script type="application/ld+json">
{
 "@context": "https://schema.org",
 "@type": "FAQPage",
 "mainEntity": [{
   "@type": "Question",
   "name": "What is stemming?",
   "acceptedAnswer": {
     "@type": "Answer",
     "text": "Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots."
   }
 },{
   "@type": "Question",
   "name": "Why are stemming and lemmatization different?",
   "acceptedAnswer": {
     "@type": "Answer",
     "text": "The two words differ in their flavor. Stemming usually refers to a simple heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes."
   }
 },{
   "@type": "Question",
   "name": "What are popular stemming algorithms?",
   "acceptedAnswer": {
     "@type": "Answer",
     "text": "1. Lovins Stemmer.
2. Dawson Stemmer.
3. Krovetz Stemmer.
4. N-Gram Stemmer.
5. Snowball Stemmer.
6. Lancaster Stemmer."
   }
 }]
}
</script>

Stemming

What is stemming?

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.

While a stemming algorithm is a linguistic normalization process in which the variant forms of a word are reduced to a standard form. It is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words “eating,” “eats,” “eaten” is “eat.”

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

Why are stemming and lemmatization different?

For grammatical reasons, documents will use different forms of a word, such as organizing, organizing, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. 

In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

Both stemming and lemmatization aim to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 

For instance:

  • am, are, is $\Rightarrow$ be
  • car, cars, car's, cars' $\Rightarrow$ car

 

The result of this mapping of text will be something like:

  • the boy's cars are different colors $\Rightarrow$
  • the boy car be differ color

However, the two words differ in their flavor. Stemming usually refers to a simple heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly using a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and return the base or dictionary form of a word, which is known as the lemma. 

If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. An additional plug-in component often does linguistic processing for stemming or lemmatization to the indexing process. Several such components exist, both commercially and open-source.

 

The errors in stemming

There are mainly two errors in stemming:

  1. Over-stemming - When two words are stemmed from the same root that are of different stems. Over-stemming can also be regarded as false positives 
  2. Under-stemming - When two words are stemmed from the same root that are not of different stems. Under-stemming can be interpreted as false negatives. 

Popular stemming algorithms

Porter’s Stemmer algorithm 

Porter’s Stemmer algorithm is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped onto the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy and are known to be the oldest stemmer.

Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree.’ 

1. Lovins Stemmer 

Lovins proposed this algorithm in 1968, which removes the longest suffix from a word, then the word is recoded to convert this stem into valid words. 

Example: sitting -> sitt -> sit 

2. Dawson Stemmer 

It is an extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter. 

3. Krovetz Stemmer 

It was proposed in 1993 by Robert Krovetz. Following are the steps: 

  1. Convert the plural form of a word to its singular form. 
  2. Convert the past tense of a word to its present tense and remove the suffix ‘ing.’ 

Example: ‘children’ -> ‘child’ 

4. N-Gram Stemmer 

An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common. 

Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S* 

5. Snowball Stemmer

When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages, the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from the NLTK package. 

This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer has greater computational speed. 

6. Lancaster Stemmer

The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is faster, but the algorithm is confusing when dealing with small words. The drawback is that it is not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and uses an iterative algorithm. 

About Engati

Engati powers 45,000+ chatbot & live chat solutions in 50+ languages across the world.

We aim to empower you to create the best customer experiences you could imagine. 

So, are you ready to create unbelievably smooth experiences?

Check us out!

Stemming

October 14, 2020

Table of contents

Key takeawaysCollaboration platforms are essential to the new way of workingEmployees prefer engati over emailEmployees play a growing part in software purchasing decisionsThe future of work is collaborativeMethodology

What is stemming?

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.

While a stemming algorithm is a linguistic normalization process in which the variant forms of a word are reduced to a standard form. It is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words “eating,” “eats,” “eaten” is “eat.”

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

Why are stemming and lemmatization different?

For grammatical reasons, documents will use different forms of a word, such as organizing, organizing, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. 

In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

Both stemming and lemmatization aim to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 

For instance:

  • am, are, is $\Rightarrow$ be
  • car, cars, car's, cars' $\Rightarrow$ car

 

The result of this mapping of text will be something like:

  • the boy's cars are different colors $\Rightarrow$
  • the boy car be differ color

However, the two words differ in their flavor. Stemming usually refers to a simple heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly using a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and return the base or dictionary form of a word, which is known as the lemma. 

If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. An additional plug-in component often does linguistic processing for stemming or lemmatization to the indexing process. Several such components exist, both commercially and open-source.

 

The errors in stemming

There are mainly two errors in stemming:

  1. Over-stemming - When two words are stemmed from the same root that are of different stems. Over-stemming can also be regarded as false positives 
  2. Under-stemming - When two words are stemmed from the same root that are not of different stems. Under-stemming can be interpreted as false negatives. 

Popular stemming algorithms

Porter’s Stemmer algorithm 

Porter’s Stemmer algorithm is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped onto the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy and are known to be the oldest stemmer.

Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree.’ 

1. Lovins Stemmer 

Lovins proposed this algorithm in 1968, which removes the longest suffix from a word, then the word is recoded to convert this stem into valid words. 

Example: sitting -> sitt -> sit 

2. Dawson Stemmer 

It is an extension of Lovins stemmer in which suffixes are stored in the reversed order indexed by their length and last letter. 

3. Krovetz Stemmer 

It was proposed in 1993 by Robert Krovetz. Following are the steps: 

  1. Convert the plural form of a word to its singular form. 
  2. Convert the past tense of a word to its present tense and remove the suffix ‘ing.’ 

Example: ‘children’ -> ‘child’ 

4. N-Gram Stemmer 

An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common. 

Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S* 

5. Snowball Stemmer

When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages, the Snowball Stemmers can be called a multi-lingual stemmer. The Snowball stemmers are also imported from the NLTK package. 

This stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer has greater computational speed. 

6. Lancaster Stemmer

The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is faster, but the algorithm is confusing when dealing with small words. The drawback is that it is not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and uses an iterative algorithm. 

Share

Continue Reading