Stemming

Table of contents

Automate your business at $5/day with Engati

REQUEST A DEMO
Stemming

What is stemming in NLP?

Stemming, in Natural Language Processing (NLP), refers to the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots.

While a stemming algorithm is a linguistic normalization process in which the variant forms of a word are reduced to a standard form. It is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words “eating,” “eats,” “eaten” is “eat.”

Search engines use stemming for indexing the words. That’s why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.

stemming
Source: C# Corner

Why are stemming and lemmatization different?

For grammatical reasons, documents will use different forms of a word, such as organizing, organizing, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. 

In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

Both stemming and lemmatization aim to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 

For instance:

  • am, are, is $\Rightarrow$ be
  • car, cars, car's, cars' $\Rightarrow$ car

 

The result of this mapping of text will be something like:

  • the boy's cars are different colors $\Rightarrow$
  • the boy car be differ color

However, the two words differ in their flavor. Stemming usually refers to a simple heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly using a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and return the base or dictionary form of a word, which is known as the lemma. 

If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun. The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma. An additional plug-in component often does linguistic processing for stemming or lemmatization to the indexing process. Several such components exist, both commercially and open-source.

 

What are the errors that could occur in stemming?

There are mainly two errors that could occur in stemming:

  1. Over-stemming - When two words are stemmed from the same root that are of different stems. Over-stemming can also be regarded as false positives 
  2. Under-stemming - When two words are stemmed from the same root that are not of different stems. Under-stemming can be interpreted as false negatives. 

Which are some of the popular stemming algorithms?

Porter’s Stemmer algorithm 

Porter’s Stemmer algorithm is one of the most popular stemming methods proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped onto the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy and are known to be the oldest stemmer.

Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree.’ 

1. Lovins Stemmer 

Lovins proposed this algorithm in 1968, which removes the longest suffix from a word, then the word is recoded to convert this stem into valid words. 

Example: sitting -> sitt -> sit 

2. Dawson Stemmer 

The Dawson Stemmer is an extension of the Lovins stemmer. In the Dawson Stemmer, suffixes are stored in the reversed order indexed by their length and last letter. 

3. Krovetz Stemmer 

This stemming algorithm was proposed in 1993 by Robert Krovetz. Here are the steps that the Krovetz Stemmer follows:

  1. Convert the plural form of a word to its singular form. 
  2. Convert the past tense of a word to its present tense and remove the suffix ‘ing.’ 

Example: ‘children’ -> ‘child’ 

4. N-Gram Stemmer 

An n-gram is a set of n consecutive characters extracted from a word in which similar words will have a high proportion of n-grams in common. 

Example: ‘INTRODUCTIONS’ for n=2 becomes : *I, IN, NT, TR, RO, OD, DU, UC, CT, TI, IO, ON, NS, S* 

5. Snowball Stemmer

When compared to the Porter Stemmer, the Snowball Stemmer can map non-English words too. Since it supports other languages, the Snowball Stemmers can be referred to as a multilingual stemmer. The Snowball stemmers are also imported from the NLTK package. 

The Snowball Stemmer is based on a programming language called ‘Snowball’ that processes small strings and is the most widely used stemmer. The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer has greater computational speed. 

6. Lancaster Stemmer

The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is faster, but the algorithm is confusing when dealing with small words. The drawback is that it is not as efficient as Snowball Stemmers. The Lancaster stemmers save the rules externally and uses an iterative algorithm. 

Close Icon
Request a Demo!

Get started on Engati with the help of a personalised demo.

Thanks for the information.
We will be shortly getting in touch with you.
Please enter a valid email address.
For any other query reach out to us on contact@engati.com
Close Icon
Congratulations! Your demo is recorded.

Select an option on how Engati can help you.

I am looking for a conversational AI engagement solution for the web and other channels.

I would like for a conversational AI engagement solution for WhatsApp as the primary channel

I am an e-commerce store with Shopify. I am looking for a conversational AI engagement solution for my business

I am looking to partner with Engati to build conversational AI solutions for other businesses

continue
Finish
Close Icon
You're a step away from building your Al chatbot

How many customers do you expect to engage in a month?

Less Than 2000

2000-5000

More than 5000

Finish
Close Icon
Thanks for the information.

We will be shortly getting in touch with you.

Close Icon

Contact Us

Please fill in your details and we will contact you shortly.

Thanks for the information.
We will be shortly getting in touch with you.
Oops! Looks like there is a problem.
Never mind, drop us a mail at contact@engati.com

<script type="application/ld+json">
{
 "@context": "https://schema.org",
 "@type": "FAQPage",
 "mainEntity": [{
   "@type": "Question",
   "name": "What is stemming?",
   "acceptedAnswer": {
     "@type": "Answer",
     "text": "Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or the roots."
   }
 },{
   "@type": "Question",
   "name": "Why are stemming and lemmatization different?",
   "acceptedAnswer": {
     "@type": "Answer",
     "text": "The two words differ in their flavor. Stemming usually refers to a simple heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time and often includes the removal of derivational affixes."
   }
 },{
   "@type": "Question",
   "name": "What are popular stemming algorithms?",
   "acceptedAnswer": {
     "@type": "Answer",
     "text": "1. Lovins Stemmer.
2. Dawson Stemmer.
3. Krovetz Stemmer.
4. N-Gram Stemmer.
5. Snowball Stemmer.
6. Lancaster Stemmer."
   }
 }]
}
</script>