Bot essentials 12: The NLU deep-dive–Stemming and lemmatization

So what exactly is stemming and lemmatization and how does it get used in machine learning? The specific issues that these approaches solve for inflections in language use so that search /retrieval and response accuracy can be increased further.


When we stem a branch we cut off the redundant branches to retain the core whole of the branch or tree. Similarly, for word stemming, we cut the redundant aspects of the word to determine the core essence and the context of the word that we use.  The standard technique for stemming is using Porter’s algorithm. Porter’s approach is a standard set of heuristics on how we can handle inflection points in English.

A stemmer (as we call the algorithm) uses the principle of abstraction or chopping of words so we abstract Berry and berries to berri. Applying a stemmer increases the probability and accuracy of matching words against their inflected derivations.


We think of lemmatization to be more effective than stemming. In a lemmatization algorithm, we don't just reduce or chop off the inflections but we use a knowledge base to obtain the correct base of the word forms.

Stemming and lemmatization are techniques that we use for determining word usage. We do this to frame the intent and the context of the word that we use in a sentence. We use these techniques in NLP in the engine of chatbot platforms like Engati. This is to find the closest match of answers to questions that people ask.

We couple the base tenets of Natural Language Understanding with processing techniques. This allows us to categorise each sentence sent to the bot in free form. Further, determining the context, intent, word meanings and matching it with available responses forms the core. It is basically how a bot will structure a response.

The aspects described in the last few blogs are only those for building a simple FAQ bot. However, this is an important first step in the evolution of machine learning technology. This is helping us in building a knowledge set and is even allowing us to open it to the world to ask queries.

