N-gram

Q: Why do we use N-grams?

1. Auto completion of sentences2. Auto spell check Speech recognition3. Machine translation4. To a certain extent, checking grammar in a given sentence

Q: What are the challenges in using N-grams?

1. Sensitivity to the training corpus2. Smoothing

What is N-gram?

N-gram is simply a sequence of N words. For instance, a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework.”

‍

What is an n-gram model in NLP?

An N-gram model is a type of Language Model (LM), that focuses on finding the probability distribution over word sequences. The model is built by counting how often word sequences occur in corpus text and then estimating the probabilities. Due to the fact that a simple N-gram model has limitations, improvements are often made through the means of smoothing, interpolation and backoff.

When there is a sequence of N-1 words, an N-gram model is used to predict the most probable word that would follow that sequence. The N-gram model is a probabilistic model that is trained on a corpus of text. N-gram models are used in a wide range of natural language processing (NLP) applications like speech recognition, machine translation and predictive text input.

A model that just focuses on how frequently a word occurs without looking at previous words is called unigram. If a model only considers just the previous word to predict the current word, then that model is called a bigram.

Why do we use N-grams?

In Natural Language Processing, n-grams are used for a variety of things. Some examples include:

Auto completion of sentences
Auto spell check
Speech recognition
Machine translation
To a certain extent, checking grammar in a given sentence

In the process of speech recognition, the input could be noisy and that could lead to wrong speech-to-text conversions. N-gram models can rectify this based on their knowledge of the probabilities.

N-gram models are even employed in machine translation for the purpose of producing more natural sentences in the target language.

Even when checking for spelling errors and correcting them, sometimes dictionary lookups will not help. Quite often, when a word is misspelt, the author has used another word that is a valid dictionary word, but it does not fit in the context of the sentence in which it is used. N-gram models can be used to rectify these areas.

While N-gram models are usually used at the word-level, they can also be used at the character level to perform stemming, that is, to separate the root word from the suffix.

It is also possible to use N-gram statistics to classify languages or to differentiate between US and UK spellings.

A wide range of NLP applications benefit from N-gram models. These applications include part-of-speech tagging, natural language generation, word similarity, sentiment extraction and predictive text input.

How do N-grams work?

Let us take a look at the following examples.

San Francisco (is a 2-gram)
The Three Musketeers (is a 3-gram)
She stood up slowly (is a 4-gram)

Now which of these three N-grams have you seen quite frequently? Probably, “San Francisco” and “The Three Musketeers”. On the other hand, you might not have seen “She stood up slowly” that frequently. Basically, “She stood up slowly” is an example of an N-gram that does not occur as often in sentences as Examples 1 and 2.

Now if we assign a probability to the occurrence of an N-gram or the probability of a word occurring next in a sequence of words, it can be very useful. Why?

First of all, it can help in deciding which N-grams can be chunked together to form single entities (like “San Francisco” chunked together as one word, “high school” being chunked as one word).

It can also help make next word predictions. Say you have the partial sentence “Please hand over your”. Then it is more likely that the next word is going to be “test” or “assignment” or “paper” than the next word being “school”.

It can also help to make spelling error corrections. For instance, the sentence “drink cofee” could be corrected to “drink coffee” if you knew that the word “coffee” had a high probability of occurrence after the word “drink” and also the overlap of letters between “cofee” and “coffee” is high.

As you can see, assigning these probabilities has a huge potential in the NLP domain.

‍

How can you evaluate an N-gram model?

The most effective way to evaluate an N-gram model is to see how well it predicts in end-to-end application testing. This is known as extrinsic evaluation. It is a rather time-consuming and expensive process.

You could alternatively use intrinsic evaluation, which involves defining an appropriate metric and evaluate independent of the application. It can act as a quick first step to check algorithmic performance, but it does not guarantee application performance.

Perplexity, usually written as PP is a popular metric to use. Due to the inverse relationship with probability, minimizing perplexity implies maximizing the test set probability.

What are the challenges in using N-grams?

There are, of course, challenges, as with every modeling approach, and estimation method. Let’s look at the key ones affecting the N-gram model, as well as the use of MLE

1. Sensitivity to the training corpus

The N-gram model, like many statistical models, is significantly dependent on the training corpus. As a result, the probabilities often encode particular facts about a given training corpus. Besides, the performance of the N-gram model varies with the change in the value of N.

Moreover, you may have a language task in which you know all the words that can occur, and hence we know the vocabulary size V in advance. The closed vocabulary assumption assumes there are no unknown words, which is unlikely in practical scenarios.

2. Smoothing

A notable problem with the MLE approach is sparse data. Meaning, any N-gram that appeared a sufficient number of times might have a reasonable estimate for its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it.

As a result of it, the N-gram matrix for any training corpus is bound to have a substantial number of cases of putative “zero probability N-grams.”