What is a preprocessor?
A preprocessor is a program that processes its input data to produce output that is used as input to another program like a compiler.
How are preprocessing and data pre-processor different?
Data preprocessing is used in machine learning and data mining to make input data easier to work with
It’s an essential step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, etc.
What is text preprocessing?
To preprocess your text simply means bringing your text into a predictable and analyzable form for your task. A task here is a combination of approach and domain. For example, extracting top keywords with TF-IDF (approach) from Tweets (domain) is an example of a Task.
- Task = approach + domain
One task’s ideal preprocessing can become another task’s worst nightmare. So take note: text preprocessing is not directly transferable from task to task.
Let’s say you are trying to discover commonly used words in a news dataset. If your pre-processing step involves removing stop words because some other task used them, then you are probably going to miss out on some of the common words as they’re already eliminated. So it’s not a one-size-fits-all approach.
What is the importance of text preprocessing?
To illustrate the importance of text preprocessing, consider a task on sentiment analysis for customer reviews.
Suppose a customer feedbacked that “their customer support service is a nightmare,” a human can surely identify the sentiment of the review as negative. However, for a machine, it is not that straightforward.
To illustrate this point, I experimented with the Azure text analytics API. Feeding in the same review, the API returns 50%, i.e., neutral sentiment, which is wrong.
However, if we had performed some text preprocessing, in this case just removing some stopwords, we will see that the results become 16%, i.e., negative sentiment, which is correct.
So as illustrated, text preprocessing, if done correctly, can help to increase the accuracy of the NLP tasks.
What are the types of text preprocessing techniques?
There are different ways to preprocess your text.
Lowercasing is a common text preprocessing technique. The idea is to convert the input text into the same casing format so that 'text,’ 'Text,' and 'TEXT' are treated the same way.
This is more helpful for text featurization techniques like frequency, TF-IDF as it helps to combine the same words together, thereby reducing the duplication and get correct counts.
This may not be helpful when we do tasks like Part of Speech tagging (where proper casing gives some information about Nouns and so on) and Sentiment Analysis (where uppercasing refers to anger and so on)
By default, lowercasing is done by most of the modern-day vectorized and tokenizers. So we need to set them to false as required depending on our use case.
2. Removal of Punctuations
One another common text preprocessing technique is to remove the punctuations from the text data. Again, this is a text standardization process that will help treat 'hurray' and 'hurray!' in the same way.
We also need to carefully choose the list of punctuations to exclude depending on the use case. For example, the string.punctuation in python contains the following punctuation symbols
We can add or remove more punctuations as per our need.
3. Removal of stopwords
Stopwords are commonly occurring words in a language like 'the,’ 'a,' etc. They can be removed from the text most of the time, as they don't provide valuable information for downstream analysis. However, in cases like Part of Speech tagging, we should not remove them as they provide valuable information about the POS.
4. Removal of Frequent words
In the previous preprocessing step, we removed the stopwords based on language information. But say, if we have a domain-specific corpus, we might also have some frequent words which are of not so much importance to us.
This step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.
5. Removal of Rare words
This is very similar to the previous preprocessing step, but we will remove the rare words from the corpus.
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form.
For example, if there are two words in the corpus walks and walking, then stemming will stem the suffix to make them walk. But say in another example, we have two words, console and consoling, the stemmer will remove the suffix and make them consol, which is not a proper English word.
Several types of stemming algorithms are available, and one of the famous ones is porter stemmer, which is widely used.
Effects of stemming inflected words
Stemming helps deal with sparsity issues as well as standardizing vocabulary. I’ve had success with stemming in search applications in particular. The idea is that, if say you search for “deep learning classes,” you also want to surface documents that mention “deep learning class” as well as “deep learn classes,” although the latter doesn’t sound right. But you get where we are going with this. You want to match all variations of a word to bring up the most relevant documents.
Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word (also called lemma) belongs to the language.
As a result, this one is generally slower than the stemming process. So depending on the speed requirement, we can choose to use either stemming or lemmatization.
8. Stopword Removal
Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. The intuition behind using stop words is that by removing low information words from the text, we can focus on the important words instead.
9. Removal of emojis and emoticons
With more and more usage of social media platforms, there is an explosion in emojis in our day-to-day lives. Probably we might need to remove these emojis for some of our textual analysis.
From Grammarist.com, an emoticon is built from keyboard characters that, when put together in a certain way, represent a facial expression, whereas an emoji is an actual image.
:-) is an emoticon
😀 is an emoji
Please note again that the removal of emojis and emoticons are not always preferred, and the decision should be made based on the use case at hand.
10. Spelling Correction
One another important text preprocessing step is spelling correction. Typos are common in text data, and we might want to correct those spelling mistakes before we do our analysis.