Term Frequency-Inverse Document Frequency (TF-IDF)

Table of contents

Automate your business at $5/day with Engati

REQUEST A DEMO
term frequency-inverse document frequency

What is Term Frequency-Inverse Document Frequency?

Term Frequency-Inverse Document Frequency, or TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. 

This is done by multiplying two metrics: 

  1. How many times a word appears in a document
  2. The inverse document frequency of the word across a set of documents

It has many uses, most importantly in automated text analysis, and is very useful for scoring words in machine learning algorithms for Natural Language Processing.

TF-IDF was invented for document search and information retrieval. It works by increasing proportionally to the number of times a word appears in a document, but is offset by the number of documents that contain the word. So, words that are common in every document, such as this, what, and if, rank low even though they may appear many times, since they don’t mean much to that document in particular.

term frequency-inverse document frequency TF-IDF
Source: Towards Data Science

How is Term Frequency-Inverse Document Frequency calculated?

TF-IDF is frequently used in machine learning algorithms in various capacities, including stop-word removal. These are common words like “a, the, an, it” that occur frequently but hold little informational value. TF-IDF consists of two components, term frequency, and inverse document frequency.

Term frequency can be determined by counting the number of occurrences of a term in a document.

IDF is calculated by dividing the total number of documents by the number of documents in the collection containing the term. It’s useful for reducing the weight of terms that are common within a collection of documents. The log of this figure is used to dampen the effect of IDF.

Why is Term Frequency-Inverse Document Frequency used in Machine Learning?

Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization. It’s a fundamental step in the process of machine learning for analyzing data, and different vectorization algorithms will drastically affect end results, so you need to choose one that will deliver the results you’re hoping for.

Once you’ve transformed words into numbers, in a way that’s machine learning algorithms can understand, the TF-IDF score can be fed to algorithms such as Naive Bayes and Support Vector Machines, greatly improving the results of more basic methods like word counts.

What are the applications of Term Frequency-Inverse Document Frequency?

Applications of Term Frequency-Inverse Document Frequency

Determining how relevant a word is to a document, or TD-IDF, is useful in many ways, for example:

1. Information retrieval

TF-IDF was invented for document search and can be used to deliver results that are most relevant to what you’re searching for. Imagine you have a search engine and somebody looks for LeBron. The results will be displayed in order of relevance. That’s to say the most relevant sports articles will be ranked higher because TF-IDF gives the word LeBron a higher score.

It’s likely that every search engine you have ever encountered uses TF-IDF scores in its algorithm.

2. Keyword Extraction

TF-IDF is also useful for extracting keywords from text. How? The highest scoring words of a document are the most relevant to that document, and therefore they can be considered keywords for that document. Pretty straightforward.

Why does this work? Simply put, a word vector represents a document as a list of numbers, with one for each possible word of the corpus. Vectorizing a document is taking the text and creating one of these vectors, and the numbers of the vectors somehow represent the content of the text. TF-IDF enables us to gives us a way to associate each word in a document with a number that represents how relevant each word is in that document. Then, documents with similar, relevant words will have similar vectors, which is what we are looking for in a machine learning algorithm.

Close Icon
Request a Demo!
Get started on Engati with the help of a personalised demo.
Thanks for the information.
We will be shortly getting in touch with you.
Oops! something went wrong!
For any query reach out to us on contact@engati.com
Close Icon
Congratulations! Your demo is recorded.

Select an option on how Engati can help you.

I am looking for a conversational AI engagement solution for the web and other channels.

I would like for a conversational AI engagement solution for WhatsApp as the primary channel

I am an e-commerce store with Shopify. I am looking for a conversational AI engagement solution for my business

I am looking to partner with Engati to build conversational AI solutions for other businesses

continue
Finish
Close Icon
You're a step away from building your Al chatbot

How many customers do you expect to engage in a month?

Less Than 2000

2000-5000

More than 5000

Finish
Close Icon
Thanks for the information.

We will be shortly getting in touch with you.

Close Icon

Contact Us

Please fill in your details and we will contact you shortly.

Thanks for the information.
We will be shortly getting in touch with you.
Oops! Looks like there is a problem.
Never mind, drop us a mail at contact@engati.com