Normalization

What is normalization?

Database normalization is the process of structuring a database, usually a relational database, by a series of so-called standard forms to reduce data redundancy and improve data integrity.

Normalization organizes attributes and relations of a database to ensure that database integrity constraints properly enforce their dependencies. It is accomplished by applying some formal rules either by the process of synthesis (creating a new database design) or decomposition (improving an existing database design).

In Natural Language Processing (NLP), normalization is a process that converts a list of words to a more uniform sequence. This is useful in preparing text for later processing. In addition, by transforming the words to a standard format, other operations can work with the data and not have to deal with issues that might compromise the process. For example, converting all words to lowercase will simplify the searching process.

The normalization process can improve text matching. For example, there are several ways that the term "modem router" can be expressed, such as modem and router, modem & router, modem/router, and modem-router. Normalizing these words to the common form makes it easier to supply the correct information to a shopper.

Understand that the normalization process might also compromise an NLP task. For example, converting to lowercase letters can decrease the reliability of searches when the case is important.

‍

Why do we need normalization?

When we normalize text, we attempt to reduce its randomness, bringing it closer to a predefined “standard.” This helps us reduce the amount of different information that the computer has to deal with and improves efficiency. Normalization techniques like stemming and lemmatization are to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

The purpose of database normalization is to get rid of duplicate data and database anomalies from the relational table. It also aids in lowering the level of redundancy and complexity by examining the new data types that are introduced and used in the table. It is useful in dividing large database tables into smaller tables and linking them to each other by making use of relationships. Normalization helps in avoiding duplicate data and ensures that there are no repeating groups in a table. It minimizes the chances of anomalies occurring in the database.

What are the types of normalization?

Here are the types of normalization:

First Normal Form (1NF)

A table is in First Normal Form (1NF) if all the attributes of the table are made up solely of atomic values. If a table has multivalued data items in attributes or composite values, the relation can’t be in the first normal form. You’d need to change it to the first normal form by making the entries of the table atomic.

Second Normal Form (2NF)

A relation can be in 2NF if it is in 1NF, all the non-prime attributes are fully functionally dependent on the candidate key, and it does not contain any partial dependency.

Third Normal Form (3NF)

A table can be in third normal form if it is in 2NF and does not contain any transitive dependency. A transitive dependency is a situation in which any non-prime attribute determines or depends on the other non-prime attribute.

Boyce Codd Normal Form (BCNF)

This is the next version of 3NF. It is also known as 3.5NF. For a table or relation to be in BCNF, it must be in 3NF. If a relation R has functional dependencies (FD) and if X determines Y, where X is a super Key, the relation can be considered to be in BCNF.

Fourth Normal Form (4NF)

A table or relation is in 4NF if it is in BCNF and there is no multivalued dependency in the table.

Fifth Normal Form (5NF)

A table or relation is in 5NF if it is in 4NF and there is no oin Dependency or further non-loss decomposed. It is also referred to as Project Join Normal Form (PJNF).
‍

How does normalization affects chatbots?

Conversational Normalization is when the chatbot goes through processes to find common spelling or errors that could change the meaning.

There are five major steps involved when creating a chatbot—tokenizing, normalizing, recognizing entities, dependency parsing, and generation—for the chatbot to read, interpret, understand, and formulate and send a response. Let’s take a closer look.

Tokenizing: The chatbot starts by chopping up text into pieces (also called ‘tokens’) and removing punctuation.
Normalizing: Next, the bot finds common misspellings, slang, or typos in the text and converts these to its “normal” version.
Recognizing Entities: Now that the words are all normalized, the chatbot seeks to identify which type of thing is being referred to. For example, it would locate North America as a location, 67% as a percentage, and Google as an organization.
Dependency Parsing: For the next step, the bot splits the sentence into nouns, verbs, objects, punctuation, and common phrases.
Generation: Finally, the chatbot generates a number of responses using the information determined in all the other steps and selects the most appropriate response to send to the user.

A highly overlooked preprocessing step is text normalization. Text normalization is the process of transforming text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good,” its canonical form. Another example is the mapping of near-identical words such as “stopwords,” “stop-words” and “stop words” to just “stopwords.”

Text normalization is essential for noisy texts such as social media comments, text messages, and comments to blog posts where abbreviations, misspellings, and use of out-of-vocabulary words are prevalent.

Normalization has even been effective for analyzing highly unstructured clinical texts where physicians take notes in non-standard ways. For example, it can be helpful for topic extraction where near-synonyms and spelling differences are expected. Take, for example, topic modeling, topic modeling, topic-modeling, topic-modeling, etc.

Unfortunately, unlike stemming and lemmatization, there isn’t a standard way to normalize texts. It typically depends on the task. For example, normalizing clinical texts would arguably be different from how you normalize SMS text messages.

Some common approaches to text normalization include dictionary mappings (easiest), statistical machine translation (SMT), and spelling-correction-based approaches.