<!-- JSON-LD markup generated by Google Structured Data Markup Helper. --><script type="application/ld+json">{  "@context" : "http://schema.org",  "@type" : "Article",  "name" : "LASER for NLP tasks - Part 2",  "author" : {    "@type" : "Person",    "name" : "Divya Priya |"  },  "image" : "https://global-uploads.webflow.com/5ef788f07804fb7d78a4127a/5ef788f17804fb27e8a41ab4_NLP-Part-II.png",  "articleSection" : "Here’s a step-by-step tutorial on using LASER for the multi-classification task",  "articleBody" : "1. Dataset Preparation</P><P>2. Setup and installation</P><P>3. Classification Model Training</P><P>4. Inference</P><P>5. Analysis of results</P><P>6. Conclusion",  "url" : "https://www.engati.com/blog/laser-for-nlp-tasks-part-ii",  "publisher" : {    "@type" : "Organization",    "name" : "Engati"  }}</script>

AI Revolution

LASER for NLP tasks - Part 2

Divya Priya
.
Feb 10
.
3-4 mins

Table of contents

Key takeawaysCollaboration platforms are essential to the new way of workingEmployees prefer engati over emailEmployees play a growing part in software purchasing decisionsThe future of work is collaborativeMethodology

The previous article (PART -1) covers the LASER concepts and it’s architecture.

There are 5 sections in this tutorial:

Here’s a step-by-step tutorial on using LASER for the multi-classification task (sentiment analysis). LASER provides multilingual embeddings, which can be used to train the model to carry out the sentiment analysis task.

There are 5 sections in this tutorial:

1. Dataset Preparation

2. Setup and installation

3. Classification Model Training

4. Inference

5. Analysis of results

6. Conclusion

Dataset Preparation

Clean your data in order to make sure you don’t have any empty rows or na values in your dataset. Divide the entire dataset to split into training, dev, and testing, in which train.tsv and dev.tsv will have the labels, and test.tsv won’t have the labels. I had around 31k English sentences in my training dataset, after cleaning.

Setup and installation

This Github link can be referred for the setup of the docker for LASER for getting the sentence embeddings, else there is a package for LASER named laserembeddings, which is a production-ready port of Facebook Research’s LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings. I used laser embeddings package to get the embeddings.

Run the following command to install the package and test it:

Install laser embeddings
Installing laser embeddings

Calculate the embeddings for all the sentences in the dataset using laserembeddings package and store all the embedding vectors(1024 dimensional) in numpy file.

Classification Model Training

Since we already have the embeddings for all the sentences computed with us, here laser is acting as the encoder in our model, as it provides with the embeddings for the input sentences. So, now we need to build a classifier network for our decoder in the model, to classify the sentence as a positive, negative or neutral sentiment:

For building the model:

Do the following imports:

Imports to build the model

For modeling:

(I will explain the parameters given below)

Modeling

Here, we have a sequential model, the above model works as a decoder, in which you can tweak the layers in the model, the above is a very simplistic architecture. I experimented with, adding globalaveragepool layer also, however, the results were comparatively the same. The ‘3’ in the Dense layer suggests we have 3 classes to predict. I have used adam optimizer and categorical_crossentropy, as I have multi-class classification problem statement[positive, negative, neutral].X1, is the embedding for all the sentences in the training dataset and Y1 corresponds to the labels, whereas X2 is the embedding for all the sentences in the validation dataset and Y2 corresponds to the labels. I ran it for a few epochs around 7, for which val_accuracy came around 92%. You can decide epochs based on your data.

Finally, I trained the model using my English dataset(consisting of 31k sentences).

After you train the model by running the above steps, make sure you save the model:

Saving the model

Inference

Now, with the saved model, I carried the inference task by loading the laser_model.

Carrying out the inference task

I carried inference on the test dataset, using the loaded model: model.predict_classes, with an accuracy of around 90 %. I converted my test dataset in various different languages(Hindi, German, French, Arabic, Tamil, Indonesia) using google translate package and after getting the embeddings using laserembeddings, I used the trained model to carry out the inferences.

Analysis of results

The accuracy score on various languages came as:

  • Hindi -- 89.2%
  • German -- 87%
  • French -- 88%
  • Arabic -- 87.7%
  • Tamil -- 79%
  • Indonesia -- 84%

The properties of the multilingual semantic space can be used for paraphrasing a sentence or searching for sentences with similar meaning — either in the same language or in any of the 93 others now supported by LASER.
- FB engineering article on Laser.

Conclusion

In this article, we have learned to use LASER for the multi-classification task. LASER can be used on other Natural Language Processing tasks also instead of just classification, as I implemented FAQ for multilingual support. I feel that with adequate data, the results are surprisingly good even with the simple architecture of the decoder model.

But, when it comes to less data, it has some issues, I got to know that when I tried the same with another domain data, which had less data comparatively, so I feel you should have proper data to achieve extremely good results. I am also exploring BERT Multilingual to see if it outperforms LASER and will come up with the results. Thanks for reading and have a great day ahead. See you again in the next article!

Until then, register to explore our chatbot offerings.

Share
Share
Divya Priya

Andy is the Co-Founder and CIO of SwissCognitive - The Global AI Hub. He’s also the President of the Swiss IT Leadership Forum.

Andy is a digital enterprise leader and is transforming business strategies keeping the best interests of shareholders, customers, and employees in mind.

Follow him for your daily dose of AI news and thoughts on using AI to improve your business.

Catch our interview with Andy on AI in daily life

Continue Reading

LASER for NLP tasks - Part 2

Divya Priya
|
4
min read

The previous article (PART -1) covers the LASER concepts and it’s architecture.

There are 5 sections in this tutorial:

Here’s a step-by-step tutorial on using LASER for the multi-classification task (sentiment analysis). LASER provides multilingual embeddings, which can be used to train the model to carry out the sentiment analysis task.

There are 5 sections in this tutorial:

1. Dataset Preparation

2. Setup and installation

3. Classification Model Training

4. Inference

5. Analysis of results

6. Conclusion

Dataset Preparation

Clean your data in order to make sure you don’t have any empty rows or na values in your dataset. Divide the entire dataset to split into training, dev, and testing, in which train.tsv and dev.tsv will have the labels, and test.tsv won’t have the labels. I had around 31k English sentences in my training dataset, after cleaning.

Setup and installation

This Github link can be referred for the setup of the docker for LASER for getting the sentence embeddings, else there is a package for LASER named laserembeddings, which is a production-ready port of Facebook Research’s LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings. I used laser embeddings package to get the embeddings.

Run the following command to install the package and test it:

Install laser embeddings
Installing laser embeddings

Calculate the embeddings for all the sentences in the dataset using laserembeddings package and store all the embedding vectors(1024 dimensional) in numpy file.

Classification Model Training

Since we already have the embeddings for all the sentences computed with us, here laser is acting as the encoder in our model, as it provides with the embeddings for the input sentences. So, now we need to build a classifier network for our decoder in the model, to classify the sentence as a positive, negative or neutral sentiment:

For building the model:

Do the following imports:

Imports to build the model

For modeling:

(I will explain the parameters given below)

Modeling

Here, we have a sequential model, the above model works as a decoder, in which you can tweak the layers in the model, the above is a very simplistic architecture. I experimented with, adding globalaveragepool layer also, however, the results were comparatively the same. The ‘3’ in the Dense layer suggests we have 3 classes to predict. I have used adam optimizer and categorical_crossentropy, as I have multi-class classification problem statement[positive, negative, neutral].X1, is the embedding for all the sentences in the training dataset and Y1 corresponds to the labels, whereas X2 is the embedding for all the sentences in the validation dataset and Y2 corresponds to the labels. I ran it for a few epochs around 7, for which val_accuracy came around 92%. You can decide epochs based on your data.

Finally, I trained the model using my English dataset(consisting of 31k sentences).

After you train the model by running the above steps, make sure you save the model:

Saving the model

Inference

Now, with the saved model, I carried the inference task by loading the laser_model.

Carrying out the inference task

I carried inference on the test dataset, using the loaded model: model.predict_classes, with an accuracy of around 90 %. I converted my test dataset in various different languages(Hindi, German, French, Arabic, Tamil, Indonesia) using google translate package and after getting the embeddings using laserembeddings, I used the trained model to carry out the inferences.

Analysis of results

The accuracy score on various languages came as:

  • Hindi -- 89.2%
  • German -- 87%
  • French -- 88%
  • Arabic -- 87.7%
  • Tamil -- 79%
  • Indonesia -- 84%

The properties of the multilingual semantic space can be used for paraphrasing a sentence or searching for sentences with similar meaning — either in the same language or in any of the 93 others now supported by LASER.
- FB engineering article on Laser.

Conclusion

In this article, we have learned to use LASER for the multi-classification task. LASER can be used on other Natural Language Processing tasks also instead of just classification, as I implemented FAQ for multilingual support. I feel that with adequate data, the results are surprisingly good even with the simple architecture of the decoder model.

But, when it comes to less data, it has some issues, I got to know that when I tried the same with another domain data, which had less data comparatively, so I feel you should have proper data to achieve extremely good results. I am also exploring BERT Multilingual to see if it outperforms LASER and will come up with the results. Thanks for reading and have a great day ahead. See you again in the next article!

Until then, register to explore our chatbot offerings.

Tags
No items found.
About Engati

Engati powers 45,000+ chatbot & live chat solutions in 50+ languages across the world.

We aim to empower you to create the best customer experiences you could imagine. 

So, are you ready to create unbelievably smooth experiences?

Check us out!