LASER for NLP tasks - Part 2
The previous article (PART -1) covers the LASER concepts and it’s architecture.
There are 5 sections in this tutorial:
Here’s a step-by-step tutorial on using LASER for the multi-classification task (sentiment analysis). LASER provides multilingual embeddings, which can be used to train the model to carry out the sentiment analysis task.
There are 5 sections in this tutorial:
1. Dataset Preparation
2. Setup and installation
3. Classification Model Training
5. Analysis of results
Clean your data in order to make sure you don’t have any empty rows or na values in your dataset. Divide the entire dataset to split into training, dev, and testing, in which train.tsv and dev.tsv will have the labels, and test.tsv won’t have the labels. I had around 31k English sentences in my training dataset, after cleaning.
Setup and installation
This Github link can be referred for the setup of the docker for LASER for getting the sentence embeddings, else there is a package for LASER named laserembeddings, which is a production-ready port of Facebook Research’s LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings. I used laser embeddings package to get the embeddings.
Run the following command to install the package and test it:
Calculate the embeddings for all the sentences in the dataset using laserembeddings package and store all the embedding vectors(1024 dimensional) in numpy file.
Classification Model Training
Since we already have the embeddings for all the sentences computed with us, here laser is acting as the encoder in our model, as it provides with the embeddings for the input sentences. So, now we need to build a classifier network for our decoder in the model, to classify the sentence as a positive, negative or neutral sentiment:
For building the model:
Do the following imports:
(I will explain the parameters given below)
Here, we have a sequential model, the above model works as a decoder, in which you can tweak the layers in the model, the above is a very simplistic architecture. I experimented with, adding globalaveragepool layer also, however, the results were comparatively the same. The ‘3’ in the Dense layer suggests we have 3 classes to predict. I have used adam optimizer and categorical_crossentropy, as I have multi-class classification problem statement[positive, negative, neutral].X1, is the embedding for all the sentences in the training dataset and Y1 corresponds to the labels, whereas X2 is the embedding for all the sentences in the validation dataset and Y2 corresponds to the labels. I ran it for a few epochs around 7, for which val_accuracy came around 92%. You can decide epochs based on your data.
Finally, I trained the model using my English dataset(consisting of 31k sentences).
After you train the model by running the above steps, make sure you save the model:
Now, with the saved model, I carried the inference task by loading the laser_model.
I carried inference on the test dataset, using the loaded model: model.predict_classes, with an accuracy of around 90 %. I converted my test dataset in various different languages(Hindi, German, French, Arabic, Tamil, Indonesia) using google translate package and after getting the embeddings using laserembeddings, I used the trained model to carry out the inferences.
Analysis of results
The accuracy score on various languages came as:
- Hindi -- 89.2%
- German -- 87%
- French -- 88%
- Arabic -- 87.7%
- Tamil -- 79%
- Indonesia -- 84%
The properties of the multilingual semantic space can be used for paraphrasing a sentence or searching for sentences with similar meaning — either in the same language or in any of the 93 others now supported by LASER.
- FB engineering article on Laser.
In this article, we have learned to use LASER for the multi-classification task. LASER can be used on other Natural Language Processing tasks also instead of just classification, as I implemented FAQ for multilingual support. I feel that with adequate data, the results are surprisingly good even with the simple architecture of the decoder model.
But, when it comes to less data, it has some issues, I got to know that when I tried the same with another domain data, which had less data comparatively, so I feel you should have proper data to achieve extremely good results. I am also exploring BERT Multilingual to see if it outperforms LASER and will come up with the results. Thanks for reading and have a great day ahead. See you again in the next article!
Until then, register to explore our chatbot offerings.
Engage and retain your customers using Engati. Try it for free!Set it up in 7 mins!
Engati powers 45,000+ chatbot & live chat solutions in 50+ languages across the world.
We aim to empower you to create the best customer experiences you could imagine.
So, are you ready to create unbelievably smooth experiences?