Tech Corner

Contrastive Learning in NLP

Anwesh Roy
Aug 9
2-3 mins

Table of contents

Key takeawaysCollaboration platforms are essential to the new way of workingEmployees prefer engati over emailEmployees play a growing part in software purchasing decisionsThe future of work is collaborativeMethodology

Contrastive learning is a technique to train a model to learn representations of say images or sentences such that similar samples are closer in the vector space, while dissimilar ones are far apart.

Contrastive learning can be applied to both supervised and unsupervised data and has been shown to achieve good performance on a variety of vision and language tasks.

Contrastive learning has been successfully applied to Computer Vision tasks and has now been applied to NLP tasks.

A recent paper published by Princeton University highlights one such contrastive learning technique for sentence embeddings: SimCSE: Simple Contrastive Learning of Sentence Embeddings 

The biggest promise of this technique is the unsupervised learning of a pre-trained language transformer model on semantic textual similarity (STS) tasks.

How does it work?

The idea is fairly simple - take an input sentence and predict itself in a contrastive objective, using standard dropout in the neural network layers as noise. This unsupervised technique is simple and works surprisingly well performing at par with supervised models.

A supervised learning technique using natural language inference (NLI) datasets can also be used with the entailment pairs as positive samples and hence should be learned to be closer in the vector space and contradiction pairs as negative samples and should be placed far apart in the vector space.

The following table shows the accuracy of SimCSE when compared with other ways to obtain sentence embeddings for semantic textual similarity (STS) tasks.

The accuracy of SimCSE when compared to other models

Let’s review the unsupervised technique pointed out in this paper since this holds the promise of generating trained sentence similarity models based on pre-trained language models in multiple languages without the need to create labeled datasets for training. Creating and maintaining clean labeled datasets is one of the most time consuming and error prone tasks in machine learning. Learning universal sentence representations is a tough problem in NLP and so far the only successful models have been the supervised ones trained on STS and NLI datasets.

Obtaining labeled datasets in specific domains and languages may not be an easy proposition.

As more textual data is being generated on a daily basis the need to understand and mine such textual data is becoming challenging especially due to the dependency on supervised learning.

Thus unsupervised learning is attractive.

A solution to automate unsupervised learning

The following diagram (a) illustrated the simple contrastive learning technique to learn similar sentence embeddings in an unsupervised manner.

Simple contrastive learning techniques

As seen above the set of input sentences are ‘Two dogs are running,' ‘A man surfing on the sea,’ and ‘A kid is on a skateboard’.

For creating positive sample pairs the same input sentence e.g. ‘Two dogs are running’ is passed into the pre-trained encoder twice and two different embeddings are obtained by applying independently sampled dropout masks. This technique of using dropout is a novel one and easy to implement.

The embeddings obtained for the other two sentences, ‘A man surfing on the sea’ and ‘A kid is on a skateboard’ are taken as negative sample pairs for the sentence embeddings generated for the positive pairs of the ‘Two dogs are running’ sentence.

Creation of labeled positive and negative sample pairs for such a task becomes very simple and can be easily automated for a large corpus.

As part of the training the embeddings for the positive sample pairs are brought closer together in the vector space and the embeddings of the negative sample pairs are pushed apart in the vector space. This trains the encoder to generate similar embeddings for similar input sentences so that the cosine similarity between these input sentences will be high.

Obtaining good representative sentence embeddings can be very useful for NLP tasks such as text classification or information retrieval. Obtaining such embeddings in an unsupervised way can scale up textual analytics in multiple domains and languages.

At the forefront of cutting edge tech

At Engati, we believe in checking out the latest research in the NLP world to understand how we can employ cutting edge techniques to solve bigger problems faster.


Anwesh Roy

Anwesh is the Senior Vice President of Engati. Driven by a passion to deliver value through AI-driven solutions, Anwesh is on a mission to mainstream Natural Language Processing (NLP), Natural Language Understanding (NLU), Natural Language Generation (NLG) and Data Analytics applications.

Andy is the Co-Founder and CIO of SwissCognitive - The Global AI Hub. He’s also the President of the Swiss IT Leadership Forum.

Andy is a digital enterprise leader and is transforming business strategies keeping the best interests of shareholders, customers, and employees in mind.

Follow him for your daily dose of AI news and thoughts on using AI to improve your business.

Catch our interview with Andy on AI in daily life

Continue Reading