Sentence similarity, a tough NLP problem
With the advent of chatbots, training computers to read, understand, and write language has become a big business. The training may seem easy at first but as you start your journey with Natural Language Processing (NLP) you realize that surmounting the challenges is no easy task. That's why sentence similarity is amongst the toughest problems.
Why is it so hard to teach language to a computer?
After all, a child picks up languages intuitively. That’s because computers are good at number crunching but are not capable of reading or understanding human language directly. For example, when we say, ‘An apple a day keeps you healthy’, how do you teach the computer that ‘Apple’ in this context is a fruit and not the company Apple Inc.
We take great pains to transform words and sentences into numerical representations and then train the computers with these numerical representations of language.
These representations have to capture the meaning of words (semantics), how they occur in a sentence (syntax), the contextual conversation, and the intertwining of words.
A new phase
NLP got a big boost when Tomas Mikolov at Google invented word2vec. Word2vec is a method to convert a word into a representation in an n-dimensional vector space which is referred to as word embedding. Glove is another method from Stanford that creates vector representations of words.
One of the popular and new methods is fastText from Facebook. These methods make it easy for the computer to compare words based on the context and the meaning. Armed with this technique, let’s see if we can now extend this comparison to complete sentences or find similarities between sentences.
Spacy, a popular open source NLP engine, provides an out-of-the-box feature to find similarity between sentences.
The numbers represent sentence similarity. The greater the similarity value, the more similar are the sentences. The first two sentences are more similar since a city and country occur in them. However, why do we get a 0.65 and 0.55 similarity with the third one?
Let’s play with Spacy’s sentence similarity models to figure out what’s going on behind the scenes.
Test Case 1: Very similar sentences
The sentences used are: ‘Apple sells iOS smartphones.’ and ‘Google sells Android smartphones.
’The two models give around 0.85 similarity which is decent.
Let’s fool the model by stating something that is not a fact.
The sentences used are: ‘Apple sells iOS smartphones.’ and ‘Google sells iOS smartphones.’
The two models give around 0.94 similarity which is technically correct but factually incorrect.
Notice how replacing Android by iOS in the second sentence enhanced the similarity from 0.85 to 0.94. The models used here have not been trained with ‘facts’ but trained to compare sentences semantically. Would a human have performed better? Yes, only if the human has the knowledge that Google does not sell iOS phones. So, by imparting this knowledge to the model it can perform at the same level as a human.
Test Case 2: Somewhat similar sentences
The sentences used are: ‘Apple invented iOS.’ and ‘Google bought Android.’The two models give 0.73 and 0.62 similarity, which may be acceptable.
Test Case 3: Dissimilar sentences
The sentences used are: ‘Apple invented iOS.’ and ‘Apple a day keeps you healthy.’
The two models give 0.6 similarity, which is high.The results from the above tests prove that we have a long way to go for coming up with models that can do an acceptable job.
By default, Spacy averages the word vectors in a sentence to come up with a sentence level vector or a vector that represents the meaning of the sentence in an n-dimensional vector space.
It then compares the sentence level vectors of the two sentences by using the cosine similarity method to come up with the similarity number.
Cosine in sentence similarity
It is a measurement of similarity between two non-zero vectors of an inner product space that measure the cosine of the angle between them.
Isn’t this non-intuitive? Would a human compare sentences in the same manner as this?
Recent developments in Deep Learning have shown promise that semantic similarity at a sentence level can be solved with better accuracy using recurrent and recursive neural networks. This will be the subject of discussion in a future post.
In text mining, sentence similarity is used as a criterion to discover unseen knowledge from textual databases.
- Sentence similarity study on Researchgate.
Get the Engati advantage
At Engati, we have some of the best in the industry at our office who can help you learn more about NLP, machine learning or the chatbot technology. Our expertise will better help you understand the technology and make use of it at work.
Businesses can build their own chatbot for free and can upgrade as per their need and requirement. Engati comes with more than 54 international languages, businesses can select their preferred language. Therefore, get started with chatbot technology. In fact, start at the earliest because the chatbot only gets better with time because it becomes mature with more data.
Ever wondered what is stopping your business from reaching its full potential? Register with Engati today and start building your free chatbot!
Engage and retain your customers using Engati. Try it for free!Set it up in 7 mins!
Engati powers 45,000+ chatbot & live chat solutions in 50+ languages across the world.
We aim to empower you to create the best customer experiences you could imagine.
So, are you ready to create unbelievably smooth experiences?