With the advent of chatbots, training computers to read, understand, and write language has become a big business. The training may seem easy at first but as you start your Natural Language Processing (NLP) journey, you realize that surmounting the challenges is no easy task. That's why sentence similarity is amongst the toughest NLP problems.
Why is it so hard to teach language to a computer?
After all, a child picks up languages intuitively. That’s because computers are good at number crunching but are not capable of reading or understanding human language directly. For example, when we say, ‘An apple a day keeps you healthy’, how do you teach the computer that ‘Apple’ in this context is a fruit and not the company, 'Apple Inc.'
We take great pains to transform words and sentences into numerical representations and then train the computers with these numerical representations of language.
These representations have to capture the meaning of words (semantics), how they occur in a sentence (syntax), the contextual conversation, and the intertwining of words.
A new phase
NLP got a big boost when Tomas Mikolov at Google invented word2vec. Word2vec is a method to convert a word into a representation in an n-dimensional vector space which is referred to as word embedding. Glove is another method from Stanford that creates vector representations of words.
One of the popular and new methods is fastText from Facebook. These methods make it easy for the computer to compare words based on the context and the meaning. Armed with this technique, let’s see if we can extend this comparison to complete sentences or find similarities between sentences.
Spacy, a popular open-source NLP engine, provides an out-of-the-box feature to find similarity between sentences.
The numbers represent sentence similarity. The greater the similarity value, the more similar the sentences are. The first two sentences are more similar since a city and country occur in them. However, why do we get a 0.65 and 0.55 similarity with the third one?
The sentences used are: ‘Apple sells iOS smartphones.’ and ‘Google sells Android smartphones.’
The semantic similarity is 83%, which is decent.
Let’s fool the model by stating something that is not a fact.
The sentences used are: ‘Apple sells iOS smartphones.’ and ‘Google sells iOS smartphones.’
The semantic similarity is 87% since we have similar sentences; however, the second sentence is factually incorrect.
Notice how replacing Android with iOS in the second sentence enhanced the similarity from 83% to 87%.
The model used here has not been trained with ‘facts’ but trained to compare sentences semantically.
Would a human perform better? Yes, only if the human has the knowledge that Google does not sell iOS phones.
So, by imparting this knowledge to the model it can perform at the same level as a human.
The sentences used are: ‘Apple invented iOS.’ and ‘Google bought Android.’
The semantic similarity is 75%, which may be acceptable.
The sentences used are: ‘Apple invented iOS.’ and ‘Apple a day keeps you healthy, as shown in an iOS application.’
The semantic similarity is 92%, which is very high for sentences that are dissimilar.
The ‘Apple’ and ‘iOS’ words are common between the two sentences, but the meaning of both sentences is quite different.
The results from the above tests prove that we have a long way to go for coming up with models that can do an acceptable job.
By default, most semantic similarity models take an average of the word vectors in a sentence to come up with a sentence-level vector or a vector that represents the meaning of the sentence in an n-dimensional vector space.
It then compares the sentence level vectors of the two sentences by using the cosine similarity method to come up with the similarity %.
Cosine in sentence similarity
It is a measurement of similarity between two non-zero vectors of an inner product space that measure the cosine of the angle between them.
Isn’t this non-intuitive? Would a human compare sentences in the same manner as this?
Recent developments in Deep Learning have shown promise that semantic similarity at a sentence level can be solved with better accuracy using recurrent and recursive neural networks. This will be the subject of discussion in a future post.
In text mining, sentence similarity is used as a criterion to discover unseen knowledge from textual databases.
- Sentence similarity study on Researchgate.
Get the Engati advantage
At Engati, we have some of the best in the industry at our office who can help you learn more about NLP, machine learning or chatbot technology. Our expertise will better help you understand the technology and make use of it at work.
Businesses can build their own chatbot for free and can upgrade as per their need and requirement. Engati comes with 50+ languages, and businesses can select their preferred language. Therefore, get started with chatbot technology. In fact, begin at the earliest because the chatbot only gets better with time because it becomes mature with more data.
Ever wondered what is stopping your business from reaching its full potential? Register with Engati today and start building your free chatbot!