Tech Corner

How Deep learning AI model compression caters to edge devices

Anwesh Roy
last edited on
October 17, 2023
3-4 mins

Table of contents

Automate your business at $5/day with Engati

Switch to Engati: Smarter choice for WhatsApp Campaigns 🚀
Deep learning AI model

In the world of deep learning AI models, ‘Bigger is Better’ is the norm. Which means bigger models perform better. NVIDIA recently announced the MegatronLM - a monster language model that packs 8.3 billion parameters inside the model. The parameters occupy 33 GB of disk space. 

512 V100 GPUs ran continuously for more than nine days to train the model. The energy expended for the training is 3X the average energy consumption by an American in a year. The energy spent in training the BERT language model is equivalent to the fuel spent on a single transatlantic flight. A direct effect of building such massive models is the carbon footprint generated by these models.

There are probably a little over 100 million processors in the cloud today. However, there are close to 3 billion smartphones and several billion IOT devices out there. A standard mobile phone has enough processing power to carry out model inference. These are the ‘Edge’ devices, and most future computing will happen on these devices. However, these Edge devices cannot host such massive deep learning models to carry out local prediction.

There is also the question of security, privacy, and latency when hitting an API in the cloud to carry out a prediction as opposed to doing it locally on the device. Cost may also be a factor in the long run in choosing to run a model locally instead of the cloud.

For all the above reasons, it is extremely important to come up with smaller models that can easily fit into Edge devices so that inferences can happen locally. Recent research has brought up innovative ideas to reduce the size of the models via compression and other techniques.

It is logical to assume that smaller models will have reduced performance i.e. accuracy. This is no doubt true, but the reduction inaccuracy may not be that significant.

The chart below shows the accuracy of Image models vs the # of model parameters. The size of the bubble represents the model size. It is quite evident that the bigger models do not necessarily deliver higher accuracy.

E.g. smaller models such as MobileNet (20MB in size) do not perform worse than, say, VGG-16 while they are around 25X smaller in size.

In this post, we will focus on compression techniques that have been used off late to reduce model size. Compressing a model not only reduces its size so that it consumes less memory but also makes it faster. Compression reduces the number of parameters (weights) or their precision.

Which techniques are used to compress large language and image models?


Deep learning AI model compression - Catering to the edge

Quantization means reducing the numerical precision of the model parameters or weights. Typically the weights are stored in 32-bit floating-point numbers. However, for many applications, this level of precision may not be necessary. Quantization maps each floating-point weight to a fixed precision integer containing lesser bits. Quantization can be done post-training or during training. Post-training can result in 4X smaller models with < 2% accuracy loss. Quantization during training can result in 8X-16X smaller models. Quantization can also be applied to the activation functions of the models.



Pruning involves removing the parameters or weights that contribute least to overall model accuracy. 

Weight pruning can be used to remove individual connection weights. Weight magnitude pruning removes the weights having less magnitude, using magnitude as a measure of connection importance. 

Neuron pruning removes entire neurons. If we can rank the neurons in the network based on their contribution, then the lower contributing neurons can be removed.


Knowledge Distillation

Although knowledge distillation is an indirect way of compressing a model, it achieves the same result, i.e., creating smaller models. In this technique, a larger (already created) model called ‘teacher’ is used to train a smaller model called ‘student’. Here the goal is to have the same distribution in the student model as available in the teacher model. This is achieved by transferring the knowledge from the teacher to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model.

Recent successes in compressing language models is evident by the availability of many smaller transformer models based on BERT - ALBERT (Google and Toyota), DistilBERT (Huggingface), TinyBERT (Huawei), MobileBERT, and Q8BERT (Quantized 8Bit BERT).

Some of the popular model compression tools used to achieve deep learning are:

  • TensorFlowLite from Google.
  • Core ML from Apple
  • Caffe2Go from Facebook
  • Neural Network Distiller from Intel AI Lab

It is good to see the discussion and research on model compression techniques. This has resulted in a push to build smaller models, especially to cater to Edge devices. This can give a much-needed fillip to Edge computing which is the future of computing. It also has the added advantage of better resource utilization since the mobile phone processors will be put to use much more than what happens currently.

Anwesh Roy

Anwesh is the Senior Vice President of Engati. Driven by a passion to deliver value through AI-driven solutions, Anwesh is on a mission to mainstream Natural Language Processing (NLP), Natural Language Understanding (NLU), Natural Language Generation (NLG) and Data Analytics applications.

Close Icon
Request a Demo!
Get started on Engati with the help of a personalised demo.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
*only for sharing demo link on WhatsApp
Thanks for the information.
We will be shortly getting in touch with you.
Oops! something went wrong!
For any query reach out to us on
Close Icon
Congratulations! Your demo is recorded.

Select an option on how Engati can help you.

I am looking for a conversational AI engagement solution for the web and other channels.

I would like for a conversational AI engagement solution for WhatsApp as the primary channel

I am an e-commerce store with Shopify. I am looking for a conversational AI engagement solution for my business

I am looking to partner with Engati to build conversational AI solutions for other businesses

Close Icon
You're a step away from building your Al chatbot

How many customers do you expect to engage in a month?

Less Than 2000


More than 5000

Close Icon
Thanks for the information.

We will be shortly getting in touch with you.

Close Icon

Contact Us

Please fill in your details and we will contact you shortly.

This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
This is some text inside of a div block.
Thanks for the information.
We will be shortly getting in touch with you.
Oops! Looks like there is a problem.
Never mind, drop us a mail at