<!-- JSON-LD markup generated by Google Structured Data Markup Helper. --><script type="application/ld+json">{  "@context" : "http://schema.org",  "@type" : "Article",  "name" : "How does HBase scale?",  "author" : {    "@type" : "Person",    "name" : "Shripati Bhat"  },  "image" : "https://global-uploads.webflow.com/5ef788f07804fb7d78a4127a/5f167e809b057c57b9c82135_How%20does%20Hbase%20scale_.png",  "articleSection" : "components in HBase.",  "articleBody" : [ "Zookeeper", "HMaster", "Region Servers", "Compactions", "Region", "CAP Theorem" ],  "url" : "https://www.engati.com/blog/how-does-hbase-scale",  "publisher" : {    "@type" : "Organization",    "name" : "Engati"  }}</script>

How does HBase scale?

Shripati Bhat
min read
How does HBase scale?

HBase is a column-oriented database. It belongs to the NoSQL family of databases which means that there are no fixed rows, columns or defined structure in the database which all data in the database must compulsorily adhere to. To understand how HBase scales with these dynamic data components within it, we would need to understand the components in HBase.


HBase is a distributed database. As any distributed database, there has to be a component that centrally manages the metadata of all other components. HBase uses zookeeper for this task. Zookeeper manages the meta information like the number of region servers, location of all components (master, region server). Zookeeper constants does heart beats with all components, reports to the Master Server in case of region server failure and does a master re-election in case of a Master Server node failure.

Putting in zookeeper as a component here might be a little bit confusing because it can be considered as a dependency rather than a component. But HBase is so closely knit to zookeeper for functioning that it could almost be treated as a component in HBase.

The functionality of Hbase with Zookeeper.


As the name suggests, HMaster is the Master Server that monitors region servers. In HBase1.6, the metadata existed in zookeeper and hence the clients could reach the required region server even when Master Server is down. HMaster only comes into the picture on table creation or alter, when a new region server is added or when there is a failure in one of the existing region servers. WALs are used to transfer the current state to another Master when the current Master Server crashes.

Region Servers:

Region Servers are primarily responsible for doing handle the Read (GET) & Write (PUT) requests that come from the client (via the ZK). In addition to this, Region Servers also manageRegions (smallest units of a table holding actual data).


Region Servers go through different types of Compactions.

A minor compaction just collates multiple smaller files on the disk into one to make the read more efficient and fast (occurs when more than a configurable amount of data is held inMemory - hbase.hregion.memstore.flush.size).

A major compaction on the other hand (occurs once a week by default) combines all files stored into one per Region per store and also deletes any data that has to be cleaned up.

Hence the major compaction even though is intended to benefit reads, but ends up taking a lot of resources and can cause problems in the Production environment.


Regions are the smallest units comprising of a store per column family (which holds data in memory and persists it to disk after a threshold). Regions split automatically and is done by the Region Server. The Region Server first informs Zookeeper about splitting a region. It then creates daughter regions and splits HDFS files as needed. After successful completion, the meta data is updated in other region servers, zookeepers and master (in the same order). Thus, HBase automatically handles both splitting and compaction of data in regions to optimize the read operations.

CAP Theorem:

Now HBase is strongly consistent and unlike few other NoSQL database like MongoDB (when secondary reads are enabled - https://docs.mongodb.com/manual/reference/read-concern/or partition replica failure occurs) or Cassandra (through consistency level configuration -https://cassandra.apache.org/doc/latest/configuration/cass_yaml_file.html#idealconsistency-level)

Due to the unavailability of Regions during the split, HBase cannot provide Availability.

Consistency is achieved by reading / writing through Region Servers concerned. PartitionTolerance is achieved by replication of data files using the underlying HDFS.

Wrapping up

Hope you guys now know a little bit more in detail about the various components in HBase and how it can be used to handle millions of data points and still return results in the order of a few milliseconds. That’s how easy it is to scale HBase. 

Want to empower your business with an AI-powered chatbot? Register with Engati to get started!

No items found.
About Engati

Engati powers 45,000+ chatbot & live chat solutions in 50+ languages across the world.

We aim to empower you to create the best customer experiences you could imagine. 

So, are you ready to create unbelievably smooth experiences?

Check us out!