HBase is a column-oriented database. It belongs to the NoSQL family of databases which means that there are no fixed rows, columns, or defined structure in the database which all data in the database must compulsorily adhere to. To understand how HBase scales with these dynamic data components within it, we would need to understand the components in HBase.
HBase is a distributed database. As any distributed database, there has to be a component that centrally manages the metadata of all other components. HBase uses Zookeeper for this task. Zookeeper manages the meta information like the number of region servers, location of all components (master, region server). Zookeeper constants do heartbeats with all components, reports to the Master Server in case of region server failure and does a master re-election in case of a Master Server node failure.
Putting in Zookeeper as a component here might be a little bit confusing because it can be considered as a dependency rather than a component. But HBase is so closely knit to Zookeeper for functioning that it could almost be treated as a component in HBase.
As the name suggests, HMaster is the Master Server that monitors region servers. In HBase1.6, the metadata existed in Zookeeper, and hence the clients could reach the required region server even when Master Server is down. HMaster only comes into the picture on table creation or alter when a new region server is added or when there is a failure in one of the existing region servers. WALs are used to transfer the current state to another Master when the current Master Server crashes.
Region Servers are primarily responsible for doing handle the Read (GET) & Write (PUT) requests that come from the client (via the ZK). In addition to this, Region Servers also manage Regions (smallest units of a table holding actual data).
Region Servers go through different types of Compactions.
A minor compaction just collates multiple smaller files on the disk into one to make the read more efficient and fast (occurs when more than a configurable amount of data is held in Memory - hbase.hregion.memstore.flush.size).
A major compaction, on the other hand (occurs once a week by default), combines all files stored into one per Region per store and also deletes any data that has to be cleaned up.
Hence even though the major compaction is intended to benefit reads, it ends up taking a lot of resources and can cause problems in the Production environment.
Regions are the smallest units comprising of a store per column family (which holds data in memory and persists it to disk after a threshold). Regions split automatically and is done by the Region Server. The Region Server first informs Zookeeper about splitting a region. It then creates daughter regions and splits HDFS files as needed. After successful completion, the metadata is updated in other region servers, Zookeepers, and master (in the same order). Thus, HBase automatically handles both splitting and compaction of data in regions to optimize the read operations.
Now HBase is strongly consistent and unlike a few other NoSQL database like MongoDB (when secondary reads are enabled - https://docs.mongodb.com/manual/reference/read-concern/or partition replica failure occurs) or Cassandra (through consistency level configuration -https://cassandra.apache.org/doc/latest/configuration/cass_yaml_file.html#idealconsistency-level)
Due to the unavailability of Regions during the split, HBase cannot provide Availability.
Consistency is achieved by reading / writing through Region Servers concerned. PartitionTolerance is achieved by replication of data files using the underlying HDFS.
Hope you guys now know a little more about the various components in HBase and how it can be used to handle millions of data points and still return results in the order of a few milliseconds. That’s how easy it is to scale HBase. Here's our article on how to install Hbase on Mac!
Want to empower your business with an AI-powered chatbot? Register with Engati to get started!