Decentralized Data Processing with embedded SQL engines

Anwesh Roy

last edited on

October 11, 2023

2-5 mins

Table of contents

Try Engati for WhatsApp Marketing: Meta's Tech Partner of the year 2024

Book a Demo

Decentralized Data Processing with embedded SQL engines

For the past 50 years, RDBMS have been the mainstay for data storage and processing. The fundamental architecture of RDBMS has not changed much over the years. It is still a centralized architecture having collocated compute and storage. The query engine and the storage work hand in hand to process SQL queries and manipulate data. The RDBMS is a server and clients have to connect to the database remotely from applications. Distributed versions of popular RDBMS handle scale but with a lot of complexity of data replication and for parallel processing of queries.

‍

With large-scale adoption of cloud and pay per use, computing appears to be cheaper than investing in a multi-million dollar on premise server to house an enterprise grade RDBMS that can process millions of queries in minutes and provide a high level of availability.

‍

Database as a Service has solved that problem in the cloud, but as data grows and scale grows the cost of cloud based Database as a Service also increases significantly.

‍

SaaS applications are typically multi-tenant and have to segregate and in cases isolate data by tenant. In a RDBMS we have to create different databases or schemas either star schema or otherwise or tables or partition the data by tenant to achieve this segregation.

‍

While RDBMSs are extremely fast for OLTP workloads with proper database schema design, rouge queries which result in high computation and huge amounts of data access can bring the database to a standstill and lead to loss of service for other queries. This can lead to poor performance for tenants in a SaaS environment.

‍

With the ever increasing need to process and store more data because of phenomenal growth in data volume the overall cost to manage data for cloud based applications will become a very significant cost of operation.

It is time to rethink the approach that has been used over the years and try to come up with new and efficient design to scale and manage costs for data processing and storage.

‍

Taking advantage of cheap and infinite storage in the cloud like Amazon S3, cheap compute on commodity hardware like Amazon EC2 servers and auto scaling abilities can we come up with a new way to process lots of data at a reasonable cost and provide loan quickly too, Especially for read only workloads like data caching, reporting, dashboards, data analysis and the like which need access to lots of data stored in a relational format and has to be queried with SQL.

One idea is to adopt a divide and conquer approach based on a decentralized architecture by partitioning the data storage by tenant or a set of tenants, separate the storage from compute, run multiple SQL engines in a host of commodity servers rather than using a centralized OLDP RDBMS or OLAP data warehouse.

‍

SQLite has been the mainstay database for a lot of embedded systems and mobile phones for decades. There are an estimated 1 Trillion active SQLite databases in use.

It is a full fledged relational database with a fast SQL query engine on a reasonable size database. It has a serverless architecture with an in process SQL engine that can be embedded into an application. The database can be in memory, or in memory mapped files or stored in ordinary files in the disk. Especially with SSD storage, reading data from the disk files into memory is fast. A recent paper by SQLite and University of Wisconsin-Madison has outlined a plan to enhance and support SQLite through 2050.

‍

SQLite can be used for data processing across an array of servers each serving a big tenant or a set of smaller tenants by using commodity EC2 servers and disk based or memory storage. Based on performance requirements the database size and memory caching can be designed appropriately. As long as we ensure multiple servers do not access a single database, issues relating to coordinating concurrent writes can be avoided.

‍

DuckDB is a new in memory and embeddable OLAP engine based on a similar design as SQLite. It executes complex analytical queries on gigabytes of data faster than what a centralized database can do. It is based on a serverless architecture, can be used as an in-process OLAP query engine and access data stored in Parquet files in a columnar format. DuckDB uses SIMD based processing and vectorizes data to leverage L3 cache of CPUs and thus is able to squeeze out performance from the processor.

DuckDB is integrated with Apache Arrow to undertake fast transfer of data in disk to memory. It can house the database in memory or disk, use memory mapped files or can be just used to directly run analytical SQL queries directly on disk based parquet files on the fly. The DuckDB paper outlines the design principles of this new age in memory OLAP engine.

‍

Using SQLite and DuckDB in a decentralized architecture we can easily process large amounts of data on commodity hardware without sacrificing speed and scalability. Although this will increase the management of servers and data partitions compared to using a centralized database, that is a good problem to solve. Centralized databases of yester years were not designed for the cloud era. Embedded SQL engines like SQLite and DuckDB easily leverage cheap on demand compute and storage at scale and thus can give better ROI.

‍

Please fill in your details and we will contact you shortly.