Data Distribution 101
Data distribution plays an important role in today’s database world. Applications accessing and modifying data need operations to be faster and more reliable than ever. The strategy implemented for data distribution differs from one product to another, and different database products offer various distribution mechanisms to ensure scalable, speedy, and secure data operations.
In this first part of our series regarding data distribution in Postgres vs. MongoDB, we will explain the importance of data distribution in the current database environment. We will also further illustrate how data distribution can help enhance performance and scalability.
Data Distribution: Overview
Data distribution is a mechanism of distributing data across multiple physical locations in a strategic manner to ensure efficient and scalable data operations. As part of a data distribution strategy, the data part of the database is spread across multiple disks on the same server or can be distributed across multiple servers connected via a network, perhaps even ones that are hosted in geographically distant locations. The distributed data can be either the same data distributed using replication methodology or distinct subsets of the data distributed using a partitioning or sharding mechanism.
Why Data Distribution Is Necessary
In today’s world, databases are being tasked with storing terabytes and even petabytes of data while processing millions of transactions per day in order to cater to complex data requests. Distributing this data and processing across a number of resources makes this workload more manageable and, since cloud-based resources or commodity hardware can be used, less expensive to address. Thus, proper data distribution has become essential for ensuring good performance and optimal application response time even in the face of ever growing demands.
Achieving this, however, is not always straightforward. A lot of analysis and evaluation of metrics, design, and architectural aspects need to be completed via an efficient testing plan to arrive at a proper data distribution strategy. The ultimate target will be to achieve optimal vertical and horizontal scalability. For example, in banking or telecom, where an operational DBMS may be called upon to perform millions of transactions per second, with minimal latency, the distribution of the data being accessed as well as the processing of these transactions becomes an absolute must.
Characterization of Workloads for Data Distribution
It is important for database professionals to understand that not all use cases for data distribution are the same. The characteristics of a workload that is being addressed must be understood before designing and implementing any kind of data distribution architecture. In order to implement data distribution, significant effort is required–planning, designing, testing, implementing–and can also involve significant costs. Therefore, a broad understanding of the application requirements is required.
Let’s take a look at the three different types of database workloads that commonly arise.
1. Scale-Out Data Warehouse
Businesses across various domains invest significant money and time in building data warehousing (DWH) environments that can cater to various business requirements, predominantly for reporting and analytical purposes. Reporting applications run complex and expensive (mostly SELECT) queries to scan large portions of historical data. The challenge is to ensure databases are performing at the expected scale.
Generally, Data Warehousing environments deal with large volumes of data, around 10s or 100s of terabytes or even petabytes. Queries running on such large pools of data can pose significant performance challenges. This is especially true since the queries that are used to process all this data are also quite complex, looking for complicated relationships and correlations between different parts of the data to gain unique insights. Because of this, these systems require not just high performance access to large amounts of storage for the data, but also a tremendous amount of compute capacity to process it all.
Vertical scalability is the first option to resolve this issue. Simply using a bigger machine with more and faster disk space may be enough. CPUs keep getting faster, memory is becoming cheaper, and solid state disks (SSDs) can dramatically improve the speed at which data can be read from a disk. Many problems that could only be solved with a scaleout data warehouse even a few years ago can now be addressed by a single fast machine. That isn’t always the case, however, and a scale-out solution may still be required, wherein the architecture would help leverage the storage capacity and computing power of multiple connected machines within a network.
Traditionally this use case was addressed by massively parallel processing (MPP) database management systems where a large array of specialized processors could be brought to bear to solve complex analytical problems. More recently, MapReduce-based solutions like Hadoop have become popular in addressing these use cases. In both cases, the basic idea is the same: divide and conquer. The access and processing of the data is spread across a large number of resources to answer these complex queries in minutes instead of hours.
2. Sharded Web/Mobile Applications
Web or mobile applications are designed to handle extremely large numbers of users. These could be social network applications, online commerce, SaaS applications, or even online games. While each individual user may not have a huge amount of data, the number of users means that, in aggregate, the overall data set can become enormous. When this occurs, it can become necessary to distribute or “shard” this data across a number of machines for scalability. In addition, since the number of users interacting with the system at any given time can be unpredictable, it can be advantageous that the data access and processing is spread across these machines holding the data as well.
The nice thing about this type of data distribution is that, typically, any user of the system will be interacting with a well-defined subset of the overall data set–either their own personal data or perhaps the data for their company in the case of a SaaS business application. In addition, interactions with the data are often simpler than in the case of a data warehouse. The user may be accessing their profile data, updating their account, or generating a simple report on the performance of their department in a SaaS application.
This means that queries and updates can usually be “routed” to a single data node containing the necessary information to fulfill a specific request. The node that holds that data can then process the request and return the information to the user without needing to execute complex coordination with the other nodes in the system.
This is a very common use case for NoSQL document databases like MongoDB where all the information for a user or group can be stored in a “document” that is stored on a particular node. This access pattern also makes cross document or distributed transactions unnecessary in most cases.
Relational databases can also be sharded in a similar way using the user or company as a “sharding key” to ensure that all information associated with that entity is stored and processed on a single server. For example, a logical table of customers could be split across a number of servers based on the first letter of the customer’s last name. This way, the application can easily route any customer request to the appropriate server.
A common challenge with an RDBMS in this scenario actually comes from one of its greatest strengths, the ability to query arbitrary data using standard SQL or reporting tools. Since it is easy to create a query that lets you report on activity across all users in the system, perhaps matching some criteria (e.g. “Select all customers in Texas who have bought more than six pairs of shoes in the last three months and who have also bought dog food and are a preferred customer”), you can easily end up with queries that span shards and make transaction coordination more complex in a distributed system.
3. OLTP Data Partitioning
In OLTP environments, transactions are very small in size but high in number. Typically, there would be a large quantity of single-row updates or inserts. In domains such as banking, the number of transactions per day can easily be in the millions or tens of millions, so the need to make these read/write operations scalable becomes essential. Each operation may be simple, so the compute load may not be overwhelming in this case, but I/O to disk can be very high.
Data distribution can help improve the throughput of OLTP databases. To ensure data is distributed efficiently, the transactions hitting the data portions in the database must be identified and distributed across multiple physical locations–multiple disks. This is where partitioning comes into play. An RDBMS may split a table across a number of physical partitions so it can write transactions on each of these disks in parallel, dramatically improving I/O throughput.
In an RDBMS such as PostgreSQL, you can identify the tables with high read/write requests and spread them across multiple disks. The same must also be done with indexes. Larger-size tables can be considered for partitioning, and partitions can then be distributed across multiple physical locations, which helps distribute I/O.
In the case of NoSQL databases, sharding can help achieve the same, though it tends to create a more complex architecture where processing power must be scaled along with storage and when only disk performance is the bottleneck. We will be covering this topic further and in more detail in the follow-up parts to this article.
Summary
Data distribution is an essential requirement for enhancing performance and scalability of data operations. And it is important to understand the characteristics of data loads before considering data distribution as the solution. In today’s world, databases are hosted on multiple high-power CPU servers with gigabytes of memory. An effective data distribution strategy can significantly help leverage the computing power of multiple servers for optimal performance.
Part 2 of this series will focus in detail on the data distribution capabilities of PostgreSQL, followed by Part 3 on MongoDB.