Shard Database to handle a million records

Shard Database to handle a million records

Sharding is a type of database partitioning that separates large databases into smaller & easily managed parts.These smaller parts are called shards.

Introduction

Hi, folks. In today's data-driven world, where the volume and velocity of information continue to grow exponentially, To design a database handling a million rows. There are multiple ways to handle it. In this article, we are going to explore the technique called Sharding with a practical explanation.

What is Shard?

The word “Shard” means “a small part of a whole“. Hence Sharding means dividing a larger part into smaller parts. That means dividing the database into small databases.

Types of Sharding

There are different types of sharding. In this article, we are going to see major two types.

  • Horizontal Sharding

  • Vertical Sharding

Vertical Sharding

Vertical sharding involves splitting data based on specific columns or attributes rather than a range. In this approach, different shards hold different sets of columns for a given dataset. For instance, one shard may contain customer names and contact information, while another shard holds customer purchase history. Vertical sharding can be useful when certain attributes are accessed more frequently than others or when different attributes require different levels of resources.

Horizontal Sharding

Horizontal sharding involves distributing data across multiple databases or servers based on a range of values. In this approach, each shard contains a subset of data that falls within a specific range.

For example:
Data could be partitioned based on customer IDs, where one shard contains customer data with IDs from 1 to 100,000, and another shard contains IDs from 100,001 to 200,000.

I hope you have a clear idea about what is sharding now. To implement it in your system. we need to know some more techniques & Jargon. Let's dive into it.

Consistent Hashing

Consistent Hashing is a specific type of algorithm to achieve data sharding.

Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table.

In Layman Terms

Let me explain the basic idea behind consistent hashing for a more in-depth explanation check out the article Understanding and implementing consistent hashing.

We know that the hashing function converts the plain text into some random set of characters with a defined length. So, consider that we have 3 instances of databases running on different ports 5432, 5433, & 5434.

Now we have 3 different databases as you understood Horizontal Sharding. We need to find which instance of database we have to read & write. To solve this we are going to apply are knowledge of the hash function.

PORT NOHASH
5432NTQzMg==
5433NTQzMw==
5434NTQzNA==

As we know hashing the port numbers will give the same hash every time. When every we decode that hash we get back the port number. So, now we can know on which database instance we have to do our read & write operations.

It's not a working explanation. I have oversimplified for understanding it. I hope now you got a complete idea about it.

Methods of database sharding?

  • Range-based sharding

  • Hash-Based Sharding

  • Directory sharding

  • Geo sharding

You can apply different methods of sharding as per the use case. However, sharding is one among several other database scaling strategies. Explore some other techniques. Implementation & explanation of each method is beyond the scope of this article.

Pros of Sharding

Improve response time

Data retrieval takes longer on a single large database. The database management system needs to search through many rows to retrieve the correct data. By contrast, data shards have fewer rows than the entire database. Therefore, it takes less time to retrieve specific information or run a query, from a shared database.

Scale efficiently

A growing database consumes more computing resources and eventually reaches storage capacity. Organizations can use database sharding to add more computing resources to support database scaling. They can add new shards at runtime without shutting down the application for maintenance.

Improves Reliability

Lots of shards mitigate problems that affect individual instances. With one huge DB, an outage takes the whole site down. With 100 shards, a single-instance outage affects only 1% of your data world.

Cons of Sharding

  • Shards can be complicated to get right, particularly if your shard key isn’t obvious.

  • You occasionally have to worry about splitting shards, or very occasionally about merging shards. This can be quite complicated.

  • Applications need to be aware of the details of database organization, at least at some level. Joins across shards are not easily doable. If you need to do cross-shard joining, you probably need a data warehouse or some type of alternate reporting data world.

  • No Native Support Sharding is not natively supported by every database engine.

Conclusion

Sharding is a great solution when the single database of your application is not capable to handle & store a huge amount of growing data. Sharding helps to scale the database and improve the performance of the application. However, it also adds some complexity to your system.

I hope you got valuable information today. For more content like this like & share in your communities.

Did you find this article valuable?

Support Saravana Sai by becoming a sponsor. Any amount is appreciated!