Mastering Scalable Identity: A Guide to Crafting a Unique ID Generator

Introduction

Hi Flok's. In the realm of modern computing, distributed systems have become the backbone of countless applications and services, ranging from cloud-based data storage to complex real-time networks.

As these systems grow in complexity and scale, the need for reliable and unique identification becomes a difficult task for engineers.

The purpose of this article is to unveil the difficulties of crafting a Unique ID Generator specifically tailored to the challenges posed by distributed systems. We will delve into the building blocks of scalable Unique ID Generator, discussing distributed systems.

What is a Unique ID?

A Unique ID, also known as a Unique Identifier, is a distinct and non-repeating value or code that is assigned to a specific entity, object, or data record within a system or database.

Suppose you are asked to design a unique ID generator. Your first thought might be to use a primary key with the auto_increment attribute in a traditional database. However, auto_increment does not work in a distributed environment because a single database server is not large enough and generates unique IDs across multiple databases.

Example

What is the scale of the system?

In the context of distributed systems and infrastructure, the "scale of the system" refers to the size and capacity of the system to handle a growing or changing workload.

Our unique ID generator system should be able to generate 10,000 IDs per second.

IDs must be unique.
IDs are numerical values only.
IDs are ordered by date.
Ability to generate over 10,000 unique IDs per second.

Multiple options can be used to generate unique IDs in distributed systems. The options we are going to explore are some of the solutions which suit our requirements.

Multi-master replication
Ticket server
Universally unique identifier (UUID)
Twitter snowflake approach

Multi-master replication

The idea behind this solution is to use the databases’ auto_increment feature. Instead of increasing the next ID by 1, we increase it by k, where k is the number of database servers used.

The next ID to be generated is equal to the previous ID in the same server plus 2. But, this approach as cons IDs does not go up with time across multiple servers and it does not scale well when a server is added or removed.

Ticket Server

The term "ticket" in this context represents a unique identifier, A "Ticket Server" typically refers to a server or system used for managing and distributing tickets or tokens in various contexts.

Ticket servers are another interesting way to generate unique IDs. Flicker developed ticket servers to generate distributed primary keys.

The basic idea is to use a centralized auto_increment feature in a single database server (Ticket Server). To learn more about this, refer to Flicker’s engineering blog article.

Even this approach has its pros & cons :

It can be only used for generating Numeric IDs.
It is easy to implement, and it works for small to medium-scale applications.

Cons:

Single point of failure. Single ticket server means if the ticket server goes down, all systems that depend on it will face issues. To avoid a single point of failure, we can set up multiple ticket servers. However, this will introduce new challenges such as data synchronization.

Universally unique identifier (UUID)

A UUID is another easy way to obtain unique IDs. UUID is a 128-bit number used to identify information in computer systems. UUID has a very low probability of getting collusion.

Quoted from Wikipedia, “After generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%”.

UUIDs can be generated independently without coordination between servers.

In this approach, each web server contains an ID generator, and a web server is responsible for generating IDs independently.

Pros:
• Generating UUID is simple. No coordination between servers is needed so there will not be any synchronization issues.

• The system is easy to scale because each web server is responsible for generating the IDs they consume. ID generators can easily scale with web servers.

Cons:

• IDs are 128 bits long.
• IDs do not go up with time.
• IDs could be non-numeric.

Twitter snowflake approach

Twitter Snowflake is a distributed unique ID generation service developed by Twitter to address the need for generating unique IDs in a distributed, scalable, and reliable manner.

It was designed to overcome some of the limitations of traditional auto-incrementing database keys and generic unique ID generation methods.

Twitter’s unique ID generation system is called “snowflake”.

The idea behind it 🤔

The Snowflake approach relies on a structured, 64-bit unique ID format, which is broken down into various components to ensure uniqueness and support a distributed architecture.

Timestamp (41 bits):

This is the most significant part of the ID and represents the timestamp when the ID was generated, measured in milliseconds since a customizable epoch. The use of timestamps ensures that IDs generated at different nodes or instances of the service are guaranteed to be unique and are approximately ordered by time.

Machine ID (10 bits):

Snowflake allocates a unique machine or worker ID for each server or node in the distributed system. This allows for a significant number of machines to coexist and generate unique IDs without conflicts.

Data Center ID (12 bits):

In addition to the machine ID, a data center or worker group ID is assigned to distinguish between machines operating in different data centers. This is crucial for large-scale, geographically distributed systems.

Sequence Number (12 bits):

The sequence number is used to differentiate IDs generated within the same millisecond on the same machine. It helps ensure that IDs remain unique even in cases of high throughput.

The combination of these components ensures that each Snowflake ID is unique within the distributed system, is sortable by time, and contains information about the machine and data center responsible for generating it.

Summary

We have seen different approaches to designing a unique ID generator: multi-master replication, UUID, ticket server, and Twitter snowflake-like unique ID generator.

In all the proposed designs, we assume ID generation servers have the same clock. This assumption might not be true when a server is running on multiple cores.

In our next blog. We can discuss what is Clock synchronization & its solution.

Reference :

Universally unique identifier
https://en.wikipedia.org/wiki/Universally_unique_identifier

Ticket Servers
https://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-thecheap/

Announcing Snowflake:
https://blog.twitter.com/engineering/en_us/a/2010/announcingsnowflake.html

Conclusion

As we part ways with this article, I encourage you to carry forward the insights gained here, whether you're a system architect, developer, or engineer, to drive innovation and excellence in the realm of distributed computing.

I am excited to hear your thoughts on our exploration of Unique ID Generators in distributed systems. Your perspective, insights, and questions are valuable to our community. Please take a moment to share your thoughts and share it to your friends.

Crafting a Unique ID Generator for Distributed Systems

We navigate the intricate world of distributed systems, unlocking the secrets to building a robust, scalable, and secure Unique ID Generator.

Introduction