Skip to main content

Sharding an Existing Payment Gateway Transaction Table: A Complete Guide

Photo by Erol Ahmed on Unsplash


In payment gateway systems, the transaction table often grows rapidly, leading to performance bottlenecks. One effective way to scale a transactional database is sharding — splitting the database into smaller, more manageable pieces based on a shard key. In this guide we will walk through the process of sharding an existing transaction table in PostgreSQL, explaining key considerations, risks, and two popular sharding approaches: row-based and table-based.

Key Considerations and Risks

Before diving into the implementation of sharding, it’s important to understand the key considerations and risks associated with this approach:

Considerations:

  • Sharding Key: The selection of a shard key is crucial. A good shard key ensures even distribution of data, while also aligning with the most frequent query patterns. For payment transactions, transaction_date or customer_id can be used.
  • Database Complexity: Sharding increases the complexity of your database management. You’ll need to manage multiple shards, ensure data consistency across shards, and optimize cross-shard queries.
  • Data Migration: Migrating existing data to shards can be resource-intensive, especially when you have a large amount of data in the original transactions table.
  • Maintenance Overhead: Ongoing maintenance, such as schema updates, backup, and scaling the database, becomes more complex with sharded systems.

Risks:

  • Cross-Shard Queries: Queries that need to aggregate data across multiple shards can become slower and more complicated.
  • Data Imbalance: If the shard key does not evenly distribute the data, some shards may become overloaded, while others are underutilized.
  • Application Changes: Your application logic will need to be adapted to handle sharding, such as routing queries to the correct shard based on the shard key.

What Needs to Be Done First?

Before implementing sharding, follow these initial steps:

  1. Assess Current Data Volume: Evaluate the current size of the transactions table. If the table is too large, it’s a sign that sharding might be necessary.
  2. Choose a Shard Key: Choose a shard key that will ensure even distribution of data. For transaction-based data, a time-based shard key (e.g., transaction_date) or customer_id might be appropriate.
  3. Decide on Sharding Strategy: Decide whether to use row-based or table-based sharding. Each approach has its pros and cons, which we’ll dive into shortly.
  4. Prepare for Data Migration: Plan for migrating data from the existing transactions table to the newly created shards. This process may require downtime or a strategy for migrating data incrementally.

Row-Based vs Table-Based Sharding: Which Is Best?

Row-Based Sharding

In row-based sharding, data is distributed across different schemas based on the shard key. For example, you may create a schema for each year (e.g., shard_2022shard_2023), and each schema will have the same table structure, with the data filtered by the year.

Pros:

  • Logical Isolation: Each year or time period has its own schema, making it easy to manage and isolate data.
  • Easier Schema Evolution: Schema changes can be made to individual shards without affecting other shards.

Cons:

  • Limited Flexibility: The database grows based on the chosen shard key, so if the volume increases drastically, you may hit limits with the number of schemas.
  • Cross-Shard Queries: Queries that span multiple years (e.g., to retrieve data from different years) will require querying multiple schemas.

Table-Based Sharding

In table-based sharding, each shard is a separate table within the same schema (e.g., transactions_2022transactions_2023). The data is split into tables based on the shard key.

Pros:

  • Fine-Grained Control: You can split data across tables in the same schema or distribute the tables across different physical instances.
  • Scalability: Easier to distribute the shards across multiple database instances for horizontal scaling.

Cons:

  • Schema Clutter: The same schema will have many tables, potentially cluttering it.
  • Cross-Table Queries: Aggregating data across multiple tables can be cumbersome.

Sharding Example 1: Row-Based Sharding

In this example, we will shard the transactions table by the year extracted from transaction_date.

Step 1: Create Shard Schemas

-- Create schemas for each year (shard)
CREATE SCHEMA IF NOT EXISTS shard_2022;
CREATE SCHEMA IF NOT EXISTS shard_2023;

Step 2: Create Identical Tables in Each Shard

-- Create the same table in each schema
CREATE TABLE shard_2022.transactions (
transaction_id UUID PRIMARY KEY,
customer_id INT,
amount NUMERIC(10, 2),
transaction_date TIMESTAMPTZ,
status VARCHAR(50)
);

CREATE TABLE shard_2023.transactions (
transaction_id UUID PRIMARY KEY,
customer_id INT,
amount NUMERIC(10, 2),
transaction_date TIMESTAMPTZ,
status VARCHAR(50)
);

Step 3: Migrate Data to Shards

-- Insert data into the respective shards based on the year of TransactionDate
INSERT INTO shard_2022.transactions
SELECT * FROM transactions WHERE EXTRACT(YEAR FROM transaction_date) = 2022;

INSERT INTO shard_2023.transactions
SELECT * FROM transactions WHERE EXTRACT(YEAR FROM transaction_date) = 2023;

Step 4: Querying Data

-- Query transactions for 2022
SELECT * FROM shard_2022.transactions WHERE customer_id = 101;

-- Query transactions for 2023
SELECT * FROM shard_2023.transactions WHERE customer_id = 202;

Sharding Example 2: Table-Based Sharding

In this example, we will shard the transactions table by the year, but instead of using schemas, we will create separate tables for each year in the same schema.

Step 1: Create Shard Tables

-- Create separate tables for each year in the same schema
CREATE TABLE transactions_2022 (
transaction_id UUID PRIMARY KEY,
customer_id INT,
amount NUMERIC(10, 2),
transaction_date TIMESTAMPTZ,
status VARCHAR(50)
);

CREATE TABLE transactions_2023 (
transaction_id UUID PRIMARY KEY,
customer_id INT,
amount NUMERIC(10, 2),
transaction_date TIMESTAMPTZ,
status VARCHAR(50)
);

Step 2: Migrate Data to Shard Tables

-- Insert data into the respective shard tables based on the year
INSERT INTO transactions_2022
SELECT * FROM transactions WHERE EXTRACT(YEAR FROM transaction_date) = 2022;

INSERT INTO transactions_2023
SELECT * FROM transactions WHERE EXTRACT(YEAR FROM transaction_date) = 2023;

Step 3: Querying Data

-- Query transactions for 2022
SELECT * FROM transactions_2022 WHERE customer_id = 101;

-- Query transactions for 2023
SELECT * FROM transactions_2023 WHERE customer_id = 202;

Summary and Comparison: Row-Based vs. Table-Based Sharding

Row-Based Sharding (Schemas)

  • Logical Separation: Better isolation with separate schemas.
  • Cross-Shard Queries: Requires cross-schema queries.
  • Scalability: Limited by the number of schemas.
  • Operational Overhead: Higher complexity in schema management.

Table-Based Sharding (Tables)

  • Logical Separation: Same schema with multiple tables.
  • Cross-Shard Queries: Requires union or cross-table queries.
  • Scalability: Easier to distribute tables across instances.
  • Operational Overhead: Simpler table management, but more tables.

Conclusion

Sharding your transactions table is an effective way to scale a payment gateway system. Whether you choose row-based or table-based sharding depends on your specific use case. Row-based sharding offers better isolation, while table-based sharding allows for greater scalability. Both approaches have their pros and cons, and choosing the right strategy requires a deep understanding of your system's requirements and future growth potential.

By following the detailed steps and examples provided, you can successfully implement sharding in your PostgreSQL database, optimizing both scalability and performance for your payment gateway system.

Comments

Popular posts from this blog

Understanding Number Systems: Decimal, Binary, and Hexadecimal

In everyday life, we use numbers all the time, whether for counting, telling time, or handling money. The number system we’re most familiar with is the   decimal system , but computers use other systems, such as   binary   and   hexadecimal . Let’s break down these number systems to understand how they work. What is a Number System? A number system is a way of representing numbers using a set of symbols and rules. The most common number systems are: Decimal (Base 10) Binary (Base 2) Hexadecimal (Base 16) Each system has a different “base” that tells us how many unique digits (symbols) are used to represent numbers. Decimal Number System (Base 10) This is the system we use daily. It has  10 digits , ranging from  0 to 9 . Example: The number  529  in decimal means: 5 × 1⁰² + 2 × 1⁰¹ + 9 × 1⁰⁰ =  500 + 20 + 9 = 529 Each position represents a power of 10, starting from the rightmost digit. Why Base 10? Decimal is base 10 because it has 10 digits...

How to Monetize Your API as an Individual Developer While Hosting on Your Own Server?

In the API economy, cloud services like AWS, Google Cloud, and Azure offer many conveniences, such as scaling and infrastructure management. However, some developers prefer more control and autonomy, opting to host their APIs on personal servers. Whether for cost efficiency, data privacy, or customization, hosting your own API comes with both advantages and challenges. But, even without cloud platforms, there are effective ways to monetize your API. This guide will explore how individual developers can successfully monetize their APIs while hosting them on their own servers. Why Host Your API on Your Own Server? Hosting your own API gives you full control over the infrastructure and potentially lower long-term costs. Here’s why some developers choose this approach: Cost Control : Instead of paying ongoing cloud fees, you may opt for a one-time or lower-cost hosting solution that fits your budget and resource needs. Data Ownership : You have full control over data, which is critical if ...

The Weight of Responsibility: A Developer’s Journey to Balance Passion and Reality

For the past several years, Eddie has been on a steady climb in his career as a developer, but recently, he found himself at a crossroads — caught between the weight of his responsibilities and the desire to pursue his true passions. His journey began with a three-month internship as a web developer, which led to nearly four years in an application developer role. After that, he spent almost a year as a systems associate, managing tasks across systems analysis, quality assurance, and business analysis. Eventually, he returned to full-time software development for another two years before transitioning into more complex roles. For over a year, he worked as a multi-role software developer and database administrator before stepping into his current position as a senior software developer, database administrator, and cloud administrator — occasionally handling security tasks as well. Now, with over 8 years of professional experience, he also leads a small team of developers, which has been...

The Hidden Costs of Overdesign and Bad Practices in API Systems

In software development, simplicity and clarity are often sacrificed in favor of overly complex solutions. While it can be tempting to add more features and intricate designs to ensure robustness, overdesign and poor practices can have significant consequences. They frustrate developers, lead to inefficiencies, increase costs, and put unnecessary strain on system resources.  A recent example involving a team that has faced challenges with complexity highlights the pitfalls of such an approach. Overdesign: The Problem of Too Much Complexity Overdesign occurs when systems are built with more complexity than necessary. This might manifest in bloated APIs, convoluted data flows, or excessive checks and processes that don’t add substantial value. The goal is often to anticipate future problems, but this approach typically results in cumbersome systems that are difficult to maintain and scale. In one case, a company found itself paying a hefty price just to host two API services and a po...

Selenium for Beginners: What, Where, When, and Why to Use It in Automated Testing

In today’s software development landscape, automated testing has become essential for delivering robust applications efficiently. Among various automated testing tools,   Selenium   stands out as one of the most widely used and beginner-friendly options. As you embark on your journey into automated testing, it’s crucial to understand the   what, where, when, and why   of using Selenium. In this guide we will run through these essentials and help you decide if Selenium is the right tool for you. What is Selenium? Selenium  is an open-source framework used primarily for automating web browsers. It enables developers and testers to write scripts that interact with websites, simulating actions like clicking buttons, filling out forms, and navigating pages, which allows for comprehensive automated testing. Selenium supports multiple programming languages, including Python, Java, C#, and JavaScript, making it flexible for teams with different coding preferences. Key C...