In distributed systems, generating unique and reliable transaction reference numbers can be a challenging task, especially in payment gateway setups where race conditions may occur. During my experience with such a system, I encountered issues with race conditions caused by database triggers. In this guide we will dive into the pitfalls of generating transaction reference numbers, the approaches you can take to avoid these issues, and the lessons I learned along the way.
Understanding the Problem
Transaction reference numbers are essential in any payment gateway. These unique identifiers ensure every transaction is traceable and auditable, helping with both internal processing and external reporting. Initially, we implemented a trigger-based approach that generated structured reference numbers using a combination of prefixes, dates, and sequential numbers.
While this approach worked well in a low transaction volume environment, the story changed when we scaled out to a higher transaction volume. As transaction volumes grew and multiple instances accessed the database concurrently, race conditions began to appear. The triggers could not keep up with concurrent requests, leading to concurrent tuple updates and transaction failures — a serious issue in any payment system.
Exploring Alternatives for Unique Transaction IDs
When addressing race conditions in a distributed system, two main approaches are worth considering:
- Using UUIDs for Transaction Reference Numbers
- Using Database-Generated Sequences with Structured Formatting
Let’s dive into the pros and cons of each approach.
Approach 1: UUIDs for Transaction Reference Numbers
UUIDs (Universally Unique Identifiers) offer a powerful solution for creating globally unique transaction identifiers without relying on a database sequence or a trigger.
Pros of UUIDs:
- Guaranteed Uniqueness: UUIDs provide a global uniqueness guarantee, eliminating the risk of collisions across distributed systems.
- Concurrency-Friendly: UUIDs are generated independently of the database, making them ideal for high-concurrency environments.
- Efficient Generation: Creating UUIDs is fast and doesn’t require a database round trip, making it a non-blocking process.
Cons of UUIDs:
- Readability: UUIDs are lengthy and lack a logical order, making them harder to read and trace for support teams and end-users.
- Index and Storage Costs: Due to their size, UUIDs consume more storage and may impact index performance slightly compared to shorter, structured IDs.
Approach 2: Database-Generated Sequence with Structured Formatting
In this approach, reference numbers are generated using a database sequence combined with additional formatting, such as <prefix><year><month><padded-sequence>
, like TX2410000000001
.
Pros of Structured Sequences:
- Readability and Context: Structured IDs are human-readable, providing useful context such as the transaction type and date.
- Easier Debugging: These IDs are helpful for troubleshooting, allowing support teams to identify transactions quickly.
- Sequential Order: Numeric sequences are naturally ordered, aiding in index performance and making transaction tracing straightforward.
Cons of Structured Sequences:
- Database Dependency: Sequence-based IDs rely on the database for uniqueness, which can create bottlenecks, especially in high-volume, distributed environments.
- Potential Race Conditions: Triggers and sequences are prone to race conditions when accessed by multiple instances, leading to concurrent tuple update if not handled properly.
- Setup Complexity: Setting up and maintaining triggers and sequences requires careful planning, especially as your system scales.
Adjusting the Triggers: The Short-term Fix
Given the complexity of overhauling our reference number generation, we implemented a targeted adjustment in the database triggers. Initially, our reference numbers were generated using a database sequence in the format <prefix><year><month><padded-sequence>
, resulting in identifiers like TX2410000000001
. However, to avoid the race conditions caused by sequences in a distributed system, we shifted to an epoch-based approach combined with md5 hash.
Now, the reference number format is <prefix><year><month><md5hash>
, resulting in identifiers like TX2410060C4F2FA
. This new reference no was constructed from two main components:
Epoch Component:
epoch text := extract(epoch FROM clock_timestamp())::bigint::text;
The epoch time in seconds (e.g., 1730700917
) is unique per second, but not per millisecond or microsecond. This means that multiple calls within the same second will generate the same epoch
.
md5 Hash:
upper(left(md5(random()::text || epoch || pg_backend_pid()), 9))
md5
produces a 32-character hexadecimal hash (128 bits), which is known for its low collision rate in general-purpose hashing.
Using upper(left(md5(...), 9))
limits this down to 9 characters, reducing the number of bits in the hash and slightly increasing the chance of collisions.
The backend process ID is unique per session and adds some additional uniqueness to the hash by varying with different database sessions.
Analyzing Collision Risk with Additional Constraints
To better understand the robustness of this approach, let’s examine the collision probability of our new reference number structure.
Total Variability
- Prefix (TX) and date segment (
YYMM
) add uniqueness by limiting ID generation to month-by-month batches. IDs generated in different months will have entirely different prefixes. - 9-character MD5 hash segment provides 68.7 billion unique possibilities per month.
Monthly Collision Rate
Each month, the possible number of unique reference nos (from the MD5 segment) remains the same, approximately 68.7 billion unique values from the 9-character MD5 hash.
- For lower volumes (e.g., generating a few thousand IDs per month), the collision probability remains extremely low.
- For high-volume use cases (generating millions of IDs per month), the collision probability increases but stays manageable due to the 36-bit hash space, especially considering the time constraint (new month resets the reference no structure).
Summary of Collision Probability with Monthly Constraints
- Per Month: We have around 68.7 billion possible IDs. Generating even millions of IDs monthly still results in a low probability of collision.
- Across Months: Each month starts fresh with a new
YYMM
combination, so there is no cross-month collision risk. The prefix further segments by year.
This revised approach achieves a high degree of uniqueness while keeping the implementation simple and efficient. By leveraging the epoch with md5 hash, we ensure that reference numbers are unique on a per-month basis without the overhead of sequence management, making it well-suited to distributed environments.
Transitioning Away from Triggers: The Long-term Migration Process
Having identified the limitations of the trigger-based approach, we decided that for long-term we have to transition to UUIDs for transaction IDs in production. Here’s the step-by-step migration strategy we used to avoid disruptions:
- Introduce a New Transaction ID Column:
We added a new column,TransactionUUID
, to store the UUIDs without disrupting the existing reference number column. This allowed us to generate the new IDs independently while keeping the old ones for historical data. - Update Application Code:
Next, we modified our application logic to generate UUIDs at the application layer instead of relying on the database. This eliminated the need for a database trigger and ensured each instance could generate unique IDs independently. - Run Tests in Staging:
Before deploying to production, we extensively tested the new UUID-based approach in a staging environment. This involved running performance benchmarks to ensure that UUID generation met our latency requirements under load. - Phase Out the Old Trigger Gradually:
After verifying that the new UUIDs worked correctly, we disabled the trigger in production. By monitoring for any anomalies, we were able to confirm that the new system functioned as expected without causing interruptions. - Backfill Existing Transactions (Optional):
To maintain uniformity, it best to consider backfilling UUIDs for existing transactions. However, if historical consistency isn’t a priority, you may retain the old reference numbers and use UUIDs only for new transactions.
Lessons Learned and Best Practices
This experience taught us several valuable lessons about handling transaction IDs in distributed systems. Here are some best practices:
- Avoid Database Triggers in Distributed Systems: Relying on triggers for ID generation can lead to race conditions as your application scales. Using UUIDs or another distributed approach is generally safer and more scalable.
- Use UUIDs for Scalability: UUIDs are particularly well-suited to distributed systems, where the guarantee of uniqueness across instances is critical. While they aren’t as readable as structured sequences, they offer greater reliability in high-concurrency environments.
- Consider a Hybrid Approach: If readability is essential, consider generating structured, user-friendly IDs for display purposes while using UUIDs internally for database records.
- Test Extensively in Staging: Before making changes to ID generation in production, test your approach in a staging environment with simulated load to catch potential issues early.
Conclusion
Choosing the right approach for generating transaction reference numbers is critical in payment systems. By transitioning from a trigger-based approach to UUIDs, we were able to eliminate race conditions and ensure reliable ID generation for our transactions. Whether you choose UUIDs or a database sequence, it’s essential to plan for scalability, minimize dependencies, and test thoroughly before making changes in production.
This journey taught us valuable lessons that can help others facing similar challenges. By sharing our experience, I hope to save others from the pitfalls of race conditions and help them design resilient and scalable payment systems.
Comments
Post a Comment