Application High Availability and Resiliency: Steps to Improve Transaction Retry

Tom Kincaid

July 30, 2021

At EDB, we often work with Postgres users to help them take their software and services to greater levels of availability—and ultimately to be “Always On.” This entails leveraging their on premise and cloud infrastructure, using Postgres, EDB Postgres Distributed, and other technologies from EDB.

It also involves helping customers ensure their applications are written to be highly available. Making a service highly available requires proper hardware and software redundancy and a highly available database; however, it may also include proper application design patterns for handling certain failure scenarios.

I hope this post is the first in a series of blogs on how to ensure your Postgres applications are written or modified to achieve a maximum level of availability for either Postgres, and if you want to take it to the next level of availability: EDB Postgres Distributed.

Reconnecting and retrying

The first topic I want to blog about is applications retrying transactions in the face of errors or lost connections. More specifically, I’ll cover the steps you need to take and how to prevent proliferation of cutting and pasting these required steps throughout your entire application.

This post is centered around applications that use a Postgres database. However, while there is some Postgres specific material, the topics discussed and the approaches proposed are applicable to many other databases.

Most notable of the transaction retry scenarios is a lost database connection. Postgres, like most databases, requires the application to reconnect before retrying these transactions. Connections can be lost for a variety of reasons, including network outages, database restarts, database server crashes, database switchovers and database failovers. There is no concept of session migration or connection migration in Postgres (or almost any other database). Applications must re-establish a connection. Depending on what else took place while attempting the transaction, there may need to be other actions taken.

Options when a connection is lost before transaction commit

If you lose a connection, during a transaction, or your transaction fails, as a result of an intermittent or temporary condition, you need to do one of three things:

Retry the transaction, possibly after reconnecting first.
Let the client know that the transaction failed and let the user decide what to do (i.e. in the case of a web application, pop up a web page saying transaction failed, would you like to retry).
Forget about the transaction and not bother anybody due to (perceived) availability being more important.

Retrying the transaction

What I want to discuss is case 1, not only for lost connections but also for other error conditions that may require a simple retry. In fact, I want to make it more simple than that for now and focus on the case where you are doing a create, update or delete operation as opposed to a read operation.

After looking around the web and having a lot of internal discussions with a lot of different people, I couldn’t find a complete description of best practices for retry and reconnect with respect to an inflight transaction. I have come to the conclusion that there really is not a one size fits all pattern.

How you handle it depends on a number of factors including but not limited to:

The nature of the application (i.e. is it OK if a record gets committed twice)?
Is your database configured to failover? In other words, when you reconnect are you going to be reconnecting the same database node or could you be failing over to a newly promoted node or another master?
What sort of facilities does your programming language and framework provide?
Application Logic
Transaction Granularity

What this comes down to is that your applications will need to perform transactions within retry loops, in each loop iteration test the error condition or commit success of the transaction, and take appropriate action.

It is important to note that there can be occasions, depending on the error, in which retrying is not appropriate. For instance, as a result of an underlying DDL change, the table you are trying to update no longer exists.

This code can get tricky, and if you are not careful, become difficult to maintain. What needs to be avoided as much as possible is having cut and paste retry loops all over your application. You want a minimal number of retry loops that can be reused for many different transactions. As your testing gets more rigorous and as your database deployment evolves with more features to help you in this area you will want to evolve this code as well. Therefore, you want it to be in a limited number of places or possibly only in a single place.

You may say “many folks use frameworks that handle all this for them.” That is in fact true, however I am not convinced that the majority of frameworks do all that is required for the potential issues or can take advantage of all the things a given database can offer. At a minimum, it is important to understand the concepts of reconnect and retry and how your chosen framework behaves and if it meets the resiliency requirements of your application.

As an example of things frameworks may not take advantage of, Postgres has two functions that will allow you to get the current transaction id and subsequently check if in fact that transaction was committed in the event you have an ambiguous commit status (see

https://www.postgresql.org/docs/13/functions-info.html, functions pg_xact_status ( xid8 and pg_current_xact_id()). Noteworthy, is that these functions are not reliable in the case of failover to a physical replica. However, BDR CAMO does offer the ability to check ths.

The retry loop

So what does your retry loop need to do?

As already stated, your requirements will vary based on a number of things. Based on my findings, here is the general workflow:

Acquire a database connection
Begin a transaction
Attempt some changes
Commit the transaction
If an error occur in steps 1 - 4
1. Evaluate if it is a recoverable error and potentially rollback an in flight transaction.
2. If deemed to be an unrecoverable error (i.e. SQL Syntax error) return error to caller /user.
3. If you have exceeded your retry maximum attempts return error to caller / user.
4. If there is a recoverable error, go to sleep for some period of time to allow the situation to recover (i.e. resources to become available or a failover to occur),.
5. If there is a connection loss, you may need to determine if the transaction managed to commit without the application receiving the commit message. If and how you do this is deployment and application dependent and I will have more to say in the future on this.
6. Go back to step 1 or 2 depending if the error resulted in a connection loss.

For the most part steps 1, 2, 4 and 5 will be the same code for many different database operations. Step 3 can look very different depending on the database operation you are performing. Some applications may have hundreds of different operations for step 3. However, we need to avoid all the code from steps 1, 2, 4, and 5 being replicated for each database operation.

The programming languages I have worked with the most are C++, C and Java. For these languages, I believe the best approach is a callback strategy where you have a function that has code that performs steps 1, 2, 4 and 5. Call it the ConnectTransactRetry. The ConnectTransactRetry function takes as a parameter a function or an interface that performs step 3, the specific database operation. We will call it the TransactionalFunction. The TransactionalFunction can do a lot of different things for you.

I struggled to provide pseudo code, or my actual code. Neither fit together nicely in the form of a blog and the second was not far enough along for me to consider making publicly available.

BDR CAMO

However, this is the approach the BDR CAMO reference client, developed by Markus Wanner, uses and from my journey looking for things was the most complete and robust implementation I could find for appropriately handling error conditions and reconnecting while executing transactions. As an aside, BDR CAMO provides an additional level of robustness and consistency in the case of transactions failing as a result of node outages. If you are interested in more details please see this webinar: BDR CAMO Commit without Fear.

Using a function callback strategy does require that each database change function have the same signature. This means you need to pull the appropriate parameters out of an array of objects or something similar. I hope to deal with ways of handling that in a future blog.

Other approaches I considered to ensure a single source base for reconnect and retry were C macros, C++ and Java inheritance patterns, and Java Annotations. I have also started working with my colleague Leo Jin exploring Resiliance4J, which can be used in a SpringBoot application to handle reconnect and retry. I have heard good things about Resilance4J and hope it can provide an alternative to what I have outlined that works better for existing applications wanting to introduce more robust reconnect and retry logic in their applications.

Finally, and perhaps most importantly, I am interested in your feedback. In the development of your applications, in whatever programming language or framework you use, how do you handle these scenarios? Please leave me a comment and I will try to consider it for future materials.

Update

Update: since writing this post, Leo Jin showed me how he is accomplishing this and using Resilance4J @Retry annotation and Spring boot with @Transactional annotation. He and I hope to be able to share this with you in the near future. We are currently testing it.

Defining Extreme High Availability in a Postgres World

What is “extreme” high availability when we’re talking about Postgres? Is it avoiding outages? Handling failovers? EDB Postgres Distributed defines a new era in high availability that defies description, so...

May 15, 2023

Achieving Extreme High Availability with EDB Postgres Distributed: Rescue Your Downtime from (Un)Planned Processes

High availability is crucial for businesses to ensure that their database is up and running at all times, as downtime can result in a loss of business. In this blog...

May 11, 2023

Building EDB Postgres Distributed Superset Dashboards: 4 Easy Steps

EDB Postgres Distributed (formerly BDR) is an asynchronous multi-master replication system for PostgreSQL that helps you maintain your application’s availability during planned and unplanned outages. PGD turns high availability into...

April 04, 2023

Next Generation Routing in EDB Postgres Distributed 5.0

The basic building block of a PGD deployment is a node group or, simply, a group. In a geo-distributed PGD cluster, a user may associate a group with a geo-location...

February 28, 2023

Release Radar: EDB Postgres Distributed 5 Enhances Performance, User Experience and Disaster Recovery to Improve Extreme High Availability

Increased throughput and reduced latency, faster problem identification and resolution, flexible architectures and more control over replication are just a few of the enhancements with EDB Postgres Distributed 5. In...