Engineering
Engineering

Let's Try Again: Making Retries Work With Cloud Services

Abstract image with marigold background, white lines, and green circles and dotted background.

Not surprisingly, Amazon's AWS enforces rate limits on their services. Their client libraries also incorporate automatic retries. They may allow an application to gracefully recover after exceeding those rate limits. But under heavy data volumes and with the AWS default retry strategy, a process can still trigger rate limits and fail despite retrying.

In this article, we will:

  • Review the basics of retries as a failure handling strategy.
  • Explain the above interaction with rate limits and its solution illustrated by Python source code.
  • Explore several other technical properties of effective retries.

Our journey begins a couple of months ago, when one of our data processing pipelines failed because AWS returned a 503 Slow Down error. After inspecting our logs and consulting Amazon's documentation, my team determined that a bulk copy between two S3 buckets exceeded Amazon's rate limits.

At this point, I was already implementing retries in my head, a challenge familiar from previous employers and projects. After all, retries — executing the same operation with the same arguments again, possibly after a short delay — are simple and sufficient for overcoming many (but not all) failures.

In fact, we constantly use retries in our everyday lives: think about asking somebody to repeat what they just said because face masks make it harder to understand people or swiping a credit or subway card again (and again) when the first swipe didn't register.

After digging a little deeper into Amazon's documentation, it turned out that all their client libraries already implement retries and automatically employ them when appropriate — so much for me implementing retries again. Under the default settings of boto3, the Python client for AWS, our pipeline stage didn't just fail once but failed five times on the same copy operation. Maybe retries aren't so simple after all.

Recent changes to AWS' implementation of retries suggest that AWS engineers might just share that sentiment. In February 2020, boto3 demoted the previous retry mode to “legacy” status (though it still is the default) and gained two new modes called “standard” and “adaptive”. As it turns out, the latter solves our problem. But the technical reasons aren't intuitively obvious.

This blog post starts out by explaining the basics of when to use retries. We then explore their interaction with rate limits and show the Python code for configuring boto3 to gracefully handle that case. Along the way, I’ll explain several other technical aspects of rate limits and retries. By the end of the article, you’ll have a solid understanding of both rate limits and retries that you can apply to your own work.

1. Only Try Retries When...

Retries differ from other techniques for providing fault tolerance such as data replication or distributed consensus in that they are strikingly simple: just do it — again! As a result, we don't need to go through complex algorithms for retries themselves (yay!). But we do need to review the three primary criteria for when we can use retries.

1.1 Operations Are Idempotent

Fundamentally, retrying is only safe if an operation is idempotent, i.e., repeated execution results in the exact same outcomes or state. If you were to retry

<div class="code-wrap"><code>transfer_money_from_me_to(recipient, amount)</code></div>

and your bank's API timed out, retrying would be ok if your request was dropped by the load balancer in your bank's frontend but not so much if the confirmation was lost in the tangle of your bank's microservices. The problem is that you couldn't tell a-priori and retrying this particular operation might just drain your account of all its funds. To put this differently, retrying is safe if and only if doing so eventually converges on the same global system state.

Conveniently, the protocol enabling the web, HTTP, was explicitly designed with retries in mind. They are part of the representational state transfer or REST model underlying the web. In particular, DELETE, GET, HEAD, and PUT are idempotent. GET and HEAD also are read-only. Only POST is not idempotent and thus not safe to retry. So when designing a REST API, you probably want to minimize the use of POST. Sure enough, AWS does just that. They expose their cloud services through a REST API and, out of 96 endpoints in the service description for S3, only six use POST.

The central role of retries for recovering from failures on the web is seen in browsers and services alike. Notably, it explains why browsers have a reload button in the first place. It also explains why most websites warn us not to use the reload button during payment processing, which typically is implemented as a POST. Finally, it explains why eventual consistency is the dominant consistency model across the web. After all, that's just what we get when combining idempotent operations with retries and also the reason I already used similar language in the above definition of safe retries.

That's all well and good. But what should we do about operations like the above endpoint to transfer money? To find a solution, it helps to ask how we distinguish seemingly indistinguishable events, such as birthdays or rent payments, as well as fungible packaged goods, such as cereal boxes and milk cartons. The answer is straightforward: We add another attribute based on date/time or a monotonically increasing counter. That certainly works for the above endpoint as well:

<div class="code-wrap"><code>transfer_money_from_me_to(recipient, amount, transaction_id)</code></div>

The bank doesn't know how a client generates such a transaction ID. That is entirely up to each individual client. But the bank does commit to performing only one transfer, even if several requests with the same transaction ID are submitted to the endpoint. It may also want to reject any request that shares a transaction ID with a previous request but differs in recipient or amount. This pattern can be found in many REST APIs. The corresponding operations may be named create_or_get_something.

1.2 Failures Are Transient

Even if it is safe to retry an operation, that doesn't necessarily mean we should actually retry the operation. We also have to consider the nature of the failure. For an example, let's consider our favorite banking endpoint again. If the bank returns an error indicating insufficient funds, retrying will only have the same result. The bank won't transfer non-existent funds no matter how often (or nicely) we ask. We first need to transfer funds into the account or open a line of credit.

However, if the request times out, we don't know the outcome of the request and should definitely try again. In short, retrying only makes sense if the failure is non-deterministic or transient. While it is safe to retry on any error, doing so for deterministic errors is pointless and only wastes resources.

Effective retries thus depend on the error reporting mechanism as well as the error classification policy. As far as mechanism is concerned, both server and client need to capture the concrete causes of failures and faithfully forward fine-grained error information. Whether an implementation uses error codes or exception types doesn't really matter, but it is critical that higher layers do not mask the error information captured by lower layers.

As far as policy is concerned, engineers need to inventory the specific errors occurring in a system and then classify them as retryable or not. A mechanized version of that classifier forms the core of the retry logic.

Judging by boto3's code repositories and commit histories on Github, AWS engineers significantly deepened their understanding of both retry policy and mechanism over the years. The original version in boto3's runtime botocore implements the so-called legacy mode. It dates back to April 2013 and comprises one module with 360 lines of well-documented Python code.

In contrast, the new version from February 2020 implements both the standard and adaptive modes. It comprises seven modules with 920 lines of well-documented Python code. The primary difference between legacy and standard modes is that the latter classifies many more errors as retryable and thus covers many more endpoints. At the same time, the standard mode also enforces a quota on active retries, thus protecting the client from being overwhelmed by retries in case of, say, a network or AWS outage (which are rare but do happen).

Systems Archeology I: I suspect that AWS engineers started out with retry logic even simpler than what is offered by the legacy mode. When a client exceeds rate limits, S3 returns a 503 HTTP status code with a non-standard reason 503 Slow Down. That's a curious choice of status code because the 5xx status codes indicate server errors, whereas exceeding rate limits are client errors, as indicated by a 4xx status code in HTTP. Furthermore, RFC 6585 defines a directly suitable alternative, 429 Too Many Requests. The standard definition of the 503 status code, 503 Service Unavailable, suggests the reason for this non-standard choice by Amazon engineers: That status code is commonly used when a server is overloaded, an obviously retryable error condition. Hence I wouldn't be surprised if earlier versions of AWS client libraries simply retried on 503 only and this was an expedient solution to increasing the reach of the retry logic.

1.3 Success Is Likely

In the introduction to this section, I breezed through one last major concern when I wrote “just do it — again,” leaving the timing entirely open. While the previous two subsections didn't mention timing, we nonetheless made one important observation: retrying when the likely outcome is another failure wastes resources. If we extrapolate from that, retrying in short succession over and over again is equally pointless.

Real world experience with children during long car journeys underlines that point. Pestering the parents by retrying incessant “Are we there yet?” chants does not bring the destination any closer. It may even unnerve the parents to the point of a forced hour-long break, thus delaying arrival. At least, that's how my parents describe past car trips for summer vacation.

The uncomfortable truth is that, in distributed systems, sending too many retries is fundamentally indistinguishable from a denial of service attack. Experienced service providers on the internet do not take kindly to such attacks. Instead they implement drastic countermeasures, up to and including dropping all requests originating from a suspect IP address in the firewall, before the request even gets to any application server. To avoid that, we need to pace ourselves when retrying. Best practice is to delay each retry, use exponential backoff for repeated retries, and to add some degree of variability by introducing randomized jitter.

In my experience, jitter is typically computed as a smallish delta on top of the deterministic exponential delay. Reading the source code for botocore, I discovered that it instead randomly scales the deterministic exponential delay between zero and the full value. I was surprised by that choice. As it turns out, AWS engineers carefully considered the interaction between exponential backoff and jitter, simulating several approaches, and found that “full jitter” was the most effective choice. Amazon's Marc Brooker wrote a great blog post about just that.

Are we there yet? — No!

2. Limiting the Rate of Tries and Retries

So far, we only considered retrying individual operations in isolation. In such cases, the above three criteria are sufficient for determining when and how to use retries. But when operations occur repeatedly, at more or less regular intervals, and loosely depend on each other, the three criteria aren't sufficient anymore.

Probably the loosest such dependency are rate limits, which do not impose any ordering constraints but do limit the overall number of operations per time period. They are critical for ensuring that a few overly aggressive or even malicious clients cannot overwhelm resources shared amongst many more clients. For that reason, they are pervasive across cloud services.

While it is possible to trigger rate limits reliably and predictably by sending a sufficiently high volume of requests per time period for long enough, triggering rate limits for a particular request is exceedingly hard if not impossible. That's good news: we can treat rate limit violations as non-deterministic errors that are subject to retries (of course, only if the operation also is idempotent). But retrying the failed operation does not suffice for making sustained progress. Instead, the rate limit violation serves as a signal that subsequent future operations are not currently welcome either — within some time period.

Are we there yet? — No!

2.1 The Token Bucket Algorithm

That time period is typically determined by the algorithm used for enforcing rate limits, with the default choice being the token bucket algorithm. It is simple enough to describe in a paragraph:

The name stems from the main data structure, a bucket holding fungible tokens, i.e., a counter. The server maintains a bucket per client and periodically adds a fixed amount of tokens into each bucket, up to some upper limit corresponding to the rate limit. When the server receives a request, it attempts to remove one or more tokens from the client's bucket, with the amount corresponding to the effort necessary for servicing the request. If there are sufficient tokens in the bucket, the server removes them and completes the request. If there aren't, it rejects the request with a rate limit violation.

Are we there yet? — Now we are!

2.2 How Our Pipeline Failed

Now we have enough context to explain the exact circumstances leading to the failure of our data processing pipeline. As mentioned in the introduction, the pipeline failed during a bulk copy between two S3 volumes. The data being copied is a very large dataset in Parquet's columnar format, which puts (parts of) columns into their own files organized by directories to represent tables.

Many of these directories contain many thousands of files. The particular directory that triggered the rate limit failure contained 4,003 files. At the same time, AWS limits clients to 3,500 DELETE, POST, or PUT requests and 5,000 GET or HEAD requests per second per prefix in a bucket. (“Object,” “prefix,” and “bucket” are official S3 lingo. I'm using “file,” “directory,” and “volume” interchangeably.) There are no limits on the number of prefixes in a bucket.

Since the bulk copy is implemented by copying individual objects and initiates those operations in a tight loop, it can easily exceed the above rate limits. Furthermore, since boto3's retry logic uses full jitter, i.e., randomizes the delay between 0 and the exponentially increasing duration, the retry may occur pretty much immediately after the failure, i.e., with a delay much smaller than the period AWS uses to track S3 rate limits.

But that means that the retry will exceed the same already exceeded rate limit and fail again. While the probability of the retry delay being much smaller than the rate limit period is small and getting smaller as the number of retries increases, the number of copies is sufficiently large so that this event may happen five times in a row. At that point, boto3's legacy retry mode stops retrying and fails the copy operation. That in turn fails our pipeline.

Systems Archeology II: In the first systems archeology above, I took note of an unusual choice of HTTP status code to speculate about earlier retry logic and an expedient engineering decision by AWS engineers. This time, I am basing my speculation on a note in the official documentation for S3. The section on Optimizing Amazon S3 performance states: You no longer have to randomize prefix naming for performance. If users had to randomize prefixes before, that implies that AWS was partitioning the data stored in buckets by some prefix of the prefix / directory name.

Nowadays, however, AWS can partition data at the granularity of a single prefix / directory name. Since it supports more flexible naming schemes, that certainly is a boon for users. At the same time, partitioning at such fine granularity must result in a massive maintenance state within S3's implementation. I suspect it is just for that reason that the same documentation section later states: Amazon S3 automatically scales in response to sustained new request rates, dynamically optimizing performance. In other words, S3 will partition a bucket by individual prefix, but only if necessary for a given access pattern. That is an impressive engineering achievement!

2.3 Fixing Our Pipeline

Conveniently, the carefully staged explanation of how exactly our pipeline failed also describes everything we need for avoiding just that kind of failure. Even more conveniently, AWS engineers already implemented the solution and then some — through the adaptive retry mode. The solution is based on the realization that, for bulk operations that exceed rate limits, it isn't sufficient to retry the failing request. Instead we need to slow down the entire bulk operation and throttle any future requests, i.e., issue them at a slower rate.

We already know how to do that: use the token bucket algorithm! The only difference when using the algorithm on the client for throttling is that we don't fail requests when there aren't enough tokens in the bucket, but rather we wait until there are enough tokens. That's just what boto3's adaptive retry mode does. Well, with one significant additional feature: iInstead of requiring that user code configures the maximum request rate, the adaptive retry mode automatically adjusts the maximum capacity of the bucket based on successful as well as unsuccessful outcomes.

I'm sure you are as excited as I was when I discovered the retry mode and are itching to enable it for your own Python-based data processing pipelines. Thankfully, it doesn't take much: All you need to do is pass an appropriate configuration object to boto3's client() or resource() function via the config named argument. The only slightly tricky part is that you need to use botocore's configuration object:

<div class="code-wrap"><code>import boto3
import botocore

# Access S3 with adaptive retries enabled:
ADAPTIVE_RETRIES = botocore.config.Config(retries={
   "total_max_attempts": 4, # 1 try and up to 3 retries
   "mode": "adaptive"
,})
s3 = boto3.resource("s3", config=ADAPTIVE_RETRIES)

# Test adaptive retries: Can we break camel's back?
bucket = s3.create("camel")
for index in range(0, 10000):
   straw = f"straw #{index}"
   bucket.upload_fileobj(straw.encode("utf8"), straw)</code></div>

import boto3import botocore# Access S3 with adaptive retries enabled:ADAPTIVE_RETRIES = botocore.config.Config(retries={    "total_max_attempts": 4, # 1 try and up to 3 retries    "mode": "adaptive",})s3 = boto3.resource("s3", config=ADAPTIVE_RETRIES)# Test adaptive retries: Can we break camel's back?bucket = s3.create("camel")for index in range(0, 10000):    straw = f"straw #{index}"    bucket.upload_fileobj(straw.encode("utf8"), straw)

Code example for configuring boto3 to use adaptive retry mode

2.4 Beyond AWS

I apologize if the code example seems underwhelming. AWS engineers did all of the hard work here already, we just need to enable it. However, if you are looking for a coding challenge, consider what it would take to provide similar functionality for another cloud service. A likely source of frustration is the need for implementing similar algorithms for retries and rate limits for the server as well as the client libraries in every language.

In fact, that's already happening for AWS. Several of the client libraries for other programming languages as well as the Android and iOS SDKs support automatic retries for authentication errors caused by clock skew. However, boto3 does not. The corresponding issue has been open for four years now.

That raises the question if we can do better. My answer is an emphatic “possibly maybe.” I base that decisive assessment on the observation that exponential backoff with jitter and the token bucket algorithm are both techniques for best guessing an appropriate delay for retries based solely on whether requests succeeded or failed. Yet the server generating these responses knows quite a bit more. Since it enforces rate limits, it knows their basic configuration and it knows their client-specific state.

Hence the above question becomes: If we expose some of that server knowledge to clients, can we dispense with token buckets? My answer remains an emphatic “possibly maybe.” HTTP already includes the Retry-After header to provide a hint for the retry delay and an IETF draft standardizes several more headers that expose even more information.

But the Retry-After header applies only to a single operation and thus becomes meaningless under sustained load, much like exponential backoff is insufficient under sustained load. Ben Ogle wrote a blog post exploring just that question and ended up showing that the token bucket algorithm works real well. But that outcome also was predictable because he applied the Retry-After header only to the next retry for that request. In short, the question of whether we can design a more general header remains open.

In Conclusion

We explored retries as a simple and effective fault tolerance mechanism in significant detail. While retries apply to many cases, they aren't applicable in general. First, retries are only safe for idempotent operations. Second, they can only change the outcome upon non-deterministic failures, which include transient, rate-limit, and time-skew errors. Third, retries must be delayed, typically through exponential backoff with randomized jitter, to avoid denial-of-service attacks.

When operations occur at regular intervals, retrying individual operations isn't enough. Instead, we may have to throttle the request rate across all operations. This blog post used AWS and its client library for Python as running examples. Notably, it included source code for configuring throttled operation. But the techniques described in this blog post are general and supported by many other client libraries. Now that you understand all the necessary background, it is your turn to try retries — or to retry them again!

At the same time, the single most important take-away from this blog post is not the particulars of retries, rate limits, and AWS’ exemplary support for them but rather an understanding that handling failure in distributed systems is genuinely hard — you are about to finish a 3,900 word article about “doing it again” — and critically depends on the exact semantics of error conditions. To put that differently, getting the normal code path working is easy. That’s also when the real engineering work starts!