Postmortem for July 27 outage of the Manta service

We would like to share the details of what happened during the July 27 Manta service outage, including what we have learned and what actions we are taking to prevent this from happening again. We know that the Manta service is critical for many of our customers (and indeed, it plays a critical role in our own infrastructure), and we apologize for the outage and the inconvenience it caused.

Scope of the event

Around 10:30 UTC (3:30AM US Pacific), the Manta service started serving a high percentage of error responses. Clients experienced very high latency, and ultimately received 500-level errors in response to about 22% of all types of requests, including PUT, GET, and DELETE for both objects and directories. At peak, the error rate approached 27% of all requests, and for most of the window the error rate varied between 19 and 23%. This lasted until about 20:45 UTC (1:45PM US Pacific). By 20:45 UTC, the error rate returned back to zero.

Root cause analysis

For transparency, we've included a lot of detail here about how the incident proceeded. If you're interested in more of a summary, skip to the Summary of Root Cause section below.

Background

The Manta service employs a metadata tier consisting of sharded PostgreSQL databases. These databases map the user-visible namespace of directories and objects (e.g., "/$MANTA_USER/stor/your/object") to a list of backend servers that store the object.

The metadata tier is queried for all Manta requests relating to directories and objects, including creating, listing, and removing directories, as well as fetching and saving objects. Job data is also stored in the metadata tier, so job-related requests also access the metadata tier.

Each shard of the Metadata tier is a three-node PostgreSQL cluster using synchronous replication and using our Manatee component to manage automatic failover.

The problem

We quickly identified that API requests were hanging and ultimately failing because all queries to one of the PostgreSQL databases were themselves hanging. API servers timed out these PostgreSQL queries and returned 500-level errors to clients. Only one metadata shard was affected, which is why a large percentage of requests continued to succeed.

The affected database was responding to simple queries, but was not responding to any queries on the table used for Manta metadata. We saw that nearly all queries were "waiting", which typically means they're blocked on a lock. The only unblocked query was an "autovacuum" operation that had been kicked off at 10:30:09, right around the start of the outage.

In the past, we have experienced similar, much less severe situations where queries appeared to be blocked and autovacuum was the only unblocked operation. In those cases, we've successfully restored service by cancelling the autovacuum operation. In attempt to restore service as quickly as possible, we decided to restart PostgreSQL around 14:11. To help root-cause the problem after restoring service, we saved the data that PostgreSQL makes available around locking status for locks both held and wanted, as well as the list of queries running. (More on this data later.) Then we restarted PostgreSQL. This did restore service for about 5-10 minutes, but another autovacuum kicked off that shortly resulted in the same situation: all queries blocked on locks, with only the autovacuum operation actually running. At this point we observed that the autovacuum had been kicked off and that it was (according to PostgreSQL) "to prevent wraparound".

While this particular reason for autovacuuming was new to us, and while we did not understand why the autovacuum operation was causing all other queries to block, we also believed that it was out of our control (e.g., a PostgreSQL issue). We also believed that this operation was important to the continued functioning of PostgreSQL. As we were quickly learning, transaction wraparound is a serious concern, and it's important to allow PostgreSQL to perform the maintenance required to avoid the problem. More on both of these issues later.

So engineering spent the next several hours attempting to speed up the autovacuum process, principally in two ways:

Using process microstate accounting (in particular, prstat -Lm), we determined that the autovacuum process was spending a large percentage of its time sleeping. Using DTrace, we determined that the vast majority of time it was coming off-CPU was in order to read filesystem data. Also using DTrace to see which files were being read, we found that PostgreSQL was reading through a large number of files in order. While ZFS identifies sequential read behavior within a file and prefetches file data, it does not prefetch files that that have not been read yet. We manually prefetched this data, which improved autovacuum speed considerably. (This particular PostgreSQL operation is essentially a full scan of a table and all of its indexes -- which exceeded the memory of the system, inducing substantial read I/O.)
Once we started prefetching data, we found that PostgreSQL was still sleeping a lot, and was coming off-CPU as a result of a PostgreSQL function that was explicitly sleeping as part of the cost-based vacuum delay mechanism, which attempts to limit resources used by vacuum operations. In this case, with all database activity effectively blocked on this operation, this delay was only hurting us. Since these tunables are not dynamically adjustable, we manually patched the running process to avoid sleeping in this case. This improved performance considerably.

We observed the autovacuum scan the full table and all of its indexes, and then do the same thing again. After several hours, it successfully completed and updated the "last_autovacuum_time" metadata for the table. The blocked queries completed and service was restored. However, a few minutes later, another autovacuum kicked off for the same table, also "to prevent wraparound". We still do not understand why the first autovacuum did not work to clear this problem, but it was clear that we could not wait several more hours for another operation to complete that may still not clear the issue.

By this point, we'd learned that we could tune up the threshold at which PostgreSQL kicks off this operation, and that doing so was very safe as a short-term measure. By doing so, we planned to restore service and give ourselves time to better understand the issue and resolve it less disruptively. We made the change around 20:40, service was quickly restored, and the incident ended. We took action items to better understand the underlying issues so that we could deal with the immediate threat of wraparound threat in a less disruptive way, and also to make sure that we continue dealing with it more proactively instead of having these disruptive operations kick in.

Subsequent analysis

More background

To fully understand what happened, a little more background on PostgreSQL is required. PostgreSQL's implementation of Multi-Version Concurrency Control means that when a tuple (basically, a row) is changed, a new copy of it is created. The old copy is kept around in case concurrent transactions reference that version. These old copies eventually have to be removed by a routine maintenance operation called vacuuming.

The vacuum process is also used to address the separate problem of transaction id wraparound. In order to manage both consistency and concurrency, rows are assigned a transaction id relating to when they were created, and transactions can only see rows whose transaction id is before their own transaction id. As with any fixed-size integer, transaction ids can wrap around. To deal with this, the 32-bit transaction id space behaves as a circular space, where the previous 2 billion transactions are in the "past" and the next 2 billion transactions are in the "future". But in order for a table to contain rows that span more than 2 billion transactions, an additional step is necessary: very old rows are assigned a special value that indicates that they're in the past for all current and future transactions. This is called "freezing" a tuple. When deemed necessary, routine vacuuming will perform an extra full table scan just to make sure old tuples are frozen. Administrators can do this themselves using the "VACUUM FREEZE" operation. (This is all a gross simplification of the underlying concerns. The official documentation and Josh Berkus's blog series explain it in much more detail.)

PostgreSQL has an autovacuum mechanism that's intended to kick off vacuum operations as needed to deal with both of the above problems. It cleans up old tuples and, if necessary, scans the entire table to freeze very old tuples.

The normal autovacuum mechanism is skipped when locks are held in order to minimize service disruption. However, because transaction wraparound is such a severe problem, if the system gets too close to wraparound, an autovacuum is launched that does not back off under lock contention. The default configuration is aggressive: although a table can go about 2 billion transactions without freezing, a "wraparound autovacuum" is kicked off if a table has gone at least 200 million transactions in total and more than 150 million transactions since the last "freeze" vacuum.

Back to the incident

As mentioned above, we hit the wraparound autovacuum condition at 10:30 UTC on July 27. We ultimately resolved the outage by adjusting the PostgreSQL configuration to be less aggressive. The documentation is very clear that there is little impact to tuning this up as long as the database never gets close to the true underlying limit.

When the outage was resolved, we still had a number of questions: is a wraparound autovacuum always so disruptive? Given that it was blocking all table operations, why does it throttle itself? Was this the first time we'd experienced a wraparound autovacuum, or had we seen it before and those previous operations were less disruptive? How close were the other shards to hitting this condition? Now that we've tuned the point at which PostgreSQL engages this operation, will a manual "VACUUM FREEZE" operation address the problem in a less disruptive way, or will it be just as disruptive? Longer-term: what's the correct combination of tunables and scheduled maintenance that will avoid the problem, and how can we tell that those mechanisms are working?

It seemed most important to understand better whether it was expected that the autovacuum operation would be so disruptive. We analyzed the locking data that we'd saved during the incident (writing a program to summarize lock dependencies) and found that the locks held and wanted by the autovacuum operation did not conflict with the shared locks wanted by the data path queries. Instead, there was a single "DROP TRIGGER" query that was attempting to take an exclusive lock on the whole table. It appears that PostgreSQL blocks new attempts to take a shared lock while an exclusive lock is wanted. (This sounds bad, but it's necessary in order to avoid writer starvation.) However, the exclusive lock was itself blocked on a different shared lock held by the autovacuum operation. In short: the autovacuum itself wasn't blocking all the data path queries, but it was holding a shared lock that conflicted with the exclusive lock wanted by the "DROP TRIGGER" query, and the presence of that "DROP TRIGGER" query blocked others from taking shared locks. This explanation was corroborated by the fact that during the outage, the oldest active query in the database was the "DROP TRIGGER". Everything before that query had acquired the shared lock and completed, while queries after that one blocked behind it.

In Manta, the "DROP TRIGGER" query is normally made when a certain component reconnects to the database. It drops the trigger if it exists and then recreates it, as a simple way to idempotently make sure the trigger is present and up-to-date. We now know that this operation is quite disruptive (since it takes an exclusive table lock), even if most of the time it was not a big deal (because it's quick). It only became a problem because of the locking constraints introduced by the wraparound autovacuum. (Note that a regular autovacuum may have just aborted, seeing lock contention. It's likely that in order to see this problem you'd have to have this special wraparound autovacuum.) Fortunately, we expect to be able to resolve this issue by simply avoiding dropping and recreating the trigger. With that change in place, we expect to be able to perform a manual "VACUUM FREEZE" on all shards without disruption to deal with the wraparound issue.

This answers several of the questions above: wraparound autovacuum operations are not normally so disruptive, though they do have the potential to be much worse than other autovacuums. Throttling itself is not as unreasonable if it's not expected to be blocking all other queries. With the fix in place, we expect to be able to resolve this problem without disruption with a "VACUUM FREEZE", though without the fix, we can expect that operation would have been just as disruptive.

We don't know if this is the first time we saw this, but based on our traffic patterns, we would have expected to see wraparaound autovacuums a few times on each shard already. It may be that we've been lucky in the past with not having a concurrent exclusive-lock operation.

To deal with this in the future, we will be monitoring the PostgreSQL statistics that are used to calculate when the wraparound autovacuum should happen. We'll be making sure that normal autovacuum operations complete regularly, and that these statistics are updated when they do.

We still don't know why the completed autovacuum did not resolve the wraparound issue.

Summary of Root Cause

In summary:

All Manta operations touch our sharded, PostgreSQL-backed metadata tier.
During the event, one of the shard databases had all queries on our primary table blocked by a three-way interaction between the data path queries that wanted shared locks, a "transaction wraparound" autovacuum that held a shared lock and ran for several hours, and an errant query that wanted an exclusive table lock.
We temporarily addressed the issue by configuring PostgreSQL to avoid the transaction wraparound autovacuum operation. We believe this is safe for quite some time, but we will be prioritizing work to avoid the query that attempted to take the exclusive table lock, manually vacuum the table to deal with the wraparound threat, and monitor statistics that will tell us that autovacuum is proceeding regularly.

Reflections

First of all, we want to emphasize that we're sorry about this incident. We strive to maintain very high availability for all our services, and we know our customers are greatly impacted by service disruptions. We're committed to make sure we fully understand what happened and fix our software and procedures to make sure it does not happen again. That's why we've shared this detailed analysis.

While there are still some lingering questions (notably: why the first successful autovacuum did not clear the problem), we're heartened that the data we gathered during the outage later allowed us to identify our own locking issue as the most direct and actionable cause. Without that, we would still be wondering whether wraparound autovacuums were always so disruptive and whether this was at heart a PostgreSQL issue.

This was not the first incident we've experienced resulting from insufficient autovacuuming, though it was by far the most disruptive. Other incidents have largely affected performance of the compute service, and resulted from:

table bloat, leading to more I/O, memory, and compute required to process queries
table fragmentation, resulting in much more I/O to read the same amount of data
vastly distorted query planner statistics, resulting in very inefficient queries (e.g., table scans instead of index scans)

We've historically struggled to tune our databases to automatically perform these operations in the background. Many resources suggest running periodic "vacuum" and "analyze" operations out of cron, while other solutions attempt to identify periods of low activity and run these operations at those times. But Manta runs a 24/7 duty cycle: customers are using it constantly and do not expect latency bubbles for periodic maintenance. We had been optimistic that background autovacuum operations would address these issues with little impact to service, or at least that performance degradation would be gradual. This optimism was clearly not warranted, and as part of this outage we learned that we need to monitor these operations much more closely.

This failure mode was especially painful because by its nature, a database can go years without seeing any symptoms -- until the threat of wraparound suddenly causes a major disruption. Indeed, we were not the first service provider within a week to experience an outage related to this behavior. We will take this opportunity to carefully examine the relevant tunables to make sure they're configured to run autovacuum appropriately. We'll validate those settings in our pre-production environment under high load. We will also be examining the relevant statistics that can be used to identify that these operations are running, have completed, and are cleaning up the issues they're intended to clean up.

These issues aside, we're still happy with PostgreSQL, and we've not seriously considered switching away from it. On the contrary, the documentation and data that's available to understand such issues is very valuable. We believe we can do better to tune and monitor these operations.

Closing

We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We will be working hard to prevent an issue like this from happening again, and do not hesitate to reach out to us if you have any follow-up questions about this outage or its postmortem!

Sincerely,The Joyent Team

Post written by The Joyent Team