Incident Post Mortem: October 27, 2021 | Cryptonate

Incident Post Mortem: October 27, 2021

Summary

Between approximately 6:40 am and 10:42 am PT, and again between 12:20 pm and 2:32 pm PT on Wednesday, October 27th, we experienced intermittent outages on Coinbase.com, Coinbase mobile apps, and Coinbase Pro. During these outages, many users experienced slow loading times and errors while attempting to access Coinbase, or were unable to use features like buying, selling, and trading through our Retail and Pro websites and apps. The Exchange itself was not materially impacted. This post is intended to describe what occurred and the causes, and to discuss how we plan to avoid such problems in the future.

We’re continuing to learn more about these events, and will continue to update this post with additional details that may be of interest.

The Incident

On the morning of October 27th PT, we experienced a significant increase in traffic. As traffic increased, our engineers were alerted about elevated error rates appearing across a number of services.

The following functionality was affected:

  • Logged-out experience: users that were not logged in experienced errors when visiting coinbase.com or our mobile apps.
  • Coinbase Pro: users were temporarily unable to log in to Coinbase Pro.
  • Transfers: There was a higher rate of cancelled and refunded transfers during this time, as well as delays in processing on-chain money movements. Users may have been unable to see their latest transfer history.

Root Cause Analysis

These issues were caused by two separate but related outages. Both were triggered by system bottlenecks caused by the elevated traffic.

Traffic to Coinbase — 10/27/2021

In the first outage, we observed traffic patterns that were several times greater than previous peaks. This increase in traffic began to overload a datastore responsible for our rewards functionality. As latency increased on this database, related services became saturated and started to deplete resources as well. This resulted in a chain of failures and a more widespread outage.

Query capacity to key database cluster

The second outage was also triggered by a spike in traffic levels. In the early afternoon, engineers were alerted that our payment processing was being similarly overloaded. Unfortunately, an automated maintenance event that was already underway slowed our ability to scale this cluster up to meet with demand, and a set of failures similar to those that occurred during the first outage followed.

Elevated query latency for Payments cluster

In this instance, the servers that power our logged-out experience were also affected. As these servers became overwhelmed, they were unable to serve new traffic and were ultimately marked by our load balancer as unhealthy and removed from its pool, causing coinbase.com to become unavailable to users who were logged out or who were attempting to log in. Other impacted functionality included the ability to buy, sell, and trade in both Coinbase’s retail application as well as Coinbase Pro.

At 2:32pm PT, our services returned to normal operation.

Resolution & Improvements

For the first outage, once the caching changes were deployed, the rewards database was scaled up, and additional replicas became available. Afterwards, our system was able to resume normal operation.

To resolve the second outage, we upgraded the under-capacity payments cluster to a larger instance size and introduced additional read-only replicas.

To prevent similar issues in the future, we are taking several additional actions:

  1. Reorganizing our largest services: we will continue to shard and isolate our largest services to avoid hitting limits like those mentioned previously.
  2. Enhanced load testing: we’re enhancing our load testing framework to be more representative of new traffic patterns that we saw during this event.
  3. Additional scaling: we are further scaling several of our databases that we observed operating close to limits at Wednesday’s elevated traffic levels.

We take the uptime and performance of our infrastructure very seriously, and we’re working hard to support the millions of customers that choose Coinbase to manage their cryptocurrency. If you’re interested in solving scaling challenges like those presented here, come work with us.


Incident Post Mortem: October 27, 2021 was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tags
Coinbase