Product and developer platform outage

Incident Report for Atlassian Developer

Postmortem

CONTEXT

Atlassian Cloud products, applications, the customer experiences they provide, as well as supporting services and Atlassian’s own internal tooling rely on compute and data workloads deployed in Amazon Web Services (AWS) Virtual Private Cloud (VPC).

Critical to the function of these applications is the network communication between them, as well as to the Internet and AWS-managed services. This communication depends on DNS resolution, which applications perform by querying a DNS server to translate service names into network addresses they can connect to.

To facilitate DNS resolution across multiple accounts and VPCs, Atlassian operates centralized EC2 (Elastic Compute Cloud) DNS servers in each AWS region we serve customers from.

Atlassian is currently in the process of migrating from this centralized infrastructure to a more resilient distributed solution that leverages AWS-managed VPC resolver in a Shared VPC architecture. However, a number of internal applications and services still rely on the EC2-based solution.

The EC2-based DNS servers utilize security groups, which are subject to connection tracking allowance limitsinstrumented by AWS.

INCIDENT SUMMARY

On November 22, 2022, between 19:39 and 21:28 UTC, some Jira, Confluence, and Bitbucket customers experienced varying levels of degradation across our products and services, including partner apps and integrations. The incident was detected within nine minutes by an automated monitoring system and resolved within one hour and 49 minutes.

The event was triggered by Atlassian’s DNS infrastructure within the AWS us-east-1 region encountering a network connection tracking allowance limit. Despite the fault being localized to one region, there was a global impact for customers whose data resided in us-east-1, as well as Atlassian products which have unique dependencies on that region.

The issue was initially mitigated by scaling up the DNS infrastructure, which increased the connection tracking allowance limit and returned Atlassian products to a healthy state. The underlying issue has since been resolved by a configuration change and cannot reoccur.

The incident was not caused by an attack and it was not a security issue. Atlassian customer data remains secure.

We deeply value the trust placed in Atlassian and apologize to customers who were affected by the event.

IMPACT

The impact occurred on November 22, 2022, between 19:39 and 21:28 UTC, for a total of one hour and 49 minutes. Some Jira, Confluence, and Bitbucket customers experienced service degradation ranging from intermittent access to a complete outage. Customers experienced error pages or very slow responses in applications and browsers, as well as 5xx response codes from our APIs.

ROOT CAUSE

The incident occurred as connections through Atlassian’s DNS infrastructure reached a new daily peak due to steady traffic growth. As a result, a limit on the number of simultaneous network connections tracked by an AWS EC2 security group was encountered. Upon reaching this limit, DNS packets were dropped, meaning services were unable to resolve network addresses which resulted in application failures. Services retried their DNS queries upon receiving a SERVFAIL response or query timeout, which created even more connections - compounding the problem.

During the incident, Atlassian’s DNS infrastructure in the AWS us-east-1 region was unable to service up to 90% of DNS resolution requests.

We were not aware of how close our infrastructure was to the security group connection tracking limit because utilization of this allowance is not currently observable.

Troubleshooting took an extended period of time because EC2 network allowance packet drops (as a result of encountering this limit) were not actively monitored by Atlassian.

REMEDIAL ACTIONS PLAN & NEXT STEPS

Atlassian acknowledges that outages like this one impact the productivity and business of our customers. Since this incident:

We have deployed an immediate change to prevent traffic through Atlassian’s EC2-based DNS infrastructure from consuming the security group connection allowance limit. As a result of this change, this incident cannot reoccur.
We are investigating ways to improve our visibility into the utilization of AWS network allowance limits and will monitor packet drops due to them.
We are continuing our migration of internal services away from the EC2-based DNS infrastructure involved in this incident to a new, distributed architecture that is not subject to these network limits.

Again, we apologize to those customers whose services were impacted during this incident.

Thank you,
Atlassian Customer Support

Posted Dec 03, 2022 - 00:35 UTC

Resolved

Between 19:39 UTC and 21:25 UTC an issue with our networking services caused an outage of a number of Ecosystem capabilities.
The impact of this incident has been mitigated and our monitoring tools confirm that the impact is resolved.
During the incident, the following capabilities were impacted:

Forge:
- 1.62% of product triggers / async events failed to be delivered during the incident window. These events have since been replayed. This means that some events may have been received out-of-order.
- Other types of Forge function invocations were not impacted.
- End-User Consent: Some end users may not have been able to view or complete the consent flow during the incident. This impacted 3.6% of requests in the incident window.
- App installation: No impact.

Atlassian Connect:
- User impersonation for Atlassian Connect apps was also impacted, 7.8% of requests to retrieve bearer tokens from the Atlassian authentication server at https://oauth-2-authorization-server.services.atlassian.com failed.

OAuth 2.0 (3LO):
- A small number of Refresh Token Rotation requests failed (0.523%). These should have recovered on subsequent retries by the client.
- A small number of Authorization Code grant flows failed (0.75%). These should have recovered on subsequent retries by the user.

Product APIs:
- Product APIs and webhooks were intermittently unreliable for impacted sites during the incident timeframe.

All affected products and platform components are now back online and no further impact has been observed.

UTC 07:23 23/11/2022 UPDATE: Added OAuth 2.0 (3LO) and Product API impact observed during the same incident timeframe.

Posted Nov 23, 2022 - 00:19 UTC

Investigating

We are currently investigating an incident impacting multiple Atlassian products (see related incidents on https://status.atlassian.com). This incident also impacts apps, and some developer and platform capabilities.

We will provide additional updates and more specific details as they become available.

Posted Nov 22, 2022 - 22:34 UTC

This incident affected: APIs (Bitbucket Cloud APIs, Confluence Cloud APIs, Jira Cloud APIs, Product Events) and Developer (Forge Function Invocation).