Degraded Forge App Rendering

Incident Report for Atlassian Developer

Postmortem

SUMMARY

On August 3, 2022, from 8:10AM UTC to 3:50pm, Atlassian customers using Jira and Confluence were unable to use Forge Apps. The incident was triggered by the Forge Extension Discovery service, which is a key service for products to integrate with Forge and determine what Forge Apps need to be invoked. The Forge Extension Discovery service became overwhelmed due to the production deployment of a new Jira Forge System App that integrates with Compass. This impacted Jira and Confluence customers running in the following AWS regions: prod-east, prod-eucentral and prod-euwest. The incident was detected within 11 minutes by an automated monitoring system and mitigated by scaling up the Extension Discovery service and rolling back the Jira Forge System App deployment. This put Atlassian systems into a known good state. The total time to resolution was about seven hours and 40 minutes.

INCIDENT TIMELINE

Below is the timeline of the incident.
Please note that all times are in UTC.

08:10AM Latency degradation started in prod-euwest for the Forge Extension Discovery service
08:21AM Several alerts were triggered at once, they were acknowledged by the on-call engineer but mistakenly assumed it was all for prod-east and did not check prod-euwest region.
10:20AM The on-call engineer while validating prod-east noticed prod-euwest latency increased on the service dashboard and took action
10:20AM Forge Extension Discovery service in the prod-euwest is scaled up
10:34AM Forge Extension Discovery service in the prod-eucentral scaled up in anticipation of failover
10:35AM The region failover was started but initial attempt failed
10:38AM It was determined that the Compass system app which had been rolled out, which is causing a heavy increase in traffic from Jira to the Forge Extension Discovery service
10:46AM Second attempt at region failover is successful and traffic is now directed to prod-eucentral from prod-euwest for the Forge Extension Discovery service
10:52AM prod-eucentral starts having latency degradation for the Forge Extension Discovery service
10:52AM prod-eucentral is scaled up to see if this alleviates problems
11:08AM prod-euwest region is enabled to start receiving traffic to the Forge Extension Discovery service
11:20AM prod-east starts experiencing latency degradation
11:20AM prod-east is scaled up
11:51AM Compass system app rollout feature flag is disabled to stop the rollout of the app
12:45PM Began manually terminating old nodes that were non-responsive in prod-eucentral for Forge Extension Discovery service
01:06PM prod-eucentral recovered
01:14PM Began manually terminating non-responsive nodes in prod-euwest
01:30PM prod-euwest recovered
01:46PM A pull request is created to apply preventive measures within Jira. This was to be able to filter out requests to the Forge Extension service via a Feature Flag inside the custom field implementation.
02:01PM Began manually terminating non-responsive nodes in prod-east
02:33PM The pull request relating to the Jira preventive measures was merged and deployed
03:29PM The Compass System App was removed from the Forge Development & Staging environments
03:50PM prod-east recovered. End of impact, Forge Apps and Products are operating as normal
03:52PM Developer Statuspage updated for the first time
04:58PM The Compass System App is removed from Forge Production

‌

IMPACT

On August 3, 2022, from 8:10AM UTC to 3:50pm, the Forge Extension Discovery service experienced a severe latency degradation in the AWS regions: prod-east, prod-eucentral and prod-euwest. Latency increased from 24ms at the p50 and 35ms at p90 to up to 10s and 20s respectively in some regions. This resulted in an outage of Forge apps across Jira and Confluence due to the high latency and timeouts.

ROOT CAUSE

Extension Discovery service received up to ten times more traffic than expected from Jira, causing it to become overwhelmed. This heavy increase in traffic was caused by the production deployment of a new Jira Forge System App that provides Jira and Compass integration into all Jira Cloud sites.

Some other contributing factors to this incident include:

Extension Discovery service rate limits are per context and per user, and it currently does not include any rate limits per product which would have helped if this was configured.
The Compass System App was not progressively deployed due to some technical limitations with one of our targeting platforms.

‌

REMEDIAL ACTIONS PLAN & NEXT STEPS

We acknowledge that developer platform incidents disrupt both our partners and customers and these types of incidents are unacceptable. We take reliability very seriously.

While the action plan below describes steps taken specifically to address this issue, incidents like this one do not happen in isolation. We are taking steps across our Ecosystem engineering organization to shore up monitoring and response processes, in order to address systemic gaps that impact reliability. We are committed to providing a stable and reliable platform for partners and for app customers.

We have taken a number of immediate actions to prevent this problem in the future. Here are the specific areas where we will make changes:

Improvements to Extension Discovery service to better handle a heavy increase in load
The following areas have been identified to help improve the scalability and reliability of the Extension Discovery service:
1. Evaluate a technical method to handle a temporary outage from the Extension Discovery service.
2. Optimize the number of API calls the Jira Custom Fields module makes to the Extension Discovery service.
3. Add dynamic, granular rate limits to the Extension Discovery service to allow us to isolate misbehaving products and/or apps.
4. Reduce the Load Balancer timeout boundaries to the Extension Discovery service that map to the expected product wait times and Service Level Objectives.
5. Establish an improved analytics dashboard that makes it easy to map regions to impacted apps for measuring app impact for Extension Discovery service incidents.
Improve the approval process and the technical method to progressively roll out internal Forge apps
1. The aim is to improve our process and tooling in order to make the rollout of internal apps safer and ensure that progressive rollouts are a mandatory step.
2. Review Ecosystem Platform launch procedures and ensure that any risks relating to the launch of new functionality with the potential to impact platform stability are mitigated ahead of launch, via progressive rollouts and other methods as appropriate.
Developer Status Page Communication Improvements
We acknowledge that we didn’t communicate as promptly as we should have, nor did we provide a precise enough impact statement with an adequate level of detail. Alongside remediations for the specific causes of this incident, we are also reviewing our incident management processes to improve the cadence and quality of partner communications during incidents.
Run a comprehensive Load test (in our staging environment) on the Extension Discovery service to replicate the behaviour caused by this incident
The purpose of this load test is to replicate the load patterns that caused this incident, diagnose what parts of the Extension Discovery service contributed to this and make the required fix.
Validating Forge Service alerts to ensure they are re-paged if the service has not recovered
The team is taking an action to prevent a repeat of the on-call engineer missing a key alert, the Ecosystem Engineering team will be reviewing the teams alerts to ensure that they are set up to re-page if the system has not recovered within a set period of time. The team will also review the alerts names so they are easier to distinguish per region.

Some other areas of improvement we are looking at based on this post-incident review are:

Develop a quicker process to disable a system app.
Create improved metrics for Forge system apps to allow Forge services to monitor any impact.

We apologize to partners and customers who were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Forge Team

Posted Aug 29, 2022 - 23:19 UTC

Resolved

Between 8:00 UTC to 16:00 UTC /time varies between regions/ some customers experienced Forge apps failures - missing modules, degraded performance or rendering issues. We have deployed a fix to mitigate the issue and have verified that the services have recovered. The issue has been resolved and the service is operating normally.

Posted Aug 03, 2022 - 18:41 UTC

Monitoring

We have mitigated the problem that resulted in Forge app rendering failures or modules not being present and degraded performance. We are now monitoring closely.

Posted Aug 03, 2022 - 17:56 UTC

Identified

The incident response team has identified the root cause of the failure and is working on bringing the services back to fully operational.

Posted Aug 03, 2022 - 16:05 UTC

Investigating

Forge modules may be slow to load or disappear completely for some time.
At this moment we managed to isolate the impact to us-east customers.
We are working on fix to fully recover the system.

Posted Aug 03, 2022 - 15:52 UTC