On August 3, 2022, from 8:10AM UTC to 3:50pm, Atlassian customers using Jira and Confluence were unable to use Forge Apps. The incident was triggered by the Forge Extension Discovery service, which is a key service for products to integrate with Forge and determine what Forge Apps need to be invoked. The Forge Extension Discovery service became overwhelmed due to the production deployment of a new Jira Forge System App that integrates with Compass. This impacted Jira and Confluence customers running in the following AWS regions: prod-east, prod-eucentral and prod-euwest. The incident was detected within 11 minutes by an automated monitoring system and mitigated by scaling up the Extension Discovery service and rolling back the Jira Forge System App deployment. This put Atlassian systems into a known good state. The total time to resolution was about seven hours and 40 minutes.
INCIDENT TIMELINE
Below is the timeline of the incident.
Please note that all times are in UTC.
On August 3, 2022, from 8:10AM UTC to 3:50pm, the Forge Extension Discovery service experienced a severe latency degradation in the AWS regions: prod-east, prod-eucentral and prod-euwest. Latency increased from 24ms at the p50 and 35ms at p90 to up to 10s and 20s respectively in some regions. This resulted in an outage of Forge apps across Jira and Confluence due to the high latency and timeouts.
Extension Discovery service received up to ten times more traffic than expected from Jira, causing it to become overwhelmed. This heavy increase in traffic was caused by the production deployment of a new Jira Forge System App that provides Jira and Compass integration into all Jira Cloud sites.
Some other contributing factors to this incident include:
We acknowledge that developer platform incidents disrupt both our partners and customers and these types of incidents are unacceptable. We take reliability very seriously.
While the action plan below describes steps taken specifically to address this issue, incidents like this one do not happen in isolation. We are taking steps across our Ecosystem engineering organization to shore up monitoring and response processes, in order to address systemic gaps that impact reliability. We are committed to providing a stable and reliable platform for partners and for app customers.
We have taken a number of immediate actions to prevent this problem in the future. Here are the specific areas where we will make changes:
Improvements to Extension Discovery service to better handle a heavy increase in load
The following areas have been identified to help improve the scalability and reliability of the Extension Discovery service:
Improve the approval process and the technical method to progressively roll out internal Forge apps
Developer Status Page Communication Improvements
We acknowledge that we didn’t communicate as promptly as we should have, nor did we provide a precise enough impact statement with an adequate level of detail. Alongside remediations for the specific causes of this incident, we are also reviewing our incident management processes to improve the cadence and quality of partner communications during incidents.
Run a comprehensive Load test (in our staging environment) on the Extension Discovery service to replicate the behaviour caused by this incident
The purpose of this load test is to replicate the load patterns that caused this incident, diagnose what parts of the Extension Discovery service contributed to this and make the required fix.
Validating Forge Service alerts to ensure they are re-paged if the service has not recovered
The team is taking an action to prevent a repeat of the on-call engineer missing a key alert, the Ecosystem Engineering team will be reviewing the teams alerts to ensure that they are set up to re-page if the system has not recovered within a set period of time. The team will also review the alerts names so they are easier to distinguish per region.
Some other areas of improvement we are looking at based on this post-incident review are:
We apologize to partners and customers who were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Forge Team