Service Disruptions Affecting Atlassian Ecosystem Platform

Incident Report for Atlassian Developer

Postmortem

Summary

On February 14, 2024, between 20:05 UTC and 23:03 UTC, Atlassian customers on the following cloud products encountered a service disruption: Access, Atlas, Atlassian Analytics, Bitbucket, Compass, Confluence, Ecosystem apps, Jira Service Management, Jira Software, Jira Work Management, Jira Product Discovery, Opsgenie, StatusPage, and Trello.

As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names used for internal service-to-service connections. Active domain names were incorrectly deleted during this event. This impacted all cloud customers across all regions. The issue was identified and resolved through the rollback of the faulty deployment to restore the domain names and Atlassian systems to a stable state. The time to resolution was two hours and 58 minutes.

IMPACT

External customers started reporting issues with Atlassian cloud products at 20:52 UTC. The impact of the failed change led to performance degradation or in some cases, complete service disruption. Symptoms experienced by end-users were unsuccessful page loads and/or failed interactions with our cloud products.

ROOT CAUSE

As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names that were being used for internal service-to-service connections. Active domain names were incorrectly deleted during this operation.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. The detection was delayed because existing testing & monitoring focused on service health rather than the entire system’s availability.

To prevent a recurrence of this type of incident, we are implementing the following improvement measures:

Canary checks to monitor the entire system availability.
Faster rollback procedures for this type of service impact.
Stricter change control procedures for infrastructure modifications.
Migration of all DNS records to centralised management and stricter access controls on modification to DNS records.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

‌

Thanks,

Atlassian Customer Support

Posted Feb 27, 2024 - 05:47 UTC

Resolved

Between 2024-02-14 8.18pm UTC and 2024-02-14 10:19pm UTC, the Atlassian platform suffered an outage that affected all Ecosystem supporting services.
During this timeframe, apps will have seen intermittent failures for all operations, including lifecycle events such as app installations.

This incident has now been fully mitigated.

We apologize for any inconveniences this may have caused you, your team, and our mutual customers.

Posted Feb 14, 2024 - 23:20 UTC

This incident affected: Developer (App Deployment, Artifactory (Maven repository), Create and manage apps, Developer documentation, Forge App Installation, Forge CDN (Custom UI), Forge Function Invocation, aui-cdn.atlassian.com, Forge App Logs, Forge App Monitoring, Developer console, Forge direct app distribution, Hosted storage, Forge CLI, End-user consent, Forge App Alerts) and Authentication and user management.