Drop in the success rate of Forge hosted storage API calls

Incident Report for Atlassian Developer

Postmortem

SUMMARY

In preparation for supporting data residency, Forge's hosted storage has been undergoing a data migration to a new data store service. The migration rollout is implemented in phases, with a temporary support layer that facilitates dual writes and reads across both the old and new data stores. This approach aims to maintain consistency between the old and new data stores.

During the final steps of the migration rollout, where we cut over to the new data store as the primary data source, we encountered several issues: failing to perform delete operations on the new platform, a cascading increase in latency of read operations due to consistency issues caused by the failing deletions, and some unanticipated changes in cursor behaviour for paginated queries.

This document provides a detailed analysis of these issues, explores the underlying causes and outlines our plan to prevent their reoccurrence.

INCIDENT DETAILS

Failing delete operations

On February 5, at 05:58 UTC, cut-over to the new data store as the primary data source started. This was done as a gradual rollout of a feature flag. The rollout was completed in approximately 9 hours at 05:58 UTC with all apps using the new data store as the primary source. In this state, apps continued to write to the old data store, to enable prompt disaster recovery measures.

Due to a gap in our monitoring tools, we didn’t detect any impact during the rollout. Following reports from partner developers of delete operation failures, our team discovered that the production code for Forge Storage had not been updated to the latest version, due to an issue in our deployment pipeline. A fix was deployed to production at 18:17 UTC, mitigating the subsequent failure of delete operations.

The apps impacted by the delete operations failure were reverted to using the old data store as the primary source at 18:42 UTC, to eliminate the impact caused by the issue. This also enabled a read consistency check for those apps, prompting an automatic retry of the read operations to resolve race conditions. This resulted in additional latency and intermittent timeouts for read operations for a small number of installations. We raised a secondary incident at 22:01 UTC. After turning off the consistency checks for the impacted apps, we resolved this incident at 23:34 UTC.

Following the resolution of the incidents, our team proceeded to further investigate the extent of their impact and to confirm the accuracy of any data that may have been affected. It was discovered that for a small number of installations, the old data store did not have the correct data. This was due to some dual-write operations failing to write to the old data store during the initial incident, affecting installations near their quota limit (due to variations in quota usage calculations between the old and new data stores).

After a thorough examination of all potential alternatives and their respective impacts, we concluded that the most beneficial solution was to transition these installations back to the new data store, which was done on February 9, at 07:39 UTC. Unfortunately, this meant some installations were permanently impacted by a failure in delete operations.

In order to minimise any subsequent impact, we have reached out directly to the owners of these apps with details of the incident and the scope of the impacted data. This information may assist these developers in rectifying any actions taken based on the presence of a record after deletion, during the initial incident.

Unintended cursor behaviour

For some apps, a changed behaviour with cursors was also identified. Upon detailed investigation, the team discovered an unintended change in logic wherein a cursor value is returned as an empty string for the final page of results, as opposed to the previous undefined status. We raised a public bug ticket to track the issue and prioritised fixing it as part of the incident response. The bug was resolved and an update was communicated to developers on February 15, at 05:11 UTC. Additionally, apps executing paginated queries encountered temporary issues in returning the next page using a cursor that was provided just before the cut-over to the new data store. Provided that the apps don't persist the cursor (in accordance with the documented guidelines), this impact is limited to the duration of the cut-over and could be automatically resolved by initiating a new query. The team have since implemented a translation layer to prevent similar impacts in the future, as well as updated our process to provide advanced notifications when such changes are anticipated.

IMPACT

Here is the summary of all the impact created by the two incidents:

Delete operations failing: Apps temporarily experienced failures in deleting records from app storage during a 12-hour window, between 05:58 UTC and 18:17 UTC on February 5. For most apps, deletion failures were temporary and have been completely resolved. Permanent delete operations failure impacted a very small number of apps.
Latency increase: During a 5-hour period between 18:42 UTC and 23:34 UTC on February 5, a small number of installations were impacted by latency increase and occasional time-outs due to additional read consistency checks.
Cursor format change affecting app data retrieval in very limited circumstances: Apps executing paginated queries encountered temporary issues in returning the next page using a cursor that was provided by the old data store just before the cut-over. This impact was limited to the short cut-over period for that app (assuming cursors were not persisted) and resolved automatically when the query restarted.
Empty cursor in the last page of paginated queries: Apps that relied on checking for an undefined cursor value to indicate the last page of query results were provided with an empty string cursor value instead. In some cases, this may have resulted in incorrect app behaviour. This impacted the behaviour of a small number of apps for a period of 10 days.

ROOT CAUSE

An issue in our deployment pipeline meant a bug fix related to deletions was not promoted to production, before we cut over to the new Forge storage data store. Though we rolled the change out incrementally via a feature flag, a gap in our monitoring meant that we failed to detect the issue before the rollout was completed.

REMEDIAL ACTIONS PLAN & NEXT STEPS

As part of our internal incident management process, we have identified several preventative actions in three main areas the team has taken or is taking to prevent this type of incident from recurring.

Process changes

Release process
- We are in the process of transitioning from a daily deployment schedule to a continuous delivery model, where changes will be deployed to production immediately after the merge.
- As an addition to our existing review process, we have introduced a stricter review and sign-off process for high-impact rollouts.
Migration rollout strategy
- We have implemented a temporary translation layer that accepts cursor formats from the old and new data stores, in order to prevent impact during subsequent migrations.

Quality control improvements

Integration testing
- We have improved the integration tests to validate the deletion operation in dual-write mode. We have expanded this to integration tests verifying other write operations. These changes ensure complete data accuracy across all available requests.
- We have improved existing integration tests for quota management, covering edge cases where differences in quota calculations between the old and new data stores may result in CRUD operations succeeding on one data store but failing on the other.
Monitoring
- We have enhanced our monitoring measures by implementing new metrics that track storage activity per operation, supplementing our existing system that tracks storage activity per data store. This enables greater levels of visibility and proactive actions to mitigate potential incidents.
- We are prioritising improvements for anomaly detection. One such improvement is adding alerts for the highest percentile level (P99), in addition to existing alerts at the P50 and P90 levels. This enables a greater level of monitoring for latency issues that may only be affecting a small number of apps.

Communications

We have improved our process to communicate to developers regarding any anticipated changes that they should consider, prior to starting migration rollout. This enables developers to more effectively prepare for the underlying platform migration.
We are reviewing our incident response process to ensure incidents are properly communicated (via Statuspage) as soon as they are identified.

We apologise for the inconvenience caused during these incidents and assure our partners and customers of our commitment to improving our reliability.

Thanks,

Atlassian Ecosystem team

‌

[This postmortem is also sent on the related incident: https://developer.status.atlassian.com/incidents/9q71ytpjhbtl]

Posted Mar 11, 2024 - 07:25 UTC

Resolved

We identified a problem with the Forge hosted storage API calls, which resulted in a drop in invocation success rates in the developer console. The impact of this incident has been mitigated and our monitoring tools confirm that the success-rate is back to the pre-incident behaviour. It impacted 16 apps according to our logs, where these apps saw a reduced success rate of storage.get API calls, as listed in https://developer.atlassian.com/platform/forge/runtime-reference/storage-api-basic.

As part of Forge's preparation to support Data Residency, Forge hosted storage has been undergoing a platform and data migration for storing app data. As part of this migration we do comparison checks for data consistency between the old and new platform. The previous incident earlier, https://developer.status.atlassian.com/incidents/9q71ytpjhbtl, had put the data on the new platform out of sync and so comparisons of the data from the old and new platform started showing failures and the migration logic retries on failures to test for consistency issues. This retry behaviour increased latency of these requests which led to 16 apps receiving an increased number of 504 timeout errors.

Checking synchronously was identified by the team as a bug and should have been async. Once the root cause was identified we moved our backing platform rollout to a previous stage. The rollout is split into several stages. The issues we were having were on our blocking stage where we make calls to both the old and new platform and wait for both to complete so we can test any performance issues in the new platform before using it as our source of truth. It was in this blocking stage where we had a bug that included waiting on comparisons when it should've been async.

To recover, we reverted back to our shadow mode stage. In this stage, all operations to the new platform are asynchronous, including comparisons that were blocking in the other stage and resulted in timeout issues and 504 errors being sent to apps. This is the state that Forge hosted storage has been in for several months without any problems.

Here is the timeline of the impact:
- On 2024-02-05 at 06:42 PM UTC, impact started with comparisons start happening on out of sync data in blocking mode
- On 2024-02-05 at 08:57 PM UTC, impact was detected to API by our monitoring systems
- On 2024-02-05 at 11:34 PM UTC, rollout to new platform was reverted to known stable state and impact ended

We will release a public incident review, PIR, here in the upcoming weeks for this and the incident that happened earlier, https://developer.status.atlassian.com/incidents/9q71ytpjhbtl. We will detail all that we can about what caused the issue, and what we are doing to prevent it from happening again.

We apologise for any inconveniences this may have caused our customers and the developer community and committed to preventing further issues with our hosted storage capability.

Posted Feb 06, 2024 - 02:40 UTC

This incident affected: Developer (Hosted storage).