In preparation for supporting data residency, Forge's hosted storage has been undergoing a data migration to a new data store service. The migration rollout is implemented in phases, with a temporary support layer that facilitates dual writes and reads across both the old and new data stores. This approach aims to maintain consistency between the old and new data stores.
During the final steps of the migration rollout, where we cut over to the new data store as the primary data source, we encountered several issues: failing to perform delete operations on the new platform, a cascading increase in latency of read operations due to consistency issues caused by the failing deletions, and some unanticipated changes in cursor behaviour for paginated queries.
This document provides a detailed analysis of these issues, explores the underlying causes and outlines our plan to prevent their reoccurrence.
On February 5, at 05:58 UTC, cut-over to the new data store as the primary data source started. This was done as a gradual rollout of a feature flag. The rollout was completed in approximately 9 hours at 05:58 UTC with all apps using the new data store as the primary source. In this state, apps continued to write to the old data store, to enable prompt disaster recovery measures.
Due to a gap in our monitoring tools, we didn’t detect any impact during the rollout. Following reports from partner developers of delete operation failures, our team discovered that the production code for Forge Storage had not been updated to the latest version, due to an issue in our deployment pipeline. A fix was deployed to production at 18:17 UTC, mitigating the subsequent failure of delete operations.
The apps impacted by the delete operations failure were reverted to using the old data store as the primary source at 18:42 UTC, to eliminate the impact caused by the issue. This also enabled a read consistency check for those apps, prompting an automatic retry of the read operations to resolve race conditions. This resulted in additional latency and intermittent timeouts for read operations for a small number of installations. We raised a secondary incident at 22:01 UTC. After turning off the consistency checks for the impacted apps, we resolved this incident at 23:34 UTC.
Following the resolution of the incidents, our team proceeded to further investigate the extent of their impact and to confirm the accuracy of any data that may have been affected. It was discovered that for a small number of installations, the old data store did not have the correct data. This was due to some dual-write operations failing to write to the old data store during the initial incident, affecting installations near their quota limit (due to variations in quota usage calculations between the old and new data stores).
After a thorough examination of all potential alternatives and their respective impacts, we concluded that the most beneficial solution was to transition these installations back to the new data store, which was done on February 9, at 07:39 UTC. Unfortunately, this meant some installations were permanently impacted by a failure in delete operations.
In order to minimise any subsequent impact, we have reached out directly to the owners of these apps with details of the incident and the scope of the impacted data. This information may assist these developers in rectifying any actions taken based on the presence of a record after deletion, during the initial incident.
For some apps, a changed behaviour with cursors was also identified. Upon detailed investigation, the team discovered an unintended change in logic wherein a cursor value is returned as an empty string for the final page of results, as opposed to the previous undefined
status. We raised a public bug ticket to track the issue and prioritised fixing it as part of the incident response. The bug was resolved and an update was communicated to developers on February 15, at 05:11 UTC. Additionally, apps executing paginated queries encountered temporary issues in returning the next page using a cursor that was provided just before the cut-over to the new data store. Provided that the apps don't persist the cursor (in accordance with the documented guidelines), this impact is limited to the duration of the cut-over and could be automatically resolved by initiating a new query. The team have since implemented a translation layer to prevent similar impacts in the future, as well as updated our process to provide advanced notifications when such changes are anticipated.
Here is the summary of all the impact created by the two incidents:
undefined
cursor value to indicate the last page of query results were provided with an empty string cursor value instead. In some cases, this may have resulted in incorrect app behaviour. This impacted the behaviour of a small number of apps for a period of 10 days.An issue in our deployment pipeline meant a bug fix related to deletions was not promoted to production, before we cut over to the new Forge storage data store. Though we rolled the change out incrementally via a feature flag, a gap in our monitoring meant that we failed to detect the issue before the rollout was completed.
As part of our internal incident management process, we have identified several preventative actions in three main areas the team has taken or is taking to prevent this type of incident from recurring.
Release process
Migration rollout strategy
Integration testing
Monitoring
We apologise for the inconvenience caused during these incidents and assure our partners and customers of our commitment to improving our reliability.
Thanks,
Atlassian Ecosystem team
[This postmortem is also sent on the related incident: https://developer.status.atlassian.com/incidents/yzt262mxycm9]