Storage Failure Incident report 11th of March 2025
On March 10th, 2025, our provided services experienced an outage that lasted for approximately two hours, impacting virtual machines and other services caused by intermittent storage access issues. Below is a detailed analysis of the incident, its root cause, and the steps we are taking to prevent future occurrences.
First of all we would like to apologize to all customers for the inconvenience we have caused in terms of Block Storage unavailability and degraded IO performance starting from the 9th of March until the 11th of March.
The following incident report aims at providing a clear understanding and transparent view on the operations we took, which have caused us to perform an inevitable emergency downtime maintenance on the 10th of March from UTC 17:41 to UTC 19:44.
Measures taken
Upgrades to new Major releases will be delayed by 6 months after the current release goes EOL as bug fixes are being actively backported. Storage capacity will be reserved at much higher levels then before to buy us more time when critical issues occur. We are now onboarded to the croit ticket system and we will stay in contact with them to have a partner on our side that is capable of tackling severe CEPH issues.
Terminology:
CEPH = Open Source Enterprise Storage Clustering Solution
OSD = Object Storage Daemon used to provide storage for data pools in CEPH
CEPHADM = Solution for easy deployment and maintenance of CEPH Clusters
PG = Placement Groups organizing data allocation for data pools to OSDs in CEPH
Timeline of events
11th of February
We Performed routine work on our general staging environment and decided to upgrade the CEPH components from CEPH Reef to Squid as the last CEPH upgrade already dated a year back. As of today (11th of March) the staging environment is still fully operational on CEPH Squid.
19th of February
Using CEPHADM we have triggered the upgrade from CEPH Reef to Squid in the production environment. The upgrade was performed completely automated by CEPHADM with no issues occuring while and after the upgrade.
5th of March
UTC 20:06
Because of growing customer demand for storage capacity a new OSD has been added to the Cluster on Host A.
6th of March
UTC 14:13
After our Monitoring alerted us about the new OSD flapping we performed routine checks to evaluate the respective Hosts functionality and also checked the cluster state which at that point was still operational as well. As we had not found any specific indication of what was wrong with the OSD and found the cluster to be in an acceptable operational state we redeployed the flapping OSD hoping to see that issue vanish.
UTC 14:41
As the OSD appeared to keep crashing we have started debugging the OSD and inspected the logs. While performing this operation we noticed messages pointing us to issues which have been related to OSDs reaching their maximum memory target or the bluestore component segfaulting because of reached limits for the bluefs allocation size, the allocater in general and runway logs. Those issues suggest to change those settings and therefore have been applied by us under continued investigation.
UTC 16:26
The OSD keept crashing and finally we decided to move the OSD from Host A to Host B as we suspected oculink cabling to the Hosts backplane to be a possible source of the issues we experienced. The ceph tool ceph-bluestore-tool reported unrecoverable errors in the bluestore filesystem.
UTC 19:39
The OSD has been moved and the recovery operation for not fully replicated PGs started.
UTC 23:40
The OSD crashed again on Host B, at that point we suspected a firmware issue of the 15.36 TB NVMe pm9a3 to be the issue as another user reported on reddit: https://www.reddit.com/r/ceph/comments/1j07rwr/got_4_new_disks_all_4_have_the_same_issue/ A newer NVMe firmware has been installed and the OSD was put back in operation. As the demand for storage was still increasing and the idea of the firmware being the culprit another 15.36 TB NVMe from Solidigm was ordered.
7th of March
6:34 UTC
The cluster was able to recover all of the unreplicated shards to ensure full 2x fault tolerance with 3 copies, but was still occupied with moving misplaced data for optimal OSD fill level ballance.
10:55 UTC
We have received the Solidigm NVMe via transoflex express on the first morning after it was ordered and decided to add it to the cluster to meet still growing customer storage demands.
17:14 UTC
The Solidigm NVMe OSD crashed unexpectedly and put us into an alerted state as this would rather indicate a bug in ceph. We therefore started to remove a lot of variabilities out of our environment like storage performance improvements and other specific configuration that has been added in the means of 6 years of operation. Debug logging has also been increased with the implication of higher CPU utilization and less IO performance.
19:04 UTC
The Solidigm OSD crashed again and we again started to inspect the logs of the bluestore component. We recognized specific RBD instance data to be causing crashes as their data blobs had been placed on sectors which would have overlapped the content of other data blobs. We decided to remove those RBD images from the respective Pool and migrated them to another storage source.
8th of March
07:28 UTC
The newly added pm9a3 OSD crashed again leaving the cluster in a rather risky state of operation as we have found the situation to be not deterministic / controllable enough anymore for normal everyday operation. Therefore the Min Size of available Shards from our Erasure Coded pools has been decreased from 5 to 4. We immediately started reaching out to more CEPH prone users and have shared our insights and received the recommendation to try deploying those two OSDs per NVMe as the newly added OSDs (15.36tb) are double the size of the old ones (7.68tb).
14:50 UTC
One of the two OSDs from one NVMe crashed again, we immediately ran the ceph-bluestore-tool in deep mode to check the state of the data. We found even more RBD instance data Blobs to be overlapping other ones although the many efforts we made.
16:12 UTC
After we finished all of the inspections on the most recent crash we reached out to CEPH Slack and SCS Matrix server for further help on our issue. Unfortunately there have been no fitting bug reports on the CEPH issue tracker.
19:47 UTC
As we reached out to a lot people someone told us that our situation looks pretty same to the one of hostup.se. Their OSDs crashed as well with only EC Pool data blobs being reported as unrecoverable errors on execution of ceph-bluestore-tool fsck. We immediately suspended any further PG activity to make sure that no further Crashes occur. Out of cautiousness we started to migrate all of our core services away from the EC Pool to a replicated one.
22:31 UTC
As we finished migrations we have started to evaluate all possibilities to migrate data out of the CEPH cluster.
22:56 UTC
Although all PG activity has been suspended another OSD crash occurred and we immediately decided to escalate our issue to the emergency support service of croit GmbH.
9th of March
02:11 UTC
We finished our first call with croit and came to the conclusion that we would have to wait for the ceph developers within croit to be reachable. From now on we have observed the cluster state 24/7 in rotating shifts to get every OSD that crashed back up again as fast as possible while working on solutions to migrate Data to more safe/stable storage backends.
6:40 UTC
After reaching out to a lot competitors we started the migration of customer data to temporary storage solutions from synlinq, informaten and dataforest while keeping efforts high to maintain storage backend availability and data integrity.
11:20 UTC
While migration to other backends was ongoing with rather capped speeds we have started adding local softraids to hypervisors to be able to migrate customers to local storage backends.
12:30 UTC
We partially disabled the backup management functionality for our customers and started creating backups of all machines that have had no backups available and were still located in the EC pool. The situation continued to stay like this for the next time with us trying to minimize impact.
10th of March
7:20 UTC
Croit developers started to work on our issue and were already able to trace back the issue we have experienced in our CEPH environment.
10:23 UTC
Croit Support said that it would be very important to get rid of the OSDs that we have added as those are the only ones misbehaving. Thanks to our competitors providing us a lot of storage we have been able to pursue this approach. Later unfortunately though this approach didn't work out as one of the OSDs crashed and we stopped the PG operations again to avoid further impact.
15:23 UTC
As another OSD crashed as well leaving us with even more degraded data redundancy we stopped evaluating new possibilities of migration without impact and asked croit for further solutions. The last realistic chance we have seen to avoid data loss and a major outage was to migrate PGs away from the impacted OSDs while the cluster is paused. We announced an emergency maintenance time windows of 2 hours starting at 17:30 UTC.
19:41 UTC
The emergency maintenance has been carried out successfully by croit with our attendance and support. The cluster was in a stable situation after almost 2 days of uncertainty.
11th of March
4:17 UTC
All shards and replicas have been able to recover all of their missing copies making the ceph cluster healthy again.
7:13 UTC
We added back one of the OSDs with specific configuration aimed at disabling a new functionality in CEPH Squid which would result in Bluestore writing data in a wrong way leading ultimately to OSD crashes.
16:34 UTC
As of now the OSD is running stable for over 9 hours already without reporting any suspicious errors or warnings that we have seen before. Almost all data has been migrated back to our CEPH Cluster. We will continue to observe the situation and add new OSDs as soon as we are able to ensure that no further issues will arise.
The actual bug report: https://tracker.ceph.com/issues/70390