How I destroyed production Opensearch cluster and how I'll avoid it in the future?

It was a Tuesday. We've released around 11 PRs already - so why not release the 12th one as well? What could possibly go wrong? Soon after the 12th PR got merged, errors were all over Slack. What went wrong? It didn't take more than 3 minutes to figure it out - the OpenSearch domain was being deleted! But how? We'd just scaled it from 3 to 4 instances?!

Mistake #1: Environments went out of sync

We were struggling with performance in production, and a newer major OpenSearch version promised speedups, so we decided to upgrade. AWS allows zero-downtime upgrade - it spins up instances with the new version, copies data there, validates, switches traffic to the new instances, and removes the old ones. The upgrade went smoothly, so we forgot about the other environments.

Become a better engineer, one article at a time.

Practices, mindsets, and habits that actually move the needle. Delivered weekly to your inbox.

Mistake #2: Terraform config drifted from production

As we had our hands full with performance issues, we didn't update the version inside the Terraform setup. This resulted in the state where development and staging matched the Terraform configuration, while production did not. This meant that terraform plan produced no changes for development and staging. Unfortunately, that was not the case for production. Due to the version mismatch, Terraform planned to destroy the upgraded domain and replace it with one running the older version defined inside the config - with downtime!

Either upgrading all environments or updating Terraform would've caught the issue early - applying to development first would have shown the same destructive plan there, and end-to-end tests would have failed because the development domain would've been destroyed too.

Mistake #3: The plan was not reviewed

The change was super simple - scale from 3 to 4 nodes. Code-wise, that's one line — to be even more precise, a one-character change. So no one felt that the plan needed a review. The CI/CD job for validation and plan was green - let's go!

The lesson: review the plan regardless of the size of the change. A one-character diff can still produce a destructive plan.

Mistake #4: IAM role allowed destructive actions

Terraform was executed from an EC2 instance that assumed a role with AWS account access. Since only infrastructure was managed from there and changes were frequent, the permissions were very broad. Broad enough to be able to delete the OpenSearch domain. We'd used it previously to remove unneeded resources. So when Terraform planned removal, the removal was done - no guardrail stopped it.

Lesson learned

When we did the post-mortem, we realized one thing - as usual, there was a sequence of mistakes. If any of the 4 mistakes had been prevented, there would have been no issue. That said, the first three are easy to slip past. They require some manual steps here and there, so it's hard to enforce correct behavior consistently. It's mistake #4 that stood out. We've realized that we should not allow the removal of resources that hold data, because the data is hard to recover. Moving terabytes of data is expensive and time-consuming. So it's much better and cheaper to jump through many hoops when deleting something than to accidentally lose data.

Prevent similar things in the future

We learned our lesson. So we added explicit Deny statements to the IAM role for removal of things like RDS databases, OpenSearch domains, etc. - anything that holds data. We've also changed the process for removing such resources:

Exclude the resource from Terraform state, so Terraform stops managing it.
Manually delete the resource via the web console, where you are asked three times if you're sure.
Update Terraform config by removing the resource definition, so the next plan stays clean.

Conclusion

Even tiny changes can have disastrous consequences. Therefore, the best thing you can do is to prevent them from happening in the first place. Depending on someone not forgetting all the steps is too error-prone. That's why we automate deployments and follow the principle of least privilege. So next time you're tempted to grant "admin" permissions to someone or something, remember this story.

Safe engineering!