Terraform Drift: The Bad, the Ugly and the Black Swan

Jan 24, 2023Terraform

Black Swan

What is Terraform Drift? What problems does it cause? And how can we fix it?

So you’ve started or been using an Infrastructure as Code (IAC) solution like HashiCorp’s Terraform. You have some cloud resources deployed via Terraform, and maybe some that are not. Sometimes, or maybe all of the time, you’ve noticed that when you go to make subsequent changes or deployments via Terraform, your terraform plan is saying that changes are going to happen against resources where no changes were made in your Terraform code. This is not an error, you have just encountered Terraform drift.

What is Terraform drift?

Terraform drift is a well known problem. It occurs when changes happen to your cloud environment resources that were not driven by a Terraform workflow, and leads to differences between what is actually configured in your cloud and what is declared in your Terraform code. In other words, your cloud has “drifted away” from your Terraform. These differences can occur through the following channels:

  • Engineers making manual changes to resources through the Cloud provider’s web application. This is sometimes cheekily called “ClickOps”.
  • Engineers making changes via CLI commands, like Google’s gcloud tool or AWS’s CLI.
  • Application logic or deployment pipelines creating or editing resources outside of a Terraform workflow. Think boto3 or other cloud SDKs.

These sources of drift are often present in all organizations, and in particular those that are regularly moving super fast or have an engineering team accustomed to infrastructure management outside of Terraform.

Now that we have an understanding of what Terraform drift is and how it occurs, why is this even something that we should care about?

The Bad

Chances are high that when your organization’s infrastructure was written via Terraform, reviewed by other team members, and then deployed, the exact specifications were very intentional. That EC2 instance size, that Subnet’s CIDR bloc’s range, and the access privileges for your S3 bucket were all chosen for a reason.

When drift occurs, it means there is an undocumented difference between this written infrastructure and what is happening in the cloud. This in turn can lead to application degradation, excess spending, or security and compliance risks for your organization. As a result, when drift is identified, oftentimes while running terraform plan, manual toil is needed to understand and correct drift to mitigate the potential for these longer-tail risks. This toil detracts from core value delivery, and is Bad.

The Ugly

As we noted above, for most teams drift is discovered when running terraform plan. Generally engineers do not arbitrarily run Terraform workflows (as far as we know!). In fact, they often are running these workflows exclusively when trying to deploy new functionality or bug fixes. This means that:

a) undetected drift can be left causing problems within your infrastructure for a while prior to detection, and

b) when it is detected during a Terraform workflow, in the middle of a deployment, that is an extremely costly moment for your engineers to get sidelined by a manual toil task.

That is ugly.

The Black Swan

Undetected changes and resource provisioning outside of a Terraform workflow can lead to “Black Swan” events, which is why so many engineers endure “Bad” and “Ugly” costs to mitigate drift before it becomes a problem. For example:

  • A changed resource leads to application downtime (a database instance was shrunk and then falls over), or an outrageous bill (for some reason, GKE had 1500, not 150, nodes provisioned in our cluster).
  • Security-optimized configurations are relaxed and a data leak or hack occurs.
  • Maybe a hack does not occur, but when your organization is audited for business-critical compliance certifications, you fail the audit due to at-risk infrastructure.

Most insidiously, any of the above can also happen when cloud resources are created outside of Terraform control. It is much harder to discover these resources (they won’t show up in your terraform plan), and much more toil is involved to identify and then bring these resources under Terraform control.

Possible Solutions

  • Run a Terraform workflow regularly. This is possible but would need to be done for each Terraform state file. The output of terraform plan is a bit unwieldy and susceptible to sprawling text generation, which often requires more manual review. Worse of all, resources outside of Terraform control will be completely missed by this method.
  • Lock down your cloud environment to only allow changes via Terraform. Again, this is possible, but the devil is in the details. Enforcing this policy means restricting engineers with years of experience controlling or touching resources through the AWS console or using a CLI tool. This may put a heavy burden on team members who know Terraform, with all infrastructure provisioning needing to flow through them (no more engineers quickly spinning something up in Dev for themselves via the Azure portal). It also eliminates the opportunity for the occasional hot-fix, which, the need for can arise from time to time.
  • Use a cloud agnostic, self-hosted, drift-mitigation tool like dragondrop.cloud (call us biased!). dragondrop.cloud performs regular, automated scans of your cloud environment, identifying changes made outside of a Terraform workflow, and recommending necessary mitigation steps. Changes are recommended via pull requests, so developers never need to leave their existing workflows. Using dragondrop.cloud, development teams can be notified of drift prior to becoming a problem, and be given the information needed to mitigate said drift with minimal toil.

Conclusion

We hope this post gives you a better understanding of what Terraform drift is, the problems it causes, and possible solutions. Unaddressed Terraform drift creates immediate, short-term costs, and exposes organizations to significant long-tail risks. Thankfully, automated, secure solutions do exist for organizations to adopt.

dragondrop.cloud’s mission is to automate developer best practices while working with Infrastructure as Code. Our flagship product regularly scans and identifies resource changes that have occurred outside of a Terraform workflow (e.g. drift) so that dev teams can have a Cloud environment that is fully represented as code. All of our tools are self-hosted by our customers, with no data ever leaving their servers. To learn more, schedule a demo or get started today!

Learn More About Terraform

Terraform Variable Management

We've previously discussed the syntax for creating variables within Terraform configuration. While this helps us with syntax, it leaves open questions about how variable values are actually passed into our Terraform workflow. CLI Specification When running terraform...

read more

What is Terraform? How Does Terraform Work?

What is Terraform? Terraform is the leading Infrastructure as Code (IaC) tool (see our article for a review of IaC). It is fully open-sourced, and managed by HashiCorp. Over 1000+ different infrastructure providers can be controlled via Terraform, and new providers...

read more

Quickstart: Writing Terraform

In this article we discuss how the basics of writing organized Terraform infrastructure configuration. Specifying Terraform's Configuration We recommend keeping a given Terraform module's requirements within their own versions.tf file. Within versions.tf, you can...

read more