Terraform Drift: The Bad, the Ugly and the Black Swan
What is Terraform Drift? What problems does it cause? And how can we fix it?
So you’re using an Infrastructure as Code (IAC) solution like HashiCorp’s Terraform. You have at least some cloud resources deployed via Terraform. You’ve noticed that sometimes when you go to make subsequent changes or deployments via Terraform, your terraform plan flags changes for resources whose Terraform code remains unchanged.
Why is this happening?
This is not an error, in fact, you have just encountered Terraform drift.
What is Terraform drift?
Terraform drift is a well known problem. It occurs when changes happen to your cloud environment resources that were not driven by a Terraform workflow, and leads to differences between what is actually configured in your cloud and what is declared in your Terraform code. In other words, your cloud has “drifted away” from your Terraform. These differences can occur through the following channels:
- Engineers making manual changes to resources through the Cloud provider’s web application. This is sometimes cheekily called “ClickOps”.
- Engineers making changes via CLI commands, like Google’s gcloud tool or AWS’s CLI.
- Application logic or deployment pipelines creating or editing resources outside of a Terraform workflow. Think boto3 or other cloud SDKs.
These sources of drift are most often present in organizations that are moving super fast, have an engineering team accustomed to infrastructure management outside of Terraform, or both.
Now that we have an understanding of what Terraform drift is and how it occurs, why is this even something that we should care about?
Terraform drift: The Bad
Chances are high that the infrastructure defined by Terraform, reviewed by other team members, and then deployed, has very intentional specifications. That EC2 instance size, that Subnet’s CIDR bloc’s range, and the access privileges for your S3 bucket were all chosen for a reason.
When drift occurs, it means there is an undocumented difference between this written infrastructure and what is happening in the cloud. This in turn can lead to application degradation, excess spending, or security and compliance risks for your organization. As a result, when drift is identified, oftentimes while running terraform plan, manual toil is needed to understand and correct drift to mitigate the potential for these longer-tail risks. This toil detracts from core value delivery, and is “bad”.
Terraform drift: The Ugly
As we noted above, for most teams drift is discovered when running terraform plan. Generally engineers do not arbitrarily run Terraform workflows (as far as we know!). In fact, they often are running these workflows exclusively when trying to deploy new functionality or bug fixes. This means that:
a) undetected drift can be left causing problems within your infrastructure for a while prior to detection, and
b) when it is detected during a Terraform workflow, in the middle of a deployment, that is an extremely costly moment for your engineers to get sidelined by a manual toil task.
That is “ugly”.
Terraform drift: The Black Swan
Undetected changes and resource provisioning outside of a Terraform workflow can lead to “Black Swan” events, which is why so many engineers endure “Bad” and “Ugly” costs to mitigate drift before it becomes a problem. For example:
- A changed resource leads to application downtime (a database instance was shrunk and then falls over), or an outrageous bill (for some reason, GKE had 1500, not 150, nodes provisioned in our cluster).
- Security-optimized configurations are relaxed and a data leak or hack occurs.
- Maybe a hack does not occur, but when your organization is audited for business-critical compliance certifications, you fail the audit due to at-risk infrastructure.
Most insidiously, any of the above can also happen when cloud resources are created outside of Terraform control. It is much harder to discover these resources (they won’t show up in your terraform plan), and much more toil is involved to identify and then bring these resources under Terraform control.
- Run a Terraform workflow regularly. This is possible, but to catch all drift, it needs to be done for each Terraform state file. Furthermore, the output of terraform plan is a bit sprawling, requiring further manual review. Lastly, resources outside of Terraform control will be completely missed by this method.
- Lock down your cloud environment to only allow changes via Terraform. Again, this is possible (and often a best practice!), but the devil is in the details. Enforcing this policy means restricting engineers with years of experience controlling or touching resources through the AWS console or using a CLI tool. This may put a heavy burden on team members who know Terraform, with all infrastructure provisioning needing to flow through them (no more engineers quickly spinning something up in Dev for themselves via the Azure portal). It also eliminates the opportunity for the occasional hot-fix, for which the legitimate need can arise from time to time.
- Use an open-source tool like cloud-concierge cloud-concierge can be configured to perform regular, automated scans of your cloud environment, identify changes made outside of a Terraform workflow, and recommend necessary mitigation steps. Unlike running a Terraform workflow, cloud-concierge offers the following benefits:
(a) Scan for drift across an arbitrary number of state files at one time
(b) Identify and codify resources not controlled by Terraform
(c) Output results into human-readable formatting within a Pull Request
(d) Surface the entities making changes outside of your Terraform workflow so that sources of drift can be locked-down.
We hope this post gives you a better understanding of what Terraform drift is, the problems it causes, and possible solutions. Unaddressed Terraform drift creates immediate, short-term costs, and exposes organizations to significant long-tail risks. Thankfully, automated solutions do exist for organizations to adopt as they move towards best-practices.
dragondrop.cloud’s mission is to automate developer best practices while working with Infrastructure as Code. Our flagship OSS product, cloud-concierge, allows developers to codify their cloud, detect drift, estimate cloud costs and security risks, and more — while delivering the results via a Pull Request. For enterprises running cloud-concierge at scale, we provide a management platform. To learn more, schedule a demo or get started today!
Learn More About Terraform
Why a Cloud Asset Management Platform? With ever expanding cloud environments, having visiblity for and control of cloud assets is not a trivial task to perform manually. A series of offerings exist to automate this problem, providing functionality to at least: Detect...
HashiCorp recently made Terraform 1.6 generally available. Let’s get into it! terraform test Now module maintainers can write tests for Terraform native to HCL. We’ll be writing a separate, deeper-dive article on the ins and outs of terraform test syntax, but for now,...
“Everything as Code” Definition Everything as Code is a philosophy for managing IT infrastructure where all components of infrastructure are created, managed, and deleted using code. This applies to container definitions, cloud infrastructure, on-premise server...