Who Changed Our Cloud Environment? Identifying Root Causes of Terraform Drift

May 24, 2023AWS, Terraform

Engineer thinking about cloud architecture

Best Practice: Terraform as the Single Source of Cloud Truth

Best practice usage of Terraform includes having it serve as the single source of truth for what is in you organization’s cloud environment. This means that all cloud resources are accurately captured within Terraform configuration, and all changes to the cloud (resource edits, resource deletions, and new resource creations) are driven by a Terraform workflow.

The easiest way to enforce this pattern is to only grant edit access to the cloud to your organization’s Terraform workflow. Unfortunately, for organizations transitioning to Terraform, or in various stages of Terraform maturity, this is too blunt of an option to seriously consider. As a result, sooner or later, most organizations will experience drift and its downsides.

Drift Happens

Drift can happen in a variety of ways, and is not limited to the following examples:

A cloud engineer makes a hot-fix in production to resolve a bug and forgets to update the corresponding Terraform code.
A DevOps engineer is much more comfortable with CLI tools and makes changes to/creates/deletes cloud resources on the command line.
A full stack engineer with minimal working knowledge of Terraform creates resources in their dev environment to prototype new application logic.

In any of these cases, an organization moving towards the principal of “Terraform as the Source of Truth” would need to take the following actions:

Identify that drift has occurred. Ideally this is done proactively, and not when running terraform plan or terraform apply (e.g. when your engineering team is actively trying to deploy new resources and is now blocked due to drift having occurred). Furthermore, how will you identify resources that have been created wholely outside of the Terraform workflow?
Determine who or what changed the resource. Which service account or user caused the drift to occur? Did they have a good reason to do so? For most teams, this would require manually going through cloud admin logs to identify when a resource instance was changed/created/deleted. If that sounds like a lot of work that is likely never to be done — that’s because for the vast majority of engineering teams, it is.
Educate and lock down cloud permissions to prevent drift re-occurence. Unfortunately, since step 2) requires so much time and is rarely completed, this step cannot happen and drift is likely to re-occur.

Solution: dragondrop.cloud’s Cloud Actor Identification

dragondrop’s State of Cloud Report includes information on the root causes of drift for both resources already managed by Terraform and those completely outside of Terraform control. This means when drift occurs within your organization’s cloud, you can go right to step 3), and right to improving your organization’s cloud management posture.

You can watch a video of this in action with AWS here.

Conclusion

Drift happens, but the process needed to prevent it from regularly re-occuring is usually very manual. This process involves so much manual toil that it often simply does not happen, leaving organizations saddled with drift and unable to fully adopt best practices with their Terraform usage.

This is why we created dragondrop — to help automate the toil often required for organizations to adopt and maintain Terraform best practices. dragondrop will answer your drift question of “Who changed our cloud environment?”, so that you can ensure it does not happen again.

–

dragondrop.cloud’s mission is to automate developer best practices while working with Infrastructure as Code. Our flagship OSS product, cloud-concierge, allows developers to codify their cloud, detect drift, estimate cloud costs and security risks, and more — while delivering the results via a Pull Request. For enterprises running cloud-concierge at scale, we provide a management platform. To learn more, schedule a demo or get started today!

Learn More About Terraform

Firefly vs. Control Monkey vs. cloud-concierge in 2023

Why a Cloud Asset Management Platform? With ever expanding cloud environments, having visiblity for and control of cloud assets is not a trivial task to perform manually. A series of offerings exist to automate this problem, providing functionality to at least: Detect...

What’s New In Terraform 1.6: Testing!

HashiCorp recently made Terraform 1.6 generally available. Let’s get into it! terraform test Now module maintainers can write tests for Terraform native to HCL. We’ll be writing a separate, deeper-dive article on the ins and outs of terraform test syntax, but for now,...

Everything Everywhere All as Code

“Everything as Code” Definition Everything as Code is a philosophy for managing IT infrastructure where all components of infrastructure are created, managed, and deleted using code. This applies to container definitions, cloud infrastructure, on-premise server...