03/24/26

Terraform Drift: Why It Happens and How to Fix It

Detecting and managing infrastructure drift in Terraform

6 Min Read

Terraform drift is the gap between what your state file describes and what actually exists in your cloud account. A security group rule gets edited in the AWS console during an incident. An auto-scaler adds instances that Terraform doesn't track. A failed apply leaves the state file describing infrastructure that only half exists. Over time, the state file quietly falls behind reality, and the next terraform plan surfaces changes nobody remembers making.

This guide covers how to detect, fix, and prevent drift. It also explains why drift is structural to the state-file model, and how infrastructure-from-code tools like Encore avoid it by design.

What Terraform drift looks like

Terraform tracks infrastructure through a state file, a JSON document that maps your .tf configuration to real cloud resources. Every plan or apply compares that state against what's actually deployed. When they don't match, Terraform reports the difference as changes it wants to make.

Drift shows up in a few ways:

  • Unexpected diffs in terraform plan where resources show modifications you didn't write
  • Resources that exist in your cloud account but not in state, invisible to Terraform entirely
  • State entries pointing to resources that were deleted outside of Terraform, causing errors on the next apply
  • Attribute values that silently changed, like instance types or security group rules

The tricky part is that Terraform won't tell you about drift until you ask. Between runs, your state can quietly fall behind reality.

Common causes

Most drift doesn't come from malicious changes. It comes from the gap between how teams intend to manage infrastructure and how they actually do it day-to-day.

Manual console changes are the most common source. A developer debugging a production issue opens the AWS console and edits a security group rule directly. The fix works, the incident resolves, and nobody updates the Terraform config. Three months later, someone runs plan and sees a diff they can't explain.

Auto-scaling and managed services create drift by design. AWS auto-scalers add and remove instances based on load. RDS applies minor version patches. ECS updates task definitions during deployments. These changes are expected behavior from the cloud provider's side, but Terraform's state doesn't track them unless you've configured it to ignore them with lifecycle blocks.

Failed or partial applies leave state in an awkward position. If Terraform creates a security group but fails while creating the EC2 instance that references it, your state now describes a partial deployment. The next plan may try to recreate resources that already exist, or reference IDs that are stale.

Other tools modifying the same resources is increasingly common as teams adopt multiple infrastructure tools. A Kubernetes operator provisions a load balancer that Terraform also manages. A CI script modifies an S3 bucket policy. Ansible updates packages on instances Terraform provisioned. Each tool operates independently, and none of them update Terraform's state.

How to detect drift

The simplest check is terraform plan. Run it without applying, and it will show you every difference between your state and reality.

terraform plan -detailed-exitcode

The -detailed-exitcode flag returns exit code 2 when there are changes, which makes it useful in CI pipelines. You can run this on a schedule to catch drift before it compounds.

For a more targeted check, terraform apply -refresh-only updates your state file to match the actual infrastructure without changing any resources. It shows you what changed and asks for confirmation before writing the updated state. After a refresh, running plan will show you only the differences between your configuration and reality, with the state file already synchronized.

terraform apply -refresh-only terraform plan

(The older terraform refresh command does the same thing without a confirmation step, but it's been deprecated since Terraform 0.15.4 in favor of apply -refresh-only.)

Scheduled CI plans are the most reliable detection method. Set up a pipeline that runs terraform plan nightly or weekly and sends the output to Slack or creates an issue when drift is detected. Without this, drift accumulates silently until someone needs to make a change and discovers weeks of untracked modifications.

How to fix it

Once you've found drift, you have two choices: update your code to match reality, or apply your code to overwrite reality.

Option 1: Update your Terraform to match what actually exists. This is the right call when the real-world change was intentional and should be kept. If someone correctly resized a database during an incident, update your .tf files to reflect the new instance size, then run terraform plan to confirm the diff is clean.

Option 2: Run terraform apply to bring infrastructure back in line with your config. This makes sense when the change was accidental or unauthorized. Terraform will modify or recreate resources to match your declared configuration.

For resources that Terraform doesn't know about at all, terraform import brings them into state management:

terraform import aws_security_group.example sg-0123456789abcdef0

After importing, you'll need to write the corresponding .tf configuration to match the resource's actual settings. Otherwise the next plan will show Terraform trying to modify it.

For resources in state that no longer exist, terraform state rm removes the stale entry:

terraform state rm aws_instance.old_server

Preventing drift from accumulating

Complete prevention isn't realistic. Teams will always have reasons to touch infrastructure outside of Terraform. But you can reduce the frequency and catch it faster.

Restrict console access using IAM policies that limit who can modify production resources directly. Read-only console access for most team members removes the most common drift source.

Run scheduled plans in CI. A weekly terraform plan that alerts on any unexpected changes catches drift while the context is still fresh. The person who made the change might still remember why.

Use lifecycle blocks to tell Terraform to ignore attributes that change outside its control:

resource "aws_autoscaling_group" "example" { # ... lifecycle { ignore_changes = [desired_capacity] } }

This prevents Terraform from treating auto-scaler changes as drift.

Use modules for consistency across teams and environments. When everyone provisions resources through the same module, there are fewer one-off configurations that invite manual tweaking.

Lock state files with remote backends. S3 now supports native locking (Terraform 1.10+) without the previously required DynamoDB table. GCS and Terraform Cloud also handle locking automatically. This prevents concurrent applies from corrupting state, which is a different problem than drift but makes drift harder to recover from.

Why drift is a structural problem

All of these practices help, but they're managing a symptom. The underlying issue is that Terraform's architecture requires a separate state file to track the relationship between code and infrastructure. Your application code lives in one repository, your infrastructure code lives in another (or a subdirectory), and state lives in a third location. Keeping these three things synchronized is an ongoing operational burden.

Every drift incident follows the same pattern: something changed in the real world, and the synchronization layer didn't capture it. More process, stricter access controls, and better CI pipelines reduce the frequency, but the failure mode is built into the model.

Infrastructure-from-code approaches like Encore take a different path. Instead of maintaining a separate infrastructure definition that has to stay in sync with your application, infrastructure is declared as part of the application code itself. A database declaration lives in the service that uses it. A pub/sub topic is defined next to the code that publishes to it. There's no separate state file that can drift because the infrastructure definition and the application are the same artifact.

This doesn't mean Terraform is the wrong choice for every team. But if drift has become an ongoing operational cost, it's worth understanding that the problem isn't a lack of discipline. It's an architectural tradeoff baked into how state-file-based tools work.

Ready to escape the maze of complexity?

Encore Cloud is the development platform for building robust type-safe distributed systems with declarative infrastructure.