When Terraform Apply Fails: Recovering from Partial Infrastructure

Terraform has no rollback. When apply fails partway through (an IAM policy syntax error, an API rate limit, an availability zone out of capacity), it stops and writes the current state. Resources created before the failure stay. Resources that hadn't been reached don't exist. The state file reflects this partial reality, and recovering from it requires understanding what Terraform did and didn't finish.

This guide walks through recovery steps and prevention. It also covers why Terraform can't roll back, and how infrastructure-from-code platforms like Encore handle provisioning differently.

What happens during a failed apply

Terraform processes resources in dependency order. When it encounters an error, it stops. Resources that were already created or in-flight complete, but nothing new starts. Terraform writes the current state and exits.

This means the state file after a failure is accurate for what it tracks. Resources that were successfully created have their IDs and attributes recorded. Resources that failed or were never attempted don't appear in state. The gap between what you intended and what exists is real infrastructure you're now responsible for.

aws_vpc.main: Creating...
aws_vpc.main: Creation complete after 3s [id=vpc-0abc123]
aws_subnet.public: Creating...
aws_subnet.public: Creation complete after 1s [id=subnet-0def456]
aws_security_group.web: Creating...
aws_security_group.web: Creation complete after 2s [id=sg-0ghi789]
aws_instance.app: Creating...

Error: creating EC2 Instance: InvalidParameterValue: The instance type
't3.xlarge' is not available in the requested availability zone.

In this example, the VPC, subnet, and security group all exist and are tracked in state. The EC2 instance was never created. Anything that depended on the instance (like a Route 53 record pointing to its IP) was skipped entirely.

How to recover

Check the current state

Start with terraform plan. It will compare your configuration against the state file and show you exactly what still needs to happen:

terraform plan

The output will show the failed resource as needing creation, along with any downstream resources that were skipped. Everything that was successfully created should show no changes, since the state already reflects their existence.

If the plan looks strange or shows resources being recreated that you know exist, your state may be out of sync. Running terraform apply -refresh-only will update state to match actual infrastructure without making changes.

Fix the error and re-apply

Most of the time, recovery is straightforward. Fix whatever caused the original failure (correct the configuration, switch availability zones, request a quota increase) and run terraform apply again. Terraform will pick up where it left off, creating only the resources that are missing from state.

# After fixing the availability zone in your .tf file
terraform apply

The resources created during the first run won't be touched. Terraform sees they exist in state, confirms they match configuration, and moves on.

Target specific resources

When your configuration is large and you want to be precise about what gets created, use -target to apply only specific resources:

terraform apply -target=aws_instance.app
terraform apply -target=aws_route53_record.app

This creates resources one at a time, which is useful when you're debugging a sequence of failures or want to verify each step before continuing. Run a full terraform plan afterward to confirm everything is consistent, since targeted applies can miss dependency updates.

Import resources created outside Terraform

Occasionally a failed apply leaves you in a state where you've manually created resources to work around the failure, or a resource was partially created in a way Terraform can't detect. Use terraform import to bring those resources under management:

terraform import aws_instance.app i-0abc123def456

After importing, run terraform plan to see if the imported resource matches your configuration. If there are differences, either update your .tf files to match or let the next apply reconcile them.

Why there's no rollback

If you're coming from database migrations or deployment tools that support rollback, Terraform's behavior feels incomplete. There's a reason for it.

Terraform manages each resource independently through its provider's API. Creating a VPC is a standalone AWS API call. Creating a subnet is another. There's no transaction wrapping these operations together, because the underlying cloud APIs don't support transactions either. AWS doesn't offer "create these five resources atomically or create none of them."

Rolling back would mean destroying every resource that was successfully created during the failed apply. In practice, this is often worse than the partial state. A VPC with a subnet but no instances is harmless. Destroying the VPC and subnet to "roll back" would also destroy any other resources you'd attached to them in the meantime, or resources from other Terraform configurations sharing the same VPC.

Terraform's design accepts partial failure as a normal operating condition and gives you tools to move forward from it rather than backward.

Preventing partial failures

Keep configurations small. A Terraform workspace managing 200 resources has more points of failure than one managing 20. Split infrastructure into logical units: networking in one workspace, compute in another, databases in a third. A failure in one workspace doesn't affect the others.

Use targeted applies for risky changes. When you're adding a resource type you haven't used before or working with a provider that has known issues, apply that resource in isolation first. Once it succeeds, run the full apply.

Snapshot state before applying. If your backend supports versioning (S3 with versioning enabled, GCS with object versioning), you already have automatic snapshots. For local state, copy terraform.tfstate before running apply. This doesn't prevent failures, but it gives you a recovery point if state gets corrupted.

# For local state
cp terraform.tfstate terraform.tfstate.backup-$(date +%s)
terraform apply

Validate before applying. terraform validate catches syntax errors and basic configuration issues without making any API calls. Running it in CI before apply eliminates an entire class of mid-apply failures:

terraform validate && terraform apply

The broader pattern

Partial applies, state management, and the lack of rollback are all consequences of managing infrastructure through an external tool that reconciles configuration against state. Every failure requires manual assessment: what was created, what wasn't, what needs to change before the next attempt.

Infrastructure-from-code tools like Encore take a different approach, where infrastructure requirements are declared within application code and provisioned automatically. There's no separate state file to get out of sync and no partial applies to recover from, because the infrastructure graph is derived directly from your code each time. If you're finding that operational overhead from Terraform failures is becoming a pattern, it's worth considering whether the reconciliation model is the right fit for your workflow.

When Terraform Apply Fails Halfway Through

Recovering from partial applies and understanding why there's no rollback