Tools for infrastructure drift detection
2022年3月15日
0 分で読めます非推奨の通知: マネージドリソースのドリフト検出
マネージドリソースのドリフト検出は、snyk iac describe --only-managed and snyk iac describe --drift
を含めて非推奨となりました。マネージドリソースのドリフト検出は、2023 年 9 月 30 日に終了しました。
Predicting infrastructure drift is like predicting snowfall in winter… you know it will happen at some point but you can't predict exactly when. And just like snowfall, having a way to detect it as early as possible is what will make you the most prepared and your infrastructure more secure!
In this article, we’ll explore the principles of drift detection, the different kinds of drift and why they happen, and tools to help detect drift with a simple example.
What is drift detection?
To understand the need of a drift detection tooling, you need to get your head around what is infrastructure drift. In brief, you can think of it as a deviation of your whole infrastructure from your configuration file.
For the sake of this blog post, we’ll focus on Hashicorp Terraform as our infrastructure as code (IaC) software tool to deploy our cloud resources. Within the Terraform world, drift or the deviations of your infrastructure is when your state file differs from what you have applied in your cloud provider.
This is where proper tooling to detect those drifts can significantly improve your security posture — imagine a perfect world where you get alerts if someone manually changes your security group instead of using Terraform. Also, wouldn't it be awesome to have alerts on newly created resources outside of your Terraform?
Managed vs. Unmanaged resources
To illustrate my perfect world above, we will differentiate between two kinds of drifts:
Drifts on resources managed by IaC
Since we do have the configuration file and state files of all resources applied and deployed to your cloud provider, your IaC tool is usually well served to help you on this task of detecting changes made outside of it or not yet applied.
Drifts on resources unmanaged by IaC
On the other hand, this type of drift is not easy to detect since your configuration or state file doesn't have those resources defined in the first place.
Why does drift happen?
Drift happens for so many obvious reasons but Hashicorp explains it better than anyone else:
Let's take a look at real examples of two types of drift and their impact. A drift on a managed resource could be someone manually changing the versioning of your Terraform configured Amazon S3 bucket. Whereas, a drift on an unmanaged resource could be someone manually adding an S3 bucket outside of Terraform (e.g. in the AWS console). (See our previous article for tips to manage drift from manual changes.)
Which tools can help us detect those drifts?
Let's focus now on tools to help you detect drifts on managed and unmanaged resources. For the rest of this post, we will use the same simple Terraform example explained above:
1resource "aws_s3_bucket" "example" {
2 bucket = "drift-example-managed-resource"
3 versioning {
4 enabled = true
5 }
6}
We're just going to change in the console its versioning attribute to false
to add drift to it. In addition, we're going to manually create the same S3 bucket on the AWS Console.
Now let's take a look at three tools for managing drift:
1. terraform plan
The terraform plan
command is a simple description of what needs to be applied so that you end up with your desired implementation. Let's see the output of this command on our actual infrastructure:
1$ terraform plan
2
3aws_s3_bucket.example: Refreshing state... [id=drift-example-managed-resource]
4
5Note: Objects have changed outside of Terraform
6
7Terraform detected the following changes made outside of Terraform since the last "terraform apply":
8
9 # aws_s3_bucket.example has changed
10 ~ resource "aws_s3_bucket" "example" {
11 id = "drift-example-managed-resource"
12 # (10 unchanged attributes hidden)
13
14 ~ versioning {
15 ~ enabled = true -> false
16 # (1 unchanged attribute hidden)
17 }
18 }
19
20Unless you have made equivalent changes to your configuration, or ignored the relevant attributes using ignore_changes, the following plan may include actions to undo or respond to these changes.
21
22────────────────────────────────────────────────────────────
23Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
24 ~ update in-place
25
26Terraform will perform the following actions:
27
28 # aws_s3_bucket.example will be updated in-place
29 ~ resource "aws_s3_bucket" "example" {
30 id = "drift-example-managed-resource"
31 # (10 unchanged attributes hidden)
32
33 ~ versioning {
34 ~ enabled = false -> true
35 # (1 unchanged attribute hidden)
36 }
37 }
38
39Plan: 0 to add, 1 to change, 0 to destroy.
40
41────────────────────────────────────────────────────────────
42
43Note: You didn't use the -out option to save this plan, so Terraform can't guarantee to take exactly these actions if you run "terraform apply" now.
Terraform says that it found a drift on its managed resource and explained what the change was (e.g. versioning from true
to false
). And after that, it outputs what it will do if we decide to apply the plan (e.g. versioning back to true
).
Pros:
Consistent way to detect drifts on managed resources
Support all Terraform resources
Cons:
No way to detect drifts on unmanaged resources
Don't support plan on multiple state files
2. CloudQuery
CloudQuery is an open source cloud asset inventory powered by SQL. Basically, by default, they extract all your resources from your desired cloud providers, format, and load them into PostgreSQL. They create a drift detection command on top of it, so as to "turn this drift problem into a data problem," as they say.
Once the CloudQuery CLI installed, let's go back to detecting our S3 bucket drifts.
1$ cloudquery fetch
2
3Initializing CloudQuery Providers...
4
5✓ cq-provider-aws@v0.10.10 verified 0s 100 %
6
7Finished provider initialization...
8
9Upgrading CloudQuery providers aws
10
11✓ Upgraded provider aws to latest successfully.
12
13Finished upgrading providers...
14
15Starting provider fetch...
16
17✓ cq-provider-aws@latest fetch complete 15s Finished Resources: 129/129
18
19Provider fetch complete.
20
21Provider aws fetch summary: ✓ Total Resources fetched: 523 ⚠️ Warnings: 0 ❌ Errors: 0
The first command, as explained above, fetches all my resources from my cloud providers and puts them in a PostgreSQL table.
1$ cloudquery drift scan --deep --debug terraform.tfstate
2
3Initializing CloudQuery Providers...
4
5⌛cq-provider-aws@v0.10.11 downloading... 4s 100 %
6
7Finished provider initialization...
8
9Using profile drift-example
10Starting module...
11DIFF RESOURCE: s3.buckets:drift-example-managed-resource
12+------------------------------------------+-----------+---------------+-------------------------+
13| AWS EXPR | AWS VAL | TERRAFORM VAL | TERRAFORM EXPR |
14+------------------------------------------+-----------+---------------+-------------------------+
15| COALESCE("c"."versioning_status",'') | Suspended | <nil> | versioning_status |
16| COALESCE("c"."versioning_mfa_delete",'') | Disabled | <nil> | versioning_mfa_delete |
17| "c"."block_public_acls" | true | <nil> | block_public_acls |
18| "c"."block_public_policy" | true | <nil> | block_public_policy |
19| "c"."ignore_public_acls" | true | <nil> | ignore_public_acls |
20| "c"."restrict_public_buckets" | true | <nil> | restrict_public_buckets |
21+------------------------------------------+-----------+---------------+-------------------------+
22Matching attributes "region", "logging_target_prefix", "logging_target_bucket", "policy", "tags", "replication_role", "arn", "ownership_controls"
23+-----------------------+---------------------------------------------------------+
24| ATTRIBUTE | MATCHING VALUE |
25+-----------------------+---------------------------------------------------------+
26| region | us-west-2 |
27| logging_target_prefix | |
28| logging_target_bucket | |
29| policy | <nil> |
30| replication_role | |
31| arn | arn:aws:s3:::drift-example-managed-resource |
32| ownership_controls | <nil> |
33+-----------------------+---------------------------------------------------------+
34Module output:
35=== DRIFT RESULTS ===
361 Resources not managed by Terraform
37 aws:s3.buckets:
38 - drift-example-un-managed-resource
391 Resources managed by Terraform but drifted
40 aws:s3.buckets:
41 - drift-example-managed-resource
42=== SUMMARY ===
43Total number of resources: 2
44 - 1 not managed by Terraform
45 - 1 managed by Terraform but drifted
46 - 50% covered by Terraform
47Finished module
This output is quite interesting, as we can see that he found both my S3 buckets, one managed and one unmanaged. In addition, it found that my managed bucket has drifted but the analysis is quite strange as it found that the S3 bucket is not versioned on AWS but unknown on my state file. Probably a bug in the way they look at Terraform attributes. The other drifted attributes make me wonder what would happen if I have a non-drifted managed bucket.
I re-applied my first configuration of my bucket with terraform apply and I tested again a cloudquery fetch
and cloudquery drift scan --deep --debug terraform.tfstate
:
1$ terraform plan
2
3aws_s3_bucket.example: Refreshing state... [id=drift-example-managed-resource]
4
5No changes. Your infrastructure matches the configuration.
6
7Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.
8
9$ cloudquery drift scan --deep --debug terraform.tfstate
10
11...
12
13DIFF RESOURCE: s3.buckets:drift-example-managed-resource
14+------------------------------------------+----------+---------------+-------------------------+
15| AWS EXPR | AWS VAL | TERRAFORM VAL | TERRAFORM EXPR |
16+------------------------------------------+----------+---------------+-------------------------+
17| COALESCE("c"."versioning_status",'') | Enabled | <nil> | versioning_status |
18| COALESCE("c"."versioning_mfa_delete",'') | Disabled | <nil> | versioning_mfa_delete |
19| "c"."block_public_acls" | true | <nil> | block_public_acls |
20| "c"."block_public_policy" | true | <nil> | block_public_policy |
21| "c"."ignore_public_acls" | true | <nil> | ignore_public_acls |
22| "c"."restrict_public_buckets" | true | <nil> | restrict_public_buckets |
23+------------------------------------------+----------+---------------+-------------------------+
24…
CloudQuery still found that my bucket has drifted while changing in its data table the versioning status. This is clearly a false positive that should not be taken into account as my bucket on the AWS console has versioning enabled.
Pros:
Really fast enumeration of my cloud resources with the fetch command
Unmanaged resources detected with the simple drift scan command
Support scanning multiple state files
Cons:
Unreliable output for S3 bucket drifted attributes with the
drift scan --deep --debug
commandOnly support a few backends for storing your state files (e.g. S3 and locally)
Does not support all Terraform resources
Requires a SQL database
3. driftctl
driftctl is a free and open source CLI tool that warns of infrastructure drift. It helps detect, track, and alert on both managed and unmanaged drift. Let's test this with our example. Note:driftctl is Snyk’s own open source drift detection engine.
1$ driftctl scan --deep
2
3Scanned states (1)
4Found resources not covered by IaC:
5 aws_s3_bucket:
6 - drift-example-un-managed-resource
7Found changed resources:
8 From tfstate://terraform.tfstate
9 - drift-example-managed-resource (aws_s3_bucket.example):
10 ~ versioning.0.enabled: true => false
11Found 2 resource(s)
12 - 50% coverage
13 - 1 resource(s) managed by Terraform
14 - 1/1 resource(s) out of sync with Terraform state
15 - 1 resource(s) not managed by Terraform
16 - 0 resource(s) found in a Terraform state but missing on the cloud provider
17Scan duration: 17s
18Provider version used to scan: 3.74.3. Use --tf-provider-version to use another version.
This output tells us that driftctl found our two resources. The managed one was found with a drifted attribute which is the versioning status and the other one is our unmanaged S3 bucket.
Pros:
Drifted attributes on managed resources detected with the
driftctl scan --deep command
Detects drift on unmanaged resources with the
driftctl scan command
Support scanning of multiple state files
Cons:
Time to scan in
--deep
mode could be really long since they list all resources of an account and fetch for each resource its detailsAPI throttling errors while scanning since they rely heavily on the cloud provider API to gather all information
Does not support all Terraform resources
Considerations for drift tools
These three tools are really good at what they're doing, but let's wrap up what makes them really shine in specific scenarios.
The terraform plan
command is a really powerful tool to have and to use in a scheduled pipeline for all your state files. Indeed, not only do you get the benefit of having all their resources covered by the tool, you also have a universal way of presenting drifted attributes that is pretty easy to read. The only downside here is that you can't aggregate all your state files and run the command to check for drifted resources in one place. While reporting unmanaged resources is not the tool’s job, it’s an essential function.
As for the CloudQuery command line tool, results are mixed. On one hand, their enumeration command (fetch
) is just an amazing piece of technology. Indeed, you get to have for free an open source alternative to gain deep visibility into your cloud infrastructure with SQL as their query and policy engine. On the other hand, their drift detection command is quite limited. From my tests, you can't rely on finding drifts on managed resources with it. In addition, you can only use this tool for finding unmanaged resources if you find yourself in a simple scenario: you can access your state file either locally (most of the time for testing the tool) or via an S3 bucket (best practice if you want to test it in a CI). Fortunately, their drift detection command is still in alpha, so I can't wait to see this command line tool growing.
With driftctl hooked into your CI and running on a daily basis, you get a continuous way to be alerted on drifts on managed and unmanaged resources. Its range of supported backends to store state files makes it a good candidate for most infrastructure. Furthermore, you can, just like CloudQuery, filter or ignore resources that you don't care about to have a report that fits your needs. One serious weakness due to the way driftctl was designed is that huge infrastructure can find a hard time running it smoothly because of API throttling. One last thing, the execution time in --deep
mode could be annoying while testing locally for large infrastructure, in a CI pipeline it's less of a problem.
Last but not least — permissions. Both driftctl and CloudQuery don't need your Terraform code but only your state file to work, which is less cumbersome. They both respect the best practice of least-privileged permissions to scan your entire cloud provider, which means you only need a read-only policy to run both commands. This is not the case obviously for the Terraform command since it needs a read-write permission to create/update/delete your infrastructure and it must read your code to know what to deploy or remove.
Next up for the Snyk IaC
If you’re interested in bringing unmanaged resources under IaC control as well as detecting drift of your managed resources, Snyk is doing just this. Drift management in Snyk IaC helps you secure infrastructure faster by reporting issues and fixes direct to developers, in developer-friendly terms. By building a faster feedback loop between cloud security and development teams, developers will be empowered to own their Terraform from code to cloud and secure infrastructure configurations post-deployment. The second part of this is also surfacing unmanaged resources across cloud environments, so you can bring them under IaC control and reduce the risk of drift from the start.
Additional Resources for Drift Management
To continue learning about drift management in regards to Terraform we have a few more writings on our blog: