It doesn't work my life in software

Why I choose AWS CDK over Terraform

I’ve been recently asked to provide a comparison between two distinct toolchains for infrastructure as code, Terraform and AWS CDK and express a preference for one that fits our project requirements.

First of all, to help who is not familiar with the concept, we can say that infrastructure as code is an emerging practice that encourages the use of mutable infrastructure and frequent deployments, representing the infrastructure related resources and configurations trough some form of code representation.

If we think about the general experience of building a new application for the cloud, it can be easily agreed that writing the software in only one aspect of publishing anything useful for end users.

Even using the most advanced cloud providers, like Google Cloud Computing, Amazon AWS and Microsoft Azure, the job of setting up, configuring and testing the actual resources needed to implement our solutions it’s complex. And this complexity can exceed the software complexity itself (ask Twitter team how they used to run Redis in production).

Having some discrete representation of all the required bits checked into a Git repository it’s a brilliant idea, because unlocks a number of wonderful things:

  • consistent deployments with one single source of truth about your infrastructure
  • sharing reusable infrastructure patterns across teams
  • “deploy on demand” scenarios easier to support

Representing infrastructure as code also means we can reuse our expertise in collaboratively managing source code and source code repositories automations, and that has to do with productivity, particularly meaningful in devops scenarios (since devops = self-service).

Let’s not compare apples with oranges

First of all, it’s not correct to compare CDK with Terraform, and to understand why this is not the case, it may be useful to explain how these two toolchains approach their two main features:

  • Mapping the infrastructure using a construct (authoring)
  • Managing the infrastructure provider interactions (provisioning, upgrading, decommissioning)

Terraform (and Terraform Cloud)

Terraform supports AWS (and also other providers) through a software layer, implemented using the Go programming language which directly consumes the AWS API, the same API targeted by the boto3 library to be more clear.

The infrastructure constructs are coded using a domain specific languaged called HCL. Terraform implements a state machine that is able to compare the “current” state of the project infrastructure against the version represented by the code. When a delta is available, Terraform orchestrates the creation and configuration of the resources.

While it is technically possible to share the state of a project on a S3 bucket, in order to keep track of the deployment operations and to easily associate the AWS resources with the corresponding Terraform modules, a Terraform Cloud account is required. Deploying a project through Terraform community against an AWS account creates all the required resources but leaves no evident traces for you to easily reconduct any specific resource to the Terraform abstraction that produced it.

Hashicorp provides a complete solution to automate, govern and monitor IaC projects. It works, and it’s shiny as the good money it costs.

AWS CDK (and CloudFormation)

AWS already had a system implementing both infrastructure constructs and all the machinery needed to automate deployments, managing stack drifts and to provide monitoring of deployment operations.

It is called CloudFormation and it’s been around already for a while (it’s first public release announce dates back to 2011).

Infrastructure constructs, called “templates”, can be authored using JSON or YAML. Logically related resources can be organised into a self-contained units called “stacks” and CloudFormation provides all the heavy-lifting for orchestrating the operations (with powerful and solid out-of-the-box features, like blue-green deployments support and automatic deployment rollbacks).

It works, and it’s free of charge.

The problem with CloudFormation has to do with the experience of authoring templates for complex stacks. It can become overly verbose and will require you to “jump” with your eyes to resolve references.

It’s simply painful.

Keeping all the references in sync, matching outputs and parameters across stacks, is almost impossible. My personal opinion is that is not a job for humans. It looks more like a job for a compiler. And this must be the same thought that drove the team who launched CDK (Cloud Development Kit) as a developer preview roughly two years ago.

If CloudFormation was a giant, CDK is the tiny hat on the giant’s head. It basically acts as a transpiler from an high-level, strong typed, programming language to the CloudFormation layer.

In other words, you write a Typescript/Python/Java class, and it compiles to a CloudFormation template (or a set of CF templates).

As a comparison, 2000+ lines of well formatted and syntactically checked CloudFormation JSON can be produced by ~100 lines of generously commented Python code. Code that you can inspect and compare with the source, just like you could do with the Assembly language generated by a C compiler.

All the references between stacks and resources are automatically resolved through the normal Python3 interpreter and the included asset subsystem automates all the packaging steps normally required to publish code to a cloud runtime (like zipping lambda archives, building and pushing docker images on ECR, and so on…).

Deploying with CDK, in reality, is a two step process:

  1. Transpiling your application code into CF templates
  2. Starting a managed deployment operation on the CloudFormation engine

And again, it works and everything can be easily monitored through the AWS Console.

Some background information about the project

Some background information about the team and our infrastructure will be useful to fully understand the reasons behind my final preference.

Our project is designed from scratch as a serverless system, composed by many AWS resources (like buckets, queue, lambda functions, fargate services, etc.) on top of another pre-existing system which happily runs within an AWS VPC.

We don’t plan to port this engine (or some of its parts) to another cloud provider anytime soon.

One of the main selling point of Terraform is being able to support multiple Cloud providers (beware, that is not “write once” and “deploy everywhere”, support is granted through a different “Provider” implementation).

Of course having the ability to potentially reuse the same software across different cloud provider is desirable, and it’s also one of the wet dreams of any SaaS company, since it would reduce their vendor lock-in with their cloud provider.

What many software engineers don’t get about migrating a software stack to another provider is that the main job is to adapt each and every abstraction to the corresponding version provided by the target provider (as an example think about the differences between an S3 Bucket and an Azure BlobStorage Container, or between a Lambda Function and any other FAAS platform out there).

You either start from scratch targeting multi-provider support or accept the engineering costs associated with migrating to another provider at some point in the future. In this scenario, the cost of rewriting the IaC part of the project will hardly weight more then 10% of the total efforts.

So, this is not an aspect that should influence our choice, and Terraform should be evaluated only with respect to the specific AWS support.

Reasons why

After building a small but complete system, using a mix of managed services and general purpose container runtime solutions, using Terraform and AWS CDK, I came out with my personal preference for AWS CDK and CloudFormation.

The reasons for my choice are related to team productivity, operations support and total cost of ownership, and I will try to explain my points in the following.

In infrastructure as code, code matters

One of the big promises of IaC is team productivity, and that should result from the improved ability to automate the deployment pipelines, on one hand, but mainly from the increased velocity and confidence in refactoring and reusing the infrastructure related code.

Refactoring experience

When your team starts building a complex application, like my team is doing, it will happen that you want to refactor and reshape your project often.

For example, we approached the development creating a number of isolated, decoupled and self-contained components and now we are exploring different ways for “wiring” them into one single, well structured, engine.

During this phase we “changed” our mind multiple times a week, and that resulted in a refactoring of the Terraform/CDK code.

The refactoring experience of CDK it’s no different from any other Python refactoring: you move variables around, rename them and separate things that change from things that don’t, using the programming language provided abstractions, function and classes.

Since Python3 and the type hints, the general refactoring experience, supported by many automatic tools (often embedded into IDEs), has improved sensibly.

Eveything gets double checked at “synth” time and this phase is cleanly separated from the deployment state-machine.

As the authoring experience of Terraform HFL modules is better compared to writing CF templates by hand, the refactoring is painful, since relies on a “change-try-check” routine which is annoying and error prone.

My Terraform code finished having many more “hardcoded” values then my CDK Constructs.

Code readabilty

Like any other piece of code, the more is readable, the easier will be for team members to understand, onboard and to do the right thing when it’s time to evolve it.

Don’t get me wrong, I’m not saying HCL is not readable and nice for the eyes, it looks like a Pin Up compared to raw CloudFormation templates (which gives me “motion sickness” after a few seconds), but it doesn’t qualify as a programming language.

The expressiveness power of Python has no rivals yet. The support for Python in IDEs and code editors, debuggers, linters is a generation ahead, compared to the Terraform support.

Keep also in mind that our team consists in a good 75% of “native python readers”, which is at least relevant in this context.

Getting documentation/help and finding examples

Another common complaint I receive from fellow developers is that finding documentation about AWS CDK is hard, because the project “is young”, and there are not so many “ready made” examples and not many StackOverflow answers.

Honestly, this is a common rant about any new technology/tool and I’m kind of used to see them progressively fade out as people get more confident about the new elements.

Project is young indeed, but is receiving a lot of community attention and AWS investments. Development is active (11 releases in two months!) and the development team is world class.

To get Google to work for you, you need to search for the right thing.

CDK must be seen as language binding. Technically speaking, the Python version it is a language binding on top of the original Typescript implementation, but what I mean here is that you need think CDK as binding to CloudFormation.

CDK is a library organized around the concept of Construct. It’s a simple class representing a CloudFormation Resource. The library provides two layers of mappings:

  • L1: Cfn-something classes. This is the lowest level of CDK, mapping one-to-one all the existing CloudFormation resources.
  • L2: High level constructs. Wrapper classes for a Cfn-something class with nice automatic extra features

While it may happen that L2 lacks support for a specific feature, if you’re using a service that is not commonly used or recently launched by AWS, the L1 support is always complete and stable.

It is a binding, so L1 feels like programming using ctypes or something like that. You need to be aware of what your code is actually doing.

So, if you are going to build something, you should search in the following order:

  1. Is a L1 construct available? AWS-CDK Reference
  2. If not, what is the L2 construct? again AWS-CDK Reference
  3. What is the meaning of this specific CloudFormation resource/property? look into the CloudFormation reference

Of course, if you look on the internet (github issues, gitter chats, stackoverflow questions), many unexperienced developers try quickly achieve an easy solution without really understanding what are steps involved.

So, before you start, ask yourself if you really know what you want to achieve, and how this maps to AWS resources. A good source of inspiration is the growing list of certified solution examples in the AWS Architecture Center.

Another suggested exercise is to port an existing CloudFormation template into a tiny and readable CDK program.

Safety hatches availability

Another important aspect of adopting this kind of toolchain is to think about having some “safety hatches” available for tackling corner cases.

This is the reason I use to say that “hello world” programs always work, but you need to face a real life project to eventually meet your toolchain limits.

What happens if your L1 class doesn’t work as expected? What if CloudFormation itself lacks support for a specific service property or resource?

You will not find many blog posts about this. You will find hundreds of resources explaining how to build simple, commonly used, not production-ready examples. What I call “hello world” projects.

I quickly came into the situation where even CloudFormation didn’t expose an important feature (honestly it already happened twice, one for building automatic database schema migrations with Fargate, and another one for building a managed SFTP server with whitelistable IP addresses).

The solution was to build an CloudFormation CustomResource. A rather complex collaboration of lambda functions that can be implemented using again Python and the boto3 library.

AWS CDK even exposes a really convenient micro-framework for authoring such resources, which makes the creation of a resource a matter of defining a class and your lambda function handler. Sweet.

Not saying this is not possible with Terraform, but I consider authoring a custom Terraform provider or provisioner is way harder and relies on the Go programming language, not the first choice for our current teams composition.

Total cost of ownership

There is no additional charge for using AWS CloudFormation, while it would be required to pay for Terraform Cloud Team & Governance plan.

To be honest here, for our specific security policy requirements, these accounts should be managed through our Enterprise SSO, and I see that feature is available only with the Enterprise version of the Terraform Cloud solution.

You need to contact Hashicorp to get a quote.