A Brief Introduction to Infrastructure Automation

Sep 3, 2018 08:30 · 3383 words · 16 minute read Infrastructure Terraform

The previous post described the somewhat fictional evolution of a single server to a full production web application (based on a number of true stories). Aside from a few references to Amazon S3, it mostly used common terminology for describing the individual components that went into setting up the application’s infrastructure. In this post, we’ll take a look at how to approach automating the creation of these components.

We really have to consider two primary approaches when talking about automation: starting from nothing, and starting with an existing system. While it’s probably more common to start with an existing system, I think it’s easier to learn automation by starting from nothing. Once you get a handle on that, you can begin transitioning your existing system by building a mirror environment.

Let’s discuss how to approach automating the creation of infrastructure required by the second diagram in the previous post.

Figure 1

Building this system requires automating the following steps:

  • Creating the server.
  • Mounting a persistent disk on the server for the database.
  • Mounting a persistent disk on the server for the logs.
  • Pushing code to the server.
  • Installing the database on the server.
  • Configuring the database to store its data on the persistent disk.
  • Creating services to run the database and web server.
  • Configuring a service to update code on our servers automatically and then reboot the necessary processes.

You’ll notice that these tasks fall into two distinct categories:

  • Creating and attaching hardware.
  • Installing and configuring software.

The industry refers to these two categories as infrastructure automation and provisioning, respectively.

Ideally this post would cover both infrastructure automation and provisioning, since they tend to go hand in hand, but it’s a bit too much to cover all of that in a single post. Infrastructure automation must happen before provisioning, so we’ll stick with that for now.

Infrastructure Automation

You have quite a few choices for tooling to automate creating infrastructure. Major cloud providers generally provide their own: Amazon provides CloudFormation templates and a robust CLI, and Google provides Deployment Manager and a robust CLI. Provisioning tools like Ansible and Chef provide solutions. Having been burned by vendor lock-in before, I generally prefer a tool called Terraform, which allows managing infrastructure across cloud providers.

It’s extremely important when learning automation that you first understand how to perform all of the automated steps manually. I cannot possibly stress that enough. Trying to learn both your cloud provider and your automation tool at the same time will feel somewhat like learning how to translate Hungarian to Japanese, except without already knowing Hungarian or Japanese. Learn to do what you need in your cloud provider first, and then learn how to accomplish it in your automation tool.

An example of one possible mapping between the concepts in our system diagram and the variety of services provided by Amazon AWS:

  • Server – EC2: Instance
  • Subnet – VPC: Subnet
  • Load Balancer – EC2: Load Balancer
  • CDN – CloudFront
  • Database – RDS
  • ELK stack – CloudWatch
  • Firewall Rule – EC2: Security Group

When picking a tool, it’s important to take into account the amount of jargon you will have to map between your cloud provider and the tool you choose. There’s a lot to learn on both ends, and you’ll likely scratch your head at least a few times trying to make sense of how the vocabulary translates between your tools. If you don’t have a good grasp on the jargon from your cloud provider, you may find this task quite difficult. If you do have a good grasp on it, then it’s time to pick an infrastructure automation tool.

When you do this, I suggest picking a tool that allows you to define infrastructure as code. This allows you to check your code into a repository, which gives you versioning, branching, merging, and the general ability to collaborate. If you ever intend to run the infrastructure automation as part of your continuous delivery pipeline, this is necessary. Major infrastructure automation tools support this.

I also suggest that you pick a tool that supports declarative infrastructure definitions.

Declarative Infrastructure Definition

Declarative infrastructure definition requires a bit more explanation. The alternative is imperatively defining infrastructure. The difference between these is that declarative infrastructure definitions explain what you want at the end, whereas imperative infrastructure definitions explain how you get to want you want at the end.

Here’s an example of what an imperative infrastructure definition using Amazon’s NodeJS SDK might look like:

const AWS = require('aws-cli');

const ec2 = new AWS.EC2({apiVersion: '2016-11-15'})

async function buildInfrastructure(ami, zone) {
  const volume = await ec2.createVolume({
    AvailabilityZone: zone,
    Size: 100,
  }).promise();

  const instanceResult = await ec2.runInstances({
    Image: ami,
    MinCount: 1,
    MaxCount: 1,
    InstanceType: 't2.small',
  }).promise();

  const instance = instanceResult.Instances[0];

  const mountResult = await ec2.attachVolume({
    Device: '/dev/sdf',
    InstanceId: instance.InstanceId,
    VolumeId: volume.volumeId,
  }).promise();
}

buildInfrastructure();

If you run this three times, you’ll get three environments, which you likely don’t want.

Here’s an example of what the same declarative infrastructure definition using Terraform might look like:

resource "aws_instance" "web" {
  ami               = "${var.ami}"
  availability_zone = "${var.zone}"
  instance_type     = "t2.small"
}

resource "aws_volume_attachment" "logs" {
  device_name  = "/dev/sdf"
  volume_id    = "${aws_ebs_volume.logs.id}"
  instance_id  = "${aws_instance.web.id}"
  skip_destroy = true
}

resource "aws_ebs_volume" "logs" {
  availability_zone = "${var.zone}"
  size              = 100
}

If you run this three times, you’ll get one environment, which is probably what you want.

This is the primary reason for my personal preference for using a declarative infrastructure definition: applying a declarative infrastructure definition does not require that the definition takes into account the current state of the infrastructure. The tool handles this for you behind the scenes.

Immutable Infrastructure

Once you have your infrastructure defined in an automation tool, you should have the ability to create and destroy specific pieces of it at will. Provided that you’ve designed your application(s) to run on stateless servers, this allows you to deploy servers rather than deploy code. For the truly adventurous, this also allows you to disable remote access to these servers, which effectively renders them immutable. This technique, called immutable infrastructure, provides a number of benefits.

Immutable infrastructure allows creating identical replicas of every version of an application. This allows you to build auto-scaling mechanisms, to replicate issues occuring in a production environment without interfering with the actual production environment, to run tests against exact production infrastructure before deploying, and much more. These are invaluable techniques for any production system.

This technique also significantly reduces the complexity of setting up continuous deployment and rollbacks, since it makes deployments atomic. Rather than updating existing services on existing servers, which can leave a server in a nonfunctioning state, old machines are replaced by new machines. Deployments simply require pointing the production load balancer at a fresh set of machines; rolling back simply requires pointing the production load balancer at the old machines.

While you may end up relying on your cloud providers to perform these tasks for you, it’s still helpful to understand the conditions under which these types of deployments and rollbacks are possible. Reliable, immutable deployments are necessary for most of these functions.

A concrete example

Note: This example only uses Terraform and AWS’s most basic concepts. While the resulting environment is technically sound, you might want to dig into something a bit more substantial after reading it, such as this post about setting up continuous delivery on using Beanstalk and CodePipeline.

We’ve spent a lot of time discussing abstract concepts. When I read blog posts similar to this, I often feel dissatisfied at the end due to having nothing concrete to take away from it. I won’t do that to you. Rather than leaving you floating in a sea of abstractions, let’s take a look at a simple, but realistic example.

I’m most familiar with Terraform, so I’m going to use it for this example. If you’re not familiar with it, I’ll do my best to explain the fundamentals.

Terraform interacts with various cloud platforms via providers, each of which takes a configuration that tells Terraform how to interact with that platform. A provider exposes a number of named resources, which allow you to create infrastructure. Each named resource definition takes a number of inputs, which specify how to create it, and exposes a number of outputs that describe details only known after creation (i.e. the IP address of a machine, its ARN, etc…).

For example, consider the following Terraform code:

provider "aws" {
  region     = "us-east-1"
  access_key = "12345"
  secret_key = "12345"
}

resource "aws_instance" "web" {
  ami               = "ami"
  availability_zone = "us-east-1a"
  instance_type     = "t2.small"
}

The first block of code specifies the configuration of the AWS provider. It specifies which region in which to create resources and which credentials to use to connect.

The second block of code tells Terraform to create an aws_instance resource named “web.” The “web” resource uses the “ami” image, will get created in the “us-east-1a” availability zone, and will have the hardware specifications of a “t2.small” AWS instance.

When checking Terraform code into a repository, you shouldn’t push access keys and secret keys. If you omit them, Terraform defaults to using the values stored in the AWS_ACCESS_KEY and AWS_SECRET_KEY environment variables. It falls back to a number of authentication methods as well, which you can read about here. This post omits those keys, which will cause Terraform to fallback to whatever default AWS profile you have currently configured on your system.

Let’s build a stateless website with immutable servers, load balanced across multiple availability zones:

Figure 2

Subnets in AWS live in a VPC, so let’s create that:

provider "aws" {
  region     = "us-east-1"
}

resource "aws_vpc" "app" {
  cidr_block = "10.0.0.0/16"
  enable_dns_hostnames = true

  tags {
    Name = "application"
  }
}

The CIDR block in this configuration (10.0.0.0/16) specifies that machines within this subnet can have internal IP addresses between 10.0.0.0 and 10.0.255.255.

Now let’s create our public and private subnets.

resource "aws_subnet" "public1" {
  vpc_id = "${aws_vpc.app.id}"
  cidr_block = "10.0.1.0/24"
  availability_zone = "us-east-1a"

  tags {
    Name = "public-subnet-1"
  }
}

resource "aws_subnet" "public2" {
  vpc_id = "${aws_vpc.app.id}"
  cidr_block = "10.0.2.0/24"
  availability_zone = "us-east-1b"

  tags {
    Name = "public-subnet-2"
  }
}

resource "aws_subnet" "private1" {
  vpc_id = "${aws_vpc.app.id}"
  cidr_block = "10.0.3.0/24"
  availability_zone = "us-east-1a"

  tags {
    Name = "private-subnet-1"
  }
}

resource "aws_subnet" "private2" {
  vpc_id = "${aws_vpc.app.id}"
  cidr_block = "10.0.4.0/24"
  availability_zone = "us-east-1b"

  tags {
    Name = "private-subnet-2"
  }
}

These subnets have similar configurations to the VPC, aside from one extra input specifying which VPC to use. As you can see, we’ve mapped the VPC’s “id” output variable to each subnet’s “vpc_id” input variable using Terraform’s reference syntax, which roughly uses the following format: ${resource_type.resource_name.output_variable}. .

We’ve also configured the public subnet CIDR’s to allow internal IP addresses between 10.0.1.0 and 10.0.2.255, and we’ve configured the private subnet CIDR to allow internal IP addresses between 10.0.3.0 and 10.0.4.255.

Both public subnets currently have no access to the internet. They’ll need this. Let’s create an internet gateway and a route table to allow them to access the internet.

resource "aws_internet_gateway" "app" {
  vpc_id = "${aws_vpc.app.id}"

  tags {
    Name = "application"
  }
}

resource "aws_route_table" "public" {
  vpc_id = "${aws_vpc.app.id}"

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = "${aws_internet_gateway.app.id}"
  }

  tags {
    Name = "public"
  }
}

resource "aws_route_table_association" "public1" {
  subnet_id = "${aws_subnet.public1.id}"
  route_table_id = "${aws_route_table.public.id}"
}

resource "aws_route_table_association" "public2" {
  subnet_id = "${aws_subnet.public2.id}"
  route_table_id = "${aws_route_table.public.id}"
}

That should be enough for setting up our network. Before we create our load balancer, let’s create firewall rules to allow it to accept internet traffic, and also to allow our application to accept connections from the load balancer.

resource "aws_security_group" "internet_http" {
  name        = "internet-http"
  description = "Allow HTTP traffic to and from the internet"
  vpc_id = "${aws_vpc.app.id}"

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port = 80
    to_port = 80
    protocol = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags {
    Name = "internet-http"
  }
}

resource "aws_security_group" "application_lb" {
  name        = "application-lb"
  description = "Allow HTTP traffic between LB and application"
  vpc_id = "${aws_vpc.app.id}"

  tags {
    Name = "application-lb"
  }
}

resource "aws_security_group" "application" {
  name        = "application"
  description = "Allow HTTP traffic between application and load balancer"
  vpc_id = "${aws_vpc.app.id}"

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    security_groups = ["${aws_security_group.application_lb.id}"]
  }

  tags {
    Name = "application"
  }
}

Now that we have our firewall rules defined, let’s go ahead and create our load balancer.

resource "aws_alb" "application" {
  name               = "application"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [
    "${aws_security_group.internet_http.id}",
    "${aws_security_group.application_lb.id}"
  ]
  subnets            = [
    "${aws_subnet.public1.id}",
    "${aws_subnet.public2.id}"
  ]
}

resource "aws_alb_target_group" "application" {  
  name     = "application"
  port     = "80"
  protocol = "HTTP"
  vpc_id   = "${aws_vpc.app.id}"

  health_check {    
    healthy_threshold   = 3    
    unhealthy_threshold = 5    
    timeout             = 10    
    interval            = 30    
    path                = "/"    
    port                = 80
  }
}

resource "aws_alb_listener" "application" {  
  load_balancer_arn = "${aws_alb.application.arn}"  
  port              = 80
  protocol          = "HTTP"
  
  default_action {    
    target_group_arn = "${aws_alb_target_group.application.arn}"
    type             = "forward"  
  }
}

resource "aws_alb_listener_rule" "application" {
  depends_on   = ["aws_alb_target_group.application"]  
  listener_arn = "${aws_alb_listener.application.arn}"  
  priority     = 100

  action {    
    type             = "forward"    
    target_group_arn = "${aws_alb_target_group.application.id}"  
  }   

  condition {
    field  = "path-pattern"    
    values = ["/*"]
  }
}

output "dns" {
  value = "${aws_alb.application.dns_name}"
}

There’s a lot to unpack here. We created a load balancer in our public subnets, and we applied two security groups to it: one to allow inbound traffic from the internet, and one empty one marking it as a load balancer. This empty security group allows our application’s firewall rule to identify the load balancer as a valid traffic source.

We’ve created a load balancer target group, which defines a group of instances, how to check their health, and the protocol each instance expects.

We’ve created a load balancer listener, which defines the port and protocol on which our load balancer will listen.

We’ve created a listener rule that forwards all incoming traffic from all URLs to instances in our target group.

Lastly, we’ve told Terraform to keep track of the load balancer DNS in an output variable named “dns.” Once we’re finished, you can run terraform output dns to grab the DNS and access the running application.

Now let’s create our instances. Instead of using a premade image from Amazon, which usually doesn’t have our proprietary software installed on it, we’ll build one using a tool from Hashicorp called Packer. This tool provisions and packages images for use in virtual machines or on a cloud provider. In this case, we’re going to use it to generate a Ubuntu 16.04 AMI on Amazon running a default Nginx server.

Create a file named packer.json with the following content:

{
  "builders": [{
    "type": "amazon-ebs",
    "region": "us-east-1",
    "source_ami": "ami-04169656fea786776",
    "instance_type": "t2.small",
    "ssh_username": "ubuntu",
    "ami_name": "Ubuntu 16.04 Nginx - {{timestamp}}"
  }],
  "provisioners": [{
    "type": "shell",
    "inline": [
      "sudo apt-get update",
      "sudo apt-get -y install nginx"
    ]
  }]
}

Build the image with Packer:

packer build packer.json

When this runs, it will spin up an AWS instance, run the shell provisioner on the instance, package that instance into an AMI, and then delete all resources necessary to create the AMI.

It will eventually print a line with details for your custom AMI:

us-east-1: ami-XXXXXXXXXXXXX

Copy that AMI and use it in your Terraform instance resources:

resource "aws_instance" "web1" {
  ami                    = "ami-XXXXXXXXXXXXX"
  availability_zone      = "us-east-1a"
  instance_type          = "t2.small"
  vpc_security_group_ids = ["${aws_security_group.application.id}"]
  subnet_id              = "${aws_subnet.private1.id}"
}

resource "aws_instance" "web2" {
  ami                    = "ami-XXXXXXXXXXXXX"
  availability_zone      = "us-east-1b"
  instance_type          = "t2.small"
  vpc_security_group_ids = ["${aws_security_group.application.id}"]
  subnet_id              = "${aws_subnet.private2.id}"
}

Each instance is created in its respective subnet, and it has a single firewall rule allowing incoming traffic from the load balancer.

Finally, connect these instances to your load balancer:

resource "aws_alb_target_group_attachment" "web1" {
  target_group_arn = "${aws_alb_target_group.application.arn}"
  target_id        = "${aws_instance.web1.id}"
  port             = 80
}

resource "aws_alb_target_group_attachment" "web2" {
  target_group_arn = "${aws_alb_target_group.application.arn}"
  target_id        = "${aws_instance.web2.id}"
  port             = 80
}

And that’s it. If you run terraform apply against this code, you can visit your extremely stable, default Nginx page by opening the URL returned by running terraform output dns in a browser. It isn’t a particularly interesting application, but it does showcase the concepts we discussed:

  • We’ve defined all of our infrastructure using declarative code.
  • We’ve used Packer to create images used by our immutable servers.

With a small amount of effort, you can extend this example to:

  • Setup a continuous delivery tool that builds an image via Packer, inject the resulting AMI into your infrastructure configuration, and then run Terraform to deploy your new image.
  • Use Terraform workspaces to create QA, staging, and production environments.
  • Replace the archaic individual instances with an autoscaling group.
  • Templatize the example using Terraform variables and modules, allowing you to reuse it for multiple applications.
  • Setup remote storage of Terraform state, so that multiple developers can extend the infrastructure.

If you’ve never seen infrastructure automation in action, I hope you’ve enjoyed reading through this. There’s a lot more to come.

Why learn all of this?

It may seem that manually making incremental changes to your infrastructure over time will scale, and for a while it might. Making incremental changes is perhaps the easiest part to scale. When you lose the volume containing the database configuration and crontab written by the engineer who just left the organization, that’s when you find out why manual effort doesn’t scale.

Here is a woefully incomplete list of reasons to automate:

  • Hardware fails. Rebuilding a manually provisioned machine from scratch takes time and potentially a lot of trial and error (unless you have a perfect memory of what ever person did on that machine).
  • People leave. Part of your infrastructure may have been provisioned by a former employee. Hopefully you parted on good terms, or else you might have to relearn and rebuild from scratch.
  • People join. It’s difficult for new employees to ramp up without understanding what’s available in their application’s environment. While you might have great documentation, it often diverges from what’s actually deployed. Coded infrastructure rarely diverges, and it serves as great documentation for new employees.
  • Cloud providers change. Remember when Amazon renamed its load balancers to “Classic Load Balancer” and deprecated them? Your engineering team created ten of them per environment, because microservices. Have at it.
  • Code changes. Code often depends on infrastructure. If you ever need to rollback code, there’s a good chance that you’ll have to update the corresponding infrastructure. Reversing incremental manual infrastructure changes may cause issues.
  • Environments multiply. Production, staging, QA, sales demonstrations, etc… The more environments you have, the more places you’ll have to apply manual updates to an application. Each manual update is a potential mistake.
  • People make mistakes. Your production environment is just one mistake away from deletion. Infrastructure automation provides a safe and reproducible way to test infrastructure changes.

Manual updates only scale when everything goes according to plan. That should sound counterintuitive, since everything you’re building is designed to handle failure. Your process should have just as much resilience built into it as the systems it supports.

If it seems like the amount of work required to get automation up and running isn’t worth the benefit it adds, keep in mind that you only have to do it once. You can reuse almost all of the infrastructure definitions for future projects, or for old projects. The one-time effort required to set automation will save you an incredible amount of time in the long run.

If you get anything out of this article, it should be this: manually deploying infrastructure doesn’t scale. All forms of infrastructure automation are superior to a manual approach. You will not waste time learning infrastructure automation.

Automate from day one.

Outro

This series of blog posts was intended to lay the groundwork for future posts that get into specific details about how to perform all of these tasks relating to build and maintain a production web application. The first post discussed motivations; the second post discussed our end goal; this post discussed the high level approach to creating and maintaining hardware. The next post will discuss in more detail how to provision these machines and perhaps how to tie everything into a continuous deployment pipeline.

I hope you’ve enjoyed the series so far. Please let me know if you have any thoughts, suggestions, or requests.

Happy coding!

The next post is live. You can read it here.

Tweet Share

Looking for graphic design help?

I partner with designers at Handshake Studios to provide graphic design feedback to freelancers and early-stage startups.

Sign up at DesignSavior.com!

Subscribe to my newsletter to receive updates about new posts.

* indicates required