Setting up Monitoring and Alerting on Amazon AWS with Terraform

Jun 30, 2018 18:43 · 1345 words · 7 minute read DevOps Amazon AWS

A production web application should have a number of safeguards in place to prevent failures. Many outages have preliminary symptoms that a proper monitoring solution can detect, and these detections can send alerts to the appropriate people capable of fixing them. Amazon’s CloudWatch service can both monitor systems and also alert when something goes wrong. Let’s take a look at how to set it up.

We could go into the Amazon dashboard to do this, but since I have a rule to automate everything that goes into production, I’ll show you how to do it in code. This guarantees that in the case of a catastrophic failure, I can get the application up and running again as quickly as possible. It also reduces the burden of knowledge required to set everything up from scratch. Overall, it decreases risk and improves the application’s robustness.

Terraform has been my infrastructure automation tool of choice for many years. It provides a declarative syntax for deploying infrastructure, and it allows me to create and connect infrastructure across multiple different providers. In this case, we’re going to use it to configure CloudWatch logs and alerts for a single application server.

Prerequisites

Before we jump into the details, make sure you have the following software and services installed and configured:

You’ll also need to make sure you have a public and private key created at the following paths (we’ll use these to SSH to our instances):

  • Private key: ~/.ssh/id_rsa
  • Public key: ~/.ssh/id_rsa.pub

And lastly, all of the commands in this blog post assume you’re using a Mac or Linux. These commands will probably work on Windows, but they’ll probably require some modifications.

Step 1: Set up Terraform

We need to setup Terraform before we can create anything. By default, it doesn’t know anything about our AWS account, and it stores state locally. Let’s fix that.

First, let’s initialize a new Terraform project:

mkdir cloudwatch-demo
cd cloudwatch-demo
terraform init

Now let’s create a bucket to store Terraform state. I’ll name it terraform-artifacts-bucket for this tutorial, but you’ll have to pick something unique, since AWS requires globally unique names for buckets.

aws s3api create-bucket --acl private --bucket terraform-artifacts-bucket

Terraform recommends enabling bucket versioning, so that in case of a failure we can recover. Let’s do that as well:

aws s3api put-bucket-versioning --bucket terraform-artifacts-bucket --versioning-configuration Status=Enabled

Great. Now we can use the bucket for storing Terraform artifacts. Create a file named terraform.tf in the root of your project, and write the following in it:

terraform {
  backend "s3" {
    bucket = "terraform-artifacts-bucket"
    key    = "cloudwatch-demo/terraform.tfstate"
    region = "us-east-1"
  }
}

Now run the following command to initialize the backend:

terraform init

It should create a .terraform folder in the root of your project.

Now that we have a backend configured, let’s configure our project to use our AWS user. Add the following to terraform.tf:

provider "aws" {
  access_key = "${var.access_key}"
  secret_key = "${var.secret_key}"
  region     = "${var.region}"
}

This tells Terraform to use the access key and secret key from our local project variables. We’ll have to define those, since they don’t exist yet.

Create a file named variables.tf with the following contents:

variable "access_key" {}
variable "secret_key" {}
variable "region" {
  default = "us-east-1"
}
variable "alarms_email" {}

Tell Terraform what values to use by creating a file named terraform.tfvars with the following contents:

access_key = "your-aws-access-key-here"
secret_key = "your-aws-secret-key-here"
alarms_email = "your@email.here"

We’ll have to tell Terraform to initialize the aws provider by running the following command:

terraform init

Lastly, just in case you’re storing this project in Git (you should be!), let’s tell Git to ignore our sensitive Terraform files by creating a file named .gitignore with the following contents:

**/.terraform/*
*.tfstate
*.tfstate.*
crash.log
*.tfvars

Make sure you do not check any API keys into your repository! For simplicity, we’ve stored sensitive keys in a .tfvars file. Terraform recommends storing them in environment variables.

Step 2: Create an application server

Monitoring isn’t very useful without a real server to monitor, so let’s use Terraform to create one. I won’t go detail about how this works, since it’s not the point of this post. Create a file named application.tf with the following contents:

resource "aws_key_pair" "ssh" {
  key_name   = "default"
  public_key = "${file("~/.ssh/id_rsa.pub")}"
}

resource "aws_security_group" "web" {
  name        = "webserver"
  description = "Public HTTP + SSH"

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 65535
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_instance" "web" {
  ami                    = "ami-2757f631"
  instance_type          = "t2.micro"
  key_name               = "${aws_key_pair.ssh.id}"
  vpc_security_group_ids = [ "${aws_security_group.web.id}" ]

  provisioner "remote-exec" {
    connection {
      type = "ssh"
      user = "ubuntu"
      private_key = "${file("~/.ssh/id_rsa")}"
      timeout = "5m"
      agent = true
    }

    inline = [
      "sudo apt-get update -y && apt-get upgrade -y",
      "sudo apt-get install nginx -y"
    ]
  }
}

output "web_public_dns" {
  value = "${aws_instance.web.public_dns}"
}

Create the instance by running the following command:

terraform apply

Lastly, test that it’s working:

open "http://$(terraform output web_public_dns)"

A browser should open to your new server running a default Nginx service.

Step 3: Set up CloudWatch alarms

Now that we’ve set everything up, we can actually setup some alarms. We’ll do this by setting up a topic and a subscription within Amazon Simple Notification Service (SNS), and then tying a CloudWatch alarm to that topic.

SNS is essentially an event system within Amazon. Topics are used for listening to events within the Amazon ecosystem, and subscriptions tie topics to actual endpoints – an email, a web server, etc. We’ll create a topic for the alarms to fire, and then we’ll create a subscription to send you an email.

Create a new file named alarms.tf with the following contents:

resource "aws_sns_topic" "alarm" {
  name = "alarms-topic"
  delivery_policy = <<EOF
{
  "http": {
    "defaultHealthyRetryPolicy": {
      "minDelayTarget": 20,
      "maxDelayTarget": 20,
      "numRetries": 3,
      "numMaxDelayRetries": 0,
      "numNoDelayRetries": 0,
      "numMinDelayRetries": 0,
      "backoffFunction": "linear"
    },
    "disableSubscriptionOverrides": false,
    "defaultThrottlePolicy": {
      "maxReceivesPerSecond": 1
    }
  }
}
EOF

  provisioner "local-exec" {
    command = "aws sns subscribe --topic-arn ${self.arn} --protocol email --notification-endpoint ${var.alarms_email}"
  }
}

Terraform doesn’t allow creating email subscriptions, so we have to use a provisioner instead. This code runs on your local machine, and uses your local AWS CLI to create the email subscription. Once you run this, you’ll have to open your email and confirm the subscription creation.

Now that we have the subscription working, we can setup some alarms. Let’s start with two basic alarms:

  • CPU usage
  • Health check failures

Add the following to alarms.tf:

resource "aws_cloudwatch_metric_alarm" "cpu" {
  alarm_name                = "web-cpu-alarm"
  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "2"
  metric_name               = "CPUUtilization"
  namespace                 = "AWS/EC2"
  period                    = "120"
  statistic                 = "Average"
  threshold                 = "80"
  alarm_description         = "This metric monitors ec2 cpu utilization"
  alarm_actions             = [ "${aws_sns_topic.alarm.arn}" ]

  dimensions {
    InstanceId = "${aws_instance.web.id}"
  }
}

resource "aws_cloudwatch_metric_alarm" "health" {
  alarm_name                = "web-health-alarm"
  comparison_operator       = "GreaterThanOrEqualToThreshold"
  evaluation_periods        = "1"
  metric_name               = "StatusCheckFailed"
  namespace                 = "AWS/EC2"
  period                    = "120"
  statistic                 = "Average"
  threshold                 = "1"
  alarm_description         = "This metric monitors ec2 health status"
  alarm_actions             = [ "${aws_sns_topic.alarm.arn}" ]

  dimensions {
    InstanceId = "${aws_instance.web.id}"
  }
}

Run:

terraform apply

And that’s it!

Step 4: Testing it out

Let’s make sure this works. SSH into your instance:

ssh ubuntu@$(terraform output web_public_dns)

Now we’ll force the CPU to spike:

yes > /dev/null

Let that run for ten minutes. You should receive an email that looks something like this:

You are receiving this email because your Amazon CloudWatch Alarm “web-cpu-alarm” in the US East (N. Virginia) region has entered the ALARM state, because “Threshold Crossed: 1 datapoint [99.8064516129032 (01/07/18 03:56:00)] was greater than or equal to the threshold (80.0).”

Press “Control+C” to stop destroying your instance.

Well done: you have alarms!

Step 5: Clean up

All of the instances and resources we created cost money. Make sure to destroy all of them by running the following command:

terraform destroy

And that’s it. Happy coding!

You can get a copy of all the code here.

Tweet Share

Subscribe to my newsletter to receive updates about new posts.

* indicates required