DEV306 Monitoring for Operational Outcomes and Application Insights Lab Guide

Introduction

There are two goals of monitoring:

Achieve situational awareness to provide timely and effective responses and
Gain insights for the business, development, and operations that enable proactive courses of action.

In this workshop, we take you through the process of developing and implementing a workload monitoring plan to achieve these objectives. You will utilize logs, metrics, dashboards, events, and alarms within the definition of your plan, and then you will implement the plan using AWS tools, services, and features.

You also alert on major categories of events, monitor for operational outcomes, trigger responses, and deliver insights.

This is an advanced lab featuring exercises on developing a monitoring plan, and implementing that plan using CloudWatch. It is best to have some familiarity with AWS and specifically CloudWatch before attempting the lab. We will dive deeper into CloudWatch than an introductory session and do hands on configuration in the console.

Operational Excellence concepts

Operational Excellence is one of the 5 pillars of the Well-Architected program. Well-Architected was created to provide AWS customers with best practices and guidance on how to operate and design applications to operate in the cloud. You can find out more about the Well-Architected program at this link and the Operational Excellence pillar at this link

In this lab you will apply the concepts of Infrastructure as Code and Operations as Code to the following activities:

Deployment of Infrastructure
Performing operations activities
Event Management and Incident Response

Included in the lab guide are bonus sections that can be completed if you have time, or later if interested.

Note: You will be billed for any applicable AWS resources used in this lab that are not covered in the AWS Free Tier. https://aws.amazon.com/free/

Lab Requirements

You will need the following to be able to perform this lab:

Your own device for console access
A personal (not shared) AWS account that you are able to use for testing, that is not used for production or other purposes
The lab must be performed logged into a user with Administrator permissions and an existing EC2 Key Pair
An available VPC in the Ireland region EU-West-1

NOTE: You will be billed for any applicable AWS resources used if you complete this lab that are not covered in the AWS Free Tier. https://aws.amazon.com/free/

Lab Setup

Deploy the Lab Infrastructure using CloudFormation

Launch the Lab by clicking here
Confirm that you are in the Ireland (eu-west-1) region
On the Select Template page choose Next
On the Specify Details page:
1. Define your Stack name: by entering DEV306
2. Provide your email address. This will only be used in the lab, and will only be used to send you SNS notifications of events.
3. Define the HTTPLocation CIDR subnet address to limit who can connect to your resources.
  - You can use http://checkip.amazonaws.com/ to identify your IP address and restrict access to it specifically by entering the IP followed by the CIDR subnet mask of /32 (with no spaces between IP and the CIDR subnet mask).
4. Select an InstanceType. We recommend selecting the default free tier eligible t2.micro option.
5. Select your EC2 KeyPair from the KeyName pull down list.
6. Define your SSHLocation entering your IP address with CIDR subnet mask
7. Define your Workload name as ImageTrends
8. Choose Next
On the Options page optionally specify additional Tags, and then scroll down and choose Next
On the Review page:
1. Review your selections
2. Scroll down and check the box I acknowledge that AWS CloudFormation might create IAM resources.
3. Choose Create

Confirm your subscription to the SNS Notification

After the CloudFormation script has completed, enter the email client for the address you provided earlier.
Locate and email from AWS Notifications with the subject AWS Notifications - Subscription Confirmation.
Click through the Confirm subscription link to confirm your subscription.

Note: In this lab you will interact with the web based application you just created. To generate activity logs you will need to upload pictures to your application server. You can use your own images or download sample images to use here: https://s3-us-west-2.amazonaws.com/shhorsfi-store-ui-images/sample_photos.zip The sample image file is~76MB. If you plan to use the sample images start the download now so it can complete before you need them.

ImageTrends: The [fictional] Narrative

In the lab we will use a mock workload for a fictitious company to learn about performing operations on AWS. Details on the application used in this lab can be found on GitHub at this link.

About Us: ImageTrends was started on a simple idea: “A picture is worth a thousand words”. We help users find out how many words that picture is worth. Our users make thousands of connections every day based on what they discover in the pictures they share. We want to make sharing pictures and identifying what lies within, faster and easier for them.

Our App: ImageTrends has one primary application – a photo library that tags the contents of images users upload, through machine learning. Through our application users can upload pictures of themselves and friends to see which celebrities they resemble, and find celebrities in photos they upload. We have >5 million users (500,000 active daily).

Our Architecture: The ImageTrends application is a simple Ruby on Rails app with a MySQL database. It is installed as two Docker containers (application and database) launched in a VPC on EC2 instances using CloudFormation. Users upload images and background jobs analyze them and collect metadata. These jobs use AWS Rekognition to detect labels, text and celebrities, and EXIF metadata from the camera used. Deployment of application code uses update-in-place.

Our Operating Model: The ImageTrends team are mainly “DevOps” but there’s separation of duties between the cloud infrastructure team and the application team. Application code is pushed via CI/CD pipeline around once a day; the platform engineering team tests and deploys infrastructure changes. The operating model focuses on fast deployment of new features, and rapid feedback on user experience. Application performance and availability are vital to accelerated sign ups, which drive our funding valuation.

Our Team:

CEO Isobel Rose: focused on growing user base to secure funding, building a sustainable revenue stream.
CIO Henry Pomfret: focused on efficiency and performance goals for IT Operations, release targets and application availability.
Infrastructure Director Mel Blank: Primary concern is security and stability of infrastructure.
Application Director Joanne Groan: making sure production is delighting customers and features are the right features are being delivered to production as quickly as possible
InfoSec Director Aki French: security threats posed by fast delivery, competing priorities between security and release timelines.
DevOps Engineer Ryn Brandish wants insight into platform changes, application issues, and governance requirements affecting feature delivery and application security.
Site Reliability Engineer Sansa Bailiff wants to stopped getting paged at all hours and wants to better understand system reliability
Security Engineer Paco Simpson. Biggest challenge: speed to market sometimes means security issues come up only post-go-live.
User Experience Lead Cindy Logan-Matthew: Biggest challenge: user feature requests growing at a rate that exceeding ImageTrends’ ability to implement them.
Financial Analyst Cris Stamp: Biggest challenge is variable operating expenses and rapid rate of utilization expansion make cost control difficult.

Building the Monitoring Plan: People

There are two goals of monitoring:

Achieve situational awareness to provide timely and effective responses and
Gain insights for the business, development, and operations that enable proactive courses of action.

Understanding needs requires engagement. Users frequently ask for what they think they need or what they understand is possible. Apply the “5 Whys” technique, and do not assume that what they are asking for is what they actually need. Operations' engagement with teams enables understanding what they are trying to achieve and help determine options for achieving their monitoring goals.

Consider: Who will consume the outputs from monitoring? What are their goals?

Our Personas:

Business (Line of Business)/Customer
Developers
Operations

Building the Monitoring Plan: Telemetry

Consider: What telemetry can we leverage to achieve the goals? Is there desirable telemetry we do not have? If so, how can we acquire it? Can we further instrument our application to emit the information we need? Can we configure out infrastructure to provide the information we need?

Our workload must emit the data that allows operations to understand workload health and the achievement of business outcomes. We have to collect that information, analyze it, and then act on it to maximize the benefits for our organizations. If we are collecting without analysis or action we are paying to store records that will at best only be used after an event in a forensic capacity. Our goal is to achieve insight.

Categories of insight

Faults
Configuration
Accounting
Performance
Security
Outcomes
User Behavior
Workload Behavior

We will iterate: engaging our stakeholders, determining requirements and priorities, implementing improvements to monitoring, evaluating their success, and repeating. Or put more simply: create a plan, implement the plan, check to see if the plan works, adjust the plan and repeat.

Exercise 1: Acquire Telemetry

We are going to start by focusing on 3 goals

Titus Grone, our application director, wants to know that our production application is delighting our customers.
Ryn Brandish, our DevOps Engineer, wants to understand when there are application issue and is concerned about security.
Sansa Bailish, our SRE, is focused on outages, reliability, and getting better sleep.

Install and configure the CloudWatch agent

We will collect metrics and logs from our Amazon EC2 instances using the CloudWatch agent. If we had on-prem servers we could also use the CloudWatch agent to capture metrics and logs from those.

The CloudWatch Agent requires and IAM role that was created with the execution of the CloudFormation script. We will install the agent, and then configure it using a predefined configuration placed in System Manager Parameter Store by our CloudFormation script. With our configuration in parameter store we can easily apply it as a standard across our fleet. You can learn more by following this Getting Started link.

Step 1.1: Review the Parameter Store configuration

The configuration file we have in Parameter Store follows a required naming convention (it must start with AmazonCloudWatch-) that allows us to take advantage of the CloudWatchAgentServerPolicy IAM role. To simplify our management. You can learn more at this link.

Open Parameter Store in the System Manager console by clicking on this link.
Choose AmazonCloudWatch-ImageTrendsConfig
Review the logs that will be ingested into CloudWatch Logs

Step 1.2: Install and Configure the CloudWatch agent

Using System Manager's Run Command we can install and configure the CloudWatch agent on an entire fleet as easily as we install it on a single instance. By using a command document we ensure consistent execution and limit the errors that could be introduced by a manual process.

Install the CloudWatch agent to our instances using Systems Manager Run Command

Navigate to the Systems Manager console Run Command page by clicking on this link.
Choose Run command.
In the Command document list, select AWS-ConfigureAWSPackage by selecting the radio button next to it.
In the Action list, choose Install.
In the Name field, type AmazonCloudWatchAgent
Leave Version set to latest to install the latest version of the agent.
In the Targets area, choose the instance on which to install the CloudWatch agent you may either
1. select Specifying a tag and enter a Tag Key of Workload and a Tag Value of ImageTrends
2. or select Manually selecting instances and choose your instance from the list.
In the Output options area uncheck the box next to Enable writing to an S3 bucket.
Choose Run.
Optionally, in the Targets and outputs areas, select the button next to an instance name and choose View output. Systems Manager should show that the agent was successfully installed.

Configure and (re)start the CloudWatch agent using Systems Manager Run Command

Open Run Command in the Systems Manager console by clicking on this link.
Choose Run command.
In the Command document list, select AmazonCloudWatch-ManageAgent by selecting the radio button next to it.
In the Command Parameters area
1. In the Action list, choose the default configure.
2. In the Mode list, choose the default ec2
3. In the Optional Configuration Source list, choose the default ssm.
4. In the Optional Configuration Location box, type the name of the agent configuration file that we created and saved to Systems Manager Parameter Store AmazonCloudWatch-ImageTrendsConfig
In the Optional Restart list, choose yes to start the agent after you have finished these steps.
In the Targets area, choose the instance where you installed the CloudWatch agent.
In the Output options area uncheck the box next to Enable writing to an S3 bucket.
Choose Run.
Optionally, in the Targets and outputs areas, select the button next to an instance name and choose View output. Systems Manager should show that the agent was successfully started

Want to learn more about collecting logs and metrics using the unified CloudWatch Agent? Follow this link.

Review the collected application logs [with the development team]

The more integrated the business, development, and operations teams are, the more successful they will be. The custom developed application emits a variety of logs. Engaging the developers to help gain insight to their contents, meaning, and their purpose will shorten the amount of time it takes to get value out of monitoring.

Navigate to the CloudWatch Logs dashboard at this link.
Choose application.log from the Log Groups list
Choose the Log Stream of your instance
Review the available logs

Generate logs through user activity

In a web browser navigate to the IPv4 Public IP of your instance to reach the ImageTrends application
Create an account by choosing login and then choosing Sign up at the Log in page.
1. Enter an Email address, a Username, and a Password and choose Sign up
2. There is no validation on the email address you provide so you may use a fake address
Upload some photos
1. you can either use your own photos or
2. download sample photos to use from https://s3-us-west-2.amazonaws.com/shhorsfi-store-ui-images/sample_photos.zip
  - Note: There are ~76MB of images
Navigate to the CloudWatch Logs dashboard at this link.
Choose application.log from the Log Groups list
Choose the Log Stream of your instance
At the top right of the grey field, in the time window definition box, choose the pull down.
Choose to Relative, by clicking on the term.
Choose 2 Hours
Review the recent logs. Enter a Tag value identified by ImageTrends in the Filter Events text box and review the associated logs.

Publish VPC Flow Logs to CloudWatch Logs

When publishing VPC flow logs to CloudWatch Logs, flow log data is published to a log group, and each network interface has a unique log stream in the log group. Log streams contain flow log records. You can create multiple flow logs that publish data to the same log group. If the same network interface is present in one or more flow logs in the same log group, it has one combined log stream. If you've specified that one flow log should capture rejected traffic, and the other flow log should capture accepted traffic, then the combined log stream captures all traffic. For more information, see Flow Log Records.

Consider how you will use the collected VPC Flow Log data. On internal networks visibility of intended and unintended traffic can facilitate diagnosing issues. Capturing rejected traffic on your Internet Gateway by default may result in storing and processing a lot of data with limited value to you. Remember you can always enable the capability when needed and then disable it when no longer needed to help optimize the value your get from CloudWatch.

Step 1.3: Create a flow log for your VPC

Note: The following steps have been completed for you by the CloudFormation script.

Open the Amazon VPC console and navigate to Your VPCs by clicking on this link.
Select your ImageTrends VPCs and then choose Actions, Create flow log.
For Filter, choose All from the pull down list to log accepted and rejected traffic.
For Destination, choose the default Send to CloudWatch Logs.
For Destination log group, type the name of a log group in CloudWatch Logs to which the flow logs are to be published. If you specify the name of a log group that does not exist, it will attempt to create the log group for you. Enter ImageTrendsVPC in the text box next to Destination log group*.
For IAM role select the role with name in the form <your stack name>-FlowLogsRole-<random string>. It was created by the CloudFormation script with permissions to publish logs to CloudWatch Logs.
Choose Create.
Open the CloudWatch console and navigate to Logs by clicking on this link.

What have we accomplished?

All of our instances are now logging to CloudWatch Logs and, our VPC is logging to CloudWatch Logs as well. Our instances and AWS services are publishing metrics to CloudWatch. Our application and workload are providing traces though integration with AWS X-Ray. We are collecting the available telemetry, and now it is time to analyze it and take advantage of it.

Exercise 2: Generate Metrics from Logs

Traditional application development emits events in the form of logs. Use CloudWatch we can generate metrics from our logs using pattern matching. By generating metrics based on observed log messages we can increase the value of our CloudWatch logs by providing visualizations of the metric data through dashboard, and providing alerts when metrics breach baseline thresholds. Using the AWS CLI or API you can publish your own custom metrics.

Metric filters are created using the same filter and pattern syntax that is used when browsing log streams in the console.

Create a Confidence metric

Monitoring for Business Outcomes Titus Grone wants to know that ImageTrends is delighting our customers. Feedback from the customer indicates that accuracy of items identified in the upload images is the greatest source of satisfaction when it works well, and frustration when it does not. Focus groups indicate that it is better to not have misidentified (low confidence) objects.

He wants to track the image recognition confidence levels as a measure of how accurate the ImageTrends application is performing. He will use this information to help determine where to focus development efforts.

Step 2.1: Create the Log Metric

Navigate to the CloudWatch Logs dashboard at this link.
In the contents pane, select the application.log group by clicking on the radio button next to it, and then choose Create Metric Filter.
1. On the Define Logs Metric Filter screen, for Filter Pattern, type: [logType, myTimestamp, severity, delim1, delim2, type, action, for, Image, imgNum, Name, imgTags, Confidence, cValue]
2. To test your filter pattern, for Select Log Data to Test, select the log group to test the metric filter against, and then choose Test Pattern.
3. Under Results, CloudWatch Logs displays a message showing how many occurrences of the filter pattern were found in the log file. To see detailed results, click Show test results.
4. Choose Assign Metric
On the Create Metric Filter and Assign a Metric screen,
1. For Filter Name type confidenceLevels
2. Under Metric Details, for Metric Namespace, type ApplicationLogMetrics
3. For Metric Name, type cValue
4. Choose Show advanced metric settings
5. For Metric Value choose $cValue.
6. Leave the Default Value undefined, and then choose Create Filter.

Step 2.2: Perform user activity that will be captured as metrics

Generate logs through user activity

In a web browser navigate to the IPv4 Public IP of your instance to reach the ImageTrends application
Using your existing upload 5 additional photos
1. you can either use your own photos or
2. download sample photos to use from https://s3-us-west-2.amazonaws.com/shhorsfi-store-ui-images/sample_photos.zip
  - Note: There are ~76MB of images

Step 2.3: Review the resulting metrics

Navigate to metrics in the left side navigation bar of the CloudWatch console or by clicking this link.
Under Custom Namespaces you will see your ApplicationLogMetrics namespace.
1. Chose Metrics with no dimensions
2. Then choose Metrics with no dimensions
3. Then select cValue
Choose the Graphed Metrics tab and examine the differences in reported values when you change the Statistic in use by selecting alternatives from the pull down list beneath it.
Change the Period to 10 Seconds and the Statistic to Average and examine the differences in reported values.

Step 2.4: Create a dashboard

Choose Actions in to top right corner of the page and choose Add to dashboard
In the Add to dashboard dialog
1. Under Select a dashboard choose Create new and enter ImageTrends in the Dashboard name text box, and then choose the check mark icon next to the text box to confirm your choice.
2. Under Select a widget type choose Stacked area
3. Under Customize widget title replace the prepopulated value cValue with confidence
4. Choose Add to dashboard
On the Dashboards page, choose Save dashboard

(Optional) Exercise 3: Create software error and warning metrics

Monitoring for Application Outcomes Ryn Brandish wants to understand when there are application issue and is concerned about security. He would like metrics for errors and warnings. ImageTrends has limitations on the image size that it can successfully process. While this issue does not happen frequently it does not make customers happy. Being aware of the error rate for submitted images will allow the Business and Development team to determine if increase the image size should be a priority. Currently most warnings are related to security issues. Ryn would like visibility on how often they happen.

Note: if you did not download the sample photos you will need an extremely large photo (~8MB) to generate the errors captured in the following metric.

Step 3.1: Create an error rate metric filter using CloudWatch

Navigate to the CloudWatch Logs dashboard at this link.
In the contents pane, select the application.log group by clicking on the radio button next to it, and then choose Create Metric Filter.
1. On the Define Logs Metric Filter screen, for Filter Pattern, type: “E,”
2. To test your filter pattern, for Select Log Data to Test, select the log group to test the metric filter against, and then choose Test Pattern.
3. Under Results, CloudWatch Logs displays a message showing how many occurrences of the filter pattern were found in the log file. To see detailed results, click Show test results.
4. Choose Assign Metric
On the Create Metric Filter and Assign a Metric screen,
1. For Filter Name type ImageTrendsErrorRate.
2. Under Metric Details, for Metric Namespace, type ApplicationLogMetrics.
3. For Metric Name, type ErrorRate
4. Choose Show advanced metric settings
5. For Metric Value enter 1.
6. Leave the Default Value undefined, and then choose Create Filter.

Step 3.2: Create an warning rate metric filter using CloudWatch

Navigate to the CloudWatch Logs dashboard at this link.
In the contents pane, select the application.log group by clicking on the radio button next to it, and then choose Create Metric Filter.
1. On the Define Logs Metric Filter screen, for Filter Pattern, type: “W,”
2. To test your filter pattern, for Select Log Data to Test, select the log group to test the metric filter against, and then choose Test Pattern.
3. Under Results, CloudWatch Logs displays a message showing how many occurrences of the filter pattern were found in the log file. To see detailed results, click Show test results.
4. Choose Assign Metric
On the Create Metric Filter and Assign a Metric screen,
1. For Filter Name type ImageTrendsWarningRate
2. Under Metric Details, for Metric Namespace, type ApplicationLogMetrics
3. For Metric Name, type WarningRate
4. Choose Show advanced metric settings
5. For Metric Value enter 1 .
6. Leave the Default Value undefined, and then choose Create Filter.

Step 3.3 Generate logs through user activity

Upload additional photos including one or more from the sample_photos > Error Photos folder from the sample_photos.zip file found here, or personal photos of ~8MB in size.
- The application may start exhibiting issues

Step 3.4: Compare Error and Warning metrics

Navigate to metrics in the left side navigation bar of the CloudWatch console or by clicking this link.
Choose ApplicationLogMetrics.
Choose Metrics with no dimensions
Then select ErrorRate and WarningRate
Choose the Graphed metrics tab
1. change the Statistic values to Sum
2. change the Period value to 10 Seconds

Step 3.5: Use metric math to identify overall "Issue" rate

Metrics can be further leveraged through metric math. Metric math enables you to query multiple CloudWatch metrics and use math expressions to create new time series based on these metrics. You can visualize the resulting time series in the CloudWatch console and add them to dashboards. Details on how can be found here.

Choose + Add a math expression
- By default the Metric will be SUM(METRICS())

Note: there are a wide variety of supported metric math functions.

Step 3.6: Make the graph colors and labels more representative

Choose Expression1 and replace it with How bad is it?
Choose the colored round corner square next to ErrorRate and select red.
Choose the colored round corner square next to WarningRate and select orange.
Choose the colored round corner square next to How bad is it? and select an unused color.

Step 3.7: Create a dashboard

Choose Actions in to top right corner of the page and choose Add to dashboard
In the Add to dashboard dialog
1. Under Select a dashboard choose ImageTrends
2. Under Select a widget type choose Stacked area
3. Under Customize widget title replace the prepopulated value ErrorRate, WarningRate with ProductionErrorsAndWarnings
4. Choose Add to dashboard
On the Dashboards page, choose Save dashboard

Build a Monitoring Plan: Alerting and Response

Monitoring for Operational Outcomes Sansa Bailish is focused on outages, reliability, and getting better sleep. There is a known issue with image trends; when the instances are rebooted the application doesn't restart. Too frequently Sansa has received a late night call to get online and restart the application. The user experience lead is very frustrated with the downtime associated to these incidents. They need to be detected sooner and resolved faster.

Sansa is looking for a monitoring solution to detect the incident (the reboot event), and a way to trigger an automated recovery.

Exercise 4: A Process for Every Alert

Known Issues with ImageTrends

If the ImageTrends instance is rebooted the application does not automatically restart. To restart the application you must log into the application instance and execute the following 3 commands.

sudo iptables -t nat -I PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 3000

sudo service docker start

nohup sudo /usr/local/bin/docker-compose -f /opt/imagetrends.yaml up &

The ImageTrends instances are currently updated in place. As a result when the instances are rebooted as part of the application of patches we create a self inflicted incident. When that happens an administrator has to log into the system to restart the application. Until we are able to correct the issue with startup on reboot we still need to patch the systems.

Step 4.1: Test to verify the known issue is present in your environment

Reboot the ImageTrends instance

Navigate to the EC2 Instance Console at this link.
Select your ImageTrends instance
Choose Actions
Choose Instance State
Choose Reboot
In a web browser navigate to the IPv4 Public IP of your instance and refresh the page or navigate within the page to confirm it is no longer functional

Step 4.2: Corrective actions, Application Restart using an SSM command document

Navigate to the AWS Simple Systems Manager console and select Documents in the left side navigation bar or click on this link.
1. Choose the search icon
2. Choose Owner
3. Choose Owned by me
4. Choose the command document whose Name is in the format <your stack name>-FlowLogsRole-<random string>
  - The command document was created by you when you deployed the CloudFormation stack.
5. Note the command document name as you will use it in later steps.
On the command document page choose the Content tab and review the command document.

Step 4.3: Restarting the ImageTrends application with a SSM Document and SSM Run Command

Open Run Command in the Systems Manager console by clicking on this link.
Choose Run command
Choose the Search icon
Choose Platform types
Choose Linux
Select AWS-RunDocument by selecting the circle next to it
In the Command parameters box
1. In the Source Type pull down select SSMDocument
2. In the Source info text box enter {"name": "<your command document name>"}
  - replacing <your command document name>` with the name you noted prior.
In the Targets box
1. Choose Specifying a tag
2. Under Tags specify Workload in the Tag Key text box
3. Under Tags specify ImageTrends in the Tag Value (optional) text box
4. Choose Add next to the Tag Value (optional) text box
Alternatively you could
1. leave Manually selecting instances selected
2. Select you ImageTrends instance from the list by checking the box
In the Output options box
1. uncheck the box next to Enable writing to an S3 bucket
[optional] Expand the AWS command line interface command box and review the CLI version of this command
Choose Run
In a web browser navigate to the IPv4 Public IP of your instance. In a about a minute you will be able to verify the ImageTrends application is back online.

Automating Runbook Execution

Note: We are about to use all of the capabilities of CloudWatch coupled with Lambda and Systems Manager Run Command to:

use a log to create a metric
use the metric to create an alarm
use that alarm to send a notification
use that notification to trigger a lambda function
use that lambda function to create an event
use that event to trigger a rule
use that rule to launch run command
use run command to execute a command document
use that command document to run a script on our instance
and finally use that script to restart our application.

Note: This is not the best way to detect the issue with the application instance. Nor is it the most efficient way to respond. It does however demonstrate many of available options to detect, trigger, and resolve an issue without manual intervention.

Step 4.4: Identify a log entry to base our metric upon

To automate execution of ImageTrendsRestart we need to identify what boot.log entry indicates that the system is completely up and able to execute the commands we are passing to it.

Navigate to Logs in the CloudWatch console by clicking on this link.
Review the boot.log log and identify a log entry which indicates the system has completed rebooting.
1. Choose boot.log and then choose your Log Stream
At the top right of the grey field, in the time window definition box, choose the pull down.
1. Choose to Relative, by clicking on the term.
2. Choose 2 Hours
3. Review the recent logs.
4. Select an event from the log that can provide confirmation that the instance is finished rebooting

Step 4.5: Create a metric filter for the reboot event

Navigate to Logs in the CloudWatch console by clicking on this link.
In the contents pane, select the boot.log group and then choose Create Metric Filter.
On the Define Logs Metric Filter screen,
1. for Filter Pattern, type: "Started OpenSSH server daemon." including the quotes.
2. To test your filter pattern, for Select Log Data to Test, select the log group to test the metric filter against, and then choose Test Pattern.
  - It is possible that you target may not be present in the log sample.
3. Under Results, CloudWatch Logs displays a message showing how many occurrences of the filter pattern were found in the log file. To see detailed results, click Show test results.
4. Choose Assign Metric
On the Create Metric Filter and Assign a Metric screen:
1. Under Metric Details, for Metric Namespace, type boot.logMetrics
2. For Metric Name, type ImageTrendsRebooted
3. Choose Show advanced metric settings and leave the default values in place.
4. Choose Create Filter.

Step 4.6: Create an Alarm for your ImageTrendsRebooted metric

On the Filters for boot.log page choose Create Alarm
- If you had navigated away after creating the filter you can choose Logs on the left side navigation bar and then choose the filter link on the row of your log or click on this link
In the Create Alarm window under Alarm details
1. Enter ImageTrendsRebootedAlarm in the Name: text box
2. Enter leave >= selected next to is: and enter 1 in the adjacent text box
under Additional settings
1. Select Good (not breaching threshold) from the Treat missing data as: pull down list
under Actions in the Notifications box
1. leave Whenever this alarm: as the default State is ALARM
2. in the Send notification to: text box enter choose the notification list in the format <your stack name>-ImageTrendsRebootedNotify-<random string>.
  - You created this notification when you created your CloudFormation stack.
3. Leave Email list: with its prepopulated value
Choose Create Alarm

Step 4.7: Use Lambda to perform a put-event when the ImageTrendsRebootedNotify alarm triggers

Navigate to the Lambda console by clicking on this link.
Choose Manage functions
Choose ImageTrendsTebootedEvent
Review the Lambda function
1. Choose the SNS box with icon in the designer to see the SNS configuration below
2. Choose ImageTrendsTebootedEvent to see the Function code below

When the Lambda function receives the notification from the SNS topic it executes its code. For this function it puts (emits) a CloudWatch event using cloudwatch_events.put_events with source ImageTrendsInstance and ImageTrendsRebooted that is use by the following CloudWatch rule.

Step 4.8: Match the Lambda event to the CloudWatch rule and route it to Run Command

Navigate to the CloudWatch console and the Rules page by clicking on this link.
Chose ImageTrendsRestartAppRule
Review the rule configuration

The CloudWatch Rule matches the incoming event from Lambda and routes it to the target Run Command invoking our command document.

Step 4.9: Follow the execution

Reboot your instance
Review the boot.log prompting for new records
When the log arrives confirm the alarm has been triggered
Review the alarm history
Review the Lambda monitoring; note the invocation
Review the CloudWatch rule and check the metrics for the rule; note the invocation
Review run command in System Manager; note the status of the last executed command (possibly still in progress or completed)
Check to see if the application has been restored by navigating to the web interface

What have we achieved?

We have answered three monitoring needs, one for each of our teams.
We have provided insight to the quality of the end user experience by creating a image tag confidence metric for the business team.
We have provided insight to issues and potential development needs by providing the development team (DevOps) with error and warning metrics.
We have provided a mechanism for the operations team (SRE) to use an event trigger to automatically remediate an outage causing issue in the environment.

How did we create an automatic remediation?

Knowing the log entry that indicates that the application has failed and is in a state from which it can be recovered we created a metric.
Using that metric we created an alarm that notifies via an SNS topic.
We have a Lambda function that is subscribed to that SNS topic that creates a CloudWatch Event in response.
We created a CloudWatch rule that triggers on our Lambda initiated CloudWatch event and invokes run command.
Our run command invocation uses a command document we created to execute the start up script on our server restoring it to an operating state.

We have created something of a Rube Goldberg machine to achieve that outcome. In doing so we have demonstrated the use of logs and how to get more value out of them, the use of events, and the triggering of actions in response to events; all enabled by CloudWatch.

Bonus Content: CloudWatch Logs Insight (CWL-I))

CWL-I allows you to rapidly query across a log group using a powerful query language.

Exercise 5: Parsing logs with CWL-I

To extract data from a log field, creating one or more ephemeral fields that can be further processed by the CWL-I queries perform the following.

Step 5.1: Select the data from the log group and log streams

Navigate to the CWL-I console
Choose the Log Group you want to work with from the pull down, at the top right of the grey field.
Choose the pull down in the time window definition box, at the top right of the grey field.
Choose to define a Relative or Absolute time window, by clicking on the term
Select a period overlapping the log activity of interest.
In the query window under your selected Log Group, enter fields @message and choose Run Query. Results will populate in the Logs tab below
You may now optionally narrow down you time frame of interest by selecting the period in the Distribution of log events over time display by clicking and dragging in the window. A new query will run when the time window is selected and the time definition window in the top right will update.

Step 5.2: Parse the log into ephemeral values

The fields statement can be piped into a parse statement and you log statements broken into ephemeral values that can be used to refine your query. The parse statement is a all or nothing evaluation. If a log entry does not match the parse statement in its entirety no ephemeral values will be generated, however the @message value will remain for unparsed log records.

For example, a log string from the ImageTrends application.log log group:
I, [2018-10-26T15:57:20.712115 #21] INFO -- : Text detected in Image: 8 Text: EXOPLANETS Confidence: 99.11837005615234

These are the values of interest in our log string:
<Tag>, <myTimestamp> <severity> — : <type> <action> in Image: <imgNum> Text: <imgTags> Confidence: <cValue>

Note: @timestamp is a reserved term in CWL-I and cannot be used as an ephemeral variable name, and so we have choosen to use @myTimestamp.

There are 2 methods to parse the log: Anchor Strings and Regex:

Anchor Strings Within a given log string you replace any strings you wish to extract into ephemeral fields with an * and leave the remain literal values in place. If this were Regex each * would effectively be replaced with a .?* The string must be bound by either ' or “. Whichever one you select, the other does not have to be escaped in your parse of the string. To define the ephemeral values the Anchor string must be followed by “as” and your ephemeral fields defined prefixed with @.

parse @message "I, [* #21] * — : * * in Image: * Text: * Confidence: *" as @myTimestamp, @severity, @type, @action, @imgNum, @imgTags, @cValue

Regex Regex enables additional options if you need to do something that the Anchor string option does not provide. It is expected that for most use cases Anchor strings will be effective and intuitive. With Regex the string must be bound by /. To define the ephemeral values remove from your sample string the content of interest and place it in a capturing group in the form (?.?). Leave the remaining literal content from the log in place.

parse @message /I, [(?myTimestamp.?) #21] (?severity.?)-- : (?type.?) (?action.?) in Image: (?imgNum.?) Text: (?imgTags.?) Confidence: (?cValue.?)/

Step 5.3 Filter for populated values after a parse

To filter for fully parsed log entries (i.e. in a log with multiple entry types of inconsistent contents) returning results for all ephemeral values pipe the parse results into a fields statement with the ispresent operator as follows below and then filter for valid results.

parse @message "I, [* #21] * — : * * in Image: * Text: * Confidence: *" as @myTimestamp, @severity, @type, @action, @imgNum, @imgTags, @cValue
| fields ispresent (@myTimestamp) as @validResults | filter @validResults = 1

CWL-I and partitioning Log Data

Consider: the costs associated the query. The price per query is based on the log content scanned per query ($0.005/GB or $5/TB). If there are a lot of not valuable logs in the log streams you scan then you are paying to ignore them. The way in which you structure you logs can enable value optimization when using CWL-I.

Consider removing not valuable log content from your log group.
Consider separating valuable long content with different formats into separate log groups. By doing this your queries will only scan content that is both valuable and actionable by the query.

Remember: in the log visualization you can select the time frame of interest to focus your queries and limit extraneous scanned data.

References: ImageTrends

A mock workload for a fictitious company.
Source code: https://github.com/horsfieldsa/imagetrends
Admin interface: http://<the public IPv4 address of your instance>/admin
Admin user: admin@admin.com
Admin password: Password123

Helpful Content

User and Group Management

When you first create an Amazon Web Services (AWS) account, you begin with a single sign-in identity that has complete access to all AWS services and resources in the account. This identity is called the AWS account root user and is accessed by signing in with the email address and password that you used to create the account.

We strongly recommend that you do not use the root user for your everyday tasks, even the administrative ones. Instead, adhere to the best practice of using the root user only to create your first IAM user. Then securely lock away the root user credentials and use them to perform only a few account and service management tasks. To view the tasks that require you to sign in as the root user, see AWS Tasks That Require Root User.

IAM Users & Groups

As a best practice, do not use the AWS account root user for any task where it's not required. Instead, create a new IAM user for each person that requires administrator access. Then make those users administrators by placing the users into an "Administrators" group to which you attach the AdministratorAccess managed policy.

Thereafter, the users in the administrators group should set up the groups, users, and so on, for the AWS account. All future interaction should be through the AWS account's users and their own keys instead of the root user.

1.1 Create Administrator IAM User and Group

To create an administrator user for yourself and add the user to an administrators group:

Use your AWS account email address and password to sign in as the AWS account root user to the IAM console at https://console.aws.amazon.com/iam/.
In the IAM navigation pane, choose Users and then choose Add user.
In Set user details for User name, type a user name for the administrator account you are creating. The name can consist of letters, digits, and the following characters: plus (+), equal (=), comma (,), period (.), at (@), underscore (_), and hyphen (-). The name is not case sensitive and can be a maximum of 64 characters in length.
In Select AWS access type for Access type, select the check box next to AWS Management Console access, select Custom password, and then type your new password in the text box. If you're creating the user for someone other than yourself, you can leave Require password reset selected to force the user to create a new password when first signing in. Clear the box next to Require password reset and then choose Next: Permissions.
In set permissions for user ensure Add user to group is selected.
Under Add user to group choose Create group.
In the Create group dialog box, type a Group name for the new group, such as Administrators. The name can consist of letters, digits, and the following characters: plus (+), equal (=), comma (,), period (.), at (@), underscore (_), and hyphen (-). The name is not case sensitive and can be a maximum of 128 characters in length. In the policy list, select the check box next to AdministratorAccess and then choose Create group.
Back at Add user to group, in the list of groups, ensure the check box for your new group is selected. Choose Refresh if necessary to see the group in the list. choose Next: Review to see the list of group memberships to be added to the new user. When you are ready to proceed, choose Create user.
At the confirmation screen you do not need to download the user credentials for programmatic access at this time. You can create new credentials at any time.

You can use this same process to create more groups and users and to give your users access to your AWS account resources. To learn about using policies that restrict user permissions to specific AWS resources, see Access Management and Example Policies. To add additional users to the group after it's created, see Adding and Removing Users in an IAM Group.

1.2 Log in to the AWS Management Console using your administrator account

You can now use this administrator user instead of your root user for this AWS account. Choose the link https://<yourAccountNumber>.signin.aws.amazon.com/console and log in with your administrator user credentials.
Select the region you will use for the lab from the list in the upper right corner.
Verify that you have 2 available VPCs (3 or less in use) in the selected region by navigating to the VPC Console (https://console.aws.amazon.com/vpc/) and in the Resources section reviewing the number of VPCs.

1.3 Create an EC2 Key Pair

Amazon EC2 uses public-key cryptography to encrypt and decrypt login information. Public-key cryptography uses a public key to encrypt a piece of data, such as a password, then the recipient uses the private key to decrypt the data. The public and private keys are known as a key pair. To log in to the Amazon Linux instances we will create in this lab, you must create a key pair, specify the name of the key pair when you launch the instance, and provide the private key when you connect to the instance.

Use your administrator account to access the Amazon EC2 console at https://console.aws.amazon.com/ec2/.
In the IAM navigation pane under Network & Security, choose Key Pairs and then choose Create Key Pair.
In the Create Key Pair dialog box, type a Key pair name such as OELab2018 and then choose Create.
Save the keyPairName.pem file for optional later use accessing the EC2 instances created in this lab.

Lab tear down

To remove all the resources configured and deployed as part of this lab perform the following:

Navigate to the CloudFormation console at this link
- Select your stack, choose Actions, and choose Delete stack
- This will delete all resources created by the CloudFormation template
Navigate to the Logs page on the CloudWatch console at this link
- Select /aws/lambda/ImageTrendsRebootedEvent, then choose Actions, then choose Delete log group, and finally choose Yes, Delete to delete the Log Group
- Select application.log, then choose Actions, then choose Delete log group, and finally choose Yes, Delete to delete the Log Group
- Select boot.log, then choose Actions, then choose Delete log group, and finally choose Yes, Delete to delete the Log Group
- Select messages, then choose Actions, then choose Delete log group, and finally choose Yes, Delete to delete the Log Group
- Select production.log, then choose Actions, then choose Delete log group, and finally choose Yes, Delete to delete the Log Group
Navigate to the Metrics page on the CloudWatch console at this link
- CloudWatch does not support metric deletion. Metrics expire based on the retention schedules which can be found at this link
Navigate to the Alarms page on the CloudWatch console at this link
- Select ImageTrendsRebootedAlarm by checking the box on the left in its row.
- Choose Actions, choose Delete, and finally choose Yes, Delete to delete the Alarm
Navigate to the Dashboards page of the CloudWatch console at this link
- Choose ImageTrends
- Choose Actions, choose Delete dashboard, and finally when prompted select Delete dashboard to confirm that you wish to delete your dashboard.

Thank you for using this lab.