Before any of us attempt to understand what AWS is (aside from just an acronym for Amazon Web Services), we need to know what modern cloud computing is. I've taken the liberty of presenting a quote directly from Wikipedia, referencing the actual concept of “cloud computing”:
“Cloud computing is a type of Internet-based computing that provides shared computer processing resources and data to computers and other devices on demand. It is a model for enabling ubiquitous, on-demand access to a shared pool of configurable computing resources (e.g., computer networks, servers, storage, applications and services),which can be rapidly provisioned and released with minimal management effort ”.
Let’s analyze some of the aforementioned concepts extracted from good-old wiki:
“Internet-based computing”: Yes… that is pretty much true
“On-demand access”: True again, it seems we're on the right track here
“Rapidly provisioned”: Well… strike 3 and out. True again
"Minimal management effort": Well… not really, or maybe the right answer is, “Yes and No”
“Yes” , because you only need to worry about your provisioned resources. You don’t need to know about datacenter power supply, server hardware, air conditioning, etc.
“No” , because administering things in the cloud is a very different task than doing the same on other traditional systems.
In contrast to traditional datacenters and their traditional administration methods, your most critical management task in the cloud relates to the proper sizing of your cloud elements. In AWS, you pay according to what you use. That means, in practical terms, if you oversize your deployment, you’ll end paying more, or, if you don’t add enough power, you’ll end with bottlenecks and a service with poor delivery.
It's important to note that aside from the elastic nature of AWS services, even this elasticity can be poorly conceived by you as a cloud architect without noticing. The only way to truly ascertain the right size of your cloud deployment is by monitoring the entire thing. In the next chapters we’ll focus on the monitoring topics related to AWS.
THE GOOD: AWS includes its own monitoring tools, some already in the basic plans.
AWS is comprised of many services, and the “Amazon Tech Ninjas” likes to include new, very innovative ones from time to time. The most basic services (and most widely used) are focused on infrastructure services that are normally the core of the available resources in modern cloud deployments. Here are just a few of those services, and the related metrics you need to obtain from them:
EC2 : Elastic Compute Cloud: This service provides virtual machines (aka “Instances”) based on the most commonly used Linux and Windows operating systems in the I.T. industry. Here, you’ll need some basic O/S metrics like: CPU Usage, RAM Usage, Network Bandwidth, Unix Load, DISK I/O usage, disk space, etc.
S3 : One of the very first services AWS offered: Simple Storage Service: This offers “object storage” services, which is a new concept, born in Cloud Computing times. You can store any kind of file (image, binary, text, whatever) and serve them using either REST of direct HTTP/HTTPS. Very useful for video streaming, web static contents, especially backup purposes, and safe “LOG’s storage” for applications in your cloud deployment. Please note that S3 is widely distributed and redundant, so if the “many enemies of Dr. Who” attack us and get rid of two datacenters in an S3 region, at least one datacenter will be available to keep the service running and avoid data loss. That’s very good for “logs storage”, right? The most important metrics here are related to GET/PUT operations in S3 resources, and of course, the size of your stored data.
EBS : Elastic Block Storage. In some ways, it's a part of EC2, as your operating system disk will need to be somewhere. EBS provides “virtual disks” or more appropriately “virtual volumes” that you can use as a root-partition of extra partitions for specialized data, such as your database storage disks. EBS also offers different type of “disks”, with different I/O features. What should you monitor here? Some metrics are already part of what you can see on EC2: Disk I/O usage and disk space usage.
VPC : Virtual Private Cloud. This is your virtualized networking layer in the cloud, which includes virtual networks at its most basic level. These can be distributed across different datacenters in AWS (which they call “availability zones”, another cloud concept) so that you can distribute your deployment in order to increase your survival factor. Again, part of the metrics here are included on EC2: Bandwidth usage for each network interface. Please note that AWS will bill you for all traffic that leaves your region.
The good news is: AWS provides you with many tools to help you here (not only for monitoring). Let’s review some of these tools:
AWS Cloudwatch – basic monitoring : The Amazon solution for monitoring your cloud deployment is called “Cloudwatch”. This product is already included (in a basic form) in your normal plan. Some of the aforementioned metrics (the most basic ones) are already included here, as well.
AWS Cloudwatch – extended monitoring : You can enable more detailed instance monitoring at an extra cost. This extra monitoring will include more detailed metrics for the specific operating systems used to create your instances. Also, if the available metrics are not enough, you can create your own metrics and send them to Cloudwatch using the AWS API from inside the virtual machine.
AWS Cloudwatch Logs : You can, by the means of agents running inside your virtual instances, send your logs to “Cloudwatch logs” and establish rules to send you alarms when specific events are detected (and counted) inside the logs.
Cloudwatch alarms : What's a monitoring solution without thresholds? Cloudwatch allows you to define “alarms” over all of your metrics. Those alarms, when triggered, can fire not only simple notifications (mail, sms, etc.), but also “actual” actions on your cloud. Example: Deploy an additional instance on a load-balanced Auto-Scale group.
Auto-Scaling : AWS really believes in the “Elastic” concept, and offers you a way to design cloud platforms that can add or remove servers when the “monitored” load changes. The changes are triggered when some thresholds are reached under circumstances that you, the “cloud architect”, define. In conclusion, proper monitoring is a vital part of Auto-Scaling.
TIP: If you want to include your very own log analysis tool and ensure the application logs are in a safe place, you can resort to rotating all logs to S3 and even encrypt the data. Then, you can use a specialized tool that will read all logs stored on S3 and take the proper steps with that data. Maybe in your mind the words “Smart Analysis Tool” will begin to make some sense.
THE BAD: Are those AWS-included monitoring tools enough for a complex cloud deployment? And the answer is: Not always.
Cloudwatch can be good for monitoring specific independent resources, but what happens if your deployment is decoupled and distributed? What if it is distributed not only over different availability zones, but also expanding to different geographic regions across the globe?Well, that’s where you need to think ahead and begin to take extra steps. Meaning, if you want to properly monitor your cloud deployment, you'll need to get your hands dirty and I mean, very dirty.
It is important to note that AWS is already a system that allows you to recover gracefully when failures or high loads happen, but the problem is still the original one we exposed before. If you misuse your resources, you will either pay more than you should, or fall short on critical parts of your deployed cloud infrastructure and give generally bad service to your customers/end-users. The way you'll know about improper resource usage is by properly monitoring all of your deployments.
Remember, you can extend the metrics in order to include new ones, designed by you (or the OpenSource community). These metrics can be very specialized in nature and reveal the actual state of your most critical services. In the next section, we’ll finish this article by pinpointing where you need to focus to do your homework!
A final note here: Take into account that actual monitoring solutions inside AWS will not necessarily be able to correlate events on different parts of a deployment that is distributed and decoupled in nature. You’ll probably need something else to complement Cloudwatch. Prepare to include software that can correlate events and provide a global view of your systems.
THE UGLY: GET YOUR HANDS DIRTY OR “How I did forgot the old ways and started to really think Cloud”.
The first task at hand is: “Think Cloud”. The first mistake we make when most of us old-fashioned tech guys enter into the cloud arena, is that we try to use the cloud the same way we used traditional non-cloud virtualization technologies. Cloud systems, while using the same basic concepts as any other traditional systems, work differently. Very differently. They're elastic, they're self-healing, they're easy to provision, and they are easily misunderstood!The tasks you’ll need to accomplish in order to provide your cloud deployment with an adequate and effective monitoring (I’ll try to detail them here) are:
Know your cloud deployment down to the very intimate details, especially the way all cloud elements work between them to provide a final service. If your service is designed in a decoupled-distributed way, identify the proper tools and techniques needed to provide an effective monitoring solution that will help you reveal real problems when they arise.
If you need to extend the metrics, do it, but in a smart way. Don’t overuse extended metrics or you will create confusion and risk killing your servers by over-monitoring them.
If Cloudwatch is not enough, or if you prefer to monitor your deployment with something else, ensure that solution will also survive failures. After all, losing your monitoring service will blind you. Consider a redundant, multi zone monitoring service.
Get the application logs outside the application servers (EC2 instances) and rotate them to an S3 bucket. Remember -- AWS runs on the Internet and your EC2 instances are exposed to the dangers of the public network. Script the log rotation inside your servers and send everything to S3. With the logs rotated outside the application servers and stored in a centralized place, you can use smart analysis tools or solutions to evaluate failures in all your cloud deployments. Remember, you can apply “lifecycle” rules to these logs and send older logs to reduced-cost storage classes.
Finally, KNOW THE AMAZON API. Many things here depend on knowing what you can do with AWS. Many interesting things (like rotating our logs to S3 or extending our metrics) need good AWS API knowledge. Here is where you need to train yourself in the cloud arts. Be a Jedi. Know the cloud, Luke!
Bottom line: If you want to properly monitor a cloud deployment, you need to know how the cloud works, what tools are available to make the monitoring work in a “cloud” fashion, what limitations exist that can hinder your efforts to keep a “virtual eye” on your services, and what extra tasks you need to complete to overcome all possible limitations and keep the system health visible at all times.
About Loom Systems
Loom delivers an advanced AI-powered Analytics platform, used for real-time detection and resolution in complex environments such as AWS infrastructure. Loom monitors logs & metrics from every component in the IT environment to detect & correlate issues in cross-application context, enabling its users to find the actual root cause of problems in real-time and gain immediate visibility into their environment.
Loom also helps to cut Time-to-Resolution by matching the detected issues with recommendations and action items from TriKB™ , an ever-growing multi-sourced knowledge base of resolutions.
Want to see it in action? Schedule Your Live Demo Now
Loom Systems delivers an AI-powered log analysis solution to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down. Schedule Your Live Demo here!