By concept, “Linux Containers” is a virtualization technology working at operating system level which allows us to isolate multiple processes (or complex applications) from each other using a single kernel in a single machine. While it’s very similar to modern virtualization technologies applied by hypervisors like XEN or QEMU/KVM, it is a lot simpler in comparison (and faster to deploy too).
Docker is one of the most utilized containers solutions in Linux (not the only one, but the most popular). Docker can be used to deploy almost any application, from a single script echoing “hello world”, to a complex monitoring solution with a database, web server, logs, and everything needed and “contained” in a single package.
Now, the bad news: If you thought monitoring cloud deployments over OpenStack or Amazon Web Services was a complex task, then, welcome to Docker hell -- because it’s even more difficult, especially when container orchestration solutions are used!
In the following sections, we’ll explore the actual challenges faced when looking for proper Docker monitoring and some tips to overcome those challenges.
First challenge: Containers are not traditional virtual machines.
When a Docker container is run, the actual processes inside the container run isolated inside the operating system environment, but they still run in a shared resource context using the same kernel in the host operating system. Containers are not virtual machines, where every machine runs in complete isolation from a resources point of view. All elements in a container can actually be seen running in the kernel, which means that they are visible to standard monitoring tools and operating system tools (ps, top, etc.). In a normal cloud virtual machine, or in a bare-metal server, you can monitor its standard metrics (like cpu, disk i/o, ram usage, network usage) in order to have a quick view of the general machine state. The problem with Docker is: In a single machine (either bare-metal or virtualized) you can have hundreds or thousands of “dockerized” applications, each one using its own part of the machine resources in a complete different way. The good news here is that actual container metrics are registered in the host pseudo file system “sys”, inside the “/sys/fs/cgroup” structure. Also, docker contains its own subcommand “docker stats” which shows you, per container, important metrics related to actual hardware resource usage (cpu, ram, network bandwidth and disk I/O). This means that you have the perfect tool that with some smart scripting, can be adapted to any monitoring solution with the ability to include your homemade metrics.
Tip: Be smart and do your homework. If the monitoring solution uses snmp, then extend the snmp service inside the host to include “in a lightweight fashion”, all the container metrics you need to capture. If your solution uses any kind of extensible agent, apply the same principle. Finally, if your solution only uses scripts, then again, be kind with the host operating system and don’t kill it by extra-monitoring things or by using heavy scripting.
Example #1: Below, you can see the “docker stats” command in action. If you want to use the command in a batch script (not in real-time reporting), add “--no-stream=true”:
Example #2: Here, you’ll see two containers running in an openstack-based virtual machine (atomic centOS). One container is running Apache, the other NGINX. With “docker ps” you can see both containers running, but with “ps” from the host operating system (the atomic-centos host), you can actually see both Apache and NGINX running too:
Second challenge: It’s not only docker, it’s the complete infra too.
Stop here and think about your infrastructure a little bit. You still need to monitor the host (or hosts) running all your little Docker containers. Then, you need to combine the full monitoring of your hosts (virtual or bare-metal) and the specific monitoring of all your running containers as described in the first challenge. Also, in your monitoring solution, find a way to relate the actual host with the containers running inside. It’s not the same to say, “containers X, Y, and Z failed and I don’t know why” as it is to say, “host K failed, affecting containers X, Y, and Z.” See the difference? Relate events in a smart way, and you’ll have a better overall vision of your infrastructure when Murphy decides to strike!
Tip: The “docker” command line tool can connect to remote docker engines, provided they have exposed their “tcp” control port. With that in mind, you can actually centralize your monitoring solution in a single place (monitoring servers for example) which can connect to your remote docker daemons and obtain all related metrics, the same way they obtain the metrics for the hosts using snmp or dedicated agents. Moreover, you can monitor actual docker tcp port state (single tcp ping will do, but a “docker info” properly parsed will do better) in order to have a primary source of information about the docker service running in the host. Translation: tcp docker port offline = docker service probably compromised.
Third challenge: What about the dockerized application logs?
If your dockerized infrastructure uses applications that produce logs in specific locations inside the container “virtualized” file system, then you really need those logs in a place where you can analyze them, or, rotate them outside the host operating system. The good news here: Docker already has a way to map directories inside the virtualized file system to directories in the host file system. This means, for example, your “apache” logs can be outside the container, where you can locate them very easily and do whatever you want with them. Also, think about a centralized logs file system, like an NFS mount point in any NAS or server exposing NFS. By taking this approach, a centralized log analysis solution will be able to see events logged to your dockerized applications and detect problems almost in real time. Also, events correlation will be easier to implement, provided your analysis tool can perform that task.
Tip: Again, think in lightweight means of doing things. If you centralize all your logs into a single NFS resource, DO NOT put them all in the same directory. Use some kind of directory structure that makes sense to you and your monitoring system, and help to reduce the I/O load. Directory hashing is a good technique to use here. Many thousands of files in the same directory are a complete no-go for most file systems, especially the networked kind. Remember also, NFS is not the “paramount of efficiency” so try to manage the access to the logs in a smart way.
Do you want to see another example? Here, we started our containers (one Apache, one NGINX) with the “-v” option pointing their internal log directories (/var/log/apache2 and /var/log/nginx) to directories in the host operating system, specifically: “/var/log/apachelogs” and “/var/log/nginxlogs”:
Note: When rotating logs that are “alive” in the application which writes them, sometimes it is needed to send the application a “SIGHUP” in order to reset the access to the actual file. If you fail to do this, you can end up with phantom files or other undesired consequences. You can see this in most operating system packages using a “logrotate” function. You need to take the appropriate steps in order to send the proper signal to the application running inside the container. For this task, you can use a “docker exec” command in order to “execute” any command inside the container, including, sending the SIGHUP to the application.
Fourth challenge: Enter the container orchestration solutions, or, “how I learned that demons from hell are yellow and love bananas”:
The first two challenges can be solved very easily with scripts and by extending agents. The third one is even easier to solve, but what if your Docker infrastructure is running inside an orchestration solution like “Docker Swarm”, “Google Kubernetes”, or “Apache Mesos”? This is where our very own “docker hell” begins!
The aforementioned orchestration solutions are designed like a “cloud” of containers that also control containers. They can deploy specific containers across several nodes (normally called “minions”, and if you think about little guys that love bananas, that’s Ok, you’re fine). The orchestration infrastructure also have “master” nodes that actually act as the “criminal mastermind”, sometimes running numerous control applications inside dockerized containers. The actual challenge here is: You normally don’t control in which minion host a specific application container is deployed when it’s deployed the first time. Also, consider that your orchestration solution can dynamically deploy new containers when a specific event happens (i.e. the sudden death of a minion machine).
The good news here is: The control layer (which can be clustered, load-balanced, and normally uses service discovery tools) can be interacted with in order to know what’s running and where it’s running. So, the real challenge here is to construct a monitoring infrastructure that can interact with the orchestration control layer (or discovery services) in order to obtain all the running services (a “discovery”), parse the data in a smart way, and include the actual dockerized application metrics inside the monitoring stack. Also, this monitoring infrastructure needs to check for changes from time to time. If any event happens that changes the dockerized applications distribution, the monitoring infra needs to check those changes (a “re-discovery”) and act accordingly.
Tip: The discovery and re-discovery tasks are critical here. For any orchestrated container solution, the actual monitoring needs to monitor not only the cluster elements (Masters and Minions) but also the applications running inside the minions. Because the actual distribution of those elements can change during the life of the cluster, a dynamic monitoring approach with self-configuration needs to be taken into account. Also, whatever solution is used needs to be smart and lightweight in order to “not kill” the cluster by excessive discovery of the things running inside.
About Loom Systems
Loom delivers an AI-powered operational analytics platform that leverages machine-learning algorithms to automatically analyze logs & metrics at scale and in real time. Loom capable of detecting & correlating events across the Docker environment, to pinpoint the actual root cause of issues and eliminate blind spots. In addition to automating root cause analysis, loom matches detected issues with recommended action items from TriKB™, its proprietary crowd-sourced knowledge base of resolutions.
Interested in learning more? Schedule Your Live Demo Today.
Loom Systems delivers an AI-powered log analysis solution to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down. Schedule Your Live Demo here!