To log or not to log is no longer the question, but rather how and what to log has become the area of focus for IT operations looking to consistently improve application performance and ROI. We’ve gathered some best practices that will help you “log smarter” and save you invaluable time and resources when tracking a problem.
#1 KNOW YOUR AUDIENCES
When dealing with logs, the first thing to understand is that your application logs have two very different audiences: humans and machines.
Machines are good at processing large amounts of structured data quickly and automatically.
Humans, on the other hand, are not as good at processing large amounts of data, and it takes us time to read through logs. On the other hand, humans deal with unstructured data well.
In order to get the most out of your logs, you need to make your logs both readable for humans and structured for machines.
#2 HAVE A CONSISTENT STRUCTURE ACROSS ALL LOGS
A prerequisite for good logging is to have a standard structure of your log file, which would be consistent across all log files.
Each log line should represent one single event and contain at least the timestamp, the hostname, the service and the logger name.
Additional values can be the thread or process Id, event Id, session and user id.
Other Important values may be environment related such as: instance ID, deployment name, application version, or any other key-value pairs related to the event.
Use a high-precision timestamp (in resolution of milliseconds if not better) and make sure your timestamp format includes time zone data. Unless you have an exceptionally great reason, use ISO 8601 .
Finally, if you feel like a real pro, add a unique ID to every log line. A log line will usually have some fixed part and some varying part, which makes it difficult to filter specific patterns in or out (although we all love regular expressions ). This is where the unique ID becomes handy.
When logging errors, add an Error ID. This will be very useful for looking up in your knowledge-management systems (which of course you have).
These are important to track or correlate issues in different components and across your architecture.
#3 UNDERSTAND METRICS
A core concept in logging is Metrics.
A metric is a specific value of a property in a specific time, usually measured at regular intervals.
The common metric types are:
Meter – measures the rate of events (e.g. rate of visitors to your website)
Timer – measures the time some procedure takes (e.g. your webserver response time)
Counter – increment and decrement an integer value (e.g. number of signed-in users)
Gauge – measure an arbitrary value (e.g. CPU)
Each metric describes a state of some property of the system.
The cool thing about metrics is having lots of them, and being able to correlate different metrics together. For example, if we find that whenever users in our applications are using the “Get Cat Photo” method, and the “Time Spent On Web Page” is increasing – we can infer that our users prefer cat photos over other photos
We recommend you track and log metrics, or alternatively store metrics separately from your logs.
#4 REPORTING ALERTS AND EXCEPTION HANDLING
If something happens within your code, and you already know for sure what happened and perhaps what should be done, don’t log and then set an alarm on that specific log – that’s complex and error prone. Instead, fire an alert directly from within the code.
Also, when logging an exception, while the stack trace is useful, it’s hard to read. Use libraries like Apache ExceptionUtils to summarize the stack trace and make it easier to consume.
#5 USE LOG SEVERITY LEVELS
Different events have different severity implications. This is important because it enables you to differentiate severe and important events from irregular or even regular events.
Do not dismiss lower severity issues, they can be used as data points when trying to create a baseline for the application behavior.
Your log files should contain mostly Debug, Info and Warn messages, and very few Error messages.
#6 ALWAYS PROVIDE CONTEXT
Developers write logs in line with the code. This means that when writing the logs in the code, the developers base the log on the context of the code. Unfortunately, the person reading the log the doesn’t have that context, and sometimes doesn’t even have access to the source code.
For example, let’s compare the following two log lines:
“The database is down”
“Failed to Get users preferences for user id=1. Configuration Database not responding. Will retry again in 5 minutes.”
Reading the second log line, we easily understand what the application was trying to do, what component failed, and if there’s some kind of resolution for this issue.
Each log line should contain enough information to make it easy to understand exactly what was going on, and what the state of the application was during that time.
#7 CHOOSE A GOOD LOGGING FRAMEWORK AND USE ITS ADVANCED FEATURES
Refrain from trying to roll your own logging framework. There are plenty of excellent logging libraries for every programming language out there. Well, maybe except for TrumpScript.
Logging frameworks enable you to set up different appenders, each with its output formats and its custom log pattern.
Other standard features include automatically adding the logger name and a timestamp, support for multiple severity levels and filtering by these levels.
Logging frameworks also have the following advanced features that you should be using:
Configure different log-level thresholds for different components in your code
Use a lossy appender which drops lower-level events if queues get full
Use a logs-summarizing appender which will log: “the following message repeated X times: [message]” instead of repeating it X times
Put a Threshold on the log level, and configure it to also output N lower-level log lines when the higher severity log occurs
#8 WRITE STRUCTURED LOGS (SOMETIMES)
Having your appenders write structured logs might incur some performance hit, but if you can take it, it’s worth it. It will make it much easier later to load your tools into analysis tools or process them with log-brokers.
#9 LOG A LOT AND THEN LOG SOME MORE
We often neglect to write important logs or even not-log intentionally to keep your logs compact. More often than not, you will waste more time as a result of not having the answer in your logs, then you would have spent writing logs. The quality of your logs is part of the quality of your code.
Use centralized logging solutions, automatic log processing systems, and apply the techniques outlined above to keep the logs useful – just keep logging.
That said, operations being operations, we know this list will always be partial and growing, so we’d appreciate your input, so we can periodically update it with your tips from the field. If you have anything to add or ask, please let us know and share with the community.
Loom Systems delivers an AIOps-powered log analytics solution, Sophie,
to predict and prevent problems in the digital business. Loom collects logs and metrics from the entire IT stack, continually monitors them, and gives a heads-up when something is likely to deviate from the norm. When it does, Loom sends out an alert and
recommended resolution so DevOps and IT managers can proactively attend to the issue before anything goes down.
Get Started with AIOps Today!