Iâve had a number of discussions, both offline and online, about logging practices. In my view, reading individual log lines is (almost) always a sign that there are gaps in your systemâs monitoring tools – especially when dealing with software at a medium- or large scale. Just as I would never want to use tcpdump to analyze statsd metric packets individually, ideally, I donât want to look at individual log lines either.
This advice follows the spirit of âdisable port 22 on your machinesâ. While most developers would agree that using dedicated orchestration tools is preferable to manual server wrangling, I know of very few projects or companies that take the extreme stance of disabling SSH access to all machines. Nevertheless, treating manual SSH as a sign of gaps in your systemâs orchestration tools is an excellent master cue for architecting scalable and maintainable systems.
Similarly, I know of none that truly send all logs to /dev/null, or block read access to all logs. But treating logs as an extreme anti-pattern provides an excellent forcing function for designing observable systems.
Logs as Metrics
Letâs start with a basic example:
log.Printf(âLoaded page in %f secondsâ, duration)
In this case, the log line serves the same purpose as a statsd metric. We could replace it immediately with
statsd.Histogram(âpage.load.time_msâ, duration)
and the result would be better, because weâd be able to use the full extent of aggregation tools at our disposal. Itâs possible to extract this information from a log line into a structured form, but itâs more work, and itâs unnecessary. The log line doesnât give us any information that the structured metric doesnât.
Logs as Debugger Tracing
A more common example:
log.Printf(âabout to make API request onâ, obj_id) obj = c.Load(obj_id) if obj == nil { log.Printf(âcould not load objectâ, obj_id) } else { log.Printf(âloaded objectâ, obj_id) }
Logs are oftentimes used as a runtime pseudo-debugger. In this case, weâre using logs as a way to verify that a particular line of code was called for a particular transaction. The actual text of the log doesnât even matter. Instead of âAbout to make API requestâ, we could have written
log.Printf(âapi.request_preâ, id)
or even
log.Printf(âI like potato saladâ, obj_id)
As long as itâs unique to that particular line of code, it serves functionally the same purpose – it confirms that the program execution reached that point in the code.
When we use log lines this way, weâre forming a mental model of the code, and using the logs to virtually step through the code, exactly the way a traditional debugger like GDB might. Transaction-level (or request-level) tracing tools provide this same kind of visibility, with a better visual display.
Without actually counting, Iâd estimate that at least 80% – if not more – of log lines that Iâve seen in most open-source projects fit this overall use case: using log lines to virtually âtraceâ the execution path of source code on a particular piece of problematic input.
Logs as Error Reporting
Another common pattern:
try: writeResult() except Exception as e: log.info(âerror writing result!â, e)
Here, weâre using logs as a way to capture context for an error. But again, this is a relatively inconvenient way to explore information like a stacktrace. Tools like Sentry or Crashlytics also allow exception reporting, but unlike logging tools, they allow us to classify and group exceptions. We can still view individual stacktraces, but we donât have to sift through as much noise to identify problems. And by tracking the state of reported exceptions, we can identify regressions more easily. Structured logging systems are generally not capable of handling this – and even when the workflow is possible, itâs nowhere near as convenient as what dedicated exception tracking systems allow.
If you really canât break the habit of logging errors, you can at least add a hook to log.error
that sends the error (and a complete stacktrace) to an error reporting tool like Sentry.
Logs as Durable Records
Furthermore, you canât even assume that the logs you try to write will actually get written! Even when operating on a small scale, logs can be a lossy pipeline, and the potential for failure only increases with scale.
For example, if youâre developing a script intended to run on your local machine, do you know how your code will behave if your disk hangs? If you run out of space on the partition? How reliable is your log rotation? What happens to your server if it fails?
For a very small script, these sorts of failures may not matter to you, but they can come back to bite you, even at that small scale. For larger-scale services with tighter reliability guarantees, there are ways to mitigate these specific problems – a buffer, a log collector, a distributed indexer – but each solution comes with its own risks and problems. If you try to keep patching these by introducing more tools to make your logging reliable, at some point, youâll discover that youâve reinvented your own distributed database. And writing your own database is not inherently a bad idea, but to do it well, itâs the sort of task thatâs best undertaken intentionally, rather than by accident.
To be fair to logging, this problem is not unique to logs. It comes from the limitations set forth by the CAP theorem, which means that every monitoring tool has to figure out its own way to deal with them. The problem with logs, however, is that the failure modes are much more subtle and easy to overlook.
For example:
statsd.Increment(âapi.requests_totalâ, tags={âcountryâ: country.String()}, rate=.1)
Itâs relatively obvious to see that this line of code might not always emit a metric, because even under normal operating conditions, you know (a) network operations can have problems, (b) it uses UDP, and (c) the metrics are sampled at the specified rate.
Itâs a lot less obvious that this line of code might fail, because STDOUT and STDERR are generally assumed to be always available under normal operating conditions:
log.Printf(âReceived api request from %sâ, country.String())
Compared to logging, the failure modes of using runtime metrics, request tracing tools, or exception tracking tools are more visible and well-defined.
Instead of using logs as an accidental database, consider your underlying use case and which dedicated database would serve that need better. Thereâs no one-size-fits-all answer here; you may find that your use case is best served by a NoSQL database like CouchDB, or you may find that youâre really aiming to replicate the functionality of a message queue like Kafka, or another database altogether. If any of these (or other) tools fit your use case, theyâre almost guaranteed to be a better fit, long-term, than logs.
Donât Stop Logging, But Stop Reading Log Lines
By this point, it may sound like Iâm firmly anti-logging. I do consider myself an environmentally-friendly person, but when it comes to software, I support careful uses of logging.
Logging can be useful for some purposes. However, itâs rare that theyâre the only tool for monitoring your code. And itâs even rarer that theyâre the best tool. When writing software that scales, you need to be able to deal with aggregate information – the firehose is too unwieldy to parse mentally. Logs that can be aggregated are better than logs that canât. In those cases, itâs best to keep logging, but when you need to diagnose a problem, youâll be interested in reading aggregate queries across your logs, rather than viewing raw, unaggregated logs in chronological order. The former is a powerful way to absorb a lot of information about your systems quickly. The latter is a glorified tail -f | grep
.
The next time you start to write a log line, ask yourself whether another observability tool would be a better fit. Oftentimes, it will! If not, thatâs fine. Just remember that, ideally, nobody should ever be reading that raw line, so take care to structure the information in a way that facilitates the kind of aggregation queries youâll need.