This post is part of a series on monitoring.
Monitoring systems generally fall into two categories. Those where services push metrics to the monitoring system and those where the monitoring system pulls metrics from services. This can be a surprisingly contentious issue.
In reality there are two aspects as to how a monitoring system’s collection works, each of which can be push or pull, that are sometimes conflated:
- Target discovery
- Who initiates metric transfer
Let’s look at each in turn.
Push vs pull for target discovery is better framed as top-down or bottom-up, a common question in configuration management.
Do your services push by advertising themselves, such as by registering in Zookeeper or connecting out to the monitoring system? Or do you determine your targets based on where they should be, such as pulling from Ansible inventory files or what your cluster scheduler thinks? There are advantages and disadvantages to each approach.
Bottom-up has the benefit that you don’t need to do anything to add a new target, it’ll automatically be picked up. The disadvantages are that it’s more difficult to monitor for what should be running due to a lack of a central source of truth, and rogue/misconfigured targets can cause problems.
Top-down avoids these issues, but you may have to do work to configure new services and targets in the source of truth.
Who Initiates Metric Transfer
Your monitoring system and target now know enough to talk. The next question is who initiates the communication, and who determines when metrics are sent.
Regular metric collection is useful, as uneven timestamps and gaps make time series more difficult to work with. The monitoring system could poll the target at regular intervals, and similarly the target could push up metrics in a background thread. Both approaches result in pretty much the same traffic over the wire.
Where things get interesting is when the target is overloaded or has terminated. With push, all the monitoring system sees is that some metrics haven’t sent data recently. Combined with bottom-up target discovery, it is difficult to distinguish from the task being intentionally terminated. With polling you’ll know something is wrong due to a failed scrape and exactly how it failed is useful debugging information. You can also track how long the scrape took.
Where You Must Use Push
It’s useful to have metrics about the completion of batch jobs, which may be short-lived. In this case bottom-down poll based monitoring isn’t going to cut it, as it would require designing logic to make the batch job hang around and wait for it to be found and scraped.
Having the batch job push its completion metrics to a known location and then terminate is much simpler, especially when each run of the batch job can happen on different machines.
This does bring up a downside of pure push-based approaches to a configured location. If you want to test a new monitoring setup on your laptop, how do you get your targets to send information to your test system as well as the production monitoring system? Reconfiguring all of your targets every time you wanted to try something out would be arduous.
At the end of the day, all the disadvantages mentioned here can be worked and/or designed around. While I have a mild preference for pull approaches, it’s not a major factor when I’m evaluating a monitoring system. Other aspects such as instrumentation, ease of configuration, available integrations, flexible analysis, useful dashboards and running costs are more important considerations.