Monitoring and scaling applications with ContainerPilot telemetry

Application health checking is a key feature of ContainerPilot (formerly Containerbuddy). The user-defined health check gives us a binary way of determining whether the application is working. If the application is healthy, ContainerPilot sends a heartbeat to the discovery catalog, and if it's not, other ContainerPilot-enabled applications will stop sending requests to it. But automatic scaling of a service depends on more than just the pass/fail of a health check. Every application has key performance indicators that tell us if the service is nearing overload and should be scaled up or is under-utilized and can be scaled down. Today we're introducing ContainerPilot telemetry, a new feature that brings operators awareness of those performance metrics.

When applications expose their own telemetry, they provide an interface between each of the containers in the service and the scheduler used to start and stop them. If the interface is well-known then it can be used by many different schedulers, including ones that are very simple and only need to support starting and stopping of containers. For ContainerPilot we've implemented a API. This means that user-defined metrics are recorded and served by ContainerPilot over HTTP when a Prometheus-compatible client scrapes the telemetry endpoint.

How it works

As with our health checks and on-change handlers, the ContainerPilot telemetry feature relies on user-defined behavior to take measurements. In the telemetry feature we call a user-defined behavior for measurement a sensor; this is any bit of executable code that returns a number on stdout. The ContainerPilot configuration will tell the Prometheus endpoint what type of metric this number is. For example, it might be a gauge (a floating-point number at a particular point in time), or it might be aggregated into a sliding window of histograms.

All the metrics collected by the sensors are then served by an embedded HTTP server. ContainerPilot advertises this service to the discovery service just as it does any of the services you define for your application. A Prometheus-compatible web scraper just has to make an HTTP GET to the /metrics endpoint and it receives the current state of the application.

Why pull metrics?

Pulling metrics via HTTP rather than pushing them to a collector has a number of performance and reliability advantages. The metrics collector can take into account the total system load (and its own load) and back-off on collection if it would impact production workloads; this reduces the resolution of metrics without causing back-pressure on every container. Applications don't need to be aware of how collectors are deployed or how many there are, and we can use the service discovery features of ContainerPilot for collectors.

Among the major public cloud providers (including Triton!) and in Mesos or Kubernetes deployments, getting the performance metrics of the underlying infrastructure is also an HTTP pull. This means that in many deployment scenarios collecting metrics is already going to involve a collecting agent that's pulling. So this allows ContainerPilot telemetry to fit in among the rest of the metrics collection.

This model also gives us the opportunity to develop some upcoming features for ContainerPilot, like exposing information about the ContainerPilot configuration to applications that want to map relationships between containers and their status.

Compare telemetry

How does ContainerPilot's fit in context with other metrics collectors? The behavior of ContainerPilot sensors is mostly user-defined, so the sky is the limit when it comes to what you want to measure. ContainerPilot's telemetry is ideally suited for determining whether to scale a service, rather than as a general system metrics agent. ContainerPilot's telemetry is not intended to replace system-wide metrics agents like cAdvisor (or our upcoming monitoring features in RFD27), nor can it replace the language-specific instrumentation that solutions like New Relic APM can provide (aside: New Relic's Inisghts would be a great tool for visualizing telemetry data...). We expect that many users will use multiple metrics solutions, but that they will shape ContainerPilot's telemetry specifically to make scaling decisions.

FeaturecAdvisorContainerPilot telemetryNew Relic APM
Sensorscgroups and system statsuser-defined; any executable in the containerlanguage-specific and system stats
Where it runsOn the hostIn the containerIn the container
ScopeEntire Linux hostIndividual application instancesIndividual application instances with language-specific detail
Best usageHost utilization, with breakdown by containerApplication scaling indicatorsApplication performance debugging and analysis
Collects common system details; memory, CPU, disk, etc.YesIf instrumentedYes
Supports application-specific monitoringNoYesYes
Supports alertsNoYes, with PrometheusYes, with New Relic Insights
Open sourceYesYesNo

See telemetry in action

tl;dr: checkout the repo, configure your environment, then ./

We've previously demonstrated the autopilot pattern as applied to a complete application -- a Node.js application named Touchbase backed by Couchbase and load balanced by Nginx. Let's see how telemetry works for this application. You can follow along in the code on GitHub. Detailed documentation about configuring the ContainerPilot telemetry feature can be found in its README.

Our Nginx application includes the http_stub_status module that exposes runtime metrics about Nginx over HTTP. We want to transform these stats into metrics suitable for the Prometheus-compatible API. Two important statistics to operators of Nginx are the number of dropped connections and the ratio of connections in use to the number of available connections. We can write a sensor that pulls the raw information from the http_stub_status module and provides the derived metric to ContainerPilot so that it can be served over our telemetry endpoint. This ends up being just a few lines of bash:

# Cumulative number of dropped connectionsunhandled() {    local scraped=$(curl -s localhost/health)    local accepts=$(echo ${scraped} | awk 'FNR == 3 {print $1}')    local handled=$(echo ${scraped} | awk 'FNR == 3 {print $2}')    echo $(expr ${accepts} - ${handled})}# ratio of connections-in-use to available workersconnections_load() {    local scraped=$(curl -s localhost/health)    local active=$(echo ${scraped} | awk '/Active connections/{print $3}')    local waiting=$(echo ${scraped} | awk '/Reading/{print $6}')    local workers=$(echo $(cat /etc/nginx/nginx.conf | perl -n -e'/worker_connections *(\d+)/ && print $1'))    echo $(echo "scale=4; (${active} - ${waiting}) / ${workers}" | bc)}

Note that both functions are silent on stdout except for the number that we want. Any sort of logging we want to include here will need to be sent from our sensor over stderr. Our ContainerPilot configuration needs a stanza for telemetry as well:

  "telemetry": {    "port": 9090,    "sensors": [      {        "name": "tb_nginx_connections_unhandled_total",        "help": "Number of accepted connections that were not handled",        "type": "gauge",        "poll": 5,        "check": ["/opt/containerbuddy/", "unhandled"]      },      {        "name": "tb_nginx_connections_load",        "help": "Ratio of active connections (less waiting) to the maximum worker connections",        "type": "gauge",        "poll": 5,        "check": ["/opt/containerbuddy/", "connections_load"]      }    ]  }

That's all we need to add to our Nginx container to have it expose metrics for a Prometheus server! We've created an example of a Prometheus server using the autopilot pattern for discovering ContainerPilot telemetry. Check out the GitHub repo and how we've added it to our Touchbase stack's docker-compose.yml file.

Once the Touchbase stack has been started, you'll be able to watch metrics come in using the Prometheus server's web UI:

Prometheus graph web UI

Or you can just curl the telemetry endpoint yourself:

$ curl -s | grep nginx# HELP tb_nginx_connections_load Ratio of active connections (less waiting) to the maximum worker connections# TYPE tb_nginx_connections_load gaugetb_nginx_connections_load 0.0009# HELP tb_nginx_connections_unhandled_total Number of accepted connections that were not handled# TYPE tb_nginx_connections_unhandled_total gaugetb_nginx_connections_unhandled_total 0

You can check out the rest of our application components to see how we added metrics to Touchbase and Couchbase.

Post written by Tim Gross