Services and resources monitoring with Prometheus and Grafana running on Docker

Wojciech Gębiś
September 13, 2018

My experience with solutions for resource monitoring is quite extensive. In the past, I have dealt with many popular tools, e.g.: Nagios - perfect for real-time service monitoring, based on dedicated agents, Cacti - very common RRD-Tool based metrics, Application monitoring, e.g., JVM like JConsole, VisualVM, and many other commercial solutions. Every tool listed above is powerful, feature-rich, dedicated for specific purposes, but there is a catch.

Software architecture now usually run across multiple server instances, there is a whole variety of services to monitor and metrics to collect. Well, the number of those is growing with every new project I work on. At this scale, it is almost impossible to understand what is happening by overlooking multiple tools each one dedicated to a different service or metric. Constant switching between visualizing, monitoring, and alerting based on metrics from your downstream systems is just getting way too hard. At some point of this process, new features become more and more important, e.g., centralized and unified configuration patterns, one consistent storage engine for managing, storing and displaying all of the metrics and finally an easy way to add and configure new Cloud Services, Web Servers, MQ brokers etc. That is why for some time now, I have been digging deeper into this matter by looking for a new solution for centralized resource monitoring.

There are three major modules that would make up for a monitoring infrastructure:

  • metrics exporters,
  • a centralized engine for collecting and storing the data in a time-based organized database,
  • dashboard system for metrics visualization.

Moreover, it should be ready to easily integrate Docker hosts monitoring with several containers per each node as well as detailed Linux-box statistics like CPU utilization, storage activity, network traffic, etc.

As the result of my research and evaluation of many projects, finally, I have a stack of tools I want to use. My choice fell on creating the monitoring infrastructure from scratch based on Grafana and Prometheus (https://grafana.com, https://prometheus.io).

The big picture

Prometheus Grafana

The core element of the system is Prometheus, and it is responsible for collecting and storing statistics data. Prometheus can efficiently manage many vital parameters such as a retention policy or a frequency of metrics collection. The default collection mechanism is based on easy to maintain endpoint scraping. The small services named Metrics Exporters expose the endpoints. I have implemented two of them: cAdvisor and node_exporter.. Using these simple agents, we can create powerful data sources with all of the critical metrics to be monitored.

Components

Installation and configuration can be done via few simple steps:

  • Prometheus can be run as a Docker container. It’s a good option to attach an external block device for collected data:
docker	 run -p 9090:9090 -v /prometheus-data \
       prom/prometheus --config.file=/prometheus-data/prometheus.yml
  • cAdvisor connects to the Docker Host and can collect tons of parameters for live monitoring each container as well as Docker engine itself. The recommended way to run cAdvisor is, of course, a dockerized service, it can be run on the Docker Host by ‘docker run’, e.g.:
docker run \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --detach=true \
  --name=cadvisor \
  google/cadvisor:latest
https://github.com/google/cadvisor
  • In case of node_exporter, it should be deployed as a regular host service (not in a container) due to access to native hosts metrics, e.g., stored in ‘/proc’. It can be installed from source or binary packages available in the most popular Linux distribution.
    After installation node_exporter is listening on the localhost:9100 by default. https://github.com/prometheus/node_exporter

  • Now it can be included into the Prometheus configuration prometheus.yml, here is the snippet:

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any time series scraped from this config.
  # metrics_path defaults to '/metrics'
  # scheme defaults to 'http'.
  - job_name: 'default_job'
    static_configs:
      - targets: 
        - localhost:9100 # node_exporter
        - localhost:8080 # cAdvisor
  • Finally, Grafana can be run as a Docker container just like Prometheus. The default configuration uses a file database embedded in the container image - in production it should be moved to an external database.
docker run \
  -d \
  -p 3000:3000 \
  --name=grafana \
  -e "GF_SERVER_ROOT_URL=http://grafana.server.name" \
  -e "GF_SECURITY_ADMIN_PASSWORD=secret" \
  grafana/grafana

What’s next?

We have the running stack. However, it was only the first step and this is when Grafana can fully come into play. Grafana is a data visualization and exploration tool for global infrastructure. You can extend this tool to fit your custom needs with many widgets and plugins to create interactive & user-friendly dashboards. So, go ahead and customize it by adding new hosts and metrics to monitor, use the graph composer to create charts and place them on dashboards.

These projects are open source and come with many useful extensions.
If other resources need to be monitored, here is the Prometheus repository with many useful exporters: https://prometheus.io/docs/instrumenting/exporters/.

All of that data is accessible in Grafana dashboard. It can be freely visualized and monitored. There is a storefront of ready-to-use community built dashboards, that can be found on Grafana repository:
https://grafana.com/dashboards.

Graph1 Graph2

It is also worth mentioning that the completeness of this solution is complemented by the ability to create advanced alerting rules based on defined thresholds, which can push notifications to your email or Slack whenever something out of the ordinary happens.

I hope that this short article will get your attention on the stack I described and will help you to improve your own system monitoring.

Now, let's talk about your project!

We don't have one standard offer.
Each project is unique, rest assured that we will approach the next one full of energy and engagement.

LET'S CONNECT