Services and resources monitoring with Prometheus and Grafana running on Docker

My experience with solutions for resource monitoring is quite extensive. In the past, I have dealt with many popular tools, e.g.: Nagios - perfect for real-time service monitoring, based on dedicated agents, Cacti - very common RRD-Tool based metrics, Application monitoring, e.g., JVM like JConsole, VisualVM, and many other commercial solutions. Every tool listed above is powerful, feature-rich, and dedicated to specific purposes, but there is a catch.

Check this series

Software architecture now usually runs across multiple server instances, there is a whole variety of services to monitor and metrics to collect. Well, the number of those is growing with every new project I work on. At this scale, it is almost impossible to understand what is happening by overlooking multiple tools each one dedicated to a different service or metric. The constant switching between visualizing, monitoring, and alerting based on metrics from your downstream systems is just getting way too hard.

At some point in this process, new features become more and more important, e.g., centralized and unified configuration patterns, one consistent storage engine for managing, storing, and displaying all of the metrics, and finally an easy way to add and configure new Cloud Services, Web Servers, MQ brokers, etc. That is why for some time now, I have been digging deeper into this matter by looking for a new solution for centralized resource monitoring.

There are three major modules that would make up for a monitoring infrastructure:

metrics exporters,
a centralized engine for collecting and storing the data in a time-based organized database,
dashboard system for metrics visualization.

Stories on software engineering straight to your inbox

Moreover, it should be ready to easily integrate Docker hosts monitoring with several containers per each node as well as detailed Linux-box statistics like CPU utilization, storage activity, network traffic, etc.

As a result of my research and evaluation of many projects, finally, I have a stack of tools I want to use. My choice fell on creating the monitoring infrastructure from scratch based on Grafana and Prometheus ( https://grafana.com, https://prometheus.io).

The big picture

Prometheus Grafana

The core element of the system is Prometheus, and it is responsible for collecting and storing statistics data. Prometheus can efficiently manage many vital parameters such as a retention policy or a frequency of metrics collection. The default collection mechanism is based on easy-to-maintain endpoint scraping. The small services named Metrics Exporters expose the endpoints. I have implemented two of them: cAdvisor and node_exporter… Using these simple agents, we can create powerful data sources with all of the critical metrics to be monitored.

Components

Installation and configuration can be done via a few simple steps:

Prometheus can be run as a Docker container. It’s a good option to attach an external block device for collected data:

docker run -p 9090:9090 -v /prometheus-data prom/prometheus –config.file=/prometheus-data/prometheus.yml
cAdvisor connects to the Docker Host and can collect tons of parameters for live monitoring of each container as well as the Docker engine itself. The recommended way to run cAdvisor is, of course, a dockerized service, it can be run on the Docker Host by ‘docker run’, e.g.:

docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:ro --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --volume=/dev/disk/:/dev/disk:ro --publish=8080:8080 --detach=true --name=cadvisor google/cadvisor:latest
In case of node_exporter, it should be deployed as a regular host service (not in a container) due to access to native hosts metrics, e.g., stored in ‘/proc’. It can be installed from source or binary packages available in the most popular Linux distribution. After installation node_exporter is listening on the localhost:9100 by default. https://github.com/prometheus/node_exporter
Now it can be included into the Prometheus configuration prometheus.yml, here is the snippet:

A scrape configuration containing exactly one endpoint to scrape:

Here it’s Prometheus itself.

scrape_configs: # The job name is added as a label job=<job_name> to any time series scraped from this config. # metrics_path defaults to ‘/metrics’ # scheme defaults to ‘http’. - job_name: ‘default_job’ static_configs: - targets: - localhost:9100 # node_exporter - localhost:8080 # cAdvisor
Finally, Grafana can be run as a Docker container just like Prometheus. The default configuration uses a file database embedded in the container image - in production it should be moved to an external database.

docker run -d -p 3000:3000 --name=grafana -e “GF_SERVER_ROOT_URL=http://grafana.server.name” -e “GF_SECURITY_ADMIN_PASSWORD=secret” grafana/grafana

What’s next?

We have the running stack. However, it was only the first step and this is when Grafana can fully come into play. Grafana is a data visualization and exploration tool for global infrastructure. You can extend this tool to fit your custom needs with many widgets and plugins to create interactive & user-friendly dashboards. So, go ahead and customize it by adding new hosts and metrics to monitor, use the graph composer to create charts and place them on dashboards.

These projects are open source and come with many useful extensions. If other resources need to be monitored, here is the Prometheus repository with many useful exporters: https://prometheus.io/docs/instrumenting/exporters/.

All of that data is accessible in Grafana dashboard. It can be freely visualized and monitored. There is a storefront of ready-to-use community built dashboards, that can be found on Grafana repository: https://grafana.com/dashboards.

Graph1 Graph2

It is also worth mentioning that the completeness of this solution is complemented by the ability to create advanced alerting rules based on defined thresholds, which can push notifications to your email or Slack whenever something out of the ordinary happens.

I hope that this short article will get your attention on the stack I described and will help you to improve your own system monitoring.

About the author

Wojciech Gębiś

Project Lead & DevOps Engineer

Wojciech is a seasoned engineer with experience in development and management. He has worked on many projects and in different industries, making him very knowledgeable about what it takes to succeed in the workplace by applying Agile methodologies. Wojciech has deep knowledge about DevOps principles and Machine Learning. His practices guarantee that you can reliably build and operate a scalable AI solution.
You can find Wojciech working on open source projects or reading up on new technologies that he may want to explore more deeply.

Services and resources monitoring with Prometheus and Grafana running on Docker

Stories on software engineering straight to your inbox

The big picture

Components

A scrape configuration containing exactly one endpoint to scrape:

Here it’s Prometheus itself.

What’s next?

About the author

1. Definitions

2. Cookies

3. How System Logs work on the Website

4. Cookie mechanism on the Website

5. Additional information

Services and resources monitoring with Prometheus and Grafana running on Docker

Stories on software engineering straight to your inbox

The big picture

Components

A scrape configuration containing exactly one endpoint to scrape:

Here it’s Prometheus itself.

What’s next?

About the author

This article is a part of

Insights from nexocode team just one click away

Done!

Thanks for joining the newsletter

1. Definitions

2. Cookies

3. How System Logs work on the Website

4. Cookie mechanism on the Website

5. Additional information

Want to be a part of our engineering team?