29 Aug Tools to monitor Kubernetes cluster.
In the earlier post we discussed various types of Kubernetes distribution and players for each aspect of the distribution. In this blog, we will change gears to look at monitoring and where do they fit in the stack ?. Before getting on with the Monitoring discussion, let’s define a term called Monitoring Unit(MU) that refers to a collection of metrics, events and logs from various sources. This MU is often used in this post. The key difference among these MU’s are; metrics is a series of numbers measured over time, ,logs refers to an event at a particular instant of time while events refers to metrics and logs over a certain period of time.
When it comes to Monitoring, one gets to hear terminologies such as Monitoring, Visibility, Observability, AIOps, APM etc. Strictly speaking, Monitoring is a system that is a place for collecting and storing MU’s. Visibility consumes all the collected MUs for visualization and dashboarding. Visibility tools provide the ability to slice & dice MUs in many ways and in different formats. Observability is the ability of the system to store MUs for a period of time for correlation and analysis. The collected MUs are stored in a time-series database (TSDB). AIOps acronym for AI for Operations uses observability data for autonomous operations. The traditional APM (Application Performance Management) vendors were already doing what that new shiny *AIOps* vendors do but APM are restricted to onset of performance problems, degradation and isolation of performance issues while AIOps vendors go beyond performance by applying AI techniques for other aspects of operations such as CI/CD, Alert generation and resolutions, scaling etc.
One could collect a wealth of data from Kubernetes and it’s enchiladas. The sources typically are from the Kubernetes orchestration system, the infrastructure on which Kubernetes runs, Kubernetes object types, the application running on the orchestration system and other third party systems that Kubernetes interacts with such as load balancers, firewalls etc.
There are many dimensions to monitoring one such is black-box and white-box. Black-box focuses on collecting resource aspects that are applicable for all containers such as cpu, memory, storage, total http request, total errors response, avg response time etc. White-box is contextual and MU specific to the application.
It’s a huge design decision as to what to monitor ?, how to monitor ? and what value one could derive ?. This has been addressed in a separate design blog on monitoring. Here, we will just look at who’s who on monitoring.
Open-source as Lego blocks:
There is just forest out there when it comes to monitoring tools but only few are popular and possibly widely used given its adoption having strong community backing. One such tool is CNCF’s graduated project, Prometheus. Prometheus is a monitoring and alerting system that stores metrics in a time-series database. It pulls metrics from various providers using the concept of exporters. Exporters are nothing but a gateway between the source and the destination(prometheus). For example, node-exporter collects metrics from an instance/node whenever prometheus requests for the information. This link has a list of prometheus exporters. Additionally, one can write their own exporter based on requirement and need. For visualization, users can use Grafana (an open-source again) that queries Prometheus for required dashboarding. Given that Prometheus is a metric collector, one would require a separate system to collect logs and metrics. For logs, one could use a combination of Elasticsearch, Fluentd and Kibana otherwise popularly known as EFK stack. There are other sets of open source projects and part of CNCF are Opentelemetry & Jaeger for distributed tracing, there are some parts that are complementary and few parts that are common. Guess, in the long run market would figure on how both would work with each other.
Building a monitoring stack from scratch using open source tools makes sense for an organization having large deployments so does the devops team.
Commercial Tools Build From Bottom Up:
Such tools started building agents for infrastructure, applications and other systems to collect and store one or more MU in their own database and organically build it for autonomous operation such as detecting problem onset, fault isolation and auto resolution. In some sense, it’s a comprehensive solution. Examples of such products are Datadog, Dynatrace, Sumologic, NewRelic etc. By using these tools, one can get started quickly on monitoring stories. Best works for organizations having small devops teams. For large deployments, it’s best to build using open source from a purely cost perspective.
Vendors Build On Other DB’s For AI:
Many new vendors realized that customers are already using two or more monitoring tools for different value propositions and would be unable to forklift current stack to replace with new stack. Instead, these companies bring great value in autonomous operations using AI technologies rather than just collection and visibility. Examples of such vendors are harness.io, Instana, BigPanda, OpsRamp etc.
Cloud Provider Tools:
Each hyperscalers provides it’s own monitoring tool for Kubernetes. All they did was to extend support of an already existing monitoring tool for Kubernetes. Google has GCP Stackdriver while AWS has cloudwatch and Azure has Azure monitor for containers. Believe other providers have also extended support for Kubernetes. All of them are not exactly the same in terms of features but quite similar in their approach.
No Comments