This is a three-part series. If you haven't read the why, what, and how of monitoring I would suggest you read it first. This part focuses more on getting hands on.
We will install monitoring infrastructure on a Kubernetes cluster. And then create/import custom dashboards for our monitoring needs. We will cover cluster health, micro-service uptime. And follow up with creating smart alerts.
For this purpose, make sure you have a working Kubernetes cluster authorized for kubectl to use, and helm v2 installation. We will use istio bookinfo microservice example as our product services. Optionally create a namespace by the name of monitoring.
Setting up the infrastructure:
As we discussed in the previous part, we can divide the monitoring process into three steps. Exporting, collecting, and visualizing.
Step 1 (Exporters):
For the scope of this article, we will install blackbox exporter, which allows black box probing of endpoints http, https, DNS, tcp and icmp. This exporter will help us create scrape jobs of probing microservice health endpoints which will be later used to construct SLI.
Create a helm installation with the following values from gist:
helm install stable/prometheus-blackbox-exporter — name prometheus-blackbox — namespace monitoring -f blackbox_exporter.yaml
Step 2 (Collector):
We will install Prometheus as a collector whose job is to scrape metrics from various exporters and store them in its time-series database. We will also create the scrape jobs configs in which we will describe extra scrape jobs for black box to probe bookinfo services.
helm install stable/prometheus --name my-first-cluster-prometheus --namespace monitoring -f prometheus.yaml# Extra scrape job
- job_name: 'prometheus-blackbox-exporter'
Use the helm values from this gist.Still looking for a way to make long gist scrollable on medium. If you know, please do tell.
Step 3 (Visualizer):
Lastly, we will install grafana in our monitoring namespace. Grafana will use the Prometheus server as data source to fetch and visualize metrics. Like above, we will use helm to install grafana with values in the following gist:
helm install stable/grafana --name grafana --namespace monitoring -f grafana.yaml
Important to define the data source as our Prometheus server in our values YAML file.
- name: Prometheus
And that’s it. We have set up a minimal infrastructure for our monitoring.
We won't have time to cover promql and grafana diagrams in this article but surely other examples will follow. For now, we will use some of the pre-made dashboards that are available in the grafana dashboards directory.
To import a dashboard, first-run grafana locally by opening a port-forwarded session to the grafana pod.
We will import two dashboards for kubernetes cluster cost and microservice health. These are the steps :
- Goto Home -> Manage -> Import (and add the dashboard id to import).
- Select data source as Prometheus in form and click import.
There, we are done and have imported two dashboards.
Cluster Health and Costs
Microservice Health Checks
Grafana uses notification channels to send alerts of predefined conditions to users. Let's see a simple case of emailing, microservice health check alerts to a developer.
Add email server (SMTP) configuration to grafana chart. I used SendGrid's API for this demonstration. Update the following YAML values accordingly.
password: <Your Sendgrid API-KEY Here>
helm upgrade grafana stable/grafana --namespace monitoring -f grafana.yaml
From the sidebar, navigate to Alerts -> Notification Channels
Click the new channel button, add a name, enable Disable Resolve Message, Send Reminders, and add a test email in the Email Addresses field. Press the send-test button and you should receive a test email from grafana.
To create a microservice down alert we will first create a dashboard.
Added to it will be a chart that plots UP metric from the black-box exporter.
In the chart edit pane, navigate to alert section. Add a name, set Evaluate Every (10s), For (15s). This means evaluate the health checks every 10s, and only fire the alert if there is an undesired value for 15s continuously.
Lastly, add condition of (When the value is below 1). Since UP is boolean metrics, below 1 will imply a service down.
Disable the deployment of either details or reviews service to mimic a failure
kubectl scale deployment details-v1 --replicas=0 -n bookinfo
And you should receive a mail for the alert. TA DA!!!
During the course of this article, we have successfully set up monitoring in our clusters. Created and imported dashboards and finally created an alert that uses email notification channel.
For the third and last part of this series please goto. Monitoring a multi-cluster environment.