Why you should monitor? (Start with Why)
No, it's not about monitoring tools available and what bit and stone you can monitor. You have a product deployed on the cloud-native environment, then why you should monitor the environment.
Imagine two organizations. A similar product. And one difference. AL (Hint for availability) wants its product to be highly available all the time. BC (Well BC is meh). An engineer at AL is woken by an alert at midnight, the alert says a Microservice is down, it affects X feature of the product, what customer region is affected, comes with a metric having unwanted value (Let's say! Out of memory), the developer quickly outlines the impacted audience, determines a severity level and starts the action. The metrics imply he should increase the resources. He does that and waits for the alert to go away. And just like that in fifteen minutes, he can go back to sleep.
The engineer at BC, he won't know anything until a complaint comes in from the customer relation side that the product is not working. He needs to figure all the why himself and you can guess by that time. Your product has stopped adding value to your customers. That's why!
What should be monitored? (The What)
The answer to that question is a question. Easy Right!!
What can cause your customers, a bad experience with your product?
An overloaded infrastructure can cause downtime, monitor the capacity. Your microservices should be highly available. Monitor their Uptime SLI. An attacker could brute-force your authentication mechanisms, you should monitor too many 401’s in a given time. All regions should be available to respective customers, integrate your regional availability metrics into a boolean, have it visualized on a world map.
In my experience, I have worked/seen the following dashboards yet:
- Cluster and Pod health. Clusters can end up with nodes in not ready state. Often culprit, out of disk, memory, vcpus. Similarly Pods can end up in unschedulable state or failure to create(missing dependencies in a startup script).
- Cluster capacity planning. A set of charts, that compares current resource usage to the maximum available by namespaces. (Memory, CPU)
- Microservices health checks. Includes uptime SLI, response time history.
- Distributed cache: Includes uptime, clients, hits/misses per second (too many of that means a not integral database). Since we used distributed locking with Redis, a lock released to the acquired ratio can point you to a bottleneck in your system.
- Resource usage by namespace: Cpu and memory usage by services in a namespace. In an EFK based logging namespace, it pointed out elasticsearch as a culprit of using too much memory for us.
- API Gateway: A comparison of successful (20x) vs server-error (50x) by service can tell you of faulty service.
So should you just accumulate metrics and visualize them? Nope! Remember the engineer at AC, he received a message/call. You want your monitoring platform to send you the alerts of issues that have arrived or may arrive soon if not dealt with. Not only that, but you can also program who to alert with what payload of information. We worked with grafana which can be integrated with just about every productivity/chat/meeting tool service out there.
Monitoring in Kubernetes ( The How )
Imagine a colony of houses. Each house is fitted with utility. A meter fixed on the doorstep displays the usage. Each month, an agent goes to each house, and notes the reading into a register. An exec uses the register to create a powerpoint of company health. The house meter is the kubernetes exporter, the agent is the collector and the exec is the visualizer.
We will be discussing prometheus and grafana for the scope of this article.
We can divide the process into three steps.
- Querying / Visualizing.
Sometimes handled by the application itself, other times offloaded to add-ons called exporters. Exporters collect metrics and make them available via API. Whereas some applications collect them internally and also expose them via API. Example ambassador API gateway.
Prometheus can be configured to create scrape jobs. A scrape job is a recurring call to the exporter for metrics, which then received are stored in the time series database of Prometheus.
Querying / Visualizing
Prometheus provides an expression-based query language that can resolve to a vector, range vector, scaler, or string result. The output can be plotted onto a graph. You can create a collection of related graphs in a dashboard in grafana.
Cloud-native environment is a lot of moving parts in a lot of spaces. Monitoring will help you tackle issues ASAP and even alert or prevent them conditionally. For getting hands-on with Prometheus, grafana and advanced monitroing concepts. Proceed to the next part Monitoring In Kubernetes (Hands-On)