Kubernetes Backup and Disaster Recovery

This article borrows the why, what and how methadology of communicating which was described by simon sinek. I find this model to be very contextual in teaching and context is what makes our brain tick.

I have made an assumption, you roughly understand distributed computing, containerization, orchestration, microservices. If not click the links.

Overview of Kubernetes

Kubernetes is a very powerful resource orchestration service. With the kube API and controllers acting as control plane, Kubernetes act as a non-terminating loop, whose sole job is to move cluster from current to the desired state. The desired state generally is described in CRDs. Controllers read these CRD objects and provision/remove the resources, directly or via Kube system API. Platforms like helm have made it easier to create and share CRD’s dynamically.

Why (Necessity is the mother of invention)

Working at webveloper, we were trying to do a tracing and observability project four our multi-cluster Kubernetes deploy. We had organized our infrastructure in form of either helm charts or crds. With repos representing our monitoring, logging, product, and service-mesh, etc. For the research of the project, I was fiddling with the mesh(istio in our case) CRD’s. Long story short, I altered one CRD that described API-Gateway, and then all hell broke loose. Soon we had our monitoring tell us, that product services were down. As a firefighting effort, we started creating all relevant CRDs from repos. But it took two infrastructure engineers nearly ten hours to make sure every single CRD was created and the cluster came back to the desired state. There couldn't be any bigger motivation now, to look for a backup and disaster recovery tooling. That's the why ! In case of disaster, reducing the time taken to restore the cluster to desired states in a matter of minutes instead of hours.

How To Do (Approaching Kubernetes backup)

As described in the overview of Kubernetes, Kubernetes depend upon controllers to read the CRD’s (Resource definitions objects/specs) and provision resources accordingly. Hence approaching Kubernetes backup is as easy as taking a snapshot of all the CRD’s that are currently in the cluster. And in case of disaster, use that snapshot to create them again. Leaving the job of provisioning the resources to controllers. That data is stored in Kubernetes etcd (a key/value database). That is basically it.

What to do (Velero by VMware)

Fortunately, there is a very stable and relevant community effort, that was originated by engineers at VMware. Over several iterations, we have Velero (A Kubernetes backup and migration service) in its current form.

Each Velero operation – on-demand backup, scheduled backup, restore – is a custom resource, defined with a Kubernetes Custom Resource Definition (CRD) and stored in etcd. Velero also includes controllers that process the custom resources to perform backups, restores, and all related operations.

Setting Up

We will be doing a hands-on of setting up velero, creating scheduled backups, and restore example. We will be using GCP as provider for cloud storage (backup-location), and GKE for our example cluster.
You can find docs for setting up with other providers here: https://velero.io/docs/v1.5/supported-providers/

We will use istio bookinfo microservice example as our product services.

GCP Plugin

We will be setting up an object store (etcd) plugin. That involves creating a bucket and creating necessary permissions to use that bucket.

Make sure you have auth to your google cloud and your relevant project is selected in gcloud cmd. 
# Store the project value from the results in the environment variable $PROJECT_ID.

Installing and setting up Velero

Download and install velero cmd from their website. Then once the cmd is ready.

# Install velero on server

Creating Scheduled Backups

Here is how you can setup velero to create scheduled daily backups. Where the last argument is a CRON time. You can use crontab to generate your desired value.

velero schedule create daily-backup --schedule "0 7 * * *"

Disaster Recovery

Suppose a disaster has occurred, follow these steps to restore your cluster.

# Update your backup storage location to read-only mode (this prevents backup objects from being created or deleted in the backup storage location during the restore process):

Manual Backup and Restore

You can also create manual backups and restore them voluntarily.

# Creating a backup

Summary

Having a Kubernetes backup, disaster recovery in place can help you reduce the time taken to restore your clusters from hours to two minutes.