Kubernetes Backup and Disaster Recovery
This article borrows the why, what and how methadology of communicating which was described by simon sinek. I find this model to be very contextual in teaching and context is what makes our brain tick.
I have made an assumption, you roughly understand distributed computing, containerization, orchestration, microservices. If not click the links.
Overview of Kubernetes
Kubernetes is a very powerful resource orchestration service. With the kube API and controllers acting as control plane, Kubernetes act as a non-terminating loop, whose sole job is to move cluster from current to the desired state. The desired state generally is described in CRDs. Controllers read these CRD objects and provision/remove the resources, directly or via Kube system API. Platforms like helm have made it easier to create and share CRD’s dynamically.
Why (Necessity is the mother of invention)
Working at webveloper, we were trying to do a tracing and observability project four our multi-cluster Kubernetes deploy. We had organized our infrastructure in form of either helm charts or crds. With repos representing our monitoring, logging, product, and service-mesh, etc. For the research of the project, I was fiddling with the mesh(istio in our case) CRD’s. Long story short, I altered one CRD that described API-Gateway, and then all hell broke loose. Soon we had our monitoring tell us, that product services were down. As a firefighting effort, we started creating all relevant CRDs from repos. But it took two infrastructure engineers nearly ten hours to make sure every single CRD was created and the cluster came back to the desired state. There couldn't be any bigger motivation now, to look for a backup and disaster recovery tooling. That's the why ! In case of disaster, reducing the time taken to restore the cluster to desired states in a matter of minutes instead of hours.
How To Do (Approaching Kubernetes backup)
As described in the overview of Kubernetes, Kubernetes depend upon controllers to read the CRD’s (Resource definitions objects/specs) and provision resources accordingly. Hence approaching Kubernetes backup is as easy as taking a snapshot of all the CRD’s that are currently in the cluster. And in case of disaster, use that snapshot to create them again. Leaving the job of provisioning the resources to controllers. That data is stored in Kubernetes etcd (a key/value database). That is basically it.
What to do (Velero by VMware)
Fortunately, there is a very stable and relevant community effort, that was originated by engineers at VMware. Over several iterations, we have Velero (A Kubernetes backup and migration service) in its current form.
Each Velero operation – on-demand backup, scheduled backup, restore – is a custom resource, defined with a Kubernetes Custom Resource Definition (CRD) and stored in etcd. Velero also includes controllers that process the custom resources to perform backups, restores, and all related operations.
Setting Up
We will be doing a hands-on of setting up velero, creating scheduled backups, and restore example. We will be using GCP as provider for cloud storage (backup-location), and GKE for our example cluster.
You can find docs for setting up with other providers here: https://velero.io/docs/v1.5/supported-providers/
We will use istio bookinfo microservice example as our product services.kubectl apply -f samples/bookinfo/platform/kube/bookinfo.yaml
GCP Plugin
We will be setting up an object store (etcd) plugin. That involves creating a bucket and creating necessary permissions to use that bucket.
Make sure you have auth to your google cloud and your relevant project is selected in gcloud cmd. # Creating bucketBUCKET=<YOUR_BUCKET>
gsutil mb gs://$BUCKET/# View config list and copy project namegcloud config list
# Store the project value from the results in the environment variable $PROJECT_ID.PROJECT_ID=$(gcloud config get-value project)# Create a service account:gcloud iam service-accounts create velero \
--display-name "Velero service account"# Set the $SERVICE_ACCOUNT_EMAIL variable to match its email value.SERVICE_ACCOUNT_EMAIL=$(gcloud iam service-accounts list \
--filter="displayName:Velero service account" \
--format 'value(email)')Attach policies to give velero the necessary permissions to function:# Our permissionsROLE_PERMISSIONS=(
compute.disks.get
compute.disks.create
compute.disks.createSnapshot
compute.snapshots.get
compute.snapshots.create
compute.snapshots.useReadOnly
compute.snapshots.delete
compute.zones.get ) # Creating a role with those permissionsgcloud iam roles create velero.server \ --project $PROJECT_ID \ --title "Velero Server" \ --permissions "$(IFS=","; echo "${ROLE_PERMISSIONS[*]}")" # Creating IAM policy for our account that uses the role we just createdgcloud projects add-iam-policy-binding $PROJECT_ID \ --member serviceAccount:$SERVICE_ACCOUNT_EMAIL \ --role projects/$PROJECT_ID/roles/velero.server # Giving access to our bucketgsutil iam ch serviceAccount:$SERVICE_ACCOUNT_EMAIL:objectAdmin gs://${BUCKET}# Create a service account key, specifying an output file (credentials-velero) in your local directory. Store it somewhere safe. gcloud iam service-accounts keys create credentials-velero \ — iam-account $SERVICE_ACCOUNT_EMAIL
Installing and setting up Velero
Download and install velero cmd from their website. Then once the cmd is ready.
# Install velero on servervelero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.1.0 \
--bucket $BUCKET \
--secret-file ./credentials-velero
Creating Scheduled Backups
Here is how you can setup velero to create scheduled daily backups. Where the last argument is a CRON time. You can use crontab to generate your desired value.
velero schedule create daily-backup --schedule "0 7 * * *"
Disaster Recovery
Suppose a disaster has occurred, follow these steps to restore your cluster.
# Update your backup storage location to read-only mode (this prevents backup objects from being created or deleted in the backup storage location during the restore process):# Get storage location name and copy the Name value.
velero backup-location get# Set the storage location access mode to read-onlykubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
--namespace velero \
--type merge \
--patch '{"spec":{"accessMode":"ReadOnly"}}'## create a restore from the latest successful backup triggered by schedule "daily-backup"
velero restore create --from-schedule daily-backup# Once the restore jon is created, use the following command to monitor progressvelero restore describe <Restore-Object-Name># Patching backup location access to allow read/write accesskubectl patch backupstoragelocation <STORAGE LOCATION NAME> \
--namespace velero \
--type merge \
--patch '{"spec":{"accessMode":"ReadWrite"}}'
Manual Backup and Restore
You can also create manual backups and restore them voluntarily.
# Creating a backupvelero backup create my-backup-1# Getting a list of backups available
velero backup get# Creating a restore job from backup name.
velero restore create --from-backup my-backup-1
Summary
Having a Kubernetes backup, disaster recovery in place can help you reduce the time taken to restore your clusters from hours to two minutes.