Backup and Restore a Kafka Cluster on Kubernetes

The lastest Mirror Maker2 is able to replicate the Kafka from one cluster to the destination cluster. However the backup and restore requirement is still there for the local Kafka cluster. When a Kafka cluster is running on Kubernetes, the traditional backup/restore method needs to be revised. On the other hand, the standardization of Kubernetes Storage API with Container Storage Interface (CSI) makes the backup/restore for stateful app on Kubernetes much easier.

Using IBM event streams V10.1 (Kafka 2.6) as an example, this paper explores how we can backup and restore for a local Kafka Cluster. The Statefulsets of Kafka is running on Ceph RBD block storage with the Rook operator.

Backup Kafka Cluster with CSI Volume Cloning

Since Kafka is running on Kubernetes, the backup of the configuration of the Kafka cluster becomes a question of how to backup the etcd database, which is well known and I skip it here.

Now for the messages and offsets of the topics in the Kafka cluster, we need to back up the persistent volume for each of the brokers. In my test environment, I have 3 brokers with the Ceph RBD block storage provisioned dynamically with the PVC named as respectively.

Thanks to the CSI volume cloning supported by Rook Ceph, the backup of the volume for the topic becomes a declarative YAML task. For each of the PVC of the brokers, apply the following YAML with kubectl.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: broker0-clone
namespace: eventstreams
spec:
storageClassName: rook-ceph-block
dataSource:
name: data-minimal-prod-kafka-0
kind: PersistentVolumeClaim
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi

Notice it's a PVC object where the dataSource is defined as the source PVC from which it will clone the volume. The accessModes and the sizing will follow the source PVC.

Once the PVC is ready, the volume is cloned, data are backed up. Ideally, of course, it's better to shut down the Kafka cluster to make sure no data is lost before the volume cloning. But it will introduce downtime. If the loss of some messages is tolerant, we can let Kafka to perform its recovery point action to validate and recover the data.

Disasters…

Let's simulate a disaster, the brokers data are lost by deleting the data on the volume.

oc exec -i minimal-prod-kafka-0 -- bash -c 'rm -rf /var/lib/kafka/data/*' 
oc exec -i minimal-prod-kafka-1 -- bash -c 'rm -rf /var/lib/kafka/data/*'
oc exec -i minimal-prod-kafka-2 -- bash -c 'rm -rf /var/lib/kafka/data/*'

Restore

Before start, let's scale down the Statefulsets of brokers to 0 so that it won’t write any data on the PV.

Create the following Kubernetes Job.

apiVersion: batch/v1
kind: Job
metadata:
name: restore-1603637288
namespace: eventstreams
spec:
template:
spec:
containers:
- name: restore
image: ubuntu
command:
- sh
- -c
- rm -rf /broker0/*; cp -r /broker0-clone/* /broker0; rm -rf /broker1/*; cp -r /broker1-clone/* /broker1; rm -rf /broker2/*; cp -r /broker2-clone/* /broker2
volumeMounts:
- name: broker0
mountPath: /broker0
- name: broker0-clone
mountPath: /broker0-clone
- name: broker1
mountPath: /broker1
- name: broker1-clone
mountPath: /broker1-clone
- name: broker2
mountPath: /broker2
- name: broker2-clone
mountPath: /broker2-clone
restartPolicy: Never
volumes:
- name: broker0-clone
persistentVolumeClaim:
claimName: broker0-clone
- name: broker0
persistentVolumeClaim:
claimName: data-minimal-prod-kafka-0
- name: broker1-clone
persistentVolumeClaim:
claimName: broker1-clone
- name: broker1
persistentVolumeClaim:
claimName: data-minimal-prod-kafka-1
- name: broker2-clone
persistentVolumeClaim:
claimName: broker2-clone
- name: broker2
persistentVolumeClaim:
claimName: data-minimal-prod-kafka-2

In this job, we mount both the clone volume and the source volume. Cleanup the source volume, copy back the content to the source volume from the clone volume.

Now scale up the brokers to the original 3 replicas. Watch the messages and consumer offsets are restored.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store