Troubleshoot Ceph Storage with Ceph Tool for OpenShift Data Foundation

Zhimin Wen
2 min readAug 11, 2023
Image by Leo Hau from Pixabay

You may want to troubleshoot some Ceph storage issue that are coming from OpenShift Data Foundation (ODF). In the rook operator based Ceph storage solution, you can deploy the “rook-ceph-tools” deployment to have the ceph command line tool for the troublshooting purpose. In the ODF operator the ceph command line tool is actually available.

Launch of the Ceph Command Line Tool

Exec into the rook-ceph-operator pod in the ODF namespace,

oc -n openshift-storage exec -it rook-ceph-operator-5ff898fc8-pppgd -- bash

cd /var/lib/rook/openshift-storage
ls
client.admin.keyring openshift-storage.config

The file openshift-storage.config availalble in the /var/lib/rook/openshift-storage directory can be used to as the ceph config file where the MON hosts and the admin client’s keyring are defined.


[global]
fsid = 520035a-0552-461a-a194-46dca45ba8fc
mon initial members = j k h
mon host = [v2:172.30.26.44:3300, v1:172.30.26.44:6789], [v2:172.30.16.22:3300, v1:172.30.16.22:6789], [v2:172.30.140.78:3300, v1:172.30.140.78:6789]

...

[osd]
osd_memory_target_cgroup_limit_ratio = 0.8
[client.admin]
keyring = /var/lib/rook/openshift-storage/client.admin.keyring

With this, we can run the ceph command line tool for troubleshooting, continue in the exec-ed shell, check the status with the command below,

cd /var/lib/rook/openshift-storage
ceph -c openshift-storage.config -s
cluster:
id: 5e20035a-0552-461a-a194-46dca45ba8£c
health: HEALTH OK

services:
mon: 3 daemons, quorum h, j,k (age 3d)
mgr: a(active, since 8d)
mds: 1/1 daemons up, 1 hot standby
osd: 12 osds: 12 up (since 8d), 12 in (since 11d)

data:
volumes: 1/1 healthy
pools: 4 pools, 417 pgs
objects: 905.53k objects, 1.3 TiB
usage: 4.0 TiB used, 8.0 TiB / 12 TiB avail
pgs: 417 active+clean

io:
client: 12 MiB/s rd, 53 MiB/s wI, 43 op/s rd, 1.31k op/s wr

Fixing a Warning Issue

In one the ODF setup, from the OpenShift Console, I have the following warning image for the storage.

Running the ceph -s tool in the operator pod, it indicated that the health status is HEALTH WARNING, where the mds.ocs-storagecluster-cephfilesystem-b is crashed before.

Check the crash by listing the crash IDs, check the details with the ID

ceph -c openshift-storage.config crash ls
ceph -c openshift-storage.config crash info <CRASH_ID>

Silence the warning alarm by achive the crash ID.

ceph -c openshift-storage.config crash archive <CRASH_ID>

The warning message is then cleared in the OpenShift Console.

--

--