Troubleshooting Rook-Ceph¶

OSD pods are not created after zapping devices¶

Warning

You may sometimes encounter issues with OSD nodes missing when you re-provision an existing node because zapping didn't fully remove the ceph metadata. My current cleanup version seems to do the trick most of the times, but sometimes I still see issues with rook-ceph-osd-prepare reporting that no partitions were found and that the older partition (i.e. /dev/nvme0n1p3) belongs to another cluster. This seems to happen even more often when I try to re-create the entire cluster from scratch.
In such cases, I boot the machine from an Ubuntu live-USB, and wipe->create->wipe the partitions on the disk manually several times via gparted.
Any PRs to make zapping existing partitions more robust are very much welcome!

I had to do some manual steps to create OSD on the new node after zapping the disk and once the node was re-provisioned and joined the cluster again. Particularly, I had to restart the rook-ceph-operator.

Looking at the OSD pods with:

kubectl -n rook-ceph get pod -l app=rook-ceph-osd

showed only 4 nodes, even after a few hours of re-provisioning the node

NAME                               READY   STATUS    RESTARTS   AGE
rook-ceph-osd-0-5554f4f4bc-rrpj6   1/1     Running   0          3h25m
rook-ceph-osd-1-5bdcf57db5-ccwnf   1/1     Running   0          3h25m
rook-ceph-osd-3-7d9f568c86-hjj5m   1/1     Running   0          3h18m
rook-ceph-osd-4-5994f5bf78-kvvlh   1/1     Running   0          3h24m

And prepare pod for draupnir didn't start on its own after re-provisioning the node:

kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare

NAME                                     READY   STATUS      RESTARTS   AGE
rook-ceph-osd-prepare-freyja-5x2jp       0/1     Completed   0          3h25m
rook-ceph-osd-prepare-heimdall-qt596     0/1     Completed   0          3h25m
rook-ceph-osd-prepare-megingjord-24cm7   0/1     Completed   0          3h18m
rook-ceph-osd-prepare-odin-v5kwm         0/1     Completed   0          3h24m

So I had to dig into troubleshooting guides and eventually found something that seemed to work based on this solution:

After the settings are updated or the devices are cleaned, trigger the operator to analyze the devices again by restarting the operator. Each time the operator starts, it will ensure all the desired devices are configured. The operator does automatically deploy OSDs in most scenarios, but an operator restart will cover any scenarios that the operator doesn't detect automatically.

# Delete the operator to ensure devices are configured. 
# A new pod will automatically be started when the current operator pod is deleted.
$ kubectl -n rook-ceph delete pod -l app=rook-ceph-operator

Then checking the pods after the operator restarted:

kubectl -n rook-ceph get pod -l app=rook-ceph-osd

#NAME                               READY   STATUS    RESTARTS   AGE
#rook-ceph-osd-0-5554f4f4bc-rrpj6   1/1     Running   0          3h25m
#rook-ceph-osd-1-5bdcf57db5-ccwnf   1/1     Running   0          3h25m
#rook-ceph-osd-3-7d9f568c86-hjj5m   1/1     Running   0          3h18m
#rook-ceph-osd-4-5994f5bf78-kvvlh   1/1     Running   0          3h24m
#rook-ceph-osd-5-798fd5f45c-xjzz2   1/1     Running   0          6m18s

kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare

#NAME                                     READY   STATUS      RESTARTS   AGE
#rook-ceph-osd-prepare-draupnir-gqz9p     0/1     Completed   0          9m25s
#rook-ceph-osd-prepare-freyja-5x2jp       0/1     Completed   0          9m21s
#rook-ceph-osd-prepare-heimdall-qt596     0/1     Completed   0          9m18s
#rook-ceph-osd-prepare-megingjord-24cm7   0/1     Completed   0          9m15s
#rook-ceph-osd-prepare-odin-v5kwm         0/1     Completed   0          9m12s

Tip

Useful documentation related to above: - Ceph Teardown - Zapping Devices - Ceph Common Issues