Troubleshooting Rook-Ceph¶
OSD pods are not created after zapping devices¶
Warning
You may sometimes encounter issues with OSD nodes missing when you re-provision an existing node because zapping didn't fully remove the ceph metadata. My current cleanup version seems to do the trick most of the times, but sometimes I still see issues with rook-ceph-osd-prepare
reporting that no partitions were found and that the older partition (i.e. /dev/nvme0n1p3
) belongs to another cluster. This seems to happen even more often when I try to re-create the entire cluster from scratch.
In such cases, I boot the machine from an Ubuntu live-USB, and wipe->create->wipe the partitions on the disk manually several times via gparted.
Any PRs to make zapping existing partitions more robust are very much welcome!
I had to do some manual steps to create OSD on the new node after zapping the disk and once the node was re-provisioned and joined the cluster again. Particularly, I had to restart the rook-ceph-operator
.
Looking at the OSD pods with:
showed only 4 nodes, even after a few hours of re-provisioning the node
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-0-5554f4f4bc-rrpj6 1/1 Running 0 3h25m
rook-ceph-osd-1-5bdcf57db5-ccwnf 1/1 Running 0 3h25m
rook-ceph-osd-3-7d9f568c86-hjj5m 1/1 Running 0 3h18m
rook-ceph-osd-4-5994f5bf78-kvvlh 1/1 Running 0 3h24m
And prepare pod for draupnir
didn't start on its own after re-provisioning the node:
NAME READY STATUS RESTARTS AGE
rook-ceph-osd-prepare-freyja-5x2jp 0/1 Completed 0 3h25m
rook-ceph-osd-prepare-heimdall-qt596 0/1 Completed 0 3h25m
rook-ceph-osd-prepare-megingjord-24cm7 0/1 Completed 0 3h18m
rook-ceph-osd-prepare-odin-v5kwm 0/1 Completed 0 3h24m
So I had to dig into troubleshooting guides and eventually found something that seemed to work based on this solution:
After the settings are updated or the devices are cleaned, trigger the operator to analyze the devices again by restarting the operator. Each time the operator starts, it will ensure all the desired devices are configured. The operator does automatically deploy OSDs in most scenarios, but an operator restart will cover any scenarios that the operator doesn't detect automatically.
# Delete the operator to ensure devices are configured.
# A new pod will automatically be started when the current operator pod is deleted.
$ kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
Then checking the pods after the operator restarted:
kubectl -n rook-ceph get pod -l app=rook-ceph-osd
#NAME READY STATUS RESTARTS AGE
#rook-ceph-osd-0-5554f4f4bc-rrpj6 1/1 Running 0 3h25m
#rook-ceph-osd-1-5bdcf57db5-ccwnf 1/1 Running 0 3h25m
#rook-ceph-osd-3-7d9f568c86-hjj5m 1/1 Running 0 3h18m
#rook-ceph-osd-4-5994f5bf78-kvvlh 1/1 Running 0 3h24m
#rook-ceph-osd-5-798fd5f45c-xjzz2 1/1 Running 0 6m18s
kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare
#NAME READY STATUS RESTARTS AGE
#rook-ceph-osd-prepare-draupnir-gqz9p 0/1 Completed 0 9m25s
#rook-ceph-osd-prepare-freyja-5x2jp 0/1 Completed 0 9m21s
#rook-ceph-osd-prepare-heimdall-qt596 0/1 Completed 0 9m18s
#rook-ceph-osd-prepare-megingjord-24cm7 0/1 Completed 0 9m15s
#rook-ceph-osd-prepare-odin-v5kwm 0/1 Completed 0 9m12s
Tip
Useful documentation related to above: - Ceph Teardown - Zapping Devices - Ceph Common Issues