ToDo¶
-
Automate personal gitea user provisioning
-
Stream/export sensor data from baremetal and visualize it in grafana
- some potentially-useful resources to explore:
- https://grafana.com/blog/2019/11/06/how-to-stream-sensor-data-with-grafana-and-influxdb/
- https://grafana.com/blog/2021/08/12/streaming-real-time-sensor-data-to-grafana-using-mqtt-and-grafana-live/
- https://grafana.com/blog/2024/01/03/how-to-create-alert-rules-to-monitor-sensor-data-with-grafana-and-raspberry-pi/
- https://github.com/sbnb-io/sbnb/blob/main/README-GRAFANA.md
- https://grafana.com/grafana/dashboards/237-sensors/
- https://github.com/ncabatoff/sensor-exporter
- some potentially-useful resources to explore:
-
Automate firmware updates
-
Test zapping devices and make sure it works fine. Current version was tested on
draupnir
, which seemed to wipe the disk fine, but I had to do some manual steps afterwards to create OSD on the new node once the node was re-provisioned and joined the cluster again. Particularly, I had to restart the rook-ceph-operator.Looking at the OSD pods with
...showed only 4 nodes even after a few hours of re-provisioning the node
NAME READY STATUS RESTARTS AGE rook-ceph-osd-0-5554f4f4bc-rrpj6 1/1 Running 0 3h25m rook-ceph-osd-1-5bdcf57db5-ccwnf 1/1 Running 0 3h25m rook-ceph-osd-3-7d9f568c86-hjj5m 1/1 Running 0 3h18m rook-ceph-osd-4-5994f5bf78-kvvlh 1/1 Running 0 3h24m
And prepare pod for
draupnir
didn't start on its own after re-provisioning the node:NAME READY STATUS RESTARTS AGE rook-ceph-osd-prepare-freyja-5x2jp 0/1 Completed 0 3h25m rook-ceph-osd-prepare-heimdall-qt596 0/1 Completed 0 3h25m rook-ceph-osd-prepare-megingjord-24cm7 0/1 Completed 0 3h18m rook-ceph-osd-prepare-odin-v5kwm 0/1 Completed 0 3h24m
So I had to dig into troubleshooting guides and eventually found something that seemed to work based on this solution:
After the settings are updated or the devices are cleaned, trigger the operator to analyze the devices again by restarting the operator. Each time the operator starts, it will ensure all the desired devices are configured. The operator does automatically deploy OSDs in most scenarios, but an operator restart will cover any scenarios that the operator doesn't detect automatically.
# Restart the operator to ensure devices are configured. A new pod will automatically be started when the current operator pod is deleted. $ kubectl -n rook-ceph delete pod -l app=rook-ceph-operator
Then checking the pods after the operator restarted:
kubectl -n rook-ceph get pod -l app=rook-ceph-osd #NAME READY STATUS RESTARTS AGE #rook-ceph-osd-0-5554f4f4bc-rrpj6 1/1 Running 0 3h25m #rook-ceph-osd-1-5bdcf57db5-ccwnf 1/1 Running 0 3h25m #rook-ceph-osd-3-7d9f568c86-hjj5m 1/1 Running 0 3h18m #rook-ceph-osd-4-5994f5bf78-kvvlh 1/1 Running 0 3h24m #rook-ceph-osd-5-798fd5f45c-xjzz2 1/1 Running 0 6m18s kubectl -n rook-ceph get pod -l app=rook-ceph-osd-prepare #NAME READY STATUS RESTARTS AGE #rook-ceph-osd-prepare-draupnir-gqz9p 0/1 Completed 0 9m25s #rook-ceph-osd-prepare-freyja-5x2jp 0/1 Completed 0 9m21s #rook-ceph-osd-prepare-heimdall-qt596 0/1 Completed 0 9m18s #rook-ceph-osd-prepare-megingjord-24cm7 0/1 Completed 0 9m15s #rook-ceph-osd-prepare-odin-v5kwm 0/1 Completed 0 9m12s
-
Install ARA to record ansible executions
-
Try rke2 which includes by default Cilium and nginx-ingress + etcd db
-
Replace kickstart with cloud-init
- Kickstart is gonna end (?) and cloud-init is much more agnostic; also it can trigger ansible pull
- Cloud-init can also be used across muppet OSes, not just Fedora, as is the case with kickstart
- ref: https://github.com/khuedoan/homelab/issues/179#issue-2875515756
-
Improve github+gitea workflow
- One of the downsides is that I need to delete non-master (PR) branches on gitea manually (well, via cli, but still)
- Maybe use a separate branch for gitea? E.g.
main
? - Or maybe use a separate remote for gitea (or for github? since gitea is technically considered "the origin"?) E.g.
gitea
(orgithub
, for github repo origin)?
- Encrypt kubeconfig with sops so it can be committed to git
- Update architecture/overview components
- Basic diagram of code components and their relations
- Description of components and their purpose
-
Update concepts/pxe_boot with a visual "in-action" showcase of how it works, once it's in place
-
Add up-to-date config files of C1111 and C3560 for reference
- Can be placed in a separate note (probably don't even need to make it visible in nav menu) and referenced from installation/production/network
-
Check that devices on Guest WiFi network (when Eero is in AP/Bridge mode!) are still isolated and cannot see or communicate with each other or the main network.
- Eero in Bridge mode looses a lot of security related functionality (it becomes "greyed out" in the app also.) However, it seems that the guest network can still be enabled from the app. Hopefully that guest network is still isolated, but needs double-checking.
- Some related links:
-
When storing terraform state locally one needs to think about where/how to back it up. An alternative would be to use terraform cloud or opentofu TACOS, which are paid services (Plus your state is stored on someone else's computer, and hence should be encrypted)
- What can be alternatives to storing the state locally?
- Initial provisioning can be done with local state
- Once the cluster is up and running, we can host Atlantis and migrate the state to it.
- As an added benefit, this makes it possible to run terraform from PRs
- Store/commit sops-encrypted state. Run
terraform
with a script/make wrapper that decrypts the state before runningterraform
commands, and re-encrypts it at the end.
- Initial provisioning can be done with local state
- What can be alternatives to storing the state locally?
-
Configure
/etc/hosts
on local controller machine as part ofmetal
provisioning# midgard.local homelab # network devices 10.10.10.1 muspell 10.10.10.2 bifrost # k8s cluster 10.10.10.10 odin 10.10.10.11 freyja 10.10.10.12 heimdall 10.10.10.20 mjolinr 10.10.10.21 gungnir 10.10.10.22 draupnir 10.10.10.23 megingjord 10.10.10.24 hofund 10.10.10.25 gjallarhorn 10.10.10.26 brisingamen # storage devices 10.10.10.30 yggdrasil
-
Configure
~/.ssh/config
on local controller machine as part ofmetal
provisioningHost 10.10.10.* StrictHostKeyChecking no LogLevel ERROR UserKnownHostsFile /dev/null # muspell (C1111 router) in homelab vlan Host 10.10.10.1 muspell User cisco PasswordAuthentication yes # bifrost (C3560 switch) in homelab vlan Host 10.10.10.2 bifrost User cisco PasswordAuthentication yes KexAlgorithms +diffie-hellman-group14-sha1 HostKeyAlgorithms +ssh-rsa # k8s cluster nodes in homelab vlan Host 10.10.10.1* 10.10.10.2* odin freyja heimdall mjolnir draupnir gungnir megingjord hofund brisingamen gjallarhorn User root IdentityFile ~/.ssh/homelab_id_ed25519 StrictHostKeyChecking no LogLevel ERROR UserKnownHostsFile /dev/null GSSAPIAuthentication no # not supported on OS I use today for servers # storage nodes in homelab vlan Host 10.10.10.3* yggdrasil User root IdentityFile ~/.ssh/homelab_id_ed25519 StrictHostKeyChecking no LogLevel ERROR UserKnownHostsFile /dev/null
-
Check if server is up before sending WoL magic packets
-
Ask before proceeding when running
make bootstrap
inmetal
provisioning- The server prefers to boot from Network when woken up, which will erase all data on the disk and re-install the OS
- Ask the user for confirmation before proceeding.
- Mention
make wake
alternative which can be used just to wake up the machines
- Mention
-
Consider restricting ssh access from homelab to router/switch SVI to specific IPs
- [ ] Limit permits to specific IP addresses instead of using! --- Define ACL for traffic FROM Homelab Network --- ip access-list extended ACL_FROM_HOMELAB_NETWORK ! ... ! (Optional: Add permits if Homelab needs to SSH to router's Homelab SVI - local management when e.g. laptop is physically connected to homelab network) 103 remark Permit SSH from Homelab to router's Homelab SVI (local management) 103 permit tcp 10.10.10.0 0.0.0.255 host 10.10.10.1 eq 22 104 remark Permit SSH from Homelab to switch's Homelab SVI (local management) 104 permit tcp 10.10.10.0 0.0.0.255 host 10.10.10.2 eq 22 199 remark --- END --- ! ... exit
10.10.10.0
so that e.g. k8s servers couldn't ssh to Homelab's router or switch -
Provision cisco devices with Ansible
-
Explore Enchanced Power Saving Mode in BIOS
- Newer Lenovo machines support enhanced power saving mode which lowers power consumption during power-off.
- Won't do: WoL is not supported!
-
Configure and document BIOS -> Power -> After Power Loss
- What option is better for my use-cases? Make sure it's configured everywhere and document it.
-
Figure out why dnf is very slow
- References:
- Seems like adding
fastestmirror=True
to/etc/dnf/dnf.conf
helps at least to some degree
-
Setup pi-hole on the cluster