Manage & upgrade Talos cluster using talosctl

As mentioned in our previous post, Talos is a Linux distribution optimized for containers; a reimagining of Linux that is ideal for distributed systems like Kubernetes. It has been designed to be as minimalistic as possible while remaining practical. In this post, we will demonstrate how to manage Talos using gRPC API.

Kubectl & talosctl

talosctl is a command-line interface (CLI) tool that interacts with the Talos API in a user-friendly way. Additionally, it comes with several helpful options for creating and managing clusters. Prior to proceeding, we need to install talosctl.

BINARY_DIR="/usr/local/bin"
cd /tmp
# Kubectl
curl -LO "https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x ./kubectl
sudo mv ./kubectl $BINARY_DIR/kubectl
sudo apt-get install bash-completion
echo 'alias k="kubectl"' >>~/.bashrc
echo 'alias kgp="kubectl get pods"' >>~/.bashrc
echo 'alias kgn="kubectl get nodes"' >>~/.bashrc
echo 'alias kga="kubectl get all -A"' >>~/.bashrc
echo 'alias fpods="kubectl get pods -A -o wide | grep -v 1/1 | grep -v 2/2 | grep -v 3/3 | grep -v 4/4 | grep -v 5/5 | grep -v 6/6 | grep -v 7/7 | grep -v Completed"' >>~/.bashrc
echo 'source <(kubectl completion bash)' >>~/.bashrc
echo 'complete -F __start_kubectl k' >>~/.bashrc
source ~/.bashrc

### Talosctl
curl -LO "https://github.com/siderolabs/talos/releases/download/v1.1.2/talosctl-linux-amd64"
chmod +x ./talosctl-linux-amd64
sudo mv talosctl-linux-amd64 /usr/local/bin/talosctl

---
infrastructure $kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:30:46Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-15T14:15:38Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
infrastructure $talosctl version
Client:
        Tag:         v1.1.2
        SHA:         1db097f5
        Built:
        Go version:  go1.18.4
        OS/Arch:     linux/amd64
Server:
nodes are not set for the command: please use `--nodes` flag or configuration file to set the nodes to run the command against

After the nodes have been launched with Talos and its complete PKI security suite, it is necessary to utilize that PKI for communicating with the machines. This entails configuring the client, which is what the talosconfig file is intended for. Endpoints are the communication endpoints with which the client communicates directly. These may be load balancers, DNS hostnames, a list of IPs, and so on. In general, it is recommended that these endpoints be directed toward the set of control plane nodes, either directly or through a reverse proxy or load balancer. However, endpoints must also be members of the same Talos cluster as the target node because these proxied connections rely on certificate-based authentication.

  • export TALOSCONFIG="talosconfig" # talosconfig is generated with talosctl gen command, check previous post
  • talosctl config endpoint 192.168.1.80 # vip address of master nodes, to configure the shared IP use (machine.network.interfaces[].vip.ip) in the Talos configuration.

Let's see how talosctl can be used though:

talosctl
A CLI for out-of-band management of Kubernetes nodes created by Talos

Usage:
  talosctl [command]

Available Commands:
  apply-config        Apply a new configuration to a node
  bootstrap           Bootstrap the etcd cluster on the specified node.
  cluster             A collection of commands for managing local docker-based or QEMU-based clusters
  completion          Output shell completion code for the specified shell (bash, fish or zsh)
  config              Manage the client configuration file (talosconfig)
  conformance         Run conformance tests
  containers          List containers
  copy                Copy data out from the node
  dashboard           Cluster dashboard with real-time metrics
  disks               Get the list of disks from /sys/block on the machine
  dmesg               Retrieve kernel logs
  edit                Edit a resource from the default editor.
  etcd                Manage etcd
  events              Stream runtime events
  gen                 Generate CAs, certificates, and private keys
  get                 Get a specific resource or list of resources.
  health              Check cluster health
  help                Help about any command
  images              List the default images used by Talos
  inspect             Inspect internals of Talos
  kubeconfig          Download the admin kubeconfig from the node
  list                Retrieve a directory listing
  logs                Retrieve logs for a service
  memory              Show memory usage
  mounts              List mounts
  patch               Update field(s) of a resource using a JSON patch.
  pcap                Capture the network packets from the node.
  processes           List running processes
  read                Read a file on the machine
  reboot              Reboot a node
  reset               Reset a node
  restart             Restart a process
  rollback            Rollback a node to the previous installation
  service             Retrieve the state of a service (or all services), control service state
  shutdown            Shutdown a node
  stats               Get container stats
  support             Dump debug information about the cluster
  time                Gets current server time
  upgrade             Upgrade Talos on the target node
  upgrade-k8s         Upgrade Kubernetes control plane in the Talos cluster.
  usage               Retrieve a disk usage
  validate            Validate config
  version             Prints the version

Flags:
      --context string       Context to be used in command
  -e, --endpoints strings    override default endpoints in Talos configuration
  -h, --help                 help for talosctl
  -n, --nodes strings        target the specified nodes
      --talosconfig string   The path to the Talos configuration file (default "terraform/dev/templates/dev/talosconfig")

Use "talosctl [command] --help" for more information about a command.

Once networking is set up, the kubelet service must be running on control plane nodes. If the kubelet is not running, it may be due to an incorrect configuration. To check the kubelet logs, use the talosctl logs command.

infrastructure $talosctl -n 192.168.1.80 service kubelet
NODE     192.168.1.80
ID       kubelet
STATE    Running
HEALTH   OK
EVENTS   [Running]: Health check successful (3h18m51s ago)
         [Running]: Started task kubelet (PID 2975) for container kubelet (3h18m52s ago)
         [Preparing]: Creating service runner (3h18m53s ago)
         [Preparing]: Running pre state (3h19m19s ago)
         [Waiting]: Waiting for service "cri" to be "up", time sync, network (3h19m20s ago)
---
infrastructure $talosctl -n 192.168.1.80 logs kubelet

The reason for etcd not running is usually that the cluster has not been bootstrapped or is in the process of bootstrapping. It is important to manually run the "talosctl bootstrap" command once per cluster, which is a commonly overlooked step. After bootstrapping a node, etcd will start, and within a few minutes (depending on the download speed of the control plane nodes), the other control plane nodes will discover it and join the cluster.

infrastructure $talosctl -n 192.168.1.80 service etcd
NODE     192.168.1.80
ID       etcd
STATE    Running
HEALTH   OK
EVENTS   [Running]: Health check successful (3h49m27s ago)
         [Running]: Started task etcd (PID 3102) for container etcd (3h49m32s ago)
         [Preparing]: Creating service runner (3h49m33s ago)
         [Preparing]: Running pre state (3h49m33s ago)
         [Waiting]: Waiting for service "cri" to be "up", time sync, network (3h49m33s ago)

Disaster Recovery


The etcd database is responsible for storing the state of the Kubernetes control plane. If etcd is down, the control plane of Kubernetes will also go down, and the cluster will become unrecoverable until etcd is restored along with its contents. Therefore, it is crucial to regularly create backups of the etcd state, so that in case of a major failure, there is a snapshot available to restore the cluster.

infrastructure $talosctl -n 192.168.1.80 etcd snapshot db.firstbackup
etcd snapshot saved to "db.firstbackup" (7606304 bytes)
snapshot info: hash 57d71f5d, revision 39652, total keys 1836, total size 7606272

infrastructure $du -sh db.firstbackup
7.3M    db.firstbackup

This database snapshot can be taken on any healthy control plane node (with IP address <IP> in the example above), as all etcd instances contain exactly same data.

If a control plane node is healthy but etcd isn’t, wipe the node’s EPHEMERAL partition to remove the etcd data directory:

talosctl -n 192.168.1.80  reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL

At this point, all control plane nodes should boot up, and etcd service should be in the Preparing state.


talosctl -n 192.168.1.80 bootstrap --recover-from=./db.firstbackup
recovering from snapshot "./db.firstbackup": hash c25fd181, revision 4193, total keys 1287, total size 3035136

After the bootstrap node successfully starts the etcd service, the Kubernetes control plane components will initiate and the control plane endpoint will become accessible. Once the control plane endpoint is available, the remaining control plane nodes will join the etcd cluster.

Upgrading Talos Linux


Currently, in dev cluster Talos & Kubernetes are intentionally outdated:

infrastructure $k get node -o wide
NAME                  STATUS   ROLES                  AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
master-1              Ready    control-plane,master   4h11m   v1.24.2   192.168.1.130   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
master-2              Ready    control-plane,master   4h11m   v1.24.2   192.168.1.131   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
master-3              Ready    control-plane,master   4h11m   v1.24.2   192.168.1.132   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-133   Ready    <none>                 4h5m    v1.24.2   192.168.1.133   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-134   Ready    <none>                 4h8m    v1.24.2   192.168.1.134   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-135   Ready    <none>                 4h6m    v1.24.2   192.168.1.135   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-136   Ready    <none>                 4h4m    v1.24.2   192.168.1.136   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-137   Ready    <none>                 4h9m    v1.24.2   192.168.1.137   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6

Talos version is set to v1.1.0 and Kubernetes version is v1.24.2. To upgrade a Talos node, specify the node’s IP address and the installer container image for the version of Talos to upgrade to via command: talosctl upgrade --nodes NODE-IP --image ghcr.io/siderolabs/installer:v1.1.2. When a Talos node receives the upgrade command, it cordons itself in Kubernetes, to avoid receiving any new workload. It then starts to drain its existing workload.

infrastructure $talosctl upgrade --nodes 192.168.1.133 --image ghcr.io/siderolabs/installer:v1.1.2
NODE            ACK                        STARTED
192.168.1.133   Upgrade request received   2022-08-09 14:06:27.2982613 +0100 BST m=+15.456476001
---
infrastructure $k get node -o wide | grep 133
NAME                  STATUS                        ROLES                  AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
talos-192-168-1-133   NotReady,SchedulingDisabled   <none>                 4h11m   v1.24.2   192.168.1.133   <none>        Talos (v1.1.2)   5.15.57-talos    containerd://1.6.6

After the node comes back up and Talos verifies itself, it will make the bootloader change permanent, rejoin the cluster, and finally uncordon itself to receive new workloads.

k get node -o wide | grep 192.168.1.133
talos-192-168-1-133   Ready    <none>                 4h13m   v1.24.2   192.168.1.133   <none>        Talos (v1.1.2)   5.15.57-talos    containerd://1.6.6

Upgrading Kubernetes


talosctl upgrade-k8s command will automatically update the components needed to upgrade Kubernetes safely. Upgrading Kubernetes is non-disruptive to the cluster workloads. To trigger a Kubernetes upgrade, issue a command specifiying the version of Kubernetes to ugprade to, such as: talosctl --nodes <master node> upgrade-k8s --to 1.24.3; To check what will be upgraded you can run talosctl upgrade-k8s with the --dry-run flag:

infrastructure $talosctl --nodes 192.168.1.130 upgrade-k8s --to 1.24.3 --dry-run
automatically detected the lowest Kubernetes version 1.24.2
checking for resource APIs to be deprecated in version 1.24.3
discovered master nodes ["192.168.1.130" "192.168.1.131" "192.168.1.132"]
discovered worker nodes ["192.168.1.133" "192.168.1.134" "192.168.1.135" "192.168.1.136" "192.168.1.137"]
updating "kube-apiserver" to version "1.24.3"
 > "192.168.1.130": starting update
 > update kube-apiserver: v1.24.2 -> 1.24.3
 > skipped in dry-run
 > "192.168.1.131": starting update
 > update kube-apiserver: v1.24.2 -> 1.24.3
 > skipped in dry-run
 > "192.168.1.132": starting update
 > update kube-apiserver: v1.24.2 -> 1.24.3
 > skipped in dry-run
updating "kube-controller-manager" to version "1.24.3"
 > "192.168.1.130": starting update
 > update kube-controller-manager: v1.24.2 -> 1.24.3
 > skipped in dry-run
 > "192.168.1.131": starting update
 > update kube-controller-manager: v1.24.2 -> 1.24.3
 > skipped in dry-run
 > "192.168.1.132": starting update
 > update kube-controller-manager: v1.24.2 -> 1.24.3
 > skipped in dry-run
...

This command runs in several phases:

  1. Every control plane node machine configuration is patched with the new image version for each control plane component. Talos renders new static pod definitions on the configuration update which is picked up by the kubelet. The command waits for the change to propagate to the API server state.
  2. The command updates the kube-proxy daemonset with the new image version. Nevertheless, can be ignored in our case because we run Cilium in strict/kube-proxy free mode.
  3. On every node in the cluster, the kubelet version is updated. The command then waits for the kubelet service to be restarted and become healthy. The update is verified by checking the Node resource state.
  4. Kubernetes bootstrap manifests are re-applied to the cluster. Updated bootstrap manifests might come with a new Talos version (e.g. CoreDNS version update), or might be the result of machine configuration change. Note: The upgrade-k8s command never deletes any resources from the cluster: they should be deleted manually.

If the command fails for any reason, it can be safely restarted to continue the upgrade process from the moment of the failure.

Full log of cluster upgrade: https://paste.opendev.org/show/bSY2oZvwGo3WvJkpfL8N/

infrastructure $k get node -o wide
NAME                  STATUS   ROLES                  AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION   CONTAINER-RUNTIME
master-1              Ready    control-plane,master   4h36m   v1.24.3   192.168.1.130   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
master-2              Ready    control-plane,master   4h36m   v1.24.3   192.168.1.131   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
master-3              Ready    control-plane,master   4h36m   v1.24.3   192.168.1.132   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-133   Ready    <none>                 4h31m   v1.24.3   192.168.1.133   <none>        Talos (v1.1.2)   5.15.57-talos    containerd://1.6.6
talos-192-168-1-134   Ready    <none>                 4h33m   v1.24.3   192.168.1.134   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-135   Ready    <none>                 4h32m   v1.24.3   192.168.1.135   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-136   Ready    <none>                 4h29m   v1.24.3   192.168.1.136   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6
talos-192-168-1-137   Ready    <none>                 4h34m   v1.24.3   192.168.1.137   <none>        Talos (v1.1.0)   5.15.48-talos    containerd://1.6.6

Kubernetes cluster has been successfully updated. Approx. time: ~9 min.