Manage & upgrade Talos cluster using talosctl
As mentioned in our previous post, Talos is a Linux distribution optimized for containers; a reimagining of Linux that is ideal for distributed systems like Kubernetes. It has been designed to be as minimalistic as possible while remaining practical. In this post, we will demonstrate how to manage Talos using gRPC API.
Kubectl & talosctl
talosctl is a command-line interface (CLI) tool that interacts with the Talos API in a user-friendly way. Additionally, it comes with several helpful options for creating and managing clusters. Prior to proceeding, we need to install talosctl.
BINARY_DIR="/usr/local/bin"
cd /tmp
# Kubectl
curl -LO "https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x ./kubectl
sudo mv ./kubectl $BINARY_DIR/kubectl
sudo apt-get install bash-completion
echo 'alias k="kubectl"' >>~/.bashrc
echo 'alias kgp="kubectl get pods"' >>~/.bashrc
echo 'alias kgn="kubectl get nodes"' >>~/.bashrc
echo 'alias kga="kubectl get all -A"' >>~/.bashrc
echo 'alias fpods="kubectl get pods -A -o wide | grep -v 1/1 | grep -v 2/2 | grep -v 3/3 | grep -v 4/4 | grep -v 5/5 | grep -v 6/6 | grep -v 7/7 | grep -v Completed"' >>~/.bashrc
echo 'source <(kubectl completion bash)' >>~/.bashrc
echo 'complete -F __start_kubectl k' >>~/.bashrc
source ~/.bashrc
### Talosctl
curl -LO "https://github.com/siderolabs/talos/releases/download/v1.1.2/talosctl-linux-amd64"
chmod +x ./talosctl-linux-amd64
sudo mv talosctl-linux-amd64 /usr/local/bin/talosctl
---
infrastructure $kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.3", GitCommit:"aef86a93758dc3cb2c658dd9657ab4ad4afc21cb", GitTreeState:"clean", BuildDate:"2022-07-13T14:30:46Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.2", GitCommit:"f66044f4361b9f1f96f0053dd46cb7dce5e990a8", GitTreeState:"clean", BuildDate:"2022-06-15T14:15:38Z", GoVersion:"go1.18.3", Compiler:"gc", Platform:"linux/amd64"}
infrastructure $talosctl version
Client:
Tag: v1.1.2
SHA: 1db097f5
Built:
Go version: go1.18.4
OS/Arch: linux/amd64
Server:
nodes are not set for the command: please use `--nodes` flag or configuration file to set the nodes to run the command against
After the nodes have been launched with Talos and its complete PKI security suite, it is necessary to utilize that PKI for communicating with the machines. This entails configuring the client, which is what the talosconfig file is intended for. Endpoints are the communication endpoints with which the client communicates directly. These may be load balancers, DNS hostnames, a list of IPs, and so on. In general, it is recommended that these endpoints be directed toward the set of control plane nodes, either directly or through a reverse proxy or load balancer. However, endpoints must also be members of the same Talos cluster as the target node because these proxied connections rely on certificate-based authentication.
export TALOSCONFIG="talosconfig"
# talosconfig is generated with talosctl gen command, check previous posttalosctl config endpoint 192.168.1.80
# vip address of master nodes, to configure the shared IP use (machine.network.interfaces[].vip.ip
) in the Talos configuration.
Let's see how talosctl can be used though:
talosctl
A CLI for out-of-band management of Kubernetes nodes created by Talos
Usage:
talosctl [command]
Available Commands:
apply-config Apply a new configuration to a node
bootstrap Bootstrap the etcd cluster on the specified node.
cluster A collection of commands for managing local docker-based or QEMU-based clusters
completion Output shell completion code for the specified shell (bash, fish or zsh)
config Manage the client configuration file (talosconfig)
conformance Run conformance tests
containers List containers
copy Copy data out from the node
dashboard Cluster dashboard with real-time metrics
disks Get the list of disks from /sys/block on the machine
dmesg Retrieve kernel logs
edit Edit a resource from the default editor.
etcd Manage etcd
events Stream runtime events
gen Generate CAs, certificates, and private keys
get Get a specific resource or list of resources.
health Check cluster health
help Help about any command
images List the default images used by Talos
inspect Inspect internals of Talos
kubeconfig Download the admin kubeconfig from the node
list Retrieve a directory listing
logs Retrieve logs for a service
memory Show memory usage
mounts List mounts
patch Update field(s) of a resource using a JSON patch.
pcap Capture the network packets from the node.
processes List running processes
read Read a file on the machine
reboot Reboot a node
reset Reset a node
restart Restart a process
rollback Rollback a node to the previous installation
service Retrieve the state of a service (or all services), control service state
shutdown Shutdown a node
stats Get container stats
support Dump debug information about the cluster
time Gets current server time
upgrade Upgrade Talos on the target node
upgrade-k8s Upgrade Kubernetes control plane in the Talos cluster.
usage Retrieve a disk usage
validate Validate config
version Prints the version
Flags:
--context string Context to be used in command
-e, --endpoints strings override default endpoints in Talos configuration
-h, --help help for talosctl
-n, --nodes strings target the specified nodes
--talosconfig string The path to the Talos configuration file (default "terraform/dev/templates/dev/talosconfig")
Use "talosctl [command] --help" for more information about a command.
Once networking is set up, the kubelet service must be running on control plane nodes. If the kubelet is not running, it may be due to an incorrect configuration. To check the kubelet logs, use the talosctl logs command.
infrastructure $talosctl -n 192.168.1.80 service kubelet
NODE 192.168.1.80
ID kubelet
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (3h18m51s ago)
[Running]: Started task kubelet (PID 2975) for container kubelet (3h18m52s ago)
[Preparing]: Creating service runner (3h18m53s ago)
[Preparing]: Running pre state (3h19m19s ago)
[Waiting]: Waiting for service "cri" to be "up", time sync, network (3h19m20s ago)
---
infrastructure $talosctl -n 192.168.1.80 logs kubelet
The reason for etcd not running is usually that the cluster has not been bootstrapped or is in the process of bootstrapping. It is important to manually run the "talosctl bootstrap" command once per cluster, which is a commonly overlooked step. After bootstrapping a node, etcd will start, and within a few minutes (depending on the download speed of the control plane nodes), the other control plane nodes will discover it and join the cluster.
infrastructure $talosctl -n 192.168.1.80 service etcd
NODE 192.168.1.80
ID etcd
STATE Running
HEALTH OK
EVENTS [Running]: Health check successful (3h49m27s ago)
[Running]: Started task etcd (PID 3102) for container etcd (3h49m32s ago)
[Preparing]: Creating service runner (3h49m33s ago)
[Preparing]: Running pre state (3h49m33s ago)
[Waiting]: Waiting for service "cri" to be "up", time sync, network (3h49m33s ago)
Disaster Recovery
The etcd database is responsible for storing the state of the Kubernetes control plane. If etcd is down, the control plane of Kubernetes will also go down, and the cluster will become unrecoverable until etcd is restored along with its contents. Therefore, it is crucial to regularly create backups of the etcd state, so that in case of a major failure, there is a snapshot available to restore the cluster.
infrastructure $talosctl -n 192.168.1.80 etcd snapshot db.firstbackup
etcd snapshot saved to "db.firstbackup" (7606304 bytes)
snapshot info: hash 57d71f5d, revision 39652, total keys 1836, total size 7606272
infrastructure $du -sh db.firstbackup
7.3M db.firstbackup
This database snapshot can be taken on any healthy control plane node (with IP address <IP>
in the example above), as all etcd
instances contain exactly same data.
If a control plane node is healthy but etcd
isn’t, wipe the node’s EPHEMERAL partition to remove the etcd
data directory:
talosctl -n 192.168.1.80 reset --graceful=false --reboot --system-labels-to-wipe=EPHEMERAL
At this point, all control plane nodes should boot up, and etcd
service should be in the Preparing
state.
talosctl -n 192.168.1.80 bootstrap --recover-from=./db.firstbackup
recovering from snapshot "./db.firstbackup": hash c25fd181, revision 4193, total keys 1287, total size 3035136
After the bootstrap node successfully starts the etcd service, the Kubernetes control plane components will initiate and the control plane endpoint will become accessible. Once the control plane endpoint is available, the remaining control plane nodes will join the etcd cluster.
Upgrading Talos Linux
Currently, in dev cluster Talos & Kubernetes are intentionally outdated:
infrastructure $k get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master-1 Ready control-plane,master 4h11m v1.24.2 192.168.1.130 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
master-2 Ready control-plane,master 4h11m v1.24.2 192.168.1.131 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
master-3 Ready control-plane,master 4h11m v1.24.2 192.168.1.132 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-133 Ready <none> 4h5m v1.24.2 192.168.1.133 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-134 Ready <none> 4h8m v1.24.2 192.168.1.134 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-135 Ready <none> 4h6m v1.24.2 192.168.1.135 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-136 Ready <none> 4h4m v1.24.2 192.168.1.136 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-137 Ready <none> 4h9m v1.24.2 192.168.1.137 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
Talos version is set to v1.1.0
and Kubernetes version is v1.24.2
. To upgrade a Talos node, specify the node’s IP address and the installer container image for the version of Talos to upgrade to via command: talosctl upgrade --nodes NODE-IP --image ghcr.io/siderolabs/installer:v1.1.2
. When a Talos node receives the upgrade command, it cordons itself in Kubernetes, to avoid receiving any new workload. It then starts to drain its existing workload.
infrastructure $talosctl upgrade --nodes 192.168.1.133 --image ghcr.io/siderolabs/installer:v1.1.2
NODE ACK STARTED
192.168.1.133 Upgrade request received 2022-08-09 14:06:27.2982613 +0100 BST m=+15.456476001
---
infrastructure $k get node -o wide | grep 133
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
talos-192-168-1-133 NotReady,SchedulingDisabled <none> 4h11m v1.24.2 192.168.1.133 <none> Talos (v1.1.2) 5.15.57-talos containerd://1.6.6
After the node comes back up and Talos verifies itself, it will make the bootloader change permanent, rejoin the cluster, and finally uncordon itself to receive new workloads.
k get node -o wide | grep 192.168.1.133
talos-192-168-1-133 Ready <none> 4h13m v1.24.2 192.168.1.133 <none> Talos (v1.1.2) 5.15.57-talos containerd://1.6.6
Upgrading Kubernetes
talosctl upgrade-k8s
command will automatically update the components needed to upgrade Kubernetes safely. Upgrading Kubernetes is non-disruptive to the cluster workloads. To trigger a Kubernetes upgrade, issue a command specifiying the version of Kubernetes to ugprade to, such as: talosctl --nodes <master node> upgrade-k8s --to 1.24.3
; To check what will be upgraded you can run talosctl upgrade-k8s
with the --dry-run
flag:
infrastructure $talosctl --nodes 192.168.1.130 upgrade-k8s --to 1.24.3 --dry-run
automatically detected the lowest Kubernetes version 1.24.2
checking for resource APIs to be deprecated in version 1.24.3
discovered master nodes ["192.168.1.130" "192.168.1.131" "192.168.1.132"]
discovered worker nodes ["192.168.1.133" "192.168.1.134" "192.168.1.135" "192.168.1.136" "192.168.1.137"]
updating "kube-apiserver" to version "1.24.3"
> "192.168.1.130": starting update
> update kube-apiserver: v1.24.2 -> 1.24.3
> skipped in dry-run
> "192.168.1.131": starting update
> update kube-apiserver: v1.24.2 -> 1.24.3
> skipped in dry-run
> "192.168.1.132": starting update
> update kube-apiserver: v1.24.2 -> 1.24.3
> skipped in dry-run
updating "kube-controller-manager" to version "1.24.3"
> "192.168.1.130": starting update
> update kube-controller-manager: v1.24.2 -> 1.24.3
> skipped in dry-run
> "192.168.1.131": starting update
> update kube-controller-manager: v1.24.2 -> 1.24.3
> skipped in dry-run
> "192.168.1.132": starting update
> update kube-controller-manager: v1.24.2 -> 1.24.3
> skipped in dry-run
...
This command runs in several phases:
- Every control plane node machine configuration is patched with the new image version for each control plane component. Talos renders new static pod definitions on the configuration update which is picked up by the kubelet. The command waits for the change to propagate to the API server state.
- The command updates the
kube-proxy
daemonset with the new image version. Nevertheless, can be ignored in our case because we run Cilium in strict/kube-proxy free mode. - On every node in the cluster, the
kubelet
version is updated. The command then waits for thekubelet
service to be restarted and become healthy. The update is verified by checking theNode
resource state. - Kubernetes bootstrap manifests are re-applied to the cluster. Updated bootstrap manifests might come with a new Talos version (e.g. CoreDNS version update), or might be the result of machine configuration change. Note: The
upgrade-k8s
command never deletes any resources from the cluster: they should be deleted manually.
If the command fails for any reason, it can be safely restarted to continue the upgrade process from the moment of the failure.
Full log of cluster upgrade: https://paste.opendev.org/show/bSY2oZvwGo3WvJkpfL8N/
infrastructure $k get node -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master-1 Ready control-plane,master 4h36m v1.24.3 192.168.1.130 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
master-2 Ready control-plane,master 4h36m v1.24.3 192.168.1.131 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
master-3 Ready control-plane,master 4h36m v1.24.3 192.168.1.132 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-133 Ready <none> 4h31m v1.24.3 192.168.1.133 <none> Talos (v1.1.2) 5.15.57-talos containerd://1.6.6
talos-192-168-1-134 Ready <none> 4h33m v1.24.3 192.168.1.134 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-135 Ready <none> 4h32m v1.24.3 192.168.1.135 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-136 Ready <none> 4h29m v1.24.3 192.168.1.136 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
talos-192-168-1-137 Ready <none> 4h34m v1.24.3 192.168.1.137 <none> Talos (v1.1.0) 5.15.48-talos containerd://1.6.6
Kubernetes cluster has been successfully updated. Approx. time: ~9 min.