Best Practices for Upgrading and Updating Your AKS Clusters

The Kubernetes project is the second-fastest velocity open source project , with only the Linux kernel project ahead of it. This is a reflection of the pace of change within Kubernetes and the community ecosystem around it.

The upstream project releases a new minor version of Kubernetes (1.26, 1.27, etc.) every 15 weeks or so , which means customers can typically expect 3-4 new minor versions of Kubernetes per year. With each new version comes a host of upgrade operations: the control plane, the worker nodes, and your applications (if there have been API changes).

Customers that are just starting out with Kubernetes need to be aware of this right at the beginning of their journey: all too often I’ve spoken to customers who are on a version of Kubernetes that’s about to go out of support (or is already out of support) and they’re having to try and rush an upgrade - which is much more likely to lead to mistakes, outages, and negative business impact.

The remainder of this blog post will cover three key areas that any Kubernetes operator will need to become familiar and comfortable with very early on in their Kubernetes journey:

Upgrading the Kubernetes control plane
Upgrading worker nodes
Updating worker nodes

We will also look at how managed Kubernetes services like AKS can take away some of the “heavy lifting” involved.

Background

If you aren’t already aware of the Kubernetes version skew policy , you should go and read that documentation page, as it will help you understand why Kubernetes upgrades are performed the way they are.

In short, the Kubernetes upgrade policy says that most significant components in the cluster (e.g. the kube-apiserver and the kubelet, as well as many others) cannot be different by more than one minor version. It also states that components on the worker nodes (such as the kubelet) cannot be on a newer minor version that the control plane components.

Because of this, it is only possible to upgrade the minor version of your cluster by one version at a time; otherwise, if you upgrade the control plane by 2 or more versions, the nodes (still running on the original version) would break the version skew policy, and may not be able to interact with the cluster control plane correctly.

Upgrading the Kubernetes Control Plane

AKS makes upgrading your control plane minor version relatively simple, although it is not completely without risks. Occasionally, the Kubernetes project will upgrade API endpoints (e.g. from alpha to beta, or beta to stable) as the functionality matures, or they may deprecate functionality altogether (such as with Pod Security Policies , which were deprecated in 1.21 and removed in 1.25). If your workloads depend on these API endpoints that change, and you don’t update your workload definitions (Kubernetes manifest files) accordingly, your applications break after the upgrade. It is there important to always test your Kubernetes manifests on a new version of Kubernetes before upgrading your production environment.

Once you are confident that your applications are compatible with the new version, initiating a cluster upgrade is straightforward: you can click the “Upgrade version” link on the “Cluster configuration” blade in the Azure portal, or you can use the Azure CLI to check and upgrade the control plane .

AKS maintains two patch versions for each minor version (e.g. 1.25.3 and 1.25.6, 1.26.0 and 1.26.3, etc.). During an upgrade, customers can choose either patch level, although we always recommend taking the latest unless you have a specific reason not to.

After upgrading the control plane, you will need to upgrade all worker nodes to the same minor version - more on that below .

AKS has a new feature, currently in public preview, that allows customers to set maintenance windows , during which AKS cluster upgrade operations will be carried out. Without maintenance windows, cluster upgrade operations will happen as soon as you initiate them. By using maintenance windows, you can configure the upgrade operation, but it will only occur during the selected window.

Upgrading worker nodes

Once your control plane has been upgraded, it’s time to upgrade your worker nodes, or node pools, to the same minor version of Kubernetes. The main concern when upgrading your worker nodes is that you don’t interrupt the workloads running on them; and Kubernetes can handle this for you.

In AKS, customers select their base image when creating their node pools (e.g. Ubuntu 22.04, Azure Linux, Windows Server 2022), and so to upgrade the minor version of their worker nodes, they simply update the image to the latest one available for the minor version of Kubernetes running on their control plane.

Usually, when you upgrade your AKS control plane, it will upgrade the node pools automatically for you. However, if your cluster has “Automatic upgrade” set to Disabled, or you otherwise chose not to upgrade the node pools, you will have to upgrade them yourself manually.

Once the node pool image has been updated, AKS initiates a rolling deployment to update your nodes. As part of this process, an additional node is added to the cluster (running the latest node image), then an existing node is cordoned and drained to remove all running pods. The node is rebooted to take the latest image, and once it is running and showing as Ready in the cluster, the next node is cordoned and drained. The process repeats until all nodes have been updated in this manner, at which time the extra node that was created right at the beginning is removed.

Please note, that you cannot perform both scaling and upgrade operations at the same time , so it is worth considering when you should perform your upgrades (to avoid peak requests) and whether you have enough additional capacity to handle any unexpected increases in request rates during the upgrade.

If you are running any pods with stateful workloads on them, you should consider adding pod disruption budgets , which will prevent these pods being rescheduled onto other nodes, and will hence block the automatic reboot of the nodes running them. Note that you will have to reschedule the pods manually during the node upgrade process, otherwise the upgrade operation will fail.

Azure update their AKS node images on a regular basis, to incorporate the latest security patches and bug fixes. Linux images (Ubuntu and Azure Linux) are updated weekly, and Windows images (2019 and 2022) are updated monthly. These updates are documented in the weekly AKS Release notes on GitHub. If you are responsible for operating or maintaining one or more AKS clusters, I’d strongly recommend subscribing to releases on this repository.

Updating worker nodes

Keeping your node pool base images up-to-date is important because then any new nodes that are added to your cluster due to scaling events have all the latest security patches and bug fixes pre-installed. However, sometimes critical security issues are announced and need to be patched ASAP, and you can’t always wait for the weekly image update.

All AKS Linux images are pre-configured to check for OS-level updates every day , and install any security or kernel updates that are available. Many of these updates will take effect immediately after installation, but some (such as kernel updates) will need a node reboot to complete the installation. Following the update, if the node requires a reboot to complete the installation of patches and updates, a file will automatically be written to /var/run/reboot-required. The node will not automatically be rebooted to prevent disrupting any workloads running in the cluster.

Customers can implement a custom workflow to handle these reboots, or they can use the open source Kubernetes Reboot Daemon (kured ) to handle it for them. With kured, a Daemonset will be installed on each node which will watch for the presence of the /var/run/reboot-required file, and will automatically handle the scheduling of any running pods and the node reboot process.

kured integrates with the Kubernetes API to cordon and drain nodes that need to be rebooted, to avoid any disruption to your running workloads. It also schedules nodes to reboot one at a time, minimising disruption to normal cluster operations.

Blue-Green Upgrades

Although I won’t dive deep into using blue-green deployments to upgrade your Kubernetes clusters in this post, I thought it would be useful to outline some of the pros and cons of blue-green against in-place upgrades.

Control Plane Blue-Green Upgrades

For the control plane upgrades, the pros include:

You can skip multiple minor versions in one jump, as you are simply creating a new cluster and deploying your workloads into it
You don’t need to have different “upgrade” operations for upgrading one minor version, or more than one, as blue-green works for both
You can fully test your workloads running on the new minor version of Kubernetes in the new cluster before sending any “live” traffic to it
You have a very fast and easy rollback mechanism, in case any issues are detected after switching over your “live” traffic

Of course, there are some cons too, which for the control plane include:

It costs more (usually double) as you’re running two production-sized clusters for the duration of the blue-green cutover. However, this is normally quite a short period of time, so is easily managed
If you run services that maintain state within your clusters (e.g. Redis, SQL databases, etc.), this approach is much more difficult, as you have to migrate the state as well. This often involves maintenance windows or application downtime to prevent data getting lost during the cutover
You need to have some mechanism at the “front door” of your production environment that allows you to move all traffic from one cluster to another in a single operation (and monitor that activity too)

Node Blue-Green Upgrades

For the cluster node upgrades, there are two ways of performing the blue-green deployment:

Create a new node pool on the new AKS image version, with the same number of nodes as the existing node pool
Update the AKS image version in the existing node pool, and double the number of nodes (new nodes will be created using the new AKS image)

If you were to go down this route, I would advocate for the first method, as with the second method you can’t guarantee that “old” nodes will stay on the “old” AKS image version (e.g. an accidental reboot would cause it to be updated).

Compared to using blue-green deployments as an upgrade mechanism for the whole cluster, the benefits of blue-green deployments for nodes are reduced, but they do still include:

Provisioning new nodes ahead of time, minimising the time to switch from old to new nodes
Allowing some functional / smoke testing of workloads running on new nodes prior to switching over production traffic
As with the control plane, you have a fast and easy rollback mechanism, where you would simply cordon and drain the “new” nodes, and reschedule all pods back to the “old” nodes

And the cons for node upgrades in this way include:

Higher costs while you’re running double the number of nodes/or and node pools
Complex to migrate stateful workloads between node pools, with limited benefits compared to blue-green deployments at the cluster level

Conclusion

Keeping your Kubernetes cluster up-to-date is an on-going and non-trivial process. If anything about this scares or concerns you, maybe you should consider alternative container or IaaS hosting solutions (e.g. Azure Container Apps or Azure App Service - Web App for Containers ).

For those of you that are committed to using Kubernetes and staying current with versions as they are released, Azure provides numerous tools in AKS to help make the challenge a little easier. And hopefully this blog post has provided you some useful guidance on how to do so.

For many customers, the AKS-provided tooling that helps orchestrate in-place upgrades and updates will be more than sufficient for them. For any customers where this is not sufficient, blue-green upgrades at the cluster (or node pool) level may provide additional benefits or flexibility.

Happy upgrading!