Upgrading EKS Clusters!

andreszepeda6
Mar 7, 2025
3 min read

Let me set the scene: 12 EKS clusters, spread across different clients, running on Kubernetes 1.22, and a mountain of potential complexity waiting to explode. This is the story of how I transformed a potential nightmare into a surgical precision upgrade operation.

In previous posts I talked about managed Kubernetes services that span all major cloud providers, this is definitely a big win for all engineers that work and manage Kubernetes workloads as upgrades tend to be somewhat quick and safe, whereas having your own Kubernetes vanilla distribution running on multiple virtual machines can be a real nightmare!

The Initial Landscape: A Kubernetes Pressure Cooker

Imagine inheriting a system with:

12 uniquely configured EKS clusters
Clients with varying levels of tolerance for downtime
A Kubernetes version that was rapidly approaching end-of-support (This was literally the main reason for this as AWS was changing the pricing of the old versions in order to push us, the customers, to upgrade to newest versions)
A mix of critical and non-critical workloads

Spoiler alert: This wasn’t going to be a simple kubectl upgrade situation or terraform apply -y out of the box. 🚨

The Battle Plan: Terraform, Communication, and Nerves of Steel

Step 1: Reconnaissance and Mapping

First, I created a comprehensive inventory:

locals {
  cluster_upgrade_map = {
    "client-a-prod"     = { priority = "high", maintenance_window = "off-peak" }
    "client-b-staging"  = { priority = "medium", maintenance_window = "weekend" }
    "client-c-dev"      = { priority = "low", maintenance_window = "anytime" }
    # ... and 9 more clusters
  }
}

The Terraform Upgrade Module: Our Secret Weapon

resource "aws_eks_cluster" "upgraded_cluster" {
  # Base cluster configuration
  name     = var.cluster_name
  version  = "1.29"  # Target Kubernetes version

  # Controlled, phased upgrade strategy
  upgrade_policy {
    max_unavailable_percentage = 33%
    drain_strategy             = "carefully"
  }

  # Node group version alignment
  node_group_version = "1.29"
}

The Communication Symphony

Client Coordination: Like Conducting an Orchestra 🎼

Created detailed upgrade runbooks for each client
Mapped out exact maintenance windows
Prepared rollback plans (always have a Plan B!)

Communication Template

Dear [Client],

Upcoming Kubernetes Upgrade Details:
- Current Version: 1.22
- Target Version: 1.29
- Estimated Downtime: 15-30 minutes
- Maintenance Window: [Specific Date/Time]
- Rollback Capability: ✅ Fully Prepared

The Upgrade Phases: A Strategic Ballet

Phase 1: Non-Critical Clusters

Started with development and staging environments on environments that had no critical workloads.
Validated upgrade procedures on test clusters. This was one of the most key points on defining the best approach and testing the changes to somewhat real environments.
Collected initial learnings and potential pitfalls.

Phase 2: Staging and Pre-Prod

Applied learnings from Phase 1
More rigorous testing and monitoring in place, which consisted on monitoring the current workloads in the clusters and how they acted after the new nodes were created with the new kubernetes versions.
Increased monitoring and alert sensitivity

Phase 3: Production Clusters

Surgical precision mode activated
Staggered upgrades are a way to perform upgrades in phases, rather than all at once. So that’s why we did on specific clients/environments in specific dates that were confirmed by all the customers.
Minute-by-minute monitoring

Terraform Magic: Making Complex Look Simple

module "eks_cluster_upgrades" {
  source  = "my-org/eks-upgrade/aws"
  version = "1.0.0"

  for_each = local.cluster_upgrade_map

  cluster_name     = each.key
  target_version   = "1.29"
  priority        = each.value.priority
  maintenance_window = each.value.maintenance_window
}

War Stories and Lessons Learned 🏆

Unexpected Challenges

Some legacy Helm charts required updates and some of the engineers that built them were no longer in the company, a classic scenario in this lovely IT world!
Certain container images were version-specific and had to be updated, so some refactor was done with the help of the engineers.
Some Kubernetes objects needed to be updated as the versions constraints from alpha to beta to stable.

Pro Tips for the Brave

Always have a rollback plan.
Test, test, and test again on lower environments.
Communication is 90% of the battle, executing the plan communicated is just 10%. As crazy as it sounds that is the real truth.
Automate everything possible as this will save your life from doing human errors, and believe me, human errors due happen, and there are some situations on where a simple letter can explode all the infrastructure.

The Aftermath: A Hero’s Reflection

12 clusters upgraded ✅
Zero unplanned downtimes ✅
Minimal client disruption ✅
Personal sanity mostly intact 😅

Final Thoughts: Kubernetes is a Journey, Not a Destination

Upgrading isn’t just about bumping version numbers. It’s about understanding systems, managing expectations, and executing with precision in order to not disrupt any high sensitive workloads, and most importantly, not affecting the end users of your applications.

To my fellow infrastructure engineers: Stay curious, stay prepared, and always have a good coffee nearby. ☕🚀

May your upgrades be smooth and your downtime minimal!Upgrading EKS Clusters!