Kubernetes Cluster Setup & Management

kubeadm EKS, GKE, AKS Administration Node Management

A complete guide to setting up and managing Kubernetes clusters. Learn how to deploy clusters with kubeadm, use managed services (EKS, GKE, AKS), and perform day-2 operations like node management, upgrades, and cluster administration.

On-Premise Cloud Managed Administration

Kubernetes Cluster Setup Options

There are multiple ways to set up a Kubernetes cluster, each with its own trade-offs. The choice depends on your requirements for control, cost, operational overhead, and cloud strategy.

kubeadm (On-Premise)

Full control, self-managed

Kubeadm is the official tool for deploying production-ready Kubernetes clusters on-premise or in any environment. It follows Kubernetes best practices and provides a standardized way to set up clusters.

Full control, no vendor lock-in, lower cost

Operational overhead, self-managed control plane

Amazon EKS

AWS managed Kubernetes

EKS is a managed Kubernetes service that runs the control plane for you. It integrates deeply with AWS services like IAM, VPC, and ELB.

Managed control plane, AWS integration, enterprise features

Higher cost, AWS lock-in, limited control plane access

Google GKE

Google managed Kubernetes

GKE is Google's managed Kubernetes service with deep integration with GCP services. It offers auto-scaling, auto-upgrades, and advanced networking.

Auto-scaling, advanced networking, GCP integration

GCP lock-in, limited control plane access

Azure AKS

Azure managed Kubernetes

AKS is Azure's managed Kubernetes service with tight integration with Azure Active Directory, Azure Monitor, and other Azure services.

Azure integration, managed control plane, flexible pricing

Azure lock-in, limited control plane access

Choosing the right option: Use kubeadm for full control and cost savings. Use managed services (EKS, GKE, AKS) to reduce operational overhead and focus on your applications. Consider hybrid approaches for flexibility.

kubeadm: Production-Ready Cluster Setup

kubeadm is the official Kubernetes tool for setting up production clusters. It handles the complexity of setting up the control plane, joining worker nodes, and configuring cluster certificates.

1. Prerequisites and System Preparation

Ensure all nodes have a compatible OS (Ubuntu 20.04+, RHEL 8+, etc.), disable swap, enable kernel modules (overlay, br_netfilter), and install containerd or another CRI runtime.

2. Install kubeadm, kubelet, and kubectl

Install the required packages from the official Kubernetes repository. Pin the version to ensure consistency across nodes. The kubelet service will be managed by systemd.

3. Initialize the Control Plane

Run kubeadm init on the master node. This sets up the API server, etcd, scheduler, and controller manager. The command outputs the join command for worker nodes.

4. Configure kubectl and Networking

Copy the kubeconfig file and set up a network plugin (CNI). Calico, Flannel, or Cilium are popular choices. Apply the CNI manifests to enable pod-to-pod communication.

5. Join Worker Nodes

Use the join command from the control plane initialization to add worker nodes to the cluster. Verify node registration with kubectl get nodes.

6. Configure High Availability (Optional)

For production, set up multiple control plane nodes. Use a load balancer in front of the API servers. Configure etcd as a cluster for consensus.

                # On all nodes: disable swap, enable kernel modules
sudo swapoff -a
sudo modprobe overlay
sudo modprobe br_netfilter

# Install containerd
sudo apt-get install -y containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd

# Install kubeadm, kubelet, kubectl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

# Initialize control plane (on master node)
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --control-plane-endpoint=api-server.example.com

# Configure kubectl
mkdir -p $HOME/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install Calico CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26/manifests/calico.yaml

# Join worker nodes (run on each worker)
sudo kubeadm join api-server.example.com:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

# Verify cluster
kubectl get nodes
kubectl get pods -n kube-system
            

Managed Kubernetes Services

Amazon EKS

Amazon EKS (Elastic Kubernetes Service) is a managed Kubernetes service that runs the control plane for you. It's highly available, secure, and integrates with AWS services.

                # EKS cluster creation (via eksctl)
eksctl create cluster \
  --name production-cluster \
  --version 1.28 \
  --region us-east-1 \
  --nodegroup-name standard-workers \
  --node-type t3.medium \
  --nodes 3 \
  --nodes-min 1 \
  --nodes-max 10 \
  --managed

# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name production-cluster

# View nodes
kubectl get nodes

# EKS best practices:
# - Use IAM roles for service accounts (IRSA)
# - Enable cluster autoscaler
# - Use VPC CNI for networking
# - Enable control plane logging
            

Google GKE

GKE is Google's managed Kubernetes service with advanced features like auto-scaling, auto-upgrades, and integrated networking.

                # GKE cluster creation (via gcloud)
gcloud container clusters create production-cluster \
  --zone us-central1-a \
  --num-nodes 3 \
  --machine-type n1-standard-4 \
  --enable-autoscaling \
  --min-nodes 1 \
  --max-nodes 10 \
  --enable-autorepair \
  --enable-autoupgrade

# Get credentials
gcloud container clusters get-credentials production-cluster --zone us-central1-a

# View nodes
kubectl get nodes

# GKE best practices:
# - Use Workload Identity for authentication
# - Enable network policies
# - Use GKE Ingress controller
# - Enable node auto-provisioning
            

Azure AKS

AKS is Azure's managed Kubernetes service with deep integration with Azure Active Directory and other Azure services.

                # AKS cluster creation (via az)
az aks create \
  --resource-group production-rg \
  --name production-cluster \
  --node-count 3 \
  --node-vm-size Standard_D2s_v3 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 10 \
  --enable-addons monitoring \
  --generate-ssh-keys

# Get credentials
az aks get-credentials --resource-group production-rg --name production-cluster

# View nodes
kubectl get nodes

# AKS best practices:
# - Use Azure AD integration
# - Enable Azure Policy for AKS
# - Use Azure Monitor for containers
# - Implement network policies
            

Managed vs Self-Managed: Managed services reduce operational overhead significantly but come with higher costs and vendor lock-in. Self-managed clusters give you full control but require more operational expertise.

Cluster Administration

Day-2 operations are critical for maintaining healthy Kubernetes clusters. Here are the essential administrative tasks and best practices.

Node Management

                # Cordoning a node (mark unschedulable)
kubectl cordon <node-name>

# Uncordon a node
kubectl uncordon <node-name>

# Drain a node (evict all pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# View node status
kubectl get nodes
kubectl describe node <node-name>

# Add taints to node
kubectl taint nodes <node-name> key=value:NoSchedule

# Remove taint
kubectl taint nodes <node-name> key-

# View node capacity
kubectl describe node <node-name> | grep -A 5 Capacity
            

Cluster Upgrades

                # Check available versions
kubeadm upgrade plan

# Upgrade control plane
kubeadm upgrade apply v1.28.0

# Upgrade kubelet on nodes
sudo apt-get update && sudo apt-get install -y kubelet=1.28.0-00
sudo systemctl restart kubelet

# For managed clusters (EKS)
eksctl upgrade cluster --name production-cluster --version 1.28

# For GKE
gcloud container clusters upgrade production-cluster --zone us-central1-a --master

# For AKS
az aks upgrade --resource-group production-rg --name production-cluster --kubernetes-version 1.28
            

Certificate Management

                # Check certificate expiration
kubeadm certs check-expiration

# Renew all certificates
kubeadm certs renew all

# Renew specific certificate
kubeadm certs renew apiserver

# Restart control plane components after renewal
sudo systemctl restart kubelet
            

Security Administration

                # Create a service account
kubectl create serviceaccount my-sa -n default

# Create a role
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n default

# Create role binding
kubectl create rolebinding read-pods --role=pod-reader --serviceaccount=default:my-sa -n default

# View RBAC
kubectl get roles -n default
kubectl get rolebindings -n default

# Configure pod security standards
kubectl apply -f pod-security-standard.yaml
            

Resource Management

                # View resource usage
kubectl top nodes
kubectl top pods

# Set resource quotas
apiVersion: v1
kind: ResourceQuota
metadata:
  name: namespace-quota
  namespace: default
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 10Gi
    limits.cpu: "20"
    limits.memory: 20Gi
    persistentvolumeclaims: "10"
    pods: "50"

# Set limit ranges
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: default
spec:
  limits:
  - default:
      cpu: 500m
      memory: 512Mi
    defaultRequest:
      cpu: 250m
      memory: 256Mi
    max:
      cpu: 2
      memory: 2Gi
    type: Container
            

Backup and Disaster Recovery

Regular backups and disaster recovery planning are essential for production clusters. Here are the key strategies.

etcd Backup

                # Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /backups/etcd-$(date +%Y%m%d).db

# Restore etcd
ETCDCTL_API=3 etcdctl snapshot restore /backups/etcd-20250115.db --data-dir /var/lib/etcd-backup

# Automated backup script
#!/bin/bash
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-$DATE.db
find $BACKUP_DIR -name "etcd-*.db" -mtime +30 -delete
            

Application Backup with Velero

                # Install Velero
velero install \
  --provider aws \
  --bucket production-backups \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1

# Create a backup
velero backup create full-backup --include-namespaces default,production

# Create a scheduled backup
velero schedule create daily-backup --schedule="0 2 * * *" --include-namespaces default,production

# Restore from backup
velero restore create --from-backup full-backup

# View backups
velero backup get
            

Critical: Always test your restore procedures in a non-production environment. A backup is only useful if you can successfully restore from it. Document your disaster recovery procedures and practice them regularly.

Operational Best Practices

Security

Enable RBAC, use network policies, rotate certificates regularly, and implement pod security standards. Use OPA/Gatekeeper for policy enforcement.

Monitoring

Set up Prometheus, Grafana, and AlertManager. Monitor control plane components, node resources, and application performance. Define SLOs and SLIs.

Updates

Regularly update Kubernetes, node OS, and container runtimes. Use phased rollouts for updates. Test updates in staging before production.

Storage

Plan storage capacity, use appropriate storage classes, monitor PVC usage, and implement backup strategies for stateful applications.

Cost Management

Use resource quotas, implement node auto-scaling, right-size resources, and use spot instances for non-critical workloads. Monitor cloud costs regularly.

Documentation

Document cluster architecture, operational procedures, and troubleshooting guides. Maintain runbooks for common issues and disaster scenarios.

Frequently Asked Questions

Should I use a managed or self-managed Kubernetes cluster?

Choose managed (EKS, GKE, AKS) if you want to reduce operational overhead and focus on applications. Choose self-managed if you need full control, have specific compliance requirements, or want to avoid vendor lock-in. Many organizations use both.

What are the minimum system requirements for a kubeadm cluster?

For production, control plane nodes need at least 4GB RAM and 2 CPUs. Worker nodes need at least 2GB RAM and 1 CPU. Allocate 50GB+ storage for the OS and containerd. For small clusters, these requirements can be lower.

How do I upgrade my Kubernetes cluster without downtime?

Use rolling upgrades with kubeadm or managed services. Upgrade control plane nodes first, then worker nodes. Ensure you have multiple replicas of applications. Use PodDisruptionBudgets to control pod eviction during upgrades.

How often should I back up etcd?

Back up etcd daily at minimum. For critical production clusters, consider hourly backups. Always test restore procedures. Store backups in a separate location from the cluster.

What's the difference between cordon, drain, and delete node?

Cordon marks a node as unschedulable (no new pods). Drain evicts all pods from a node. Delete removes the node from the cluster. For maintenance, cordon then drain, perform maintenance, then uncordon.

How many control plane nodes do I need for HA?

For high availability, run 3 or 5 control plane nodes. This ensures quorum for etcd. For smaller clusters, 3 is sufficient. In managed services like EKS, GKE, AKS, the control plane is managed by the cloud provider.

What CNI plugin should I use?

Calico is the most popular choice due to its feature richness and network policy support. Cilium offers advanced security and observability. Flannel is simple and widely used. Choose based on your specific networking and security requirements.

How do I handle node failures automatically?

Use Node Auto-Repair (GKE) or Auto-Scaling groups (AWS) to replace failed nodes. Implement node monitoring and alerting. Use PodDisruptionBudgets to maintain application availability during node failures.

Previous: Kubernetes Objects Next: Kubernetes Networking

A well-designed cluster setup and robust operational practices are the foundation of successful Kubernetes adoption. Choose the right setup for your needs and invest in automation and monitoring.