Kubernetes Cluster Setup & Management
A complete guide to setting up and managing Kubernetes clusters. Learn how to deploy clusters with kubeadm, use managed services (EKS, GKE, AKS), and perform day-2 operations like node management, upgrades, and cluster administration.
There are multiple ways to set up a Kubernetes cluster, each with its own trade-offs. The choice depends on your requirements for control, cost, operational overhead, and cloud strategy.
kubeadm (On-Premise)
Amazon EKS
Google GKE
Azure AKS
kubeadm is the official Kubernetes tool for setting up production clusters. It handles the complexity of setting up the control plane, joining worker nodes, and configuring cluster certificates.
kubeadm init on the master node. This sets up the API server, etcd, scheduler, and controller manager. The command outputs the join command for worker nodes.kubectl get nodes.# On all nodes: disable swap, enable kernel modules
sudo swapoff -a
sudo modprobe overlay
sudo modprobe br_netfilter
# Install containerd
sudo apt-get install -y containerd
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml
sudo systemctl restart containerd
# Install kubeadm, kubelet, kubectl
curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
# Initialize control plane (on master node)
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --control-plane-endpoint=api-server.example.com
# Configure kubectl
mkdir -p $HOME/.kube
sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
# Install Calico CNI
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26/manifests/calico.yaml
# Join worker nodes (run on each worker)
sudo kubeadm join api-server.example.com:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
# Verify cluster
kubectl get nodes
kubectl get pods -n kube-system
Amazon EKS
Amazon EKS (Elastic Kubernetes Service) is a managed Kubernetes service that runs the control plane for you. It's highly available, secure, and integrates with AWS services.
# EKS cluster creation (via eksctl)
eksctl create cluster \
--name production-cluster \
--version 1.28 \
--region us-east-1 \
--nodegroup-name standard-workers \
--node-type t3.medium \
--nodes 3 \
--nodes-min 1 \
--nodes-max 10 \
--managed
# Update kubeconfig
aws eks update-kubeconfig --region us-east-1 --name production-cluster
# View nodes
kubectl get nodes
# EKS best practices:
# - Use IAM roles for service accounts (IRSA)
# - Enable cluster autoscaler
# - Use VPC CNI for networking
# - Enable control plane logging
Google GKE
GKE is Google's managed Kubernetes service with advanced features like auto-scaling, auto-upgrades, and integrated networking.
# GKE cluster creation (via gcloud)
gcloud container clusters create production-cluster \
--zone us-central1-a \
--num-nodes 3 \
--machine-type n1-standard-4 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--enable-autorepair \
--enable-autoupgrade
# Get credentials
gcloud container clusters get-credentials production-cluster --zone us-central1-a
# View nodes
kubectl get nodes
# GKE best practices:
# - Use Workload Identity for authentication
# - Enable network policies
# - Use GKE Ingress controller
# - Enable node auto-provisioning
Azure AKS
AKS is Azure's managed Kubernetes service with deep integration with Azure Active Directory and other Azure services.
# AKS cluster creation (via az)
az aks create \
--resource-group production-rg \
--name production-cluster \
--node-count 3 \
--node-vm-size Standard_D2s_v3 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 10 \
--enable-addons monitoring \
--generate-ssh-keys
# Get credentials
az aks get-credentials --resource-group production-rg --name production-cluster
# View nodes
kubectl get nodes
# AKS best practices:
# - Use Azure AD integration
# - Enable Azure Policy for AKS
# - Use Azure Monitor for containers
# - Implement network policies
Day-2 operations are critical for maintaining healthy Kubernetes clusters. Here are the essential administrative tasks and best practices.
Node Management
# Cordoning a node (mark unschedulable)
kubectl cordon <node-name>
# Uncordon a node
kubectl uncordon <node-name>
# Drain a node (evict all pods)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
# View node status
kubectl get nodes
kubectl describe node <node-name>
# Add taints to node
kubectl taint nodes <node-name> key=value:NoSchedule
# Remove taint
kubectl taint nodes <node-name> key-
# View node capacity
kubectl describe node <node-name> | grep -A 5 Capacity
Cluster Upgrades
# Check available versions
kubeadm upgrade plan
# Upgrade control plane
kubeadm upgrade apply v1.28.0
# Upgrade kubelet on nodes
sudo apt-get update && sudo apt-get install -y kubelet=1.28.0-00
sudo systemctl restart kubelet
# For managed clusters (EKS)
eksctl upgrade cluster --name production-cluster --version 1.28
# For GKE
gcloud container clusters upgrade production-cluster --zone us-central1-a --master
# For AKS
az aks upgrade --resource-group production-rg --name production-cluster --kubernetes-version 1.28
Certificate Management
# Check certificate expiration
kubeadm certs check-expiration
# Renew all certificates
kubeadm certs renew all
# Renew specific certificate
kubeadm certs renew apiserver
# Restart control plane components after renewal
sudo systemctl restart kubelet
Security Administration
# Create a service account
kubectl create serviceaccount my-sa -n default
# Create a role
kubectl create role pod-reader --verb=get,list,watch --resource=pods -n default
# Create role binding
kubectl create rolebinding read-pods --role=pod-reader --serviceaccount=default:my-sa -n default
# View RBAC
kubectl get roles -n default
kubectl get rolebindings -n default
# Configure pod security standards
kubectl apply -f pod-security-standard.yaml
Resource Management
# View resource usage
kubectl top nodes
kubectl top pods
# Set resource quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: namespace-quota
namespace: default
spec:
hard:
requests.cpu: "10"
requests.memory: 10Gi
limits.cpu: "20"
limits.memory: 20Gi
persistentvolumeclaims: "10"
pods: "50"
# Set limit ranges
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: default
spec:
limits:
- default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 250m
memory: 256Mi
max:
cpu: 2
memory: 2Gi
type: Container
Regular backups and disaster recovery planning are essential for production clusters. Here are the key strategies.
etcd Backup
# Backup etcd
ETCDCTL_API=3 etcdctl snapshot save /backups/etcd-$(date +%Y%m%d).db
# Restore etcd
ETCDCTL_API=3 etcdctl snapshot restore /backups/etcd-20250115.db --data-dir /var/lib/etcd-backup
# Automated backup script
#!/bin/bash
BACKUP_DIR="/backups/etcd"
DATE=$(date +%Y%m%d-%H%M%S)
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-$DATE.db
find $BACKUP_DIR -name "etcd-*.db" -mtime +30 -delete
Application Backup with Velero
# Install Velero
velero install \
--provider aws \
--bucket production-backups \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1
# Create a backup
velero backup create full-backup --include-namespaces default,production
# Create a scheduled backup
velero schedule create daily-backup --schedule="0 2 * * *" --include-namespaces default,production
# Restore from backup
velero restore create --from-backup full-backup
# View backups
velero backup get
Security
Monitoring
Updates
Storage
Cost Management
Documentation
A well-designed cluster setup and robust operational practices are the foundation of successful Kubernetes adoption. Choose the right setup for your needs and invest in automation and monitoring.