Kubernetes Cluster Setup & Management

A complete guide to setting up and managing Kubernetes clusters. Learn how to deploy clusters with kubeadm, use managed services (EKS, GKE, AKS), and perform day-2 operations like node management, upgrades, and cluster administration.

On-Premise Cloud Managed Administration
Kubernetes Cluster Setup Options

There are multiple ways to set up a Kubernetes cluster, each with its own trade-offs. The choice depends on your requirements for control, cost, operational overhead, and cloud strategy.

kubeadm (On-Premise)

Full control, self-managed
Kubeadm is the official tool for deploying production-ready Kubernetes clusters on-premise or in any environment. It follows Kubernetes best practices and provides a standardized way to set up clusters.
Full control, no vendor lock-in, lower cost
Operational overhead, self-managed control plane

Amazon EKS

AWS managed Kubernetes
EKS is a managed Kubernetes service that runs the control plane for you. It integrates deeply with AWS services like IAM, VPC, and ELB.
Managed control plane, AWS integration, enterprise features
Higher cost, AWS lock-in, limited control plane access

Google GKE

Google managed Kubernetes
GKE is Google's managed Kubernetes service with deep integration with GCP services. It offers auto-scaling, auto-upgrades, and advanced networking.
Auto-scaling, advanced networking, GCP integration
GCP lock-in, limited control plane access

Azure AKS

Azure managed Kubernetes
AKS is Azure's managed Kubernetes service with tight integration with Azure Active Directory, Azure Monitor, and other Azure services.
Azure integration, managed control plane, flexible pricing
Azure lock-in, limited control plane access
Choosing the right option: Use kubeadm for full control and cost savings. Use managed services (EKS, GKE, AKS) to reduce operational overhead and focus on your applications. Consider hybrid approaches for flexibility.
kubeadm: Production-Ready Cluster Setup

kubeadm is the official Kubernetes tool for setting up production clusters. It handles the complexity of setting up the control plane, joining worker nodes, and configuring cluster certificates.

1. Prerequisites and System Preparation
Ensure all nodes have a compatible OS (Ubuntu 20.04+, RHEL 8+, etc.), disable swap, enable kernel modules (overlay, br_netfilter), and install containerd or another CRI runtime.
2. Install kubeadm, kubelet, and kubectl
Install the required packages from the official Kubernetes repository. Pin the version to ensure consistency across nodes. The kubelet service will be managed by systemd.
3. Initialize the Control Plane
Run kubeadm init on the master node. This sets up the API server, etcd, scheduler, and controller manager. The command outputs the join command for worker nodes.
4. Configure kubectl and Networking
Copy the kubeconfig file and set up a network plugin (CNI). Calico, Flannel, or Cilium are popular choices. Apply the CNI manifests to enable pod-to-pod communication.
5. Join Worker Nodes
Use the join command from the control plane initialization to add worker nodes to the cluster. Verify node registration with kubectl get nodes.
6. Configure High Availability (Optional)
For production, set up multiple control plane nodes. Use a load balancer in front of the API servers. Configure etcd as a cluster for consensus.
# On all nodes: disable swap, enable kernel modules sudo swapoff -a sudo modprobe overlay sudo modprobe br_netfilter # Install containerd sudo apt-get install -y containerd sudo mkdir -p /etc/containerd containerd config default | sudo tee /etc/containerd/config.toml sudo systemctl restart containerd # Install kubeadm, kubelet, kubectl curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list sudo apt-get update sudo apt-get install -y kubelet kubeadm kubectl sudo apt-mark hold kubelet kubeadm kubectl # Initialize control plane (on master node) sudo kubeadm init --pod-network-cidr=10.244.0.0/16 --control-plane-endpoint=api-server.example.com # Configure kubectl mkdir -p $HOME/.kube sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config # Install Calico CNI kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26/manifests/calico.yaml # Join worker nodes (run on each worker) sudo kubeadm join api-server.example.com:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash> # Verify cluster kubectl get nodes kubectl get pods -n kube-system
Managed Kubernetes Services

Amazon EKS

Amazon EKS (Elastic Kubernetes Service) is a managed Kubernetes service that runs the control plane for you. It's highly available, secure, and integrates with AWS services.

# EKS cluster creation (via eksctl) eksctl create cluster \ --name production-cluster \ --version 1.28 \ --region us-east-1 \ --nodegroup-name standard-workers \ --node-type t3.medium \ --nodes 3 \ --nodes-min 1 \ --nodes-max 10 \ --managed # Update kubeconfig aws eks update-kubeconfig --region us-east-1 --name production-cluster # View nodes kubectl get nodes # EKS best practices: # - Use IAM roles for service accounts (IRSA) # - Enable cluster autoscaler # - Use VPC CNI for networking # - Enable control plane logging

Google GKE

GKE is Google's managed Kubernetes service with advanced features like auto-scaling, auto-upgrades, and integrated networking.

# GKE cluster creation (via gcloud) gcloud container clusters create production-cluster \ --zone us-central1-a \ --num-nodes 3 \ --machine-type n1-standard-4 \ --enable-autoscaling \ --min-nodes 1 \ --max-nodes 10 \ --enable-autorepair \ --enable-autoupgrade # Get credentials gcloud container clusters get-credentials production-cluster --zone us-central1-a # View nodes kubectl get nodes # GKE best practices: # - Use Workload Identity for authentication # - Enable network policies # - Use GKE Ingress controller # - Enable node auto-provisioning

Azure AKS

AKS is Azure's managed Kubernetes service with deep integration with Azure Active Directory and other Azure services.

# AKS cluster creation (via az) az aks create \ --resource-group production-rg \ --name production-cluster \ --node-count 3 \ --node-vm-size Standard_D2s_v3 \ --enable-cluster-autoscaler \ --min-count 1 \ --max-count 10 \ --enable-addons monitoring \ --generate-ssh-keys # Get credentials az aks get-credentials --resource-group production-rg --name production-cluster # View nodes kubectl get nodes # AKS best practices: # - Use Azure AD integration # - Enable Azure Policy for AKS # - Use Azure Monitor for containers # - Implement network policies
Managed vs Self-Managed: Managed services reduce operational overhead significantly but come with higher costs and vendor lock-in. Self-managed clusters give you full control but require more operational expertise.
Cluster Administration

Day-2 operations are critical for maintaining healthy Kubernetes clusters. Here are the essential administrative tasks and best practices.

Node Management

# Cordoning a node (mark unschedulable) kubectl cordon <node-name> # Uncordon a node kubectl uncordon <node-name> # Drain a node (evict all pods) kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data # View node status kubectl get nodes kubectl describe node <node-name> # Add taints to node kubectl taint nodes <node-name> key=value:NoSchedule # Remove taint kubectl taint nodes <node-name> key- # View node capacity kubectl describe node <node-name> | grep -A 5 Capacity

Cluster Upgrades

# Check available versions kubeadm upgrade plan # Upgrade control plane kubeadm upgrade apply v1.28.0 # Upgrade kubelet on nodes sudo apt-get update && sudo apt-get install -y kubelet=1.28.0-00 sudo systemctl restart kubelet # For managed clusters (EKS) eksctl upgrade cluster --name production-cluster --version 1.28 # For GKE gcloud container clusters upgrade production-cluster --zone us-central1-a --master # For AKS az aks upgrade --resource-group production-rg --name production-cluster --kubernetes-version 1.28

Certificate Management

# Check certificate expiration kubeadm certs check-expiration # Renew all certificates kubeadm certs renew all # Renew specific certificate kubeadm certs renew apiserver # Restart control plane components after renewal sudo systemctl restart kubelet

Security Administration

# Create a service account kubectl create serviceaccount my-sa -n default # Create a role kubectl create role pod-reader --verb=get,list,watch --resource=pods -n default # Create role binding kubectl create rolebinding read-pods --role=pod-reader --serviceaccount=default:my-sa -n default # View RBAC kubectl get roles -n default kubectl get rolebindings -n default # Configure pod security standards kubectl apply -f pod-security-standard.yaml

Resource Management

# View resource usage kubectl top nodes kubectl top pods # Set resource quotas apiVersion: v1 kind: ResourceQuota metadata: name: namespace-quota namespace: default spec: hard: requests.cpu: "10" requests.memory: 10Gi limits.cpu: "20" limits.memory: 20Gi persistentvolumeclaims: "10" pods: "50" # Set limit ranges apiVersion: v1 kind: LimitRange metadata: name: default-limits namespace: default spec: limits: - default: cpu: 500m memory: 512Mi defaultRequest: cpu: 250m memory: 256Mi max: cpu: 2 memory: 2Gi type: Container
Backup and Disaster Recovery

Regular backups and disaster recovery planning are essential for production clusters. Here are the key strategies.

etcd Backup

# Backup etcd ETCDCTL_API=3 etcdctl snapshot save /backups/etcd-$(date +%Y%m%d).db # Restore etcd ETCDCTL_API=3 etcdctl snapshot restore /backups/etcd-20250115.db --data-dir /var/lib/etcd-backup # Automated backup script #!/bin/bash BACKUP_DIR="/backups/etcd" DATE=$(date +%Y%m%d-%H%M%S) ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-$DATE.db find $BACKUP_DIR -name "etcd-*.db" -mtime +30 -delete

Application Backup with Velero

# Install Velero velero install \ --provider aws \ --bucket production-backups \ --secret-file ./credentials-velero \ --backup-location-config region=us-east-1 \ --snapshot-location-config region=us-east-1 # Create a backup velero backup create full-backup --include-namespaces default,production # Create a scheduled backup velero schedule create daily-backup --schedule="0 2 * * *" --include-namespaces default,production # Restore from backup velero restore create --from-backup full-backup # View backups velero backup get
Critical: Always test your restore procedures in a non-production environment. A backup is only useful if you can successfully restore from it. Document your disaster recovery procedures and practice them regularly.
Operational Best Practices

Security

Enable RBAC, use network policies, rotate certificates regularly, and implement pod security standards. Use OPA/Gatekeeper for policy enforcement.

Monitoring

Set up Prometheus, Grafana, and AlertManager. Monitor control plane components, node resources, and application performance. Define SLOs and SLIs.

Updates

Regularly update Kubernetes, node OS, and container runtimes. Use phased rollouts for updates. Test updates in staging before production.

Storage

Plan storage capacity, use appropriate storage classes, monitor PVC usage, and implement backup strategies for stateful applications.

Cost Management

Use resource quotas, implement node auto-scaling, right-size resources, and use spot instances for non-critical workloads. Monitor cloud costs regularly.

Documentation

Document cluster architecture, operational procedures, and troubleshooting guides. Maintain runbooks for common issues and disaster scenarios.
Frequently Asked Questions
Should I use a managed or self-managed Kubernetes cluster?
Choose managed (EKS, GKE, AKS) if you want to reduce operational overhead and focus on applications. Choose self-managed if you need full control, have specific compliance requirements, or want to avoid vendor lock-in. Many organizations use both.
What are the minimum system requirements for a kubeadm cluster?
For production, control plane nodes need at least 4GB RAM and 2 CPUs. Worker nodes need at least 2GB RAM and 1 CPU. Allocate 50GB+ storage for the OS and containerd. For small clusters, these requirements can be lower.
How do I upgrade my Kubernetes cluster without downtime?
Use rolling upgrades with kubeadm or managed services. Upgrade control plane nodes first, then worker nodes. Ensure you have multiple replicas of applications. Use PodDisruptionBudgets to control pod eviction during upgrades.
How often should I back up etcd?
Back up etcd daily at minimum. For critical production clusters, consider hourly backups. Always test restore procedures. Store backups in a separate location from the cluster.
What's the difference between cordon, drain, and delete node?
Cordon marks a node as unschedulable (no new pods). Drain evicts all pods from a node. Delete removes the node from the cluster. For maintenance, cordon then drain, perform maintenance, then uncordon.
How many control plane nodes do I need for HA?
For high availability, run 3 or 5 control plane nodes. This ensures quorum for etcd. For smaller clusters, 3 is sufficient. In managed services like EKS, GKE, AKS, the control plane is managed by the cloud provider.
What CNI plugin should I use?
Calico is the most popular choice due to its feature richness and network policy support. Cilium offers advanced security and observability. Flannel is simple and widely used. Choose based on your specific networking and security requirements.
How do I handle node failures automatically?
Use Node Auto-Repair (GKE) or Auto-Scaling groups (AWS) to replace failed nodes. Implement node monitoring and alerting. Use PodDisruptionBudgets to maintain application availability during node failures.
Previous: Kubernetes Objects Next: Kubernetes Networking

A well-designed cluster setup and robust operational practices are the foundation of successful Kubernetes adoption. Choose the right setup for your needs and invest in automation and monitoring.