Master Linux container isolation technologies with namespaces and cgroups. Understand the fundamental building blocks of containers, from process isolation to resource limits, with practical examples and deep technical insights.
Why Linux Namespaces & Cgroups?
Namespaces and cgroups are the fundamental Linux kernel technologies that enable modern containerization.
- Isolation: Namespaces provide process isolation (PID, network, mount, etc.)
- Resource Control: Cgroups limit and monitor resource usage (CPU, memory, I/O)
- Security: Combined with capabilities and seccomp for defense in depth
- Efficiency: Lightweight compared to full virtualization
- Portability: Containers run consistently across different environments
- Density: Run multiple isolated environments on a single host
- Orchestration: Foundation for Kubernetes, Docker, and container orchestration
1. Linux Namespaces Deep Dive
Namespace Types Comparison
| Namespace | Flag | Isolates | Introduced | Key Use Case |
|---|---|---|---|---|
| PID | CLONE_NEWPID |
Process IDs | Linux 2.6.24 | Container process trees |
| Network | CLONE_NEWNET |
Network devices, ports, routing | Linux 2.6.24 | Container networking |
| Mount | CLONE_NEWNS |
Mount points, filesystems | Linux 2.4.19 | Container root filesystem |
| IPC | CLONE_NEWIPC |
System V IPC, POSIX queues | Linux 2.6.19 | Inter-process communication |
| UTS | CLONE_NEWUTS |
Hostname and NIS domain | Linux 2.6.19 | Container hostname |
| User | CLONE_NEWUSER |
User and group IDs | Linux 3.8 | Rootless containers, security |
| Cgroup | CLONE_NEWCGROUP |
Cgroup root directory | Linux 4.6 | Cgroup namespace isolation |
| Time | CLONE_NEWTIME |
Boot and monotonic clocks | Linux 5.6 | Container time offsets |
2. Working with Namespaces
Namespace API & System Calls
Practical Namespace Examples
C Program: Creating a Simple Container
// simple-container.c - Create a minimal container with namespaces
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/mount.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#define STACK_SIZE (1024 * 1024)
static char container_stack[STACK_SIZE];
// Container process function
int container_main(void* arg) {
printf("Container [%d] - Inside container!\n", getpid());
// Set hostname in UTS namespace
sethostname("container", 9);
// Remount /proc to get accurate PID view
mount("proc", "/proc", "proc", 0, NULL);
// Execute a shell
system("/bin/bash");
return 0;
}
int main(int argc, char *argv[]) {
printf("Host [%d] - Starting container...\n", getpid());
// Flags for namespace creation
int flags = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWNET | SIGCHLD;
// Create child process with new namespaces
pid_t pid = clone(container_main, container_stack + STACK_SIZE, flags, NULL);
if (pid == -1) {
perror("clone");
exit(EXIT_FAILURE);
}
printf("Host [%d] - Container PID: %d\n", getpid(), pid);
// Wait for container process to exit
waitpid(pid, NULL, 0);
printf("Host [%d] - Container exited\n", getpid());
return 0;
}
// Compile: gcc -o simple-container simple-container.c
// Run: sudo ./simple-container
3. Cgroups (Control Groups) Mastery
Cgroups v2 Architecture
Cgroups v1 vs v2 Comparison
| Feature | Cgroups v1 | Cgroups v2 | Benefits |
|---|---|---|---|
| Hierarchy | Multiple hierarchies | Single unified hierarchy | Simpler, consistent |
| Controllers | Separate mount points | All in one place | Easier management |
| CPU | cpu, cpuacct, cpuset | Unified cpu controller | Better CPU scheduling |
| Memory | memory controller only | Memory + swap accounting | Better memory management |
| IO | blkio controller | io controller | Unified IO control |
| Write Interface | Various files | Consistent .max files | Predictable API |
| Pressure | Not available | Pressure stall information | Better monitoring |
| Default | Old default | Default since kernel 4.5 | Modern systems |
Practical Cgroup Examples
#!/bin/bash
# cgroup-manager.sh - Create and manage cgroups with resource limits
set -euo pipefail
CGROUP_ROOT="/sys/fs/cgroup"
CGROUP_NAME="mycontainer"
CGROUP_PATH="$CGROUP_ROOT/$CGROUP_NAME"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
log() {
local level="$1"
local message="$2"
case "$level" in
"INFO") echo -e "${BLUE}[INFO]${NC} $message" ;;
"SUCCESS") echo -e "${GREEN}[SUCCESS]${NC} $message" ;;
"WARNING") echo -e "${YELLOW}[WARNING]${NC} $message" ;;
"ERROR") echo -e "${RED}[ERROR]${NC} $message" ;;
esac
}
check_cgroup_v2() {
if ! mount | grep -q "cgroup2 on /sys/fs/cgroup"; then
log "ERROR" "Cgroups v2 not mounted at /sys/fs/cgroup"
log "INFO" "Try: sudo mount -t cgroup2 cgroup2 /sys/fs/cgroup"
exit 1
fi
log "SUCCESS" "Cgroups v2 is mounted"
}
create_cgroup() {
log "INFO" "Creating cgroup: $CGROUP_NAME"
if [[ -d "$CGROUP_PATH" ]]; then
log "WARNING" "Cgroup already exists, removing..."
rmdir "$CGROUP_PATH" 2>/dev/null || true
fi
mkdir -p "$CGROUP_PATH"
# Enable controllers for this cgroup
echo "+cpu +memory +io +pids" > "$CGROUP_ROOT/cgroup.subtree_control"
log "SUCCESS" "Created cgroup at $CGROUP_PATH"
}
set_resource_limits() {
local cpu_limit="${1:-0.5}" # 50% of a core
local memory_limit="${2:-100M}" # 100MB memory
local pid_limit="${3:-100}" # 100 processes max
log "INFO" "Setting resource limits:"
log "INFO" " CPU: ${cpu_limit} cores"
log "INFO" " Memory: $memory_limit"
log "INFO" " PIDs: $pid_limit"
# CPU limit (format: $MAX $PERIOD)
# $MAX microseconds per $PERIOD microseconds
local cpu_period=100000 # 100ms in microseconds
local cpu_max=$(echo "$cpu_limit * $cpu_period" | bc | cut -d. -f1)
echo "$cpu_max $cpu_period" > "$CGROUP_PATH/cpu.max"
# Memory limit
echo "$memory_limit" > "$CGROUP_PATH/memory.max"
echo "50M" > "$CGROUP_PATH/memory.swap.max" # Limit swap usage
# PID limit
echo "$pid_limit" > "$CGROUP_PATH/pids.max"
# IO limits (example: limit to 1MB/s write on sda)
# Get device major:minor for sda
if [[ -b /dev/sda ]]; then
local device_info=$(lsblk -nd -o MAJ:MIN /dev/sda)
echo "$device_info wbps=1048576" > "$CGROUP_PATH/io.max"
fi
# CPU weight (for sharing)
echo "100" > "$CGROUP_PATH/cpu.weight"
log "SUCCESS" "Resource limits configured"
}
add_process_to_cgroup() {
local pid="${1:-$$}"
log "INFO" "Adding process $pid to cgroup"
echo "$pid" > "$CGROUP_PATH/cgroup.procs"
# Verify
if grep -q "^$pid$" "$CGROUP_PATH/cgroup.procs"; then
log "SUCCESS" "Process $pid added to cgroup"
else
log "ERROR" "Failed to add process to cgroup"
fi
}
run_in_cgroup() {
local command="${1:-/bin/bash}"
log "INFO" "Running command in cgroup: $command"
# Fork a process and move it to cgroup
(
echo $$ > "$CGROUP_PATH/cgroup.procs"
exec $command
) &
local child_pid=$!
log "SUCCESS" "Started process $child_pid in cgroup"
echo "$child_pid"
}
monitor_cgroup() {
log "INFO" "Monitoring cgroup $CGROUP_NAME..."
echo "=== Cgroup Statistics ==="
while true; do
clear
echo "Cgroup: $CGROUP_NAME"
echo "======================"
# CPU
echo -e "\n${BLUE}CPU Usage:${NC}"
cat "$CGROUP_PATH/cpu.stat" 2>/dev/null || echo "N/A"
# Memory
echo -e "\n${GREEN}Memory Usage:${NC}"
local mem_current=$(cat "$CGROUP_PATH/memory.current" 2>/dev/null || echo "0")
local mem_max=$(cat "$CGROUP_PATH/memory.max" 2>/dev/null || echo "max")
echo "Current: $(numfmt --to=iec $mem_current)"
echo "Limit: $mem_max"
# Memory events
if [[ -f "$CGROUP_PATH/memory.events" ]]; then
echo -e "\nMemory Events:"
grep -E "(oom|pressure)" "$CGROUP_PATH/memory.events" || true
fi
# PIDs
echo -e "\n${YELLOW}Process Count:${NC}"
local pids_current=$(wc -l < "$CGROUP_PATH/cgroup.procs" 2>/dev/null || echo "0")
local pids_max=$(cat "$CGROUP_PATH/pids.max" 2>/dev/null || echo "max")
echo "Current: $pids_current"
echo "Limit: $pids_max"
# IO
if [[ -f "$CGROUP_PATH/io.stat" ]]; then
echo -e "\n${RED}IO Statistics:${NC}"
cat "$CGROUP_PATH/io.stat" 2>/dev/null || true
fi
echo -e "\nPress Ctrl+C to exit..."
sleep 2
done
}
cleanup() {
log "INFO" "Cleaning up cgroup: $CGROUP_NAME"
# Kill all processes in cgroup
if [[ -f "$CGROUP_PATH/cgroup.procs" ]]; then
while read -r pid; do
kill -9 "$pid" 2>/dev/null || true
done < "$CGROUP_PATH/cgroup.procs"
fi
# Remove cgroup
rmdir "$CGROUP_PATH" 2>/dev/null && log "SUCCESS" "Cgroup removed" || log "WARNING" "Could not remove cgroup (may not be empty)"
}
show_usage() {
echo "Usage: $0 [command]"
echo "Commands:"
echo " create - Create cgroup with default limits"
echo " limits - Set resource limits"
echo " add - Add process to cgroup"
echo " run - Run command in cgroup"
echo " monitor - Monitor cgroup statistics"
echo " cleanup - Remove cgroup"
echo " help - Show this help"
}
main() {
local command="${1:-help}"
case "$command" in
"create")
check_cgroup_v2
create_cgroup
set_resource_limits
;;
"limits")
set_resource_limits "$2" "$3" "$4"
;;
"add")
add_process_to_cgroup "$2"
;;
"run")
run_in_cgroup "$2"
;;
"monitor")
monitor_cgroup
;;
"cleanup")
cleanup
;;
"help"|*)
show_usage
;;
esac
}
# Run main function
main "$@"
4. Building a Container from Scratch
Minimal Container Implementation
#!/bin/bash
# minimal-container.sh - Build a container from scratch using namespaces and cgroups
set -euo pipefail
CONTAINER_NAME="mycontainer"
CONTAINER_ID=$(uuidgen | cut -d- -f1)
CONTAINER_ROOT="/var/containers/$CONTAINER_ID"
CONTAINER_CGROUP="containers/$CONTAINER_ID"
# Container image (busybox)
IMAGE_URL="https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox"
IMAGE_FILE="/tmp/busybox"
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log() {
echo -e "${GREEN}[+]${NC} $1"
}
error() {
echo -e "${RED}[!]${NC} $1" >&2
exit 1
}
# Download busybox if not present
download_busybox() {
if [[ ! -f "$IMAGE_FILE" ]]; then
log "Downloading busybox..."
curl -s -o "$IMAGE_FILE" "$IMAGE_URL" || error "Failed to download busybox"
chmod +x "$IMAGE_FILE"
fi
}
# Setup container root filesystem
setup_rootfs() {
log "Setting up container root filesystem..."
mkdir -p "$CONTAINER_ROOT"
mkdir -p "$CONTAINER_ROOT/bin"
mkdir -p "$CONTAINER_ROOT/etc"
mkdir -p "$CONTAINER_ROOT/proc"
mkdir -p "$CONTAINER_ROOT/sys"
mkdir -p "$CONTAINER_ROOT/tmp"
mkdir -p "$CONTAINER_ROOT/dev"
mkdir -p "$CONTAINER_ROOT/home"
# Copy busybox
cp "$IMAGE_FILE" "$CONTAINER_ROOT/bin/busybox"
# Create symlinks for busybox applets
"$CONTAINER_ROOT/bin/busybox" --list | while read applet; do
ln -sf /bin/busybox "$CONTAINER_ROOT/bin/$applet"
done
# Create minimal /etc files
cat > "$CONTAINER_ROOT/etc/passwd" << EOF
root:x:0:0:root:/root:/bin/sh
EOF
cat > "$CONTAINER_ROOT/etc/group" << EOF
root:x:0:
EOF
cat > "$CONTAINER_ROOT/etc/hostname" << EOF
$CONTAINER_NAME
EOF
cat > "$CONTAINER_ROOT/etc/hosts" << EOF
127.0.0.1 localhost $CONTAINER_NAME
::1 localhost ip6-localhost ip6-loopback
EOF
# Create container init script
cat > "$CONTAINER_ROOT/init.sh" << 'EOF'
#!/bin/sh
# Container init script
# Mount proc and sys
mount -t proc proc /proc
mount -t sysfs sysfs /sys
mount -t tmpfs tmpfs /tmp
mount -t devtmpfs devtmpfs /dev
# Set hostname
hostname $(cat /etc/hostname)
# Drop to shell
echo "Container started as PID $$"
exec /bin/sh
EOF
chmod +x "$CONTAINER_ROOT/init.sh"
}
# Create cgroup for container
setup_cgroup() {
log "Setting up cgroup..."
CGROUP_PATH="/sys/fs/cgroup/$CONTAINER_CGROUP"
mkdir -p "$CGROUP_PATH"
# Set resource limits
echo "50000 100000" > "$CGROUP_PATH/cpu.max" # 50% CPU
echo "100M" > "$CGROUP_PATH/memory.max" # 100MB memory
echo "50M" > "$CGROUP_PATH/memory.swap.max" # 50MB swap
echo "100" > "$CGROUP_PATH/pids.max" # 100 processes max
}
# Start container
start_container() {
log "Starting container $CONTAINER_NAME..."
# Create network namespace
ip netns add "$CONTAINER_NAME" 2>/dev/null || true
# Start container process with namespaces
unshare \
--pid --fork \
--mount \
--uts \
--ipc \
--net=/var/run/netns/$CONTAINER_NAME \
--user --map-root-user \
--cgroup \
chroot "$CONTAINER_ROOT" /init.sh &
CONTAINER_PID=$!
log "Container started with PID: $CONTAINER_PID"
# Add container process to cgroup
echo "$CONTAINER_PID" > "/sys/fs/cgroup/$CONTAINER_CGROUP/cgroup.procs"
# Setup network in namespace
setup_network
log "Container is running!"
log "Attach with: nsenter -t $CONTAINER_PID -a"
}
# Setup container network
setup_network() {
log "Setting up container network..."
# Create veth pair
ip link add veth0 type veth peer name veth1
# Move veth1 to container namespace
ip link set veth1 netns "$CONTAINER_NAME"
# Configure host side
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
# Configure container side
ip netns exec "$CONTAINER_NAME" ip addr add 10.0.0.2/24 dev veth1
ip netns exec "$CONTAINER_NAME" ip link set veth1 up
ip netns exec "$CONTAINER_NAME" ip link set lo up
ip netns exec "$CONTAINER_NAME" ip route add default via 10.0.0.1
# Enable NAT
iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -j MASQUERADE
iptables -A FORWARD -i veth0 -j ACCEPT
iptables -A FORWARD -o veth0 -j ACCEPT
# Enable IP forwarding
echo 1 > /proc/sys/net/ipv4/ip_forward
}
# Stop container
stop_container() {
log "Stopping container..."
# Kill container process
kill -9 "$CONTAINER_PID" 2>/dev/null || true
# Cleanup network
ip netns del "$CONTAINER_NAME" 2>/dev/null || true
ip link del veth0 2>/dev/null || true
# Cleanup cgroup
rmdir "/sys/fs/cgroup/$CONTAINER_CGROUP" 2>/dev/null || true
log "Container stopped"
}
# Cleanup container filesystem
cleanup() {
log "Cleaning up..."
rm -rf "$CONTAINER_ROOT"
}
# Main execution
main() {
download_busybox
setup_rootfs
setup_cgroup
start_container
# Wait for user input
echo "Press Enter to stop container..."
read
stop_container
cleanup
}
# Run main function
main "$@"
#!/bin/bash
# container-manager.sh - Complete container lifecycle management
CONTAINERS_DIR="/var/containers"
CGROUP_ROOT="/sys/fs/cgroup"
list_containers() {
echo "Running containers:"
echo "=================="
for pid in $(find /proc -maxdepth 1 -type d -name '[0-9]*'); do
if [[ -f "$pid/ns/pid" ]]; then
ns_id=$(readlink "$pid/ns/pid" | cut -d[ -f2 | cut -d] -f1)
echo "PID: $(basename $pid) | Namespace: $ns_id"
fi
done | sort -u
echo ""
echo "Cgroups:"
echo "========"
find "$CGROUP_ROOT" -type d -name "container_*" 2>/dev/null | while read cg; do
echo "Cgroup: $(basename $cg)"
[[ -f "$cg/cgroup.procs" ]] && echo " PIDs: $(cat $cg/cgroup.procs | tr '\n' ' ')"
done
}
inspect_container() {
local container_pid="$1"
if [[ ! -d "/proc/$container_pid" ]]; then
echo "Error: Process $container_pid does not exist"
return 1
fi
echo "=== Container Inspection: PID $container_pid ==="
echo ""
# Namespaces
echo "Namespaces:"
ls -la "/proc/$container_pid/ns/" | tail -n +2
echo ""
# Cgroup
echo "Cgroup:"
cat "/proc/$container_pid/cgroup"
echo ""
# Process tree
echo "Process Tree:"
pstree -p "$container_pid"
echo ""
# Resource usage
echo "Resource Usage:"
echo "CPU: $(ps -p $container_pid -o %cpu --no-headers)%"
echo "Memory: $(ps -p $container_pid -o %mem --no-headers)%"
echo ""
# Network
echo "Network Namespace:"
ip netns identify "$container_pid" 2>/dev/null || echo "Not in network namespace"
}
container_stats() {
local container_pid="$1"
local cgroup_path=$(cat "/proc/$container_pid/cgroup" | grep -o "containers/.*" | cut -d: -f2)
if [[ -z "$cgroup_path" ]]; then
echo "No cgroup found for container"
return 1
fi
echo "=== Container Statistics ==="
echo ""
# CPU
echo "CPU:"
cat "/sys/fs/cgroup/$cgroup_path/cpu.stat" 2>/dev/null || echo "N/A"
echo ""
# Memory
echo "Memory:"
echo "Current: $(cat /sys/fs/cgroup/$cgroup_path/memory.current 2>/dev/null || echo 0) bytes"
echo "Limit: $(cat /sys/fs/cgroup/$cgroup_path/memory.max 2>/dev/null || echo max)"
echo ""
# IO
echo "IO:"
cat "/sys/fs/cgroup/$cgroup_path/io.stat" 2>/dev/null || echo "N/A"
}
# Main menu
case "${1:-list}" in
"list")
list_containers
;;
"inspect")
inspect_container "$2"
;;
"stats")
container_stats "$2"
;;
"help")
echo "Usage: $0 [command]"
echo "Commands:"
echo " list - List all containers"
echo " inspect - Inspect specific container"
echo " stats - Show container statistics"
echo " help - Show this help"
;;
*)
echo "Unknown command: $1"
echo "Use: $0 help"
;;
esac
5. Security Considerations
1. Use user namespaces: Enable rootless containers with UID mapping
2. Limit capabilities: Drop unnecessary Linux capabilities
3. Apply seccomp profiles: Restrict system calls
4. Use AppArmor/SELinux: Mandatory access controls
5. Set resource limits: Prevent DoS with cgroups
6. Read-only root filesystem: Mount rootfs as read-only
7. Use no-new-privileges: Prevent privilege escalation
8. Network isolation: Use private network namespaces
9. Regular updates: Keep kernel and container tools updated
10. Audit and monitor: Log container activities
Security Hardening Examples
6. Performance Monitoring & Troubleshooting
Monitoring Tools and Commands
| Tool | Purpose | Command | Output |
|---|---|---|---|
lsns |
List namespaces | lsns -p $$ |
Namespace IDs and types |
nsenter |
Enter namespace | nsenter -t PID -n ip addr |
Network config in namespace |
systemd-cgtop |
Cgroup monitoring | systemd-cgtop |
Real-time cgroup resource usage |
bpftrace |
Tracing | bpftrace -e 'tracepoint:syscalls:sys_enter_* { @[probe] = count(); }' |
System call tracing |
perf |
Performance analysis | perf stat -a sleep 1 |
System performance counters |
cat /proc/$$/status |
Process status | grep -i NSpid /proc/$$/status |
Namespace PID mapping |
findmnt |
Mount info | findmnt --kernel |
Kernel mount information |
Troubleshooting Common Issues
uname -r and grep NAMESPACE /boot/config-$(uname -r)sysctl kernel.unprivileged_userns_clone=1memory.maxip netns listsystemd-cgtop and adjust cgroup limitsMaster Container Isolation Fundamentals
Linux namespaces and cgroups are the foundational technologies that enable modern containerization. Namespaces provide isolation (process, network, filesystem, etc.) while cgroups provide resource control (CPU, memory, I/O limits). Together, they create the secure, efficient environments that power Docker, Kubernetes, and cloud-native applications.
Key Takeaways: Understand each namespace type and its isolation purpose. Master cgroups v2 for resource management. Combine namespaces and cgroups to build containers from scratch. Implement security best practices with capabilities, seccomp, and user namespaces. Monitor and troubleshoot container performance effectively.
Next Steps: Experiment with manual namespace creation using unshare and nsenter. Build a minimal container from scratch. Explore Docker internals to see how these technologies are used in production. Study Kubernetes pod isolation to understand multi-container orchestration. Continuously monitor container security and performance in your environments.