Master the art of Linux troubleshooting with this comprehensive guide to performance and boot issues. Learn systematic approaches, diagnostic tools, and proven solutions for resolving slow systems, boot failures, and performance bottlenecks used by system administrators worldwide.
The Troubleshooting Mindset
Effective troubleshooting requires a systematic approach. Follow these principles:
- Start Simple: Check basic issues before diving deep
- Gather Evidence: Collect logs and metrics before making changes
- Isolate Problems: Determine if issue is hardware, software, or configuration
- Document Everything: Track what you've tried and what worked
- Test Incrementally: Make one change at a time and test results
What exactly is wrong? Slow response? Won't boot? Service down?
Check logs, monitor resources, reproduce the issue
Based on evidence, what could be causing the problem?
Run diagnostic commands, test solutions, verify fixes
Apply fix, monitor results, document resolution
1. Boot Issues Troubleshooting
fsck /dev/sda1journalctl -xbsystemd-analyzesystemd-analyze blamefsck -y /dev/sda1smartctl -a /dev/sdaBoot Diagnostic Commands
Boot Timeline Analysis
Hardware detection, firmware initialization
Load kernel and initramfs
Hardware drivers, filesystem setup
Network, login manager, user services
Boot Issue Resolution Script
#!/bin/bash
# boot-troubleshooter.sh
# Comprehensive boot issue diagnostics
REPORT_FILE="/tmp/boot-diagnostics-$(date +%Y%m%d_%H%M%S).txt"
echo "=== Boot Troubleshooting Diagnostics ===" > "$REPORT_FILE"
echo "Generated: $(date)" >> "$REPORT_FILE"
echo "Hostname: $(hostname)" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 1. Systemd boot analysis
echo "1. SYSTEMD BOOT ANALYSIS" >> "$REPORT_FILE"
echo "=======================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Total boot time:" >> "$REPORT_FILE"
systemd-analyze 2>&1 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Service initialization times:" >> "$REPORT_FILE"
systemd-analyze blame 2>&1 | head -20 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Critical chain (slowest services):" >> "$REPORT_FILE"
systemd-analyze critical-chain 2>&1 | head -20 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 2. Filesystem checks
echo "2. FILESYSTEM STATUS" >> "$REPORT_FILE"
echo "====================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Mounted filesystems:" >> "$REPORT_FILE"
mount | grep "^/dev" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Disk space:" >> "$REPORT_FILE"
df -h >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Inode usage:" >> "$REPORT_FILE"
df -i >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 3. Boot logs
echo "3. BOOT LOGS" >> "$REPORT_FILE"
echo "============" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Recent kernel messages:" >> "$REPORT_FILE"
dmesg | tail -50 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Current boot errors:" >> "$REPORT_FILE"
journalctl -b -p err 2>&1 | tail -20 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 4. Hardware status
echo "4. HARDWARE STATUS" >> "$REPORT_FILE"
echo "==================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Memory status:" >> "$REPORT_FILE"
free -h >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "CPU information:" >> "$REPORT_FILE"
lscpu | grep -E "Model name|CPU\(s\)|Thread|Core" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 5. Recommendations
echo "5. RECOMMENDATIONS" >> "$REPORT_FILE"
echo "==================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# Check for common issues
if systemd-analyze time 2>&1 | grep -q "min"; then
echo "⚠️ SLOW BOOT DETECTED" >> "$REPORT_FILE"
echo " Consider disabling unnecessary services:" >> "$REPORT_FILE"
systemd-analyze blame 2>&1 | grep "min\|s " | head -5 >> "$REPORT_FILE"
fi
if df -h / | grep -q "100%"; then
echo "⚠️ ROOT FILESYSTEM FULL" >> "$REPORT_FILE"
echo " Clean up disk space immediately" >> "$REPORT_FILE"
fi
if journalctl -b -p err 2>&1 | grep -q "error"; then
ERROR_COUNT=$(journalctl -b -p err 2>&1 | grep -c "error")
echo "⚠️ BOOT ERRORS DETECTED: $ERROR_COUNT errors" >> "$REPORT_FILE"
echo " Check journalctl -b -p err for details" >> "$REPORT_FILE"
fi
echo "" >> "$REPORT_FILE"
echo "Diagnostics saved to: $REPORT_FILE" >> "$REPORT_FILE"
echo "Use 'cat $REPORT_FILE | less' to view report" >> "$REPORT_FILE"
echo "Boot diagnostics completed. Report: $REPORT_FILE"
2. Performance Issues Troubleshooting
top, htop, pidstat, perffree, vmstat, slabtop, ps auxiostat, iotop, dmesg, smartctliftop, nethogs, netstat, ssstrace, lsof, perf, application logsPerformance Diagnostic Commands
Performance Troubleshooting Script
#!/bin/bash
# performance-troubleshooter.sh
# Comprehensive performance diagnostics
REPORT_FILE="/tmp/performance-diagnostics-$(date +%Y%m%d_%H%M%S).txt"
ALERT_THRESHOLDS=(
"CPU_LOAD=4.0"
"MEMORY_PERCENT=90"
"DISK_PERCENT=85"
"SWAP_PERCENT=50"
)
echo "=== Performance Troubleshooting Diagnostics ===" > "$REPORT_FILE"
echo "Generated: $(date)" >> "$REPORT_FILE"
echo "Hostname: $(hostname)" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 1. System Overview
echo "1. SYSTEM OVERVIEW" >> "$REPORT_FILE"
echo "==================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Uptime and load averages:" >> "$REPORT_FILE"
uptime >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "CPU information:" >> "$REPORT_FILE"
lscpu | grep -E "Model name|CPU\(s\)|Thread|Core|MHz" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 2. CPU Analysis
echo "2. CPU ANALYSIS" >> "$REPORT_FILE"
echo "===============" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Current CPU usage:" >> "$REPORT_FILE"
top -b -n 1 | grep "^%Cpu" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Load averages (1, 5, 15 min):" >> "$REPORT_FILE"
cat /proc/loadavg >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Top 5 CPU processes:" >> "$REPORT_FILE"
ps aux --sort=-%cpu | head -6 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 3. Memory Analysis
echo "3. MEMORY ANALYSIS" >> "$REPORT_FILE"
echo "==================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Memory usage:" >> "$REPORT_FILE"
free -h >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Memory pressure:" >> "$REPORT_FILE"
vmstat 1 3 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Top 5 memory processes:" >> "$REPORT_FILE"
ps aux --sort=-%mem | head -6 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 4. Disk Analysis
echo "4. DISK ANALYSIS" >> "$REPORT_FILE"
echo "================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Disk space:" >> "$REPORT_FILE"
df -h >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "I/O statistics:" >> "$REPORT_FILE"
iostat -x 1 2 | tail -10 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Large directories (top 10):" >> "$REPORT_FILE"
du -sh /* 2>/dev/null | sort -hr | head -10 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 5. Network Analysis
echo "5. NETWORK ANALYSIS" >> "$REPORT_FILE"
echo "===================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Network interfaces:" >> "$REPORT_FILE"
ip addr show | grep -E "^[0-9]|inet " >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
echo "Active connections:" >> "$REPORT_FILE"
ss -tun | head -20 >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
# 6. Issue Detection
echo "6. ISSUE DETECTION" >> "$REPORT_FILE"
echo "==================" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
ALERTS=()
# Check CPU load
LOAD1=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | xargs)
CPU_CORES=$(nproc)
if (( $(echo "$LOAD1 > $CPU_CORES * 1.5" | bc -l) )); then
ALERTS+=("⚠️ HIGH CPU LOAD: $LOAD1 (cores: $CPU_CORES)")
fi
# Check memory
MEM_PERCENT=$(free | grep Mem | awk '{print $3/$2 * 100.0}')
if (( $(echo "$MEM_PERCENT > 90" | bc -l) )); then
ALERTS+=("⚠️ HIGH MEMORY USAGE: ${MEM_PERCENT}%")
fi
# Check disk space
ROOT_USAGE=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$ROOT_USAGE" -ge 85 ]; then
ALERTS+=("⚠️ HIGH DISK USAGE: ${ROOT_USAGE}% on /")
fi
# Report alerts
if [ ${#ALERTS[@]} -eq 0 ]; then
echo "✅ No critical performance issues detected" >> "$REPORT_FILE"
else
echo "ISSUES DETECTED:" >> "$REPORT_FILE"
for alert in "${ALERTS[@]}"; do
echo " $alert" >> "$REPORT_FILE"
done
echo "" >> "$REPORT_FILE"
echo "RECOMMENDATIONS:" >> "$REPORT_FILE"
echo " 1. Check top processes for resource hogs" >> "$REPORT_FILE"
echo " 2. Monitor system logs for errors" >> "$REPORT_FILE"
echo " 3. Consider adding resources if issues persist" >> "$REPORT_FILE"
fi
echo "" >> "$REPORT_FILE"
echo "Diagnostics saved to: $REPORT_FILE" >> "$REPORT_FILE"
echo "Performance diagnostics completed. Report: $REPORT_FILE"
3. Common Performance Problems & Solutions
top or htopstrace -p PIDps aux | grep Zkill -9 PIDwatch free -hps aux --sort=-%mempmap PIDiostat -x 1 5iotop -osmartctl -a /dev/sdaQuick Fix Commands for Common Issues
4. Advanced Troubleshooting Techniques
Process Tracing and Debugging
# Trace system calls of a process
strace -p PID # Attach to running process
strace -f command # Trace child processes
strace -e trace=file command # Trace only file operations
strace -c command # Count system calls
# Trace library calls
ltrace -p PID # Library call tracing
ltrace -c command # Count library calls
# Check open files and connections
lsof -p PID # Files opened by process
lsof -i :80 # Processes using port 80
lsof -u username # Files opened by user
# Debug with gdb
gdb -p PID # Attach debugger to process
gdb --args program args # Debug from start
# Memory analysis
valgrind --leak-check=full program # Memory leak detection
valgrind --tool=callgrind program # Call profiling
Kernel and System Analysis
# Kernel parameters
sysctl -a # View all kernel parameters
sysctl -w parameter=value # Change parameter temporarily
# Interrupts and IRQs
cat /proc/interrupts # View interrupt distribution
mpstat -I ALL 1 5 # Interrupt statistics
# Scheduler and process info
chrt -p PID # View process scheduling policy
taskset -p PID # View CPU affinity
taskset -pc 0,2 PID # Set CPU affinity (cores 0,2)
# Power management
cpupower frequency-info # CPU frequency information
cpupower frequency-set -g performance # Set performance mode
# Hardware errors
dmesg | grep -i error # Kernel errors
journalctl -k -p err # Kernel errors from journal
mcelog # Machine Check Exceptions (hardware)
5. Preventive Maintenance Script
#!/bin/bash
# preventive-maintenance.sh
# Regular system maintenance to prevent issues
LOG_FILE="/var/log/maintenance.log"
THRESHOLD_DISK=80
THRESHOLD_INODE=80
THRESHOLD_LOAD=4.0
echo "=== Preventive Maintenance - $(date) ===" | tee -a "$LOG_FILE"
# 1. Disk Space Check
echo "" | tee -a "$LOG_FILE"
echo "1. DISK SPACE CHECK" | tee -a "$LOG_FILE"
echo "===================" | tee -a "$LOG_FILE"
df -h | grep '^/dev/' | while read -r line; do
USE_PERCENT=$(echo "$line" | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo "$line" | awk '{print $6}')
if [ "$USE_PERCENT" -ge "$THRESHOLD_DISK" ]; then
echo "⚠️ High disk usage on $MOUNT: ${USE_PERCENT}%" | tee -a "$LOG_FILE"
# Cleanup suggestions
if [ "$MOUNT" == "/" ]; then
echo " Cleanup commands:" | tee -a "$LOG_FILE"
echo " - journalctl --vacuum-size=200M" | tee -a "$LOG_FILE"
echo " - apt-get clean (Debian/Ubuntu)" | tee -a "$LOG_FILE"
echo " - yum clean all (RHEL/CentOS)" | tee -a "$LOG_FILE"
echo " - rm -rf /tmp/*" | tee -a "$LOG_FILE"
fi
fi
done
# 2. Inode Check
echo "" | tee -a "$LOG_FILE"
echo "2. INODE CHECK" | tee -a "$LOG_FILE"
echo "==============" | tee -a "$LOG_FILE"
df -i | grep '^/dev/' | while read -r line; do
INODE_PERCENT=$(echo "$line" | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo "$line" | awk '{print $6}')
if [ "$INODE_PERCENT" -ge "$THRESHOLD_INODE" ]; then
echo "⚠️ High inode usage on $MOUNT: ${INODE_PERCENT}%" | tee -a "$LOG_FILE"
echo " Check for many small files:" | tee -a "$LOG_FILE"
echo " - find $MOUNT -type f | wc -l" | tee -a "$LOG_FILE"
fi
done
# 3. Load Average Check
echo "" | tee -a "$LOG_FILE"
echo "3. LOAD AVERAGE CHECK" | tee -a "$LOG_FILE"
echo "====================" | tee -a "$LOG_FILE"
LOAD1=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | xargs)
CPU_CORES=$(nproc)
echo "Load: $LOAD1, Cores: $CPU_CORES" | tee -a "$LOG_FILE"
if (( $(echo "$LOAD1 > $THRESHOLD_LOAD" | bc -l) )); then
echo "⚠️ High load average: $LOAD1" | tee -a "$LOG_FILE"
echo " Top CPU processes:" | tee -a "$LOG_FILE"
ps aux --sort=-%cpu | head -5 | tee -a "$LOG_FILE"
fi
# 4. Memory Check
echo "" | tee -a "$LOG_FILE"
echo "4. MEMORY CHECK" | tee -a "$LOG_FILE"
echo "===============" | tee -a "$LOG_FILE"
free -h | tee -a "$LOG_FILE"
SWAP_USAGE=$(free | grep Swap | awk '{if ($2 == 0) print 0; else print $3/$2 * 100.0}')
if (( $(echo "$SWAP_USAGE > 50" | bc -l) )); then
echo "⚠️ High swap usage: ${SWAP_USAGE}%" | tee -a "$LOG_FILE"
fi
# 5. Service Status Check
echo "" | tee -a "$LOG_FILE"
echo "5. SERVICE STATUS CHECK" | tee -a "$LOG_FILE"
echo "======================" | tee -a "$LOG_FILE"
SERVICES=("ssh" "nginx" "mysql" "docker" "cron")
for service in "${SERVICES[@]}"; do
if systemctl is-active --quiet "$service"; then
echo "✅ $service: RUNNING" | tee -a "$LOG_FILE"
else
echo "❌ $service: STOPPED" | tee -a "$LOG_FILE"
fi
done
# 6. Log Rotation Check
echo "" | tee -a "$LOG_FILE"
echo "6. LOG ROTATION CHECK" | tee -a "$LOG_FILE"
echo "====================" | tee -a "$LOG_FILE"
logrotate -d /etc/logrotate.conf 2>&1 | grep -i "error\|warning" | tee -a "$LOG_FILE"
if [ $? -eq 0 ]; then
echo "⚠️ Logrotate configuration issues detected" | tee -a "$LOG_FILE"
else
echo "✅ Logrotate configuration OK" | tee -a "$LOG_FILE"
fi
echo "" | tee -a "$LOG_FILE"
echo "Maintenance check completed at $(date)" | tee -a "$LOG_FILE"
echo "Full log: $LOG_FILE" | tee -a "$LOG_FILE"
Troubleshooting Best Practices Checklist
- Always check logs first:
journalctl -xb - Monitor resource usage before making changes
- Reproduce the issue to understand it better
- Make one change at a time and test
- Document everything you try
- Have a backup/rollback plan
- Use version control for configuration files
- Test in staging before production
- Monitor after fixing to ensure issue is resolved
- Create runbooks for common issues
1. Backup before major changes: Always have a recovery plan
2. Test commands in dry-run mode: Use
--dry-run when available3. Don't run unknown commands: Understand what a command does first
4. Use maintenance windows: Schedule disruptive changes appropriately
5. Monitor after changes: Ensure fixes don't break other things
6. Document resolutions: Create knowledge base articles
7. Set up monitoring: Prevent issues before they occur
Getting Started with Troubleshooting
Follow this systematic approach to become an effective troubleshooter:
- Learn basic commands: Master
top, htop, iostat, vmstat, journalctl - Understand normal behavior: Monitor your systems when they're healthy
- Start with logs: Always check logs first when issues occur
- Practice on test systems: Create test environments to learn
- Follow the OSI model: Start physical, move up to application
- Use the scripts in this guide: Automate common diagnostics
- Document everything: Create your own troubleshooting guide
- Learn from incidents: Conduct post-mortems for major issues
- Share knowledge: Teach others what you've learned
- Stay curious: Keep learning new tools and techniques
Master the Art of Linux Troubleshooting
Effective troubleshooting is both an art and a science. By following systematic approaches and using the right tools, you can solve even the most complex performance and boot issues. Remember that every problem you solve makes you a better administrator.
Remember: The best troubleshooters are not those who know all the answers, but those who know how to find them. Develop your diagnostic skills, build your toolkit, and approach problems methodically.
Next Steps: Practice the scripts in this guide on a test system. Create your own troubleshooting checklist. The next time an issue occurs, you'll be prepared to diagnose and resolve it quickly and effectively.