Linux Troubleshooting Scenarios for DevOps

Practical troubleshooting scenarios with step-by-step debugging approaches. These real-world exercises will help you develop systematic problem-solving skills for DevOps interviews and daily operations.

1. System Performance Troubleshooting

Diagnosing and resolving system slowdowns, high resource usage, and performance degradation.

Scenario 1: Server suddenly becomes slow

Performance

Symptoms Reported:

  • SSH connections take 30+ seconds to establish
  • Commands respond slowly
  • Website loading times increased from 200ms to 5+ seconds
  • Users reporting timeout errors

Step-by-Step Diagnosis:

1 Check system load and uptime:

uptime # Output: 16:25:45 up 15 days, 2:15, 2 users, load average: 15.32, 12.45, 8.67 # ALERT: Load average much higher than CPU cores (e.g., 4-core system with 15+ load)

2 Identify resource bottlenecks:

# Quick overview top # Check CPU usage breakdown mpstat -P ALL 1 3 # Check memory usage free -h cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable|Swap)" # Check I/O wait vmstat 1 5 # Look for high 'wa' (I/O wait) percentage

3 Find top resource consumers:

# Top CPU processes ps aux --sort=-%cpu | head -10 # Top memory processes ps aux --sort=-%mem | head -10 # Check for I/O intensive processes iotop # or pidstat -d 1

4 Check disk space and inodes:

df -h df -i # Check inode usage # If / partition is full: du -sh /* 2>/dev/null | sort -rh | head -10 # Drill down into largest directory

5 Check network connections:

# Check for many connections/DDOS ss -s netstat -ant | awk '{print $6}' | sort | uniq -c | sort -rn # Check for SYN flood netstat -n -p TCP | grep SYN_RECV | wc -l

Common Solutions:

Root Cause Diagnostic Command Immediate Action Long-term Fix
High CPU Usage top, ps aux --sort=-%cpu Kill runaway process, restart service Optimize code, scale horizontally
Memory Exhaustion free -h, ps aux --sort=-%mem Clear cache, restart service, add swap Add RAM, fix memory leaks, optimize
Disk Full df -h, du -sh /* Delete large files, clear logs Implement log rotation, increase disk
I/O Wait High iostat -x 1, iotop Stop heavy I/O process Upgrade to SSD, optimize queries
Too Many Processes ps aux | wc -l Kill unnecessary processes Limit user processes, fix fork bombs
๐ŸŽฏ PRACTICAL EXAMPLE: SLOW SERVER DIAGNOSIS ============================================ $ uptime 16:30:01 up 45 days, 8:12, 3 users, load average: 18.25, 16.78, 12.45 $ free -h total used free shared buff/cache available Mem: 7.8G 7.5G 98M 1.2G 256M 32M Swap: 2.0G 2.0G 0B $ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 48G 0G 100% / DIAGNOSIS: Memory exhausted + disk full IMMEDIATE ACTION: 1. Clear disk space: find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null 2. Clear memory cache: sync; echo 3 > /proc/sys/vm/drop_caches 3. Restart biggest memory consumer

Scenario 2: Database performance degradation

Database

Symptoms Reported:

  • High response times for database queries
  • Application timeouts when accessing database
  • MySQL/PostgreSQL processes using high CPU
  • Slow query logs showing many long-running queries

Step-by-Step Diagnosis:

1 Check database process status:

# For MySQL sudo systemctl status mysql sudo tail -f /var/log/mysql/error.log # For PostgreSQL sudo systemctl status postgresql sudo tail -f /var/log/postgresql/postgresql-*.log

2 Monitor database connections:

# MySQL mysql -e "SHOW PROCESSLIST;" | head -20 mysql -e "SHOW STATUS LIKE 'Threads_connected';" mysql -e "SHOW VARIABLES LIKE 'max_connections';" # PostgreSQL sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;" sudo -u postgres psql -c "SELECT * FROM pg_stat_activity WHERE state != 'idle';"

3 Check for slow queries:

# MySQL slow query log sudo tail -f /var/log/mysql/mysql-slow.log # Enable if not enabled # Add to /etc/mysql/my.cnf: # slow_query_log = 1 # slow_query_log_file = /var/log/mysql/mysql-slow.log # long_query_time = 2 # Find currently running long queries mysql -e "SELECT * FROM information_schema.processlist WHERE TIME > 10 ORDER BY TIME DESC;"

4 Check database locks:

# MySQL mysql -e "SHOW ENGINE INNODB STATUS\G" | grep -A 30 "LATEST DETECTED DEADLOCK" # PostgreSQL sudo -u postgres psql -c "SELECT * FROM pg_locks WHERE granted = false;" sudo -u postgres psql -c "SELECT pg_blocking_pids(pid) FROM pg_stat_activity WHERE cardinality(pg_blocking_pids(pid)) > 0;"

5 Check disk I/O for database:

# Find database data directory mysql -e "SHOW VARIABLES LIKE 'datadir';" # Typically: /var/lib/mysql # Check I/O on that directory iostat -x 1 | grep -A 1 "Device" iotop

Database Troubleshooting Solutions:

๐Ÿ”ง DATABASE PERFORMANCE CHECKLIST ================================== 1. CONNECTION POOLING ISSUES: - Check max_connections vs current connections - Look for "Too many connections" errors - Solution: Increase max_connections or fix connection leaks 2. SLOW QUERIES: - Enable slow query log - Use EXPLAIN on problematic queries - Add missing indexes - Solution: Query optimization, indexing 3. LOCK CONTENTION: - Check for deadlocks in logs - Identify blocking transactions - Solution: Optimize transaction isolation levels 4. RESOURCE LIMITS: - Check innodb_buffer_pool_size (MySQL) - Check shared_buffers (PostgreSQL) - Solution: Adjust memory allocation 5. DISK I/O BOTTLENECK: - Check if database on slow disk - Monitor read/write latencies - Solution: Move to SSD, optimize queries

Immediate actions:

# 1. Kill long-running queries mysql -e "SHOW PROCESSLIST;" | grep -E "(Sleep|Query)" | awk '$6 > 300 {print "KILL "$1";"}' | mysql # 2. Clear database cache (if applicable) mysql -e "FLUSH QUERY CACHE; RESET QUERY CACHE;" # 3. Restart database service (last resort) sudo systemctl restart mysql # 4. Temporary increase connections mysql -e "SET GLOBAL max_connections = 500;" # 5. Monitor with mytop/htop mytop # MySQL monitoring tool

Prevention strategies:

# 1. Regular maintenance mysqlcheck -u root -p --auto-repair --optimize --all-databases # 2. Query optimization # Use pt-query-digest for MySQL query analysis pt-query-digest /var/log/mysql/mysql-slow.log # 3. Monitoring setup # Install and configure Prometheus + Grafana for database metrics # 4. Connection pool tuning # In application, ensure proper connection pooling # Use pgbouncer for PostgreSQL connection pooling

2. Network Connectivity Problems

Troubleshooting network connectivity, latency, and service accessibility issues.

Scenario 3: "Connection refused" to service

Networking

Problem Statement:

Users report "Connection refused" when trying to access your web application on port 8080. The service was working earlier but suddenly stopped accepting connections.

$ curl http://server-ip:8080 curl: (7) Failed to connect to server-ip port 8080: Connection refused $ telnet server-ip 8080 Trying server-ip... telnet: Unable to connect to remote host: Connection refused

Systematic Troubleshooting:

1 Check if service is running:

# Check process status ps aux | grep -E "(nginx|apache|your-app)" sudo systemctl status nginx sudo systemctl status your-application # Check if port is listening sudo ss -tulpn | grep :8080 sudo netstat -tulpn | grep :8080 # If nothing is listening, service might be down

2 Check service logs:

# Application logs sudo tail -f /var/log/nginx/error.log sudo tail -f /var/log/your-app/app.log sudo journalctl -u nginx --since "10 minutes ago" # Check for crash/restart patterns sudo grep -i "segfault\|crash\|failed" /var/log/syslog

3 Check firewall rules:

# iptables sudo iptables -L -n | grep 8080 sudo iptables -L -n -v | grep DROP # firewalld sudo firewall-cmd --list-all | grep 8080 # ufw (Ubuntu) sudo ufw status | grep 8080 # Check if port is blocked sudo iptables -I INPUT -p tcp --dport 8080 -j ACCEPT # Temporary allow

4 Check SELinux/AppArmor:

# SELinux (RHEL/CentOS) getenforce # Check if enforcing sudo ausearch -m avc -ts recent # Check for denials sudo setsebool -P httpd_can_network_connect 1 # Allow HTTP # AppArmor (Ubuntu) sudo aa-status | grep nginx sudo tail -f /var/log/kern.log | grep -i denied

5 Check resource limits:

# Check if service hit file descriptor limit cat /proc/$(pgrep nginx)/limits | grep "open files" # Check system-wide limits ulimit -n sysctl fs.file-max # Check for "Address already in use" sudo ss -tulpn | grep 8080 sudo lsof -i :8080 # If something else is using it

Solution Flowchart:

๐Ÿ” "CONNECTION REFUSED" TROUBLESHOOTING FLOW ============================================= 1. Is service running? โœ“ ps aux | grep service โ†’ If NO: Start service: sudo systemctl start service 2. Is port listening? โœ“ sudo ss -tulpn | grep :port โ†’ If NO: Check service config, restart 3. Is firewall blocking? โœ“ sudo iptables -L -n | grep port โ†’ If YES: Add rule: sudo iptables -A INPUT -p tcp --dport port -j ACCEPT 4. Is SELinux/AppArmor blocking? โœ“ Check /var/log/audit/audit.log or /var/log/kern.log โ†’ If YES: Adjust policies: setsebool or aa-complain 5. Is port already in use? โœ“ sudo lsof -i :port โ†’ If YES: Kill other process or change port 6. Check resource limits? โœ“ ulimit -n, check /proc/pid/limits โ†’ If LOW: Increase limits in /etc/security/limits.conf 7. Check network connectivity? โœ“ ping server, telnet localhost port โ†’ If LOCAL works but REMOTE doesn't: Check network/firewall

Quick fix commands:

# 1. Restart service sudo systemctl restart nginx # 2. Check and fix firewall sudo iptables -I INPUT 1 -p tcp --dport 8080 -j ACCEPT sudo iptables-save > /etc/iptables/rules.v4 # 3. Check for port conflict and resolve sudo kill $(sudo lsof -t -i:8080) # Kill process on port 8080 # 4. Increase file descriptors echo "* soft nofile 65536" >> /etc/security/limits.conf echo "* hard nofile 65536" >> /etc/security/limits.conf # 5. Disable SELinux temporarily (for testing) sudo setenforce 0 # Permanent: edit /etc/selinux/config, set SELINUX=permissive

Scenario 4: Intermittent network timeouts

Latency

Problem Statement:

Application experiences random timeouts when connecting to database or external APIs. Timeouts happen intermittently - sometimes works, sometimes fails with "Connection timed out" errors.

Application Logs: ERROR: Database connection timeout after 30000ms ERROR: API call to payment gateway failed: Connection timed out Pattern: Timeouts happen randomly, 10-20% of requests fail Time: No specific pattern, happens throughout day Affected: All services making external connections

Intermittent Issue Diagnosis:

1 Basic connectivity tests:

# Continuous ping to identify pattern ping -i 1 -c 100 database-host | grep -E "(timeout|unreachable)" # Test with different packet sizes ping -s 1472 -c 20 database-host # Test MTU issues # Multiple simultaneous tests for i in {1..10}; do timeout 2 ping -c 1 database-host && echo "OK" || echo "FAILED"; done

2 DNS resolution checks:

# Check DNS resolution timing time nslookup database-host time dig database-host # Continuous DNS resolution test while true; do date; dig +short database-host; sleep 1; done # Check DNS cache sudo systemctl status systemd-resolved # Ubuntu sudo systemctl status nscd # Name Service Cache Daemon

3 Route analysis:

# Continuous traceroute mtr --report database-host # Save intermittent issues traceroute database-host > /tmp/trace_good.txt # When issue occurs: traceroute database-host > /tmp/trace_bad.txt diff /tmp/trace_good.txt /tmp/trace_bad.txt

4 TCP connection analysis:

# Test TCP connection with timeout timeout 5 bash -c "

5 System resource monitoring during issues:

# Monitor during issue sar -n DEV 1 # Network interface stats sar -n ETCP 1 # TCP statistics # Check connection queue netstat -s | grep -i listen ss -ltn | grep :3306 # Check for SYN flood netstat -n -p TCP | grep SYN_RECV | wc -l

Solutions for Intermittent Timeouts:

Possible Cause Diagnostic Method Solution
DNS Intermittent Resolution while true; do dig +short host; sleep 1; done Use IP directly, add to /etc/hosts, change DNS server
Network Flapping mtr --report host Contact network provider, use redundant links
TCP Connection Queue Full ss -ltn | grep :port
netstat -s | grep overflow
Increase backlog queue, tune kernel parameters
Firewall Rate Limiting sudo iptables -L -n -v Adjust rate limit rules, whitelist IPs
MTU Issues ping -s 1472 -M do host Adjust MTU size, fix fragmentation
Resource Exhaustion sar -n DEV 1
ss -s
Increase limits, optimize connections

Kernel parameter tuning for timeouts:

# Add to /etc/sysctl.conf for better timeout handling # Increase TCP buffer sizes net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # Increase connection backlog net.core.somaxconn = 65535 net.ipv4.tcp_max_syn_backlog = 65535 # Timeout and retransmission settings net.ipv4.tcp_keepalive_time = 300 net.ipv4.tcp_keepalive_probes = 5 net.ipv4.tcp_keepalive_intvl = 15 # Apply changes sudo sysctl -p

Application-level fixes:

# 1. Implement retry logic with exponential backoff # Example Python: import time import random def call_with_retry(func, max_retries=3): for i in range(max_retries): try: return func() except ConnectionError: wait = (2 ** i) + random.random() time.sleep(wait) raise Exception("Max retries exceeded") # 2. Use connection pooling # 3. Implement circuit breaker pattern # 4. Add timeout configuration for all external calls # 5. Use async/non-blocking I/O where possible

3. Filesystem & Disk Problems

Diagnosing disk failures, filesystem corruption, and storage-related issues.

Scenario 5: Disk full errors

Disk

Error Messages:

Application logs showing: - "No space left on device" - "Disk quota exceeded" - "Write failure: ENOSPC (No space left on device)" System commands failing: - cp: cannot create regular file 'x': No space left on device - touch: cannot touch 'file': No space left on device - sudo: unable to write to /var/log/sudo.log: No space left on device

Disk Space Diagnosis:

1 Check disk usage overview:

df -h df -i # Check inode usage # Expected output: # Filesystem Size Used Avail Use% Mounted on # /dev/sda1 50G 48G 0G 100% / # /dev/sdb1 100G 20G 80G 20% /data

2 Identify what's consuming space:

# Top-level directory usage du -sh /* 2>/dev/null | sort -rh | head -10 # Drill down into largest directory du -sh /var/* 2>/dev/null | sort -rh | head -10 # Continue drilling du -sh /var/log/* 2>/dev/null | sort -rh | head -10

3 Find large files:

# Find files larger than 100MB find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | head -20 # Top 10 largest files sudo find / -type f -exec du -h {} + 2>/dev/null | sort -rh | head -10 # Find by specific location find /var/log -type f -size +50M -exec ls -lh {} \;

4 Check for deleted files still in use:

# Files deleted but still held open lsof +L1 # Show files with link count less than 1 # Check /proc for deleted files ls -la /proc/*/fd 2>/dev/null | grep deleted # Find processes with deleted files lsof | grep deleted | head -20

5 Check specific culprits:

# Docker disk usage docker system df # Log files find /var/log -type f -name "*.log" -size +100M # Temporary files du -sh /tmp ls -la /tmp | head -20 # Cache directories du -sh /var/cache/* 2>/dev/null | sort -rh

Disk Cleanup Procedures:

๐Ÿงน DISK CLEANUP CHECKLIST ========================== 1. QUICK WINS (Safe to delete): - Clear package manager cache: sudo apt clean / sudo yum clean all - Clear systemd journal: sudo journalctl --vacuum-time=3d - Clear temporary files: sudo rm -rf /tmp/* - Clear browser caches (if applicable) 2. LOG FILES: - Rotate and compress old logs: sudo logrotate -f /etc/logrotate.conf - Delete old log files: sudo find /var/log -type f -name "*.log.*" -mtime +30 -delete - Clear application logs (check retention policy) 3. DOCKER CLEANUP: - Remove unused containers: docker container prune -f - Remove unused images: docker image prune -a -f - Remove unused volumes: docker volume prune -f - Full cleanup: docker system prune -a --volumes 4. APPLICATION SPECIFIC: - Clear cache directories - Remove old backups - Clean up uploads/temp directories 5. MONITORING: - Set up disk monitoring alerts - Implement log rotation - Schedule regular cleanup jobs

Emergency cleanup commands:

# 1. Clear package cache sudo apt clean # Debian/Ubuntu sudo yum clean all # RHEL/CentOS # 2. Clear systemd journal (keep last 3 days) sudo journalctl --vacuum-time=3d # 3. Find and delete core dump files sudo find / -name "core" -type f -delete 2>/dev/null sudo find / -name "*.core" -type f -delete 2>/dev/null # 4. Clear thumbnail cache rm -rf ~/.cache/thumbnails/* # 5. Remove old kernels (keep last 2) sudo apt autoremove --purge # Ubuntu sudo package-cleanup --oldkernels --count=2 # CentOS # 6. Clear /tmp safely (not socket files) find /tmp -type f -atime +1 -delete

Preventive measures:

# 1. Set up monitoring # Add to crontab -e 0 * * * * df -h > /var/log/disk-usage.log 0 2 * * * find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null > /var/log/large-files.log # 2. Implement log rotation # Edit /etc/logrotate.d/your-app /var/log/your-app/*.log { daily rotate 7 compress delaycompress missingok notifempty create 644 root root } # 3. Docker cleanup cron job 0 3 * * * docker system prune -a -f # 4. Filesystem quotas # Enable quotas in /etc/fstab, then: quotacheck -avug quotaon -avug edquota username # Set limits per user

Scenario 6: Filesystem read-only or corruption

Filesystem

Symptoms:

  • "Read-only filesystem" errors when trying to write
  • Disk I/O errors in dmesg or /var/log/messages
  • Filesystem checks forced on reboot
  • Data corruption or missing files
  • Inability to create or modify files
$ touch testfile touch: cannot touch 'testfile': Read-only file system $ dmesg | tail -20 [ 1234.567890] EXT4-fs error (device sda1): ext4_find_entry: reading directory [ 1234.567891] EXT4-fs (sda1): Remounting filesystem read-only

Filesystem Health Check:

1 Check filesystem status:

# Check mount options mount | grep "^/dev" # Look for "(ro)" for read-only or errors # Check /proc/mounts cat /proc/mounts | grep /dev/sda1 # Check dmesg for errors dmesg | grep -i "error\|read.only\|filesystem\|ext4\|xfs" # Check kernel messages tail -f /var/log/messages | grep -i "filesystem"

2 Check disk health (SMART):

# Install smartmontools if needed sudo apt install smartmontools # Debian/Ubuntu sudo yum install smartmontools # RHEL/CentOS # Check SMART status sudo smartctl -H /dev/sda sudo smartctl -a /dev/sda | grep -E "(Reallocated|Pending|Uncorrectable)" # Short test sudo smartctl -t short /dev/sda # Long test sudo smartctl -t long /dev/sda

3 Check for bad blocks:

# Check filesystem for errors (read-only check) sudo fsck -n /dev/sda1 # Note: Never run fsck on mounted filesystem! # Bad blocks check sudo badblocks -v /dev/sda1 > /tmp/badblocks.txt # For ext4 filesystems sudo e2fsck -f /dev/sda1 # -f: Force check even if clean

4 Check I/O errors:

# Check kernel ring buffer dmesg | grep -i "I/O error" # Check syslog grep -i "I/O error" /var/log/syslog # Check disk stats sudo iostat -x 1 5 # Look for high error rates

5 Remount filesystem:

# Try to remount as read-write sudo mount -o remount,rw /dev/sda1 / # If successful, check mount | grep "/dev/sda1" # Should show "(rw)" instead of "(ro)"

Recovery Procedures:

โš ๏ธ FILESYSTEM RECOVERY PROCEDURE ================================= 1. IMMEDIATE ACTIONS: - Backup critical data immediately if possible - Document all error messages - Check if it's a single filesystem or multiple 2. ATTEMPT REMOUNT: sudo mount -o remount,rw /partition /mountpoint โ†’ If SUCCESS: Filesystem recovered, monitor closely โ†’ If FAILS: Continue to step 3 3. CHECK DISK HEALTH: sudo smartctl -H /dev/sdX โ†’ If FAILING: Disk hardware issue, replace disk โ†’ If OK: Continue to step 4 4. FILESYSTEM REPAIR (UNMOUNTED): - Boot into recovery mode or live CD - Unmount filesystem: sudo umount /dev/sdX1 - Run repair: sudo fsck -y /dev/sdX1 - Check type: ext4: e2fsck, xfs: xfs_repair 5. DATA RECOVERY: - Use ddrescue to clone failing disk - Use testdisk/photorec for file recovery - Restore from backups 6. PREVENTION: - Regular SMART monitoring - RAID configuration for redundancy - Regular backups - Filesystem journaling enabled

Filesystem repair commands:

# For ext2/ext3/ext4 filesystems # Boot into recovery mode or unmount first sudo umount /dev/sda1 sudo e2fsck -f -y -v /dev/sda1 # -f: Force check # -y: Auto-yes to repairs # -v: Verbose # For XFS filesystems sudo umount /dev/sda1 sudo xfs_repair /dev/sda1 # For Btrfs filesystems sudo umount /dev/sda1 sudo btrfs check --repair /dev/sda1 # Use with caution! # Remount after repair sudo mount /dev/sda1 /mountpoint

Data recovery tools:

# 1. Clone failing disk with ddrescue sudo apt install gddrescue sudo ddrescue /dev/sda /dev/sdb rescue.log # 2. Recover deleted files with testdisk sudo apt install testdisk sudo testdisk /dev/sda1 # 3. Photo recovery with photorec sudo photorec /dev/sda1 # 4. File carving with foremost sudo foremost -i /dev/sda1 -o /recovery/ # 5. Check and repair NTFS (if dual boot) sudo ntfsfix /dev/sda1

Monitoring and prevention:

# 1. SMART monitoring daemon sudo apt install smartmontools sudo systemctl enable smartd sudo systemctl start smartd # 2. Configure /etc/smartd.conf DEVICESCAN -a -o on -S on -s (S/../.././02|L/../../6/03) -m admin@example.com # 3. Regular filesystem checks # Add to /etc/fstab for periodic checking # ext4: last number = check interval (0=disable, 1=root, 2=others) /dev/sda1 / ext4 defaults,noatime 0 1 # 4. Backup critical data # Use rsync, tar, or backup tools sudo tar -czf /backup/$(date +%Y%m%d).tar.gz /important/data # 5. Use RAID for redundancy # mdadm for software RAID or hardware RAID controller

Troubleshooting Quick Reference

Diagnostic Commands Cheatsheet

Symptom First Command Follow-up Commands Expected Output
Server Slow uptime top, free -h, df -h Load avg < CPU cores, memory available
Connection Refused ss -tulpn | grep :port systemctl status, iptables -L Process listening on port
Disk Full df -h du -sh /*, find / -size +100M Available space > 10%
High CPU top ps aux --sort=-%cpu, pidstat 1 CPU usage < 80% per core
High Memory free -h ps aux --sort=-%mem, smem -t Available memory > 10%
Network Issues ping -c 4 8.8.8.8 mtr host, ss -s, netstat -s 0% packet loss, normal latency
Service Down systemctl status service journalctl -u service, tail -f log Active (running) status
Permission Denied ls -la file getfacl file, groups user User has rwx permissions
File Corruption dmesg | tail -20 fsck -n /dev, smartctl -a No I/O errors in logs
DNS Problems dig google.com nslookup, cat /etc/resolv.conf Returns IP address

Systematic Troubleshooting Approach

The 5-Step Troubleshooting Methodology:

๐Ÿ”ง SYSTEMATIC TROUBLESHOOTING FRAMEWORK ======================================== STEP 1: INFORMATION GATHERING ----------------------------- 1. What specifically is broken? 2. When did it start happening? 3. What changed recently? 4. Who is affected? 5. Are there error messages? (Copy exact text) STEP 2: REPRODUCTION -------------------- 1. Can you reproduce the issue? 2. Is it consistent or intermittent? 3. What are the steps to reproduce? 4. Does it happen in all environments? STEP 3: ISOLATION ----------------- 1. Is it a single server or all servers? 2. Is it a single service or all services? 3. Is it network, disk, CPU, or memory? 4. Use divide and conquer methodology STEP 4: DIAGNOSIS ----------------- 1. Check logs (application, system, auth) 2. Check metrics (CPU, memory, disk, network) 3. Check configuration files 4. Check dependencies (database, APIs, DNS) STEP 5: RESOLUTION & PREVENTION -------------------------------- 1. Implement immediate fix 2. Test the fix thoroughly 3. Document the issue and solution 4. Implement monitoring to detect recurrence 5. Add preventive measures

Essential questions to ask:

  • Scope: Is this affecting one user or all users?
  • Timing: When exactly did it start? After a deployment?
  • Pattern: Is it constant or intermittent?
  • Changes: What was changed before the issue?
  • Impact: What's the business impact?

Essential Troubleshooting Toolkit

Must-have tools for every DevOps engineer:

# System Monitoring sudo apt install htop iotop iftop nmon glances ncdu # Network Diagnostics sudo apt install net-tools dnsutils mtr tcpdump nmap # Disk and Filesystem sudo apt install smartmontools testdisk e2fsprogs xfsprogs # Process Analysis sudo apt install psmisc lsof strace ltrace # Log Analysis sudo apt install logwatch multitail # Performance sudo apt install sysstat perf-tools-unstable # Containers sudo apt install docker.io ctop dive # Quick one-liner to install all: sudo apt install htop iotop iftop nmon glances ncdu net-tools dnsutils mtr tcpdump nmap smartmontools psmisc lsof strace sysstat

Useful one-liners for quick diagnostics:

# Quick system health check echo "Load: $(uptime)"; echo "Memory: $(free -h | grep Mem)"; echo "Disk: $(df -h / | tail -1)" # Find top 5 CPU processes ps aux --sort=-%cpu | head -6 # Find top 5 memory processes ps aux --sort=-%mem | head -6 # Check all listening ports sudo ss -tulpn # Check disk space by directory du -sh /* 2>/dev/null | sort -rh | head -5 # Check for failed services systemctl list-units --state=failed # Check last 10 error messages journalctl -p err -b | tail -10 # Quick network test ping -c 2 8.8.8.8 && echo "Network OK" || echo "Network Issues" # Check zombie processes ps aux | awk '$8=="Z" {print $0}'

Emergency recovery commands:

# Free up memory immediately sync; echo 3 > /proc/sys/vm/drop_caches # Kill process by name pkill -9 process_name # Emergency disk space sudo find /var/log -type f -name "*.log" -size +100M -delete # Reset broken terminal reset # Recover from messed up terminal settings stty sane # Kill all user processes (careful!) pkill -9 -u username # Emergency read-write remount sudo mount -o remount,rw /