Practical troubleshooting scenarios with step-by-step debugging approaches. These real-world exercises will help you develop systematic problem-solving skills for DevOps interviews and daily operations.
1. System Performance Troubleshooting
Diagnosing and resolving system slowdowns, high resource usage, and performance degradation.
Scenario 1: Server suddenly becomes slow
PerformanceSymptoms Reported:
- SSH connections take 30+ seconds to establish
- Commands respond slowly
- Website loading times increased from 200ms to 5+ seconds
- Users reporting timeout errors
Step-by-Step Diagnosis:
1 Check system load and uptime:
2 Identify resource bottlenecks:
3 Find top resource consumers:
4 Check disk space and inodes:
5 Check network connections:
Common Solutions:
| Root Cause | Diagnostic Command | Immediate Action | Long-term Fix |
|---|---|---|---|
| High CPU Usage | top, ps aux --sort=-%cpu |
Kill runaway process, restart service | Optimize code, scale horizontally |
| Memory Exhaustion | free -h, ps aux --sort=-%mem |
Clear cache, restart service, add swap | Add RAM, fix memory leaks, optimize |
| Disk Full | df -h, du -sh /* |
Delete large files, clear logs | Implement log rotation, increase disk |
| I/O Wait High | iostat -x 1, iotop |
Stop heavy I/O process | Upgrade to SSD, optimize queries |
| Too Many Processes | ps aux | wc -l |
Kill unnecessary processes | Limit user processes, fix fork bombs |
Scenario 2: Database performance degradation
DatabaseSymptoms Reported:
- High response times for database queries
- Application timeouts when accessing database
- MySQL/PostgreSQL processes using high CPU
- Slow query logs showing many long-running queries
Step-by-Step Diagnosis:
1 Check database process status:
2 Monitor database connections:
3 Check for slow queries:
4 Check database locks:
5 Check disk I/O for database:
Database Troubleshooting Solutions:
Immediate actions:
Prevention strategies:
2. Network Connectivity Problems
Troubleshooting network connectivity, latency, and service accessibility issues.
Scenario 3: "Connection refused" to service
NetworkingProblem Statement:
Users report "Connection refused" when trying to access your web application on port 8080. The service was working earlier but suddenly stopped accepting connections.
Systematic Troubleshooting:
1 Check if service is running:
2 Check service logs:
3 Check firewall rules:
4 Check SELinux/AppArmor:
5 Check resource limits:
Solution Flowchart:
Quick fix commands:
Scenario 4: Intermittent network timeouts
LatencyProblem Statement:
Application experiences random timeouts when connecting to database or external APIs. Timeouts happen intermittently - sometimes works, sometimes fails with "Connection timed out" errors.
Intermittent Issue Diagnosis:
1 Basic connectivity tests:
2 DNS resolution checks:
3 Route analysis:
4 TCP connection analysis:
5 System resource monitoring during issues:
Solutions for Intermittent Timeouts:
| Possible Cause | Diagnostic Method | Solution |
|---|---|---|
| DNS Intermittent Resolution | while true; do dig +short host; sleep 1; done |
Use IP directly, add to /etc/hosts, change DNS server |
| Network Flapping | mtr --report host |
Contact network provider, use redundant links |
| TCP Connection Queue Full | ss -ltn | grep :portnetstat -s | grep overflow |
Increase backlog queue, tune kernel parameters |
| Firewall Rate Limiting | sudo iptables -L -n -v |
Adjust rate limit rules, whitelist IPs |
| MTU Issues | ping -s 1472 -M do host |
Adjust MTU size, fix fragmentation |
| Resource Exhaustion | sar -n DEV 1ss -s |
Increase limits, optimize connections |
Kernel parameter tuning for timeouts:
Application-level fixes:
3. Filesystem & Disk Problems
Diagnosing disk failures, filesystem corruption, and storage-related issues.
Scenario 5: Disk full errors
DiskError Messages:
Disk Space Diagnosis:
1 Check disk usage overview:
2 Identify what's consuming space:
3 Find large files:
4 Check for deleted files still in use:
5 Check specific culprits:
Disk Cleanup Procedures:
Emergency cleanup commands:
Preventive measures:
Scenario 6: Filesystem read-only or corruption
FilesystemSymptoms:
- "Read-only filesystem" errors when trying to write
- Disk I/O errors in dmesg or /var/log/messages
- Filesystem checks forced on reboot
- Data corruption or missing files
- Inability to create or modify files
Filesystem Health Check:
1 Check filesystem status:
2 Check disk health (SMART):
3 Check for bad blocks:
4 Check I/O errors:
5 Remount filesystem:
Recovery Procedures:
Filesystem repair commands:
Data recovery tools:
Monitoring and prevention:
Troubleshooting Quick Reference
Diagnostic Commands Cheatsheet
| Symptom | First Command | Follow-up Commands | Expected Output |
|---|---|---|---|
| Server Slow | uptime |
top, free -h, df -h |
Load avg < CPU cores, memory available |
| Connection Refused | ss -tulpn | grep :port |
systemctl status, iptables -L |
Process listening on port |
| Disk Full | df -h |
du -sh /*, find / -size +100M |
Available space > 10% |
| High CPU | top |
ps aux --sort=-%cpu, pidstat 1 |
CPU usage < 80% per core |
| High Memory | free -h |
ps aux --sort=-%mem, smem -t |
Available memory > 10% |
| Network Issues | ping -c 4 8.8.8.8 |
mtr host, ss -s, netstat -s |
0% packet loss, normal latency |
| Service Down | systemctl status service |
journalctl -u service, tail -f log |
Active (running) status |
| Permission Denied | ls -la file |
getfacl file, groups user |
User has rwx permissions |
| File Corruption | dmesg | tail -20 |
fsck -n /dev, smartctl -a |
No I/O errors in logs |
| DNS Problems | dig google.com |
nslookup, cat /etc/resolv.conf |
Returns IP address |
Systematic Troubleshooting Approach
The 5-Step Troubleshooting Methodology:
Essential questions to ask:
- Scope: Is this affecting one user or all users?
- Timing: When exactly did it start? After a deployment?
- Pattern: Is it constant or intermittent?
- Changes: What was changed before the issue?
- Impact: What's the business impact?
Essential Troubleshooting Toolkit
Must-have tools for every DevOps engineer:
Useful one-liners for quick diagnostics:
Emergency recovery commands: