Learn how to proactively monitor CPU, memory, disk, and network resources using powerful Bash scripts. This guide provides production-ready monitoring solutions with automated alerts, comprehensive reporting, and best practices for maintaining system health.
Why System Monitoring Matters
Proactive monitoring is essential for maintaining system reliability, performance, and security. Effective monitoring helps you:
- Prevent Outages: Detect issues before they cause downtime
- Optimize Performance: Identify bottlenecks and optimize resource usage
- Plan Capacity: Understand growth trends and plan for scaling
- Ensure Security: Detect unusual activity and potential breaches
- Reduce Costs: Right-size resources and eliminate waste
Track processor usage, load averages, and core utilization
CriticalMonitor RAM usage, swap activity, and memory pressure
High PriorityCheck disk space, I/O performance, and filesystem health
WarningTrack bandwidth, latency, connections, and interface status
Stable1. CPU Monitoring
CPU monitoring helps identify performance bottlenecks and ensure applications have sufficient processing power. Key metrics to track:
- CPU Usage: Percentage of CPU time spent processing
- Load Average: System load over 1, 5, and 15 minutes
- Per-Core Usage: Utilization of individual CPU cores
- Process CPU: Top CPU-consuming processes
- Context Switches: Frequency of CPU context switches
Essential CPU Monitoring Commands
CPU Monitoring Script
#!/bin/bash
# monitor-cpu.sh - Comprehensive CPU monitoring with alerts
# ================= CONFIGURATION =================
THRESHOLD_LOAD=4.0 # Load average threshold
THRESHOLD_CPU=85 # CPU usage percentage threshold
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/cpu-monitor.log"
REPORT_FILE="/tmp/cpu-report-$(date +%Y%m%d_%H%M%S).txt"
# ================= INITIALIZATION =================
echo "=== CPU Monitoring Report - $(date) ===" > "$REPORT_FILE"
echo "Hostname: $(hostname)" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
ALERTS=()
WARNINGS=()
# ================= LOAD AVERAGE CHECK =================
echo "1. Load Average Analysis:" >> "$REPORT_FILE"
echo "-------------------------" >> "$REPORT_FILE"
LOAD1=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | xargs)
LOAD5=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f2 | xargs)
LOAD15=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f3 | xargs)
CPU_CORES=$(nproc)
echo " Cores available: $CPU_CORES" >> "$REPORT_FILE"
echo " 1-minute load: $LOAD1" >> "$REPORT_FILE"
echo " 5-minute load: $LOAD5" >> "$REPORT_FILE"
echo " 15-minute load: $LOAD15" >> "$REPORT_FILE"
# Calculate relative load (load / cores)
RELATIVE_LOAD=$(echo "scale=2; $LOAD1 / $CPU_CORES" | bc)
if (( $(echo "$LOAD1 > $THRESHOLD_LOAD" | bc -l) )); then
ALERT="CRITICAL: High load average: $LOAD1 (Relative: $RELATIVE_LOAD per core)"
ALERTS+=("$ALERT")
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
fi
# ================= CPU UTILIZATION CHECK =================
echo "" >> "$REPORT_FILE"
echo "2. CPU Utilization:" >> "$REPORT_FILE"
echo "--------------------" >> "$REPORT_FILE"
# Get CPU usage from top
CPU_USAGE=$(top -b -n 1 | grep "^%Cpu" | awk '{print 100 - $8}')
echo " Overall CPU Usage: ${CPU_USAGE}%" >> "$REPORT_FILE"
if (( $(echo "$CPU_USAGE > $THRESHOLD_CPU" | bc -l) )); then
ALERT="CRITICAL: High CPU usage: ${CPU_USAGE}%"
ALERTS+=("$ALERT")
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
fi
# ================= PER-CORE ANALYSIS =================
echo "" >> "$REPORT_FILE"
echo "3. Per-Core Utilization:" >> "$REPORT_FILE"
echo "-------------------------" >> "$REPORT_FILE"
if command -v mpstat &> /dev/null; then
mpstat -P ALL 1 1 | grep -E "^[0-9]" | while read -r line; do
CORE=$(echo "$line" | awk '{print $2}')
USER=$(echo "$line" | awk '{print $3}')
SYSTEM=$(echo "$line" | awk '{print $5}')
IDLE=$(echo "$line" | awk '{print $NF}')
if [ "$CORE" == "all" ]; then
echo " All cores: ${IDLE}% idle" >> "$REPORT_FILE"
else
echo " Core $CORE: User=${USER}% System=${SYSTEM}% Idle=${IDLE}%" >> "$REPORT_FILE"
fi
done
else
echo " mpstat not available, using alternative method..." >> "$REPORT_FILE"
grep -E "^processor|^cpu MHz" /proc/cpuinfo | while read -r line; do
echo " $line" >> "$REPORT_FILE"
done
fi
# ================= TOP CPU PROCESSES =================
echo "" >> "$REPORT_FILE"
echo "4. Top CPU Processes:" >> "$REPORT_FILE"
echo "----------------------" >> "$REPORT_FILE"
echo " PID USER %CPU COMMAND" >> "$REPORT_FILE"
ps aux --sort=-%cpu | head -6 | tail -5 | while read -r line; do
PID=$(echo "$line" | awk '{print $2}')
USER=$(echo "$line" | awk '{print $1}')
CPU=$(echo "$line" | awk '{print $3}')
CMD=$(echo "$line" | awk '{for(i=11;i<=NF;i++) printf $i" "; print ""}')
echo " $PID $USER $CPU% ${CMD:0:50}" >> "$REPORT_FILE"
done
# ================= CONTEXT SWITCHES =================
echo "" >> "$REPORT_FILE"
echo "5. System Activity:" >> "$REPORT_FILE"
echo "-------------------" >> "$REPORT_FILE"
if [ -f /proc/stat ]; then
CTXT1=$(grep ctxt /proc/stat | awk '{print $2}')
sleep 1
CTXT2=$(grep ctxt /proc/stat | awk '{print $2}')
CTXT_DELTA=$((CTXT2 - CTXT1))
echo " Context switches per second: $CTXT_DELTA" >> "$REPORT_FILE"
if [ $CTXT_DELTA -gt 10000 ]; then
WARNING="High context switching: $CTXT_DELTA/sec"
WARNINGS+=("$WARNING")
echo " â ī¸ $WARNING" >> "$REPORT_FILE"
fi
fi
# ================= ALERTS AND NOTIFICATIONS =================
echo "" >> "$REPORT_FILE"
echo "=== SUMMARY ===" >> "$REPORT_FILE"
if [ ${#ALERTS[@]} -eq 0 ]; then
echo "â
CPU status: NORMAL" >> "$REPORT_FILE"
else
echo "â ī¸ CRITICAL ISSUES DETECTED: ${#ALERTS[@]}" >> "$REPORT_FILE"
for alert in "${ALERTS[@]}"; do
echo " âĸ $alert" >> "$REPORT_FILE"
done
# Send alert email if configured
if [ -n "$ALERT_EMAIL" ]; then
mail -s "CPU ALERT: Issues on $(hostname)" \
"$ALERT_EMAIL" < "$REPORT_FILE"
echo "Alert email sent to $ALERT_EMAIL" >> "$LOG_FILE"
fi
fi
if [ ${#WARNINGS[@]} -gt 0 ]; then
echo "" >> "$REPORT_FILE"
echo "âšī¸ WARNINGS:" >> "$REPORT_FILE"
for warning in "${WARNINGS[@]}"; do
echo " âĸ $warning" >> "$REPORT_FILE"
done
fi
# Log to file
cat "$REPORT_FILE" >> "$LOG_FILE"
echo "CPU monitoring completed. Report: $REPORT_FILE"
2. Memory Monitoring
Memory monitoring prevents out-of-memory conditions and helps optimize application performance. Critical memory metrics:
- RAM Usage: Total, used, free, and cached memory
- Swap Usage: Swap space utilization and activity
- Memory Pressure: Page faults and swapping frequency
- Process Memory: Top memory-consuming processes
- Cache/Buffer: Filesystem cache and buffer usage
Memory Monitoring Thresholds
Memory Monitoring Script
#!/bin/bash
# monitor-memory.sh - Advanced memory monitoring
# ================= CONFIGURATION =================
THRESHOLD_RAM=85 # RAM usage percentage threshold
THRESHOLD_SWAP=50 # Swap usage percentage threshold
THRESHOLD_PAGEFAULTS=500 # Page faults per second threshold
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/memory-monitor.log"
# ================= FUNCTIONS =================
get_memory_stats() {
# Get memory info from /proc/meminfo
MEM_TOTAL=$(grep MemTotal /proc/meminfo | awk '{print $2}')
MEM_FREE=$(grep MemFree /proc/meminfo | awk '{print $2}')
MEM_AVAILABLE=$(grep MemAvailable /proc/meminfo | awk '{print $2}')
BUFFERS=$(grep Buffers /proc/meminfo | awk '{print $2}')
CACHED=$(grep '^Cached' /proc/meminfo | awk '{print $2}')
SWAP_TOTAL=$(grep SwapTotal /proc/meminfo | awk '{print $2}')
SWAP_FREE=$(grep SwapFree /proc/meminfo | awk '{print $2}')
# Calculate percentages
MEM_USED=$((MEM_TOTAL - MEM_AVAILABLE))
MEM_PERCENT=$((MEM_USED * 100 / MEM_TOTAL))
SWAP_USED=$((SWAP_TOTAL - SWAP_FREE))
if [ $SWAP_TOTAL -gt 0 ]; then
SWAP_PERCENT=$((SWAP_USED * 100 / SWAP_TOTAL))
else
SWAP_PERCENT=0
fi
# Convert to human readable
MEM_TOTAL_HR=$(numfmt --from-unit=1K --to=iec $MEM_TOTAL)
MEM_USED_HR=$(numfmt --from-unit=1K --to=iec $MEM_USED)
MEM_AVAILABLE_HR=$(numfmt --from-unit=1K --to=iec $MEM_AVAILABLE)
}
check_page_faults() {
# Get page fault statistics
if command -v sar &> /dev/null; then
PAGEFAULTS=$(sar -B 1 1 | tail -1 | awk '{print $2}')
else
# Alternative method
PGFAULT1=$(grep pgfault /proc/vmstat | awk '{print $2}')
sleep 1
PGFAULT2=$(grep pgfault /proc/vmstat | awk '{print $2}')
PAGEFAULTS=$((PGFAULT2 - PGFAULT1))
fi
echo $PAGEFAULTS
}
# ================= MAIN MONITORING =================
echo "=== Memory Monitoring - $(date) ===" | tee -a "$LOG_FILE"
get_memory_stats
# Display memory dashboard
echo ""
echo "Memory Dashboard:"
echo "================="
echo "RAM Total: $MEM_TOTAL_HR"
echo "RAM Used: $MEM_USED_HR (${MEM_PERCENT}%)"
echo "RAM Available: $MEM_AVAILABLE_HR"
echo "Swap Used: ${SWAP_PERCENT}% of $(numfmt --from-unit=1K --to=iec $SWAP_TOTAL)"
echo "Buffers: $(numfmt --from-unit=1K --to=iec $BUFFERS)"
echo "Cached: $(numfmt --from-unit=1K --to=iec $CACHED)"
echo ""
# Check thresholds
ALERTS=()
if [ $MEM_PERCENT -ge $THRESHOLD_RAM ]; then
ALERTS+=("CRITICAL: High RAM usage: ${MEM_PERCENT}%")
fi
if [ $SWAP_PERCENT -ge $THRESHOLD_SWAP ]; then
ALERTS+=("CRITICAL: High Swap usage: ${SWAP_PERCENT}%")
fi
PAGEFAULTS=$(check_page_faults)
if [ $PAGEFAULTS -ge $THRESHOLD_PAGEFAULTS ]; then
ALERTS+=("CRITICAL: High page faults: ${PAGEFAULTS}/sec")
fi
# Check for memory leaks (growing processes)
echo "Top Memory Processes:" | tee -a "$LOG_FILE"
echo "PID USER %MEM RSS(MB) COMMAND" | tee -a "$LOG_FILE"
ps aux --sort=-%mem | head -6 | tail -5 | while read -r line; do
PID=$(echo "$line" | awk '{print $2}')
USER=$(echo "$line" | awk '{print $1}')
MEM=$(echo "$line" | awk '{print $4}')
RSS=$(echo "$line" | awk '{print $6}')
RSS_MB=$((RSS / 1024))
CMD=$(echo "$line" | awk '{for(i=11;i<=NF;i++) printf $i" "; print ""}')
# Check for unusually high memory usage
if [ $(echo "$MEM > 20" | bc) -eq 1 ]; then
ALERTS+=("WARNING: Process $PID ($CMD) using ${MEM}% memory")
fi
echo "$PID $USER ${MEM}% ${RSS_MB}MB ${CMD:0:40}" | tee -a "$LOG_FILE"
done
# Check OOM killer activity
if dmesg | grep -i "oom" | grep -i "kill" > /dev/null 2>&1; then
ALERTS+=("CRITICAL: OOM killer has been active recently")
echo "OOM Killer has terminated processes recently" | tee -a "$LOG_FILE"
fi
# Send alerts if any
if [ ${#ALERTS[@]} -gt 0 ]; then
echo ""
echo "â ī¸ ALERTS DETECTED:" | tee -a "$LOG_FILE"
for alert in "${ALERTS[@]}"; do
echo " âĸ $alert" | tee -a "$LOG_FILE"
done
if [ -n "$ALERT_EMAIL" ]; then
ALERT_SUBJECT="Memory Alert: $(hostname) - $(date)"
{
echo "Memory Alert Report for $(hostname)"
echo "Generated: $(date)"
echo ""
echo "Current Status:"
echo "RAM Usage: ${MEM_PERCENT}%"
echo "Swap Usage: ${SWAP_PERCENT}%"
echo "Page Faults: ${PAGEFAULTS}/sec"
echo ""
echo "Alerts:"
for alert in "${ALERTS[@]}"; do
echo " - $alert"
done
} | mail -s "$ALERT_SUBJECT" "$ALERT_EMAIL"
fi
else
echo ""
echo "â
Memory status: NORMAL" | tee -a "$LOG_FILE"
fi
echo "Monitoring completed at $(date)" | tee -a "$LOG_FILE"
3. Disk Monitoring
Disk monitoring prevents data loss and performance degradation. Essential disk metrics:
- Disk Space: Free, used, and available space per filesystem
- Inode Usage: Inode consumption and availability
- I/O Performance: Read/write throughput and latency
- Disk Health: SMART status and error rates
- Mount Status: Filesystem mount points and options
Disk Monitoring Dashboard Preview
Comprehensive Disk Monitoring Script
#!/bin/bash
# monitor-disk.sh - Advanced disk space and performance monitoring
# ================= CONFIGURATION =================
THRESHOLD_DISK=85 # Disk usage percentage threshold
THRESHOLD_INODE=80 # Inode usage percentage threshold
THRESHOLD_IOWAIT=20 # I/O wait percentage threshold
CRITICAL_PARTITIONS=("/" "/var" "/home") # Critical partitions to monitor
LOG_FILE="/var/log/disk-monitor.log"
REPORT_DIR="/var/log/disk-reports"
# ================= INITIALIZATION =================
mkdir -p "$REPORT_DIR"
REPORT_FILE="$REPORT_DIR/disk-report-$(date +%Y%m%d_%H%M%S).txt"
echo "=== Disk Monitoring Report - $(date) ===" > "$REPORT_FILE"
echo "Hostname: $(hostname)" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"
ALERTS=()
CRITICAL_ALERTS=()
# ================= DISK SPACE ANALYSIS =================
echo "1. Disk Space Analysis:" >> "$REPORT_FILE"
echo "------------------------" >> "$REPORT_FILE"
df -h | grep '^/dev/' | while read -r line; do
FILESYSTEM=$(echo "$line" | awk '{print $1}')
SIZE=$(echo "$line" | awk '{print $2}')
USED=$(echo "$line" | awk '{print $3}')
AVAIL=$(echo "$line" | awk '{print $4}')
USE_PERCENT=$(echo "$line" | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo "$line" | awk '{print $6}')
echo " $MOUNT ($FILESYSTEM):" >> "$REPORT_FILE"
echo " Size: $SIZE, Used: $USED, Available: $AVAIL" >> "$REPORT_FILE"
echo " Usage: $USE_PERCENT%" >> "$REPORT_FILE"
# Check if this is a critical partition
IS_CRITICAL=false
for critical in "${CRITICAL_PARTITIONS[@]}"; do
if [ "$MOUNT" == "$critical" ]; then
IS_CRITICAL=true
break
fi
done
# Generate alerts based on thresholds
if [ "$USE_PERCENT" -ge 95 ]; then
ALERT="CRITICAL: Disk almost full on $MOUNT: ${USE_PERCENT}%"
if [ "$IS_CRITICAL" = true ]; then
CRITICAL_ALERTS+=("$ALERT")
else
ALERTS+=("$ALERT")
fi
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
elif [ "$USE_PERCENT" -ge "$THRESHOLD_DISK" ]; then
ALERT="WARNING: High disk usage on $MOUNT: ${USE_PERCENT}%"
ALERTS+=("$ALERT")
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
fi
done
# ================= INODE USAGE CHECK =================
echo "" >> "$REPORT_FILE"
echo "2. Inode Usage Analysis:" >> "$REPORT_FILE"
echo "-------------------------" >> "$REPORT_FILE"
df -i | grep '^/dev/' | while read -r line; do
FILESYSTEM=$(echo "$line" | awk '{print $1}')
INODES_TOTAL=$(echo "$line" | awk '{print $2}')
INODES_USED=$(echo "$line" | awk '{print $3}')
INODES_FREE=$(echo "$line" | awk '{print $4}')
INODE_PERCENT=$(echo "$line" | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo "$line" | awk '{print $6}')
if [ "$INODE_PERCENT" -ge "$THRESHOLD_INODE" ]; then
echo " $MOUNT: ${INODE_PERCENT}% inodes used ($INODES_USED/$INODES_TOTAL)" >> "$REPORT_FILE"
if [ "$INODE_PERCENT" -ge 90 ]; then
ALERT="CRITICAL: High inode usage on $MOUNT: ${INODE_PERCENT}%"
CRITICAL_ALERTS+=("$ALERT")
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
elif [ "$INODE_PERCENT" -ge "$THRESHOLD_INODE" ]; then
ALERT="WARNING: High inode usage on $MOUNT: ${INODE_PERCENT}%"
ALERTS+=("$ALERT")
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
fi
fi
done
# ================= I/O PERFORMANCE MONITORING =================
echo "" >> "$REPORT_FILE"
echo "3. I/O Performance:" >> "$REPORT_FILE"
echo "--------------------" >> "$REPORT_FILE"
if command -v iostat &> /dev/null; then
echo " Device Read(KB/s) Write(KB/s) Util%" >> "$REPORT_FILE"
iostat -dkx 1 1 | grep -E '^sd|^nvme' | while read -r line; do
DEVICE=$(echo "$line" | awk '{print $1}')
READ_KB=$(echo "$line" | awk '{print $4}')
WRITE_KB=$(echo "$line" | awk '{print $5}')
UTIL=$(echo "$line" | awk '{print $NF}')
echo " $DEVICE $READ_KB $WRITE_KB $UTIL%" >> "$REPORT_FILE"
if (( $(echo "$UTIL > $THRESHOLD_IOWAIT" | bc -l) )); then
ALERT="WARNING: High I/O utilization on $DEVICE: ${UTIL}%"
ALERTS+=("$ALERT")
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
fi
done
else
echo " iostat not available. Install sysstat package." >> "$REPORT_FILE"
fi
# Check I/O wait from top
IOWAIT=$(top -b -n 1 | grep "^%Cpu" | awk '{print $10}')
echo " CPU I/O Wait: ${IOWAIT}%" >> "$REPORT_FILE"
if (( $(echo "$IOWAIT > $THRESHOLD_IOWAIT" | bc -l) )); then
ALERT="WARNING: High CPU I/O wait: ${IOWAIT}%"
ALERTS+=("$ALERT")
echo " â ī¸ $ALERT" >> "$REPORT_FILE"
fi
# ================= SMART STATUS CHECK =================
echo "" >> "$REPORT_FILE"
echo "4. Disk Health (SMART):" >> "$REPORT_FILE"
echo "------------------------" >> "$REPORT_FILE"
if command -v smartctl &> /dev/null; then
lsblk -d -o name,type | grep disk | awk '{print $1}' | while read -r disk; do
if [ -b "/dev/$disk" ]; then
SMART_STATUS=$(smartctl -H "/dev/$disk" 2>/dev/null | grep "SMART overall-health" || echo "Not Supported")
if echo "$SMART_STATUS" | grep -q "PASSED"; then
echo " /dev/$disk: â
SMART health PASSED" >> "$REPORT_FILE"
elif echo "$SMART_STATUS" | grep -q "FAILED"; then
ALERT="CRITICAL: SMART failure detected on /dev/$disk"
CRITICAL_ALERTS+=("$ALERT")
echo " /dev/$disk: â $ALERT" >> "$REPORT_FILE"
else
echo " /dev/$disk: âšī¸ SMART not supported or unavailable" >> "$REPORT_FILE"
fi
fi
done
else
echo " smartctl not installed. Install smartmontools package." >> "$REPORT_FILE"
fi
# ================= LARGEST FILES CHECK =================
echo "" >> "$REPORT_FILE"
echo "5. Largest Files Analysis:" >> "$REPORT_FILE"
echo "---------------------------" >> "$REPORT_FILE"
for partition in "${CRITICAL_PARTITIONS[@]}"; do
if [ -d "$partition" ]; then
echo " Top 5 largest files in $partition:" >> "$REPORT_FILE"
find "$partition" -type f -exec du -h {} + 2>/dev/null | sort -rh | head -5 | while read -r line; do
SIZE=$(echo "$line" | awk '{print $1}')
FILE=$(echo "$line" | awk '{print $2}')
echo " $SIZE - $FILE" >> "$REPORT_FILE"
done
fi
done
# ================= SUMMARY AND ALERTS =================
echo "" >> "$REPORT_FILE"
echo "=== SUMMARY ===" >> "$REPORT_FILE"
if [ ${#CRITICAL_ALERTS[@]} -gt 0 ]; then
echo "â CRITICAL ISSUES DETECTED:" >> "$REPORT_FILE"
for alert in "${CRITICAL_ALERTS[@]}"; do
echo " âĸ $alert" >> "$REPORT_FILE"
done
fi
if [ ${#ALERTS[@]} -gt 0 ]; then
echo "" >> "$REPORT_FILE"
echo "â ī¸ WARNINGS:" >> "$REPORT_FILE"
for alert in "${ALERTS[@]}"; do
echo " âĸ $alert" >> "$REPORT_FILE"
done
fi
if [ ${#CRITICAL_ALERTS[@]} -eq 0 ] && [ ${#ALERTS[@]} -eq 0 ]; then
echo "â
All disk metrics are within normal ranges" >> "$REPORT_FILE"
fi
# Save and log
cat "$REPORT_FILE" >> "$LOG_FILE"
echo "Disk monitoring completed. Report saved to: $REPORT_FILE"
4. Network Monitoring
Network monitoring ensures connectivity, performance, and security. Key network metrics:
- Bandwidth Usage: Incoming and outgoing traffic rates
- Latency: Network response times and delays
- Packet Loss: Percentage of lost packets
- Connection Counts: Active TCP/UDP connections
- Interface Status: Network interface up/down and errors
Network Monitoring Commands Reference
5. Comprehensive System Monitoring Dashboard
6. Monitoring Best Practices
Monitoring Implementation Checklist
- CPU monitoring with load average tracking
- Memory monitoring with swap usage
- Disk space and inode monitoring
- I/O performance monitoring
- Network bandwidth and latency tracking
- Service and process monitoring
- Alerting system with escalation
- Historical data storage and retention
- Dashboard for real-time visibility
- Automated reporting and notification
7. Scheduling Monitoring Scripts
# ================= MONITORING CRONTAB =================
# System health check every 5 minutes
*/5 * * * * /opt/monitoring/system-health.sh >> /var/log/health.log 2>&1
# CPU monitoring every minute during business hours
*/1 9-17 * * 1-5 /opt/monitoring/monitor-cpu.sh >> /var/log/cpu.log 2>&1
# Memory monitoring every 15 minutes
*/15 * * * * /opt/monitoring/monitor-memory.sh >> /var/log/memory.log 2>&1
# Disk monitoring every hour
0 * * * * /opt/monitoring/monitor-disk.sh >> /var/log/disk.log 2>&1
# Network monitoring every 30 minutes
*/30 * * * * /opt/monitoring/monitor-network.sh >> /var/log/network.log 2>&1
# Comprehensive daily report at midnight
0 0 * * * /opt/monitoring/daily-report.sh >> /var/log/daily-report.log 2>&1
# Weekly capacity planning report
0 2 * * 1 /opt/monitoring/weekly-capacity.sh >> /var/log/capacity.log 2>&1
# Monthly trend analysis
0 3 1 * * /opt/monitoring/monthly-trends.sh >> /var/log/trends.log 2>&1
1. Secure credentials: Don't store passwords in monitoring scripts
2. Limit data collection: Only collect necessary data
3. Secure transmission: Use encryption for remote monitoring
4. Access control: Restrict who can access monitoring data
5. Data retention: Define and enforce data retention policies
6. Audit logging: Log all access to monitoring systems
7. Vulnerability scanning: Regularly scan monitoring infrastructure
Getting Started with System Monitoring
Follow these steps to implement comprehensive monitoring:
- Identify critical systems: Determine what needs monitoring
- Establish baselines: Monitor for 1-2 weeks to understand normal behavior
- Set thresholds: Define warning and critical levels for each metric
- Implement monitoring scripts: Start with CPU and memory
- Add alerting: Configure email/SMS notifications
- Create dashboards: Build visibility into system health
- Test thoroughly: Simulate failures to ensure alerts work
- Document procedures: Create runbooks for common alerts
- Review and optimize: Regularly refine thresholds and alerts
Proactive Monitoring for System Reliability
Effective system monitoring transforms reactive firefighting into proactive maintenance. By implementing the scripts and practices in this guide, you'll gain deep visibility into your systems and prevent issues before they impact users.
Remember: The goal of monitoring is not to collect data, but to provide actionable insights. Focus on metrics that drive decisions and enable proactive maintenance.
Next Steps: Start with the CPU monitoring script, customize it for your environment, and schedule it to run every 5 minutes. Once you're comfortable, expand to memory, disk, and network monitoring to build a complete monitoring solution.