Monitoring System Resources: The Complete Guide

Learn how to proactively monitor CPU, memory, disk, and network resources using powerful Bash scripts. This guide provides production-ready monitoring solutions with automated alerts, comprehensive reporting, and best practices for maintaining system health.

Why System Monitoring Matters

Proactive monitoring is essential for maintaining system reliability, performance, and security. Effective monitoring helps you:

  • Prevent Outages: Detect issues before they cause downtime
  • Optimize Performance: Identify bottlenecks and optimize resource usage
  • Plan Capacity: Understand growth trends and plan for scaling
  • Ensure Security: Detect unusual activity and potential breaches
  • Reduce Costs: Right-size resources and eliminate waste
⚡
CPU Monitoring

Track processor usage, load averages, and core utilization

Critical
🧠
Memory Monitoring

Monitor RAM usage, swap activity, and memory pressure

High Priority
💾
Disk Monitoring

Check disk space, I/O performance, and filesystem health

Warning
🌐
Network Monitoring

Track bandwidth, latency, connections, and interface status

Stable

1. CPU Monitoring

CPU monitoring helps identify performance bottlenecks and ensure applications have sufficient processing power. Key metrics to track:

  • CPU Usage: Percentage of CPU time spent processing
  • Load Average: System load over 1, 5, and 15 minutes
  • Per-Core Usage: Utilization of individual CPU cores
  • Process CPU: Top CPU-consuming processes
  • Context Switches: Frequency of CPU context switches

Essential CPU Monitoring Commands

top -b -n 1 # Batch mode for scripting
mpstat -P ALL 1 1 # Per-CPU statistics
sar -u 1 3 # CPU utilization history
pidstat 1 5 # Process-level CPU stats
vmstat 1 5 # System-wide CPU and memory
uptime # Load averages

CPU Monitoring Script

#!/bin/bash
# monitor-cpu.sh - Comprehensive CPU monitoring with alerts

# ================= CONFIGURATION =================
THRESHOLD_LOAD=4.0          # Load average threshold
THRESHOLD_CPU=85            # CPU usage percentage threshold
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/cpu-monitor.log"
REPORT_FILE="/tmp/cpu-report-$(date +%Y%m%d_%H%M%S).txt"

# ================= INITIALIZATION =================
echo "=== CPU Monitoring Report - $(date) ===" > "$REPORT_FILE"
echo "Hostname: $(hostname)" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"

ALERTS=()
WARNINGS=()

# ================= LOAD AVERAGE CHECK =================
echo "1. Load Average Analysis:" >> "$REPORT_FILE"
echo "-------------------------" >> "$REPORT_FILE"

LOAD1=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f1 | xargs)
LOAD5=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f2 | xargs)
LOAD15=$(uptime | awk -F'load average:' '{print $2}' | cut -d, -f3 | xargs)

CPU_CORES=$(nproc)
echo "  Cores available: $CPU_CORES" >> "$REPORT_FILE"
echo "  1-minute load: $LOAD1" >> "$REPORT_FILE"
echo "  5-minute load: $LOAD5" >> "$REPORT_FILE"
echo "  15-minute load: $LOAD15" >> "$REPORT_FILE"

# Calculate relative load (load / cores)
RELATIVE_LOAD=$(echo "scale=2; $LOAD1 / $CPU_CORES" | bc)

if (( $(echo "$LOAD1 > $THRESHOLD_LOAD" | bc -l) )); then
    ALERT="CRITICAL: High load average: $LOAD1 (Relative: $RELATIVE_LOAD per core)"
    ALERTS+=("$ALERT")
    echo "  âš ī¸ $ALERT" >> "$REPORT_FILE"
fi

# ================= CPU UTILIZATION CHECK =================
echo "" >> "$REPORT_FILE"
echo "2. CPU Utilization:" >> "$REPORT_FILE"
echo "--------------------" >> "$REPORT_FILE"

# Get CPU usage from top
CPU_USAGE=$(top -b -n 1 | grep "^%Cpu" | awk '{print 100 - $8}')

echo "  Overall CPU Usage: ${CPU_USAGE}%" >> "$REPORT_FILE"

if (( $(echo "$CPU_USAGE > $THRESHOLD_CPU" | bc -l) )); then
    ALERT="CRITICAL: High CPU usage: ${CPU_USAGE}%"
    ALERTS+=("$ALERT")
    echo "  âš ī¸ $ALERT" >> "$REPORT_FILE"
fi

# ================= PER-CORE ANALYSIS =================
echo "" >> "$REPORT_FILE"
echo "3. Per-Core Utilization:" >> "$REPORT_FILE"
echo "-------------------------" >> "$REPORT_FILE"

if command -v mpstat &> /dev/null; then
    mpstat -P ALL 1 1 | grep -E "^[0-9]" | while read -r line; do
        CORE=$(echo "$line" | awk '{print $2}')
        USER=$(echo "$line" | awk '{print $3}')
        SYSTEM=$(echo "$line" | awk '{print $5}')
        IDLE=$(echo "$line" | awk '{print $NF}')
        
        if [ "$CORE" == "all" ]; then
            echo "  All cores: ${IDLE}% idle" >> "$REPORT_FILE"
        else
            echo "  Core $CORE: User=${USER}% System=${SYSTEM}% Idle=${IDLE}%" >> "$REPORT_FILE"
        fi
    done
else
    echo "  mpstat not available, using alternative method..." >> "$REPORT_FILE"
    grep -E "^processor|^cpu MHz" /proc/cpuinfo | while read -r line; do
        echo "  $line" >> "$REPORT_FILE"
    done
fi

# ================= TOP CPU PROCESSES =================
echo "" >> "$REPORT_FILE"
echo "4. Top CPU Processes:" >> "$REPORT_FILE"
echo "----------------------" >> "$REPORT_FILE"

echo "  PID   USER      %CPU   COMMAND" >> "$REPORT_FILE"
ps aux --sort=-%cpu | head -6 | tail -5 | while read -r line; do
    PID=$(echo "$line" | awk '{print $2}')
    USER=$(echo "$line" | awk '{print $1}')
    CPU=$(echo "$line" | awk '{print $3}')
    CMD=$(echo "$line" | awk '{for(i=11;i<=NF;i++) printf $i" "; print ""}')
    echo "  $PID   $USER      $CPU%   ${CMD:0:50}" >> "$REPORT_FILE"
done

# ================= CONTEXT SWITCHES =================
echo "" >> "$REPORT_FILE"
echo "5. System Activity:" >> "$REPORT_FILE"
echo "-------------------" >> "$REPORT_FILE"

if [ -f /proc/stat ]; then
    CTXT1=$(grep ctxt /proc/stat | awk '{print $2}')
    sleep 1
    CTXT2=$(grep ctxt /proc/stat | awk '{print $2}')
    CTXT_DELTA=$((CTXT2 - CTXT1))
    echo "  Context switches per second: $CTXT_DELTA" >> "$REPORT_FILE"
    
    if [ $CTXT_DELTA -gt 10000 ]; then
        WARNING="High context switching: $CTXT_DELTA/sec"
        WARNINGS+=("$WARNING")
        echo "  âš ī¸ $WARNING" >> "$REPORT_FILE"
    fi
fi

# ================= ALERTS AND NOTIFICATIONS =================
echo "" >> "$REPORT_FILE"
echo "=== SUMMARY ===" >> "$REPORT_FILE"

if [ ${#ALERTS[@]} -eq 0 ]; then
    echo "✅ CPU status: NORMAL" >> "$REPORT_FILE"
else
    echo "âš ī¸ CRITICAL ISSUES DETECTED: ${#ALERTS[@]}" >> "$REPORT_FILE"
    for alert in "${ALERTS[@]}"; do
        echo "  â€ĸ $alert" >> "$REPORT_FILE"
    done
    
    # Send alert email if configured
    if [ -n "$ALERT_EMAIL" ]; then
        mail -s "CPU ALERT: Issues on $(hostname)" \
             "$ALERT_EMAIL" < "$REPORT_FILE"
        echo "Alert email sent to $ALERT_EMAIL" >> "$LOG_FILE"
    fi
fi

if [ ${#WARNINGS[@]} -gt 0 ]; then
    echo "" >> "$REPORT_FILE"
    echo "â„šī¸ WARNINGS:" >> "$REPORT_FILE"
    for warning in "${WARNINGS[@]}"; do
        echo "  â€ĸ $warning" >> "$REPORT_FILE"
    done
fi

# Log to file
cat "$REPORT_FILE" >> "$LOG_FILE"
echo "CPU monitoring completed. Report: $REPORT_FILE"

2. Memory Monitoring

Memory monitoring prevents out-of-memory conditions and helps optimize application performance. Critical memory metrics:

  • RAM Usage: Total, used, free, and cached memory
  • Swap Usage: Swap space utilization and activity
  • Memory Pressure: Page faults and swapping frequency
  • Process Memory: Top memory-consuming processes
  • Cache/Buffer: Filesystem cache and buffer usage

Memory Monitoring Thresholds

Metric Normal Warning Critical Action Required RAM Usage < 70% 70-85% > 85% Add RAM or optimize apps Swap Usage < 20% 20-50% > 50% High memory pressure Page Faults < 100/sec 100-500/sec > 500/sec Memory bottleneck Cache Hit Ratio > 90% 80-90% < 80% I/O performance issue

Memory Monitoring Script

#!/bin/bash
# monitor-memory.sh - Advanced memory monitoring

# ================= CONFIGURATION =================
THRESHOLD_RAM=85          # RAM usage percentage threshold
THRESHOLD_SWAP=50         # Swap usage percentage threshold
THRESHOLD_PAGEFAULTS=500  # Page faults per second threshold
ALERT_EMAIL="admin@example.com"
LOG_FILE="/var/log/memory-monitor.log"

# ================= FUNCTIONS =================
get_memory_stats() {
    # Get memory info from /proc/meminfo
    MEM_TOTAL=$(grep MemTotal /proc/meminfo | awk '{print $2}')
    MEM_FREE=$(grep MemFree /proc/meminfo | awk '{print $2}')
    MEM_AVAILABLE=$(grep MemAvailable /proc/meminfo | awk '{print $2}')
    BUFFERS=$(grep Buffers /proc/meminfo | awk '{print $2}')
    CACHED=$(grep '^Cached' /proc/meminfo | awk '{print $2}')
    SWAP_TOTAL=$(grep SwapTotal /proc/meminfo | awk '{print $2}')
    SWAP_FREE=$(grep SwapFree /proc/meminfo | awk '{print $2}')
    
    # Calculate percentages
    MEM_USED=$((MEM_TOTAL - MEM_AVAILABLE))
    MEM_PERCENT=$((MEM_USED * 100 / MEM_TOTAL))
    
    SWAP_USED=$((SWAP_TOTAL - SWAP_FREE))
    if [ $SWAP_TOTAL -gt 0 ]; then
        SWAP_PERCENT=$((SWAP_USED * 100 / SWAP_TOTAL))
    else
        SWAP_PERCENT=0
    fi
    
    # Convert to human readable
    MEM_TOTAL_HR=$(numfmt --from-unit=1K --to=iec $MEM_TOTAL)
    MEM_USED_HR=$(numfmt --from-unit=1K --to=iec $MEM_USED)
    MEM_AVAILABLE_HR=$(numfmt --from-unit=1K --to=iec $MEM_AVAILABLE)
}

check_page_faults() {
    # Get page fault statistics
    if command -v sar &> /dev/null; then
        PAGEFAULTS=$(sar -B 1 1 | tail -1 | awk '{print $2}')
    else
        # Alternative method
        PGFAULT1=$(grep pgfault /proc/vmstat | awk '{print $2}')
        sleep 1
        PGFAULT2=$(grep pgfault /proc/vmstat | awk '{print $2}')
        PAGEFAULTS=$((PGFAULT2 - PGFAULT1))
    fi
    echo $PAGEFAULTS
}

# ================= MAIN MONITORING =================
echo "=== Memory Monitoring - $(date) ===" | tee -a "$LOG_FILE"
get_memory_stats

# Display memory dashboard
echo ""
echo "Memory Dashboard:"
echo "================="
echo "RAM Total:    $MEM_TOTAL_HR"
echo "RAM Used:     $MEM_USED_HR (${MEM_PERCENT}%)"
echo "RAM Available: $MEM_AVAILABLE_HR"
echo "Swap Used:    ${SWAP_PERCENT}% of $(numfmt --from-unit=1K --to=iec $SWAP_TOTAL)"
echo "Buffers:      $(numfmt --from-unit=1K --to=iec $BUFFERS)"
echo "Cached:       $(numfmt --from-unit=1K --to=iec $CACHED)"
echo ""

# Check thresholds
ALERTS=()

if [ $MEM_PERCENT -ge $THRESHOLD_RAM ]; then
    ALERTS+=("CRITICAL: High RAM usage: ${MEM_PERCENT}%")
fi

if [ $SWAP_PERCENT -ge $THRESHOLD_SWAP ]; then
    ALERTS+=("CRITICAL: High Swap usage: ${SWAP_PERCENT}%")
fi

PAGEFAULTS=$(check_page_faults)
if [ $PAGEFAULTS -ge $THRESHOLD_PAGEFAULTS ]; then
    ALERTS+=("CRITICAL: High page faults: ${PAGEFAULTS}/sec")
fi

# Check for memory leaks (growing processes)
echo "Top Memory Processes:" | tee -a "$LOG_FILE"
echo "PID   USER     %MEM   RSS(MB)   COMMAND" | tee -a "$LOG_FILE"
ps aux --sort=-%mem | head -6 | tail -5 | while read -r line; do
    PID=$(echo "$line" | awk '{print $2}')
    USER=$(echo "$line" | awk '{print $1}')
    MEM=$(echo "$line" | awk '{print $4}')
    RSS=$(echo "$line" | awk '{print $6}')
    RSS_MB=$((RSS / 1024))
    CMD=$(echo "$line" | awk '{for(i=11;i<=NF;i++) printf $i" "; print ""}')
    
    # Check for unusually high memory usage
    if [ $(echo "$MEM > 20" | bc) -eq 1 ]; then
        ALERTS+=("WARNING: Process $PID ($CMD) using ${MEM}% memory")
    fi
    
    echo "$PID   $USER     ${MEM}%   ${RSS_MB}MB   ${CMD:0:40}" | tee -a "$LOG_FILE"
done

# Check OOM killer activity
if dmesg | grep -i "oom" | grep -i "kill" > /dev/null 2>&1; then
    ALERTS+=("CRITICAL: OOM killer has been active recently")
    echo "OOM Killer has terminated processes recently" | tee -a "$LOG_FILE"
fi

# Send alerts if any
if [ ${#ALERTS[@]} -gt 0 ]; then
    echo ""
    echo "âš ī¸ ALERTS DETECTED:" | tee -a "$LOG_FILE"
    for alert in "${ALERTS[@]}"; do
        echo "  â€ĸ $alert" | tee -a "$LOG_FILE"
    done
    
    if [ -n "$ALERT_EMAIL" ]; then
        ALERT_SUBJECT="Memory Alert: $(hostname) - $(date)"
        {
            echo "Memory Alert Report for $(hostname)"
            echo "Generated: $(date)"
            echo ""
            echo "Current Status:"
            echo "RAM Usage: ${MEM_PERCENT}%"
            echo "Swap Usage: ${SWAP_PERCENT}%"
            echo "Page Faults: ${PAGEFAULTS}/sec"
            echo ""
            echo "Alerts:"
            for alert in "${ALERTS[@]}"; do
                echo "  - $alert"
            done
        } | mail -s "$ALERT_SUBJECT" "$ALERT_EMAIL"
    fi
else
    echo ""
    echo "✅ Memory status: NORMAL" | tee -a "$LOG_FILE"
fi

echo "Monitoring completed at $(date)" | tee -a "$LOG_FILE"

3. Disk Monitoring

Disk monitoring prevents data loss and performance degradation. Essential disk metrics:

  • Disk Space: Free, used, and available space per filesystem
  • Inode Usage: Inode consumption and availability
  • I/O Performance: Read/write throughput and latency
  • Disk Health: SMART status and error rates
  • Mount Status: Filesystem mount points and options

Disk Monitoring Dashboard Preview

/ (root) 85% used
/home 45% used
/var 92% used âš ī¸
/tmp 30% used

Comprehensive Disk Monitoring Script

#!/bin/bash
# monitor-disk.sh - Advanced disk space and performance monitoring

# ================= CONFIGURATION =================
THRESHOLD_DISK=85          # Disk usage percentage threshold
THRESHOLD_INODE=80         # Inode usage percentage threshold
THRESHOLD_IOWAIT=20        # I/O wait percentage threshold
CRITICAL_PARTITIONS=("/" "/var" "/home")  # Critical partitions to monitor
LOG_FILE="/var/log/disk-monitor.log"
REPORT_DIR="/var/log/disk-reports"

# ================= INITIALIZATION =================
mkdir -p "$REPORT_DIR"
REPORT_FILE="$REPORT_DIR/disk-report-$(date +%Y%m%d_%H%M%S).txt"

echo "=== Disk Monitoring Report - $(date) ===" > "$REPORT_FILE"
echo "Hostname: $(hostname)" >> "$REPORT_FILE"
echo "" >> "$REPORT_FILE"

ALERTS=()
CRITICAL_ALERTS=()

# ================= DISK SPACE ANALYSIS =================
echo "1. Disk Space Analysis:" >> "$REPORT_FILE"
echo "------------------------" >> "$REPORT_FILE"

df -h | grep '^/dev/' | while read -r line; do
    FILESYSTEM=$(echo "$line" | awk '{print $1}')
    SIZE=$(echo "$line" | awk '{print $2}')
    USED=$(echo "$line" | awk '{print $3}')
    AVAIL=$(echo "$line" | awk '{print $4}')
    USE_PERCENT=$(echo "$line" | awk '{print $5}' | sed 's/%//')
    MOUNT=$(echo "$line" | awk '{print $6}')
    
    echo "  $MOUNT ($FILESYSTEM):" >> "$REPORT_FILE"
    echo "    Size: $SIZE, Used: $USED, Available: $AVAIL" >> "$REPORT_FILE"
    echo "    Usage: $USE_PERCENT%" >> "$REPORT_FILE"
    
    # Check if this is a critical partition
    IS_CRITICAL=false
    for critical in "${CRITICAL_PARTITIONS[@]}"; do
        if [ "$MOUNT" == "$critical" ]; then
            IS_CRITICAL=true
            break
        fi
    done
    
    # Generate alerts based on thresholds
    if [ "$USE_PERCENT" -ge 95 ]; then
        ALERT="CRITICAL: Disk almost full on $MOUNT: ${USE_PERCENT}%"
        if [ "$IS_CRITICAL" = true ]; then
            CRITICAL_ALERTS+=("$ALERT")
        else
            ALERTS+=("$ALERT")
        fi
        echo "    âš ī¸ $ALERT" >> "$REPORT_FILE"
    elif [ "$USE_PERCENT" -ge "$THRESHOLD_DISK" ]; then
        ALERT="WARNING: High disk usage on $MOUNT: ${USE_PERCENT}%"
        ALERTS+=("$ALERT")
        echo "    âš ī¸ $ALERT" >> "$REPORT_FILE"
    fi
done

# ================= INODE USAGE CHECK =================
echo "" >> "$REPORT_FILE"
echo "2. Inode Usage Analysis:" >> "$REPORT_FILE"
echo "-------------------------" >> "$REPORT_FILE"

df -i | grep '^/dev/' | while read -r line; do
    FILESYSTEM=$(echo "$line" | awk '{print $1}')
    INODES_TOTAL=$(echo "$line" | awk '{print $2}')
    INODES_USED=$(echo "$line" | awk '{print $3}')
    INODES_FREE=$(echo "$line" | awk '{print $4}')
    INODE_PERCENT=$(echo "$line" | awk '{print $5}' | sed 's/%//')
    MOUNT=$(echo "$line" | awk '{print $6}')
    
    if [ "$INODE_PERCENT" -ge "$THRESHOLD_INODE" ]; then
        echo "  $MOUNT: ${INODE_PERCENT}% inodes used ($INODES_USED/$INODES_TOTAL)" >> "$REPORT_FILE"
        
        if [ "$INODE_PERCENT" -ge 90 ]; then
            ALERT="CRITICAL: High inode usage on $MOUNT: ${INODE_PERCENT}%"
            CRITICAL_ALERTS+=("$ALERT")
            echo "    âš ī¸ $ALERT" >> "$REPORT_FILE"
        elif [ "$INODE_PERCENT" -ge "$THRESHOLD_INODE" ]; then
            ALERT="WARNING: High inode usage on $MOUNT: ${INODE_PERCENT}%"
            ALERTS+=("$ALERT")
            echo "    âš ī¸ $ALERT" >> "$REPORT_FILE"
        fi
    fi
done

# ================= I/O PERFORMANCE MONITORING =================
echo "" >> "$REPORT_FILE"
echo "3. I/O Performance:" >> "$REPORT_FILE"
echo "--------------------" >> "$REPORT_FILE"

if command -v iostat &> /dev/null; then
    echo "  Device    Read(KB/s)  Write(KB/s)  Util%" >> "$REPORT_FILE"
    iostat -dkx 1 1 | grep -E '^sd|^nvme' | while read -r line; do
        DEVICE=$(echo "$line" | awk '{print $1}')
        READ_KB=$(echo "$line" | awk '{print $4}')
        WRITE_KB=$(echo "$line" | awk '{print $5}')
        UTIL=$(echo "$line" | awk '{print $NF}')
        
        echo "  $DEVICE     $READ_KB        $WRITE_KB       $UTIL%" >> "$REPORT_FILE"
        
        if (( $(echo "$UTIL > $THRESHOLD_IOWAIT" | bc -l) )); then
            ALERT="WARNING: High I/O utilization on $DEVICE: ${UTIL}%"
            ALERTS+=("$ALERT")
            echo "    âš ī¸ $ALERT" >> "$REPORT_FILE"
        fi
    done
else
    echo "  iostat not available. Install sysstat package." >> "$REPORT_FILE"
fi

# Check I/O wait from top
IOWAIT=$(top -b -n 1 | grep "^%Cpu" | awk '{print $10}')
echo "  CPU I/O Wait: ${IOWAIT}%" >> "$REPORT_FILE"

if (( $(echo "$IOWAIT > $THRESHOLD_IOWAIT" | bc -l) )); then
    ALERT="WARNING: High CPU I/O wait: ${IOWAIT}%"
    ALERTS+=("$ALERT")
    echo "  âš ī¸ $ALERT" >> "$REPORT_FILE"
fi

# ================= SMART STATUS CHECK =================
echo "" >> "$REPORT_FILE"
echo "4. Disk Health (SMART):" >> "$REPORT_FILE"
echo "------------------------" >> "$REPORT_FILE"

if command -v smartctl &> /dev/null; then
    lsblk -d -o name,type | grep disk | awk '{print $1}' | while read -r disk; do
        if [ -b "/dev/$disk" ]; then
            SMART_STATUS=$(smartctl -H "/dev/$disk" 2>/dev/null | grep "SMART overall-health" || echo "Not Supported")
            
            if echo "$SMART_STATUS" | grep -q "PASSED"; then
                echo "  /dev/$disk: ✅ SMART health PASSED" >> "$REPORT_FILE"
            elif echo "$SMART_STATUS" | grep -q "FAILED"; then
                ALERT="CRITICAL: SMART failure detected on /dev/$disk"
                CRITICAL_ALERTS+=("$ALERT")
                echo "  /dev/$disk: ❌ $ALERT" >> "$REPORT_FILE"
            else
                echo "  /dev/$disk: â„šī¸ SMART not supported or unavailable" >> "$REPORT_FILE"
            fi
        fi
    done
else
    echo "  smartctl not installed. Install smartmontools package." >> "$REPORT_FILE"
fi

# ================= LARGEST FILES CHECK =================
echo "" >> "$REPORT_FILE"
echo "5. Largest Files Analysis:" >> "$REPORT_FILE"
echo "---------------------------" >> "$REPORT_FILE"

for partition in "${CRITICAL_PARTITIONS[@]}"; do
    if [ -d "$partition" ]; then
        echo "  Top 5 largest files in $partition:" >> "$REPORT_FILE"
        find "$partition" -type f -exec du -h {} + 2>/dev/null | sort -rh | head -5 | while read -r line; do
            SIZE=$(echo "$line" | awk '{print $1}')
            FILE=$(echo "$line" | awk '{print $2}')
            echo "    $SIZE - $FILE" >> "$REPORT_FILE"
        done
    fi
done

# ================= SUMMARY AND ALERTS =================
echo "" >> "$REPORT_FILE"
echo "=== SUMMARY ===" >> "$REPORT_FILE"

if [ ${#CRITICAL_ALERTS[@]} -gt 0 ]; then
    echo "❌ CRITICAL ISSUES DETECTED:" >> "$REPORT_FILE"
    for alert in "${CRITICAL_ALERTS[@]}"; do
        echo "  â€ĸ $alert" >> "$REPORT_FILE"
    done
fi

if [ ${#ALERTS[@]} -gt 0 ]; then
    echo "" >> "$REPORT_FILE"
    echo "âš ī¸ WARNINGS:" >> "$REPORT_FILE"
    for alert in "${ALERTS[@]}"; do
        echo "  â€ĸ $alert" >> "$REPORT_FILE"
    done
fi

if [ ${#CRITICAL_ALERTS[@]} -eq 0 ] && [ ${#ALERTS[@]} -eq 0 ]; then
    echo "✅ All disk metrics are within normal ranges" >> "$REPORT_FILE"
fi

# Save and log
cat "$REPORT_FILE" >> "$LOG_FILE"
echo "Disk monitoring completed. Report saved to: $REPORT_FILE"

4. Network Monitoring

Network monitoring ensures connectivity, performance, and security. Key network metrics:

  • Bandwidth Usage: Incoming and outgoing traffic rates
  • Latency: Network response times and delays
  • Packet Loss: Percentage of lost packets
  • Connection Counts: Active TCP/UDP connections
  • Interface Status: Network interface up/down and errors

Network Monitoring Commands Reference

iftop -n -t -s 5 # Real-time bandwidth by connection
nethogs # Bandwidth by process
ss -tunap # Socket statistics
ping -c 5 google.com # Network connectivity test
mtr google.com # Traceroute with statistics
netstat -i # Interface statistics
sar -n DEV 1 1 # Network device stats

5. Comprehensive System Monitoring Dashboard

=== System Health Dashboard ===
CPU Load 2.1 (cores: 4)
Memory Usage 6.2/16 GB (38%)
Disk (/) 45/100 GB (45%)
Network (eth0) ↓15 MB/s ↑8 MB/s
✅ System Status: HEALTHY

6. Monitoring Best Practices

Practice Description Implementation Baseline Establishment Understand normal behavior before setting alerts Monitor for 1-2 weeks to establish baseline Progressive Alerting Use warning and critical thresholds Warning at 80%, Critical at 95% Alert Fatigue Prevention Only alert on actionable items Use intelligent alert grouping Historical Data Keep historical data for trend analysis Store metrics for 30-90 days Multi-level Monitoring Monitor system, application, and business metrics Combine resource monitoring with app metrics Automated Remediation Auto-fix common issues when possible Clear temp files, restart stuck services

Monitoring Implementation Checklist

  • CPU monitoring with load average tracking
  • Memory monitoring with swap usage
  • Disk space and inode monitoring
  • I/O performance monitoring
  • Network bandwidth and latency tracking
  • Service and process monitoring
  • Alerting system with escalation
  • Historical data storage and retention
  • Dashboard for real-time visibility
  • Automated reporting and notification

7. Scheduling Monitoring Scripts

# ================= MONITORING CRONTAB =================
# System health check every 5 minutes
*/5 * * * * /opt/monitoring/system-health.sh >> /var/log/health.log 2>&1

# CPU monitoring every minute during business hours
*/1 9-17 * * 1-5 /opt/monitoring/monitor-cpu.sh >> /var/log/cpu.log 2>&1

# Memory monitoring every 15 minutes
*/15 * * * * /opt/monitoring/monitor-memory.sh >> /var/log/memory.log 2>&1

# Disk monitoring every hour
0 * * * * /opt/monitoring/monitor-disk.sh >> /var/log/disk.log 2>&1

# Network monitoring every 30 minutes
*/30 * * * * /opt/monitoring/monitor-network.sh >> /var/log/network.log 2>&1

# Comprehensive daily report at midnight
0 0 * * * /opt/monitoring/daily-report.sh >> /var/log/daily-report.log 2>&1

# Weekly capacity planning report
0 2 * * 1 /opt/monitoring/weekly-capacity.sh >> /var/log/capacity.log 2>&1

# Monthly trend analysis
0 3 1 * * /opt/monitoring/monthly-trends.sh >> /var/log/trends.log 2>&1
Monitoring Security Considerations:
1. Secure credentials: Don't store passwords in monitoring scripts
2. Limit data collection: Only collect necessary data
3. Secure transmission: Use encryption for remote monitoring
4. Access control: Restrict who can access monitoring data
5. Data retention: Define and enforce data retention policies
6. Audit logging: Log all access to monitoring systems
7. Vulnerability scanning: Regularly scan monitoring infrastructure

Getting Started with System Monitoring

Follow these steps to implement comprehensive monitoring:

  1. Identify critical systems: Determine what needs monitoring
  2. Establish baselines: Monitor for 1-2 weeks to understand normal behavior
  3. Set thresholds: Define warning and critical levels for each metric
  4. Implement monitoring scripts: Start with CPU and memory
  5. Add alerting: Configure email/SMS notifications
  6. Create dashboards: Build visibility into system health
  7. Test thoroughly: Simulate failures to ensure alerts work
  8. Document procedures: Create runbooks for common alerts
  9. Review and optimize: Regularly refine thresholds and alerts

Proactive Monitoring for System Reliability

Effective system monitoring transforms reactive firefighting into proactive maintenance. By implementing the scripts and practices in this guide, you'll gain deep visibility into your systems and prevent issues before they impact users.

Remember: The goal of monitoring is not to collect data, but to provide actionable insights. Focus on metrics that drive decisions and enable proactive maintenance.

Next Steps: Start with the CPU monitoring script, customize it for your environment, and schedule it to run every 5 minutes. Once you're comfortable, expand to memory, disk, and network monitoring to build a complete monitoring solution.