Best Practices for DevOps: Linux DevOps Environment Guide

Mastering Linux in DevOps environments requires systematic approaches to automation, security, monitoring, and collaboration. This guide covers essential best practices, CI/CD pipelines, infrastructure as code, and production-ready configurations.

1. Infrastructure as Code (IaC) Best Practices

Infrastructure as Code is fundamental to modern DevOps. Learn how to manage infrastructure efficiently, consistently, and reproducibly.

Terraform & Ansible Best Practices

Intermediate

Terraform Guidelines:

Modular design: Create reusable modules for common infrastructure
State management: Use remote backends (S3, Terraform Cloud)
Version control: Always use Git with .tfstate in .gitignore
Variable validation: Validate inputs with conditions
Workspace isolation: Separate dev, staging, prod environments
Plan before apply: Always review terraform plan

Ansible Best Practices:

Role-based organization: Use Ansible Galaxy structure
Idempotency: Ensure playbooks can run multiple times
Variable precedence: Understand variable hierarchy
Inventory management: Use dynamic inventories for cloud
Tagging: Use tags for selective execution
Vault for secrets: Encrypt sensitive data with ansible-vault

Example Terraform Structure

# Directory structure infrastructure/ ├── main.tf ├── variables.tf ├── outputs.tf ├── terraform.tfvars └── modules/ ├── networking/ │ ├── main.tf │ └── variables.tf ├── compute/ │ ├── main.tf │ └── outputs.tf └── database/ ├── main.tf └── variables.tf # main.tf - Root module terraform { required_version = ">= 1.5.0" backend "s3" { bucket = "terraform-state-bucket" key = "prod/terraform.tfstate" region = "us-east-1" } } module "vpc" { source = "./modules/networking" cidr_block = var.vpc_cidr environment = var.environment availability_zones = var.azs } module "eks" { source = "./modules/compute" vpc_id = module.vpc.vpc_id cluster_name = "${var.project}-${var.environment}" node_group_name = "main" }

Container Orchestration with Kubernetes

Advanced

Kubernetes Best Practices:

Resource limits: Always set CPU/memory requests and limits
Health checks: Configure liveness and readiness probes
Namespaces: Isolate environments and teams
ConfigMaps & Secrets: Externalize configuration
Horizontal Pod Autoscaling: Automatically scale based on metrics
Network policies: Implement zero-trust network security
PodDisruptionBudgets: Ensure availability during maintenance

# production-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: webapp namespace: production labels: app: webapp environment: production spec: replicas: 3 selector: matchLabels: app: webapp strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: metadata: labels: app: webapp spec: containers: - name: webapp image: registry.example.com/webapp:v1.2.3 imagePullPolicy: IfNotPresent ports: - containerPort: 8080 resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 envFrom: - configMapRef: name: webapp-config - secretRef: name: webapp-secrets

2. CI/CD Pipeline Excellence

Building robust, secure, and efficient CI/CD pipelines is crucial for DevOps success. Learn industry best practices.

GitHub Actions / GitLab CI Best Practices

CI/CD

Pipeline Optimization:

1. Cache dependencies: Reduce build times with caching
2. Parallel jobs: Run independent tasks concurrently
3. Matrix builds: Test across multiple versions simultaneously
4. Artifact management: Store and reuse build artifacts
5. Secrets management: Use platform secrets, not hardcoded values
6. Self-hosted runners: For security-sensitive or resource-intensive jobs

🚀 CI/CD Pipeline Flow: 1. Code Commit → Triggers Pipeline 2. Static Analysis (SAST) → Security Scan 3. Build → Compile & Package 4. Unit Tests → Validate Logic 5. Integration Tests → Verify Components 6. Container Build → Create Docker Image 7. Vulnerability Scan (DAST) → Security Check 8. Deploy to Staging → Test Environment 9. E2E Tests → Full System Validation 10. Approval Gate → Manual Review 11. Deploy to Production → Live Environment 12. Smoke Tests → Verify Deployment 13. Monitoring & Rollback → Production Safety
# .github/workflows/ci-cd.yml name: CI/CD Pipeline on: push: branches: [ main, develop ] pull_request: branches: [ main ] env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }} jobs: test: runs-on: ubuntu-latest strategy: matrix: node-version: [16.x, 18.x, 20.x] steps: - uses: actions/checkout@v4 - name: Use Node.js ${{ matrix.node-version }} uses: actions/setup-node@v4 with: node-version: ${{ matrix.node-version }} cache: 'npm' - name: Install dependencies run: npm ci - name: Run linting run: npm run lint - name: Run unit tests run: npm test -- --coverage - name: Upload coverage uses: codecov/codecov-action@v3 build: needs: test runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v4 - name: Login to Container Registry uses: docker/login-action@v3 with: registry: ${{ env.REGISTRY }} username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - name: Build and push Docker image uses: docker/build-push-action@v5 with: context: . push: true tags: | ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} deploy: needs: build runs-on: ubuntu-latest environment: production steps: - name: Deploy to Kubernetes run: | kubectl set image deployment/webapp \ webapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} - name: Verify deployment run: | kubectl rollout status deployment/webapp --timeout=300s

Infrastructure Testing & Validation

Security

Testing Strategies:

Terratest: Go-based testing for Terraform
InSpec: Compliance as Code testing
Kitchen-Terraform: Integration testing
Checkov: Static analysis for IaC security
TFLint: Terraform linter
Conftest: Policy testing with OPA

# Infrastructure testing pipeline #!/bin/bash # 1. Terraform format check terraform fmt -check -recursive # 2. Terraform validation terraform validate # 3. Security scanning with Checkov checkov -d . # 4. Cost estimation infracost breakdown --path . # 5. Unit tests with Terratest cd test && go test -v -timeout 30m # 6. Integration tests kitchen test # 7. Compliance testing with InSpec inspec exec compliance/ --reporter cli json:report.json # 8. Policy validation with Conftest conftest test -p policies/ main.tf

3. Security in DevOps (DevSecOps)

Integrate security throughout the DevOps lifecycle with automated security practices and tools.

Container Security Best Practices

Security

Docker Security:

Non-root users: Run containers as non-root
Minimal base images: Use Alpine or distroless
Image scanning: Scan for vulnerabilities in CI/CD
Secrets management: Never store secrets in images
Resource constraints: Limit CPU/memory usage
Read-only filesystems: Mount only necessary volumes
Regular updates: Keep base images updated

Secure Dockerfile Example

# Multi-stage build for security and minimal size FROM node:18-alpine AS builder WORKDIR /app COPY package*.json ./ RUN npm ci --only=production FROM node:18-alpine WORKDIR /app # Create non-root user RUN addgroup -g 1001 -S nodejs && \ adduser -S nodejs -u 1001 # Copy application files COPY --from=builder /app/node_modules ./node_modules COPY . . # Security configurations RUN chown -R nodejs:nodejs /app && \ chmod -R 755 /app && \ # Remove unnecessary packages apk del --no-cache curl && \ # Create necessary directories with proper permissions mkdir -p /tmp && chmod 1777 /tmp # Switch to non-root user USER nodejs # Health check HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ CMD node healthcheck.js # Port exposure EXPOSE 3000 # Command CMD ["node", "server.js"]

Secrets Management

Security

Secrets Management Tools:

HashiCorp Vault: Comprehensive secrets management
AWS Secrets Manager: AWS-native solution
Azure Key Vault: Microsoft Azure solution
Google Secret Manager: GCP solution
Bitwarden: Open-source alternative
SOPS: Secrets OPerationS - encrypted files

# Vault integration example #!/bin/bash # Export secrets to environment variables export DB_PASSWORD=$(vault kv get -field=password secret/database) export API_KEY=$(vault kv get -field=api_key secret/external-api) # Kubernetes integration # Install Vault agent injector helm repo add hashicorp https://helm.releases.hashicorp.com helm install vault hashicorp/vault --set "injector.enabled=true" # Annotations for automatic injection metadata: annotations: vault.hashicorp.com/agent-inject: "true" vault.hashicorp.com/role: "webapp" vault.hashicorp.com/agent-inject-secret-db-password: "secret/data/database" vault.hashicorp.com/agent-inject-template-db-password: | {{- with secret "secret/data/database" -}} export DB_PASSWORD="{{ .Data.data.password }}" {{- end }}

4. Monitoring & Observability

Comprehensive monitoring, logging, and alerting strategies for production DevOps environments.

Prometheus & Grafana Stack

Monitoring

Monitoring Best Practices:

Four Golden Signals: Latency, traffic, errors, saturation
USE Method: Utilization, saturation, errors (resources)
RED Method: Rate, errors, duration (services)
Alerting rules: Meaningful, actionable alerts
Dashboard design: Consistent, informative dashboards
SLO/SLI definition: Define service level objectives/indicators

# prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s rule_files: - "alerts/*.yml" alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - action: labelmap regex: __meta_kubernetes_pod_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod # alerts/node.rules.yml groups: - name: node_alerts rules: - alert: HighNodeCPU expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 10m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for 10 minutes" - alert: NodeMemoryHigh expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: severity: critical annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 85% for 5 minutes"

Centralized Logging with ELK/Loki

Monitoring

Logging Strategy:

Structured logging: Use JSON format for logs
Log levels: Appropriate severity levels (DEBUG, INFO, WARN, ERROR)
Context enrichment: Include request IDs, user IDs, correlation IDs
Log aggregation: Centralize logs for analysis
Retention policies: Define log retention periods
Sensitive data: Never log passwords, tokens, PII

# Fluentd configuration for Kubernetes @type tail path /var/log/containers/*.log pos_file /var/log/fluentd-containers.log.pos tag kubernetes.* read_from_head true @type json time_format %Y-%m-%dT%H:%M:%S.%NZ @type kubernetes_metadata @type record_transformer enable_ruby true host "#{Socket.gethostname}" log_level ${record["log"].match(/ERROR|WARN|INFO|DEBUG/).to_s} pod_name ${record["kubernetes"]["pod_name"]} namespace ${record["kubernetes"]["namespace_name"]} container_name ${record["kubernetes"]["container_name"]} @type elasticsearch host elasticsearch port 9200 logstash_format true logstash_prefix kubernetes include_tag_key true type_name fluentd # Loki configuration for log aggregation loki: config: server: http_listen_port: 3100 auth_enabled: false ingester: lifecycler: address: 127.0.0.1 ring: kvstore: store: inmemory replication_factor: 1 final_sleep: 0s chunk_idle_period: 5m max_chunk_age: 1h schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h

5. Automation & Scripting Excellence

Master automation techniques and scripting best practices for efficient DevOps workflows.

Bash Scripting for DevOps

Automation

Bash Best Practices:

Shebang: Always start with #!/usr/bin/env bash
Error handling: Use set -euo pipefail
Input validation: Validate all inputs and parameters
Logging: Implement proper logging with timestamps
Function usage: Use functions for reusable code
Exit codes: Return meaningful exit codes
Temporary files: Use mktemp and clean up
Portability: Write portable scripts for different systems

Production-Ready Bash Template

#!/usr/bin/env bash # ============================================================================== # Script: deployment-automation.sh # Description: Production deployment automation script # Author: DevOps Team # Version: 1.0.0 # ============================================================================== set -euo pipefail IFS=$'\n\t' # ============================================================================== # Configuration # ============================================================================== readonly SCRIPT_NAME="$(basename "${0}")" readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" readonly LOG_FILE="/var/log/${SCRIPT_NAME%.*}.log" readonly LOCK_FILE="/tmp/${SCRIPT_NAME}.lock" # ============================================================================== # Colors and Formatting # ============================================================================== readonly RED='\033[0;31m' readonly GREEN='\033[0;32m' readonly YELLOW='\033[1;33m' readonly BLUE='\033[0;34m' readonly NC='\033[0m' # No Color # ============================================================================== # Logging Functions # ============================================================================== log_info() { echo -e "${BLUE}[INFO]${NC} $(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "${LOG_FILE}" } log_success() { echo -e "${GREEN}[SUCCESS]${NC} $(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "${LOG_FILE}" } log_warning() { echo -e "${YELLOW}[WARNING]${NC} $(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "${LOG_FILE}" } log_error() { echo -e "${RED}[ERROR]${NC} $(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "${LOG_FILE}" } # ============================================================================== # Utility Functions # ============================================================================== check_dependencies() { local dependencies=("kubectl" "helm" "docker" "jq") local missing=() for cmd in "${dependencies[@]}"; do if ! command -v "${cmd}" &> /dev/null; then missing+=("${cmd}") fi done if [[ ${#missing[@]} -gt 0 ]]; then log_error "Missing dependencies: ${missing[*]}" exit 1 fi } acquire_lock() { if [ -e "${LOCK_FILE}" ]; then log_error "Script is already running. Lock file: ${LOCK_FILE}" exit 1 fi echo "$$" > "${LOCK_FILE}" trap 'release_lock' EXIT } release_lock() { rm -f "${LOCK_FILE}" } # ============================================================================== # Main Functions # ============================================================================== deploy_application() { local namespace="${1}" local image_tag="${2}" log_info "Starting deployment to namespace: ${namespace}" log_info "Using image tag: ${image_tag}" # Update deployment kubectl set image "deployment/webapp" \ "webapp=registry.example.com/webapp:${image_tag}" \ -n "${namespace}" # Wait for rollout kubectl rollout status "deployment/webapp" \ -n "${namespace}" \ --timeout=300s # Verify deployment local replicas replicas=$(kubectl get deployment webapp -n "${namespace}" -o json | jq -r '.status.readyReplicas') if [[ "${replicas}" -eq 3 ]]; then log_success "Deployment successful. All 3 replicas are ready." else log_error "Deployment failed. Ready replicas: ${replicas}" exit 1 fi } run_smoke_tests() { log_info "Running smoke tests..." # Your smoke test commands here # curl -f http://service/health # Run automated tests log_success "Smoke tests passed" } # ============================================================================== # Main Execution # ============================================================================== main() { acquire_lock check_dependencies local namespace="production" local image_tag="${1:-latest}" log_info "Starting production deployment pipeline" deploy_application "${namespace}" "${image_tag}" run_smoke_tests log_success "Deployment pipeline completed successfully" } # ============================================================================== # Entry Point # ============================================================================== if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then if [[ $# -lt 1 ]]; then echo "Usage: ${SCRIPT_NAME} " echo "Example: ${SCRIPT_NAME} v1.2.3" exit 1 fi main "$@" fi

6. Collaboration & Documentation

Effective collaboration and comprehensive documentation are key to successful DevOps teams.

Git Workflow Strategies

Beginner

Git Branching Models:

Git Flow: Feature branches with develop/main separation
GitHub Flow: Simpler model with feature branches to main
GitLab Flow: Environment-based branching with production branch
Trunk-based Development: Short-lived feature branches
Feature Flags: Deploy without merging with feature toggles
Pull Request Templates: Standardized PR descriptions
Commit Message Conventions: Conventional Commits format

Documentation Standards

# Project Documentation Structure project/ ├── README.md ├── CHANGELOG.md ├── CONTRIBUTING.md ├── CODE_OF_CONDUCT.md ├── docs/ │ ├── architecture/ │ │ ├── system-design.md │ │ ├── data-flow.md │ │ └── deployment-architecture.md │ ├── development/ │ │ ├── setup.md │ │ ├── coding-standards.md │ │ └── testing.md │ ├── operations/ │ │ ├── deployment.md │ │ ├── monitoring.md │ │ └── troubleshooting.md │ ├── api/ │ │ └── api-reference.md │ └── security/ │ ├── security-policy.md │ └── compliance.md ├── .github/ │ ├── PULL_REQUEST_TEMPLATE.md │ └── ISSUE_TEMPLATE/ └── scripts/ └── documentation/ └── generate-docs.sh # README.md Template # Project Name ## Overview Brief description of the project. ## Architecture - System architecture diagram - Technology stack - Data flow description ## Getting Started ### Prerequisites - List of required tools and versions ### Installation Step-by-step installation instructions ### Configuration Environment variables and configuration files ## Development ### Setup Development Environment ### Running Tests ### Code Standards ## Deployment ### CI/CD Pipeline ### Environment Configuration ### Monitoring Setup ## API Reference [Link to API documentation] ## Troubleshooting Common issues and solutions ## Contributing [Link to CONTRIBUTING.md] ## Security [Link to security documentation] ## License [License information]

DevOps Best Practices Checklist

Infrastructure & Security:

Infrastructure as Code: All infrastructure defined in code
Version Control: All code in Git with proper branching
Automated Testing: Comprehensive test suite
Security Scanning: SAST/DAST in CI/CD pipeline
Secrets Management: No secrets in code or config files
Access Control: Principle of least privilege enforced
Network Security: Proper network segmentation
Compliance: Regular compliance checks automated

CI/CD & Deployment:

Automated Pipeline: End-to-end automation
Environment Parity: Dev/Staging/Prod similarity
Rollback Capability: Quick and safe rollbacks
Blue-Green/Canary: Advanced deployment strategies
Artifact Management: Versioned artifacts stored
Pipeline Security: Secure pipeline configuration
Approval Gates: Manual approval where needed
Performance Testing: Automated performance tests

Monitoring & Operations:

Centralized Logging: All logs aggregated
Metrics Collection: System and app metrics
Alerting System: Meaningful, actionable alerts
Dashboarding: Real-time dashboards available
Tracing: Distributed tracing implemented
SLO Monitoring: Service level objectives tracked
Incident Response: Documented procedures
Disaster Recovery: Tested recovery plans

Collaboration & Documentation:

Documentation: Comprehensive and up-to-date
Knowledge Sharing: Regular team knowledge transfer
Code Reviews: Mandatory peer reviews
Post-Mortems: Learning from incidents
On-call Rotation: Well-defined on-call process
Training: Continuous skill development
Feedback Loops: Regular retrospectives
Open Communication: Transparent team communication

Pro Tips for DevOps Success

Start Small: Implement changes incrementally
Measure Everything: You can't improve what you don't measure
Automate Ruthlessly: Automate repetitive tasks
Security First: Integrate security from the beginning
Document Religiously: Knowledge should be accessible
Test in Production: Use feature flags and canaries
Embrace Failure: Learn from incidents and failures
Continuous Learning: Stay updated with new tools and practices
Collaborate Actively: Break down silos between teams
Focus on Business Value: Align technical decisions with business goals

Common DevOps Anti-Patterns to Avoid

1. Manual Deployments: Avoid manual steps in deployment
2. Snowflake Servers: Each server configured differently
3. Security as Afterthought: Security added late in the process
4. Tool Overload: Too many tools without integration
5. Siloed Teams: Development and operations working separately
6. No Rollback Plan: Deploying without ability to revert
7. Ignoring Logs: Not monitoring or analyzing logs
8. Poor Documentation: Tribal knowledge instead of documentation
9. Over-engineering: Complex solutions for simple problems
10. Ignoring Technical Debt: Accumulating debt without addressing it