dlx-ansible/docs/REMEDIATION-SUMMARY.md

9.8 KiB
Raw Permalink Blame History

Storage Remediation Playbooks Summary

Created: 2026-02-08 Status: Ready for deployment


Overview

Four Ansible playbooks have been created to remediate critical storage issues identified in the Proxmox cluster storage audit.


Playbooks Created

1. remediate-storage-critical-issues.yml

Location: playbooks/remediate-storage-critical-issues.yml

Purpose: Address immediate critical and high-priority issues

Targets:

  • proxmox-00 (root filesystem at 84.5%)
  • proxmox-01 (dlx-docker at 81.1%)
  • All nodes (SonarQube, stopped containers audit)

Actions:

  • Compress journal logs (>30 days)
  • Remove old syslog files (>90 days)
  • Clean apt cache and temp files
  • Prune Docker images, volumes, and build cache
  • Audit SonarQube disk usage
  • Report on stopped containers

Expected space freed:

  • proxmox-00: 10-15 GB
  • proxmox-01: 20-50 GB
  • Total: 30-65 GB

Execution time: 5-10 minutes


2. remediate-docker-storage.yml

Location: playbooks/remediate-docker-storage.yml

Purpose: Detailed Docker storage cleanup for proxmox-01

Targets:

  • proxmox-01 (Docker host)
  • dlx-docker LXC container

Actions:

  • Analyze container and image sizes
  • Identify dangling resources
  • Remove unused images, volumes, and build cache
  • Run aggressive system prune (docker system prune -a -f --volumes)
  • Configure automated weekly cleanup
  • Setup hourly monitoring with alerting
  • Create log rotation policies

Expected space freed:

  • 50-150 GB depending on usage patterns

Automated maintenance:

  • Weekly: docker system prune -af --volumes
  • Hourly: Capacity monitoring and alerting
  • Daily: Log rotation with 7-day retention

Execution time: 10-15 minutes


3. remediate-stopped-containers.yml

Location: playbooks/remediate-stopped-containers.yml

Purpose: Safely remove unused LXC containers

Targets:

  • All Proxmox hosts
  • 15 stopped containers (1.2 TB allocated)

Actions:

  • Audit all containers and identify stopped ones
  • Generate size/allocation report
  • Create configuration backups before removal
  • Safely remove containers (dry-run by default)
  • Provide recovery guide and instructions
  • Verify space freed

Containers targeted for removal (recommendations):

  • dlx-mysql-02 (108): 200 GB
  • dlx-mysql-03 (109): 200 GB
  • dlx-mattermost (107): 32 GB
  • dlx-nocodb (116): 100 GB
  • dlx-swarm-01/02/03: 195 GB combined
  • dlx-kube-01/02/03: 150 GB combined

Total recoverable: 877+ GB

Safety features:

  • Dry-run mode by default (dry_run: true)
  • Config backups created before deletion
  • Recovery instructions provided
  • Containers listed for manual approval

Execution time: 2-5 minutes


4. configure-storage-monitoring.yml

Location: playbooks/configure-storage-monitoring.yml

Purpose: Set up proactive storage monitoring and alerting

Targets:

  • All Proxmox hosts (proxmox-00, 01, 02)

Actions:

  • Create monitoring scripts:

    • /usr/local/bin/storage-monitoring/check-capacity.sh - Filesystem monitoring
    • /usr/local/bin/storage-monitoring/check-docker.sh - Docker storage
    • /usr/local/bin/storage-monitoring/check-containers.sh - Container allocation
    • /usr/local/bin/storage-monitoring/cluster-status.sh - Dashboard view
    • /usr/local/bin/storage-monitoring/prometheus-metrics.sh - Metrics export
  • Configure cron jobs:

    • Every 5 min: Filesystem capacity checks
    • Every 10 min: Docker storage checks
    • Every 4 hours: Container allocation audit
  • Set alert thresholds:

    • 75%: ALERT (notice level)
    • 85%: WARNING (warning level)
    • 95%: CRITICAL (critical level)
  • Integrate with syslog:

    • Logs to /var/log/storage-monitor.log
    • Syslog integration for alerting
    • Log rotation configured (14-day retention)
  • Optional Prometheus integration:

    • Metrics export script for Grafana/Prometheus
    • Standard format for monitoring tools

Execution time: 5 minutes


Execution Guide

Quick Start

# Test all playbooks (safe, shows what would be done)
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
ansible-playbook playbooks/remediate-docker-storage.yml --check
ansible-playbook playbooks/remediate-stopped-containers.yml --check
ansible-playbook playbooks/configure-storage-monitoring.yml --check

Day 1: Critical Fixes

# 1. Deploy monitoring first (non-destructive)
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox

# 2. Fix proxmox-00 root filesystem (CRITICAL)
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00

# 3. Fix proxmox-01 Docker storage (HIGH)
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01

# Expected time: 30 minutes
# Expected space freed: 30-65 GB

Day 2-3: Verify & Monitor

# Verify fixes are working
/usr/local/bin/storage-monitoring/cluster-status.sh

# Monitor alerts
tail -f /var/log/storage-monitor.log

# Check for issues (48 hours)
ansible proxmox -m shell -a "df -h /" -u dlxadmin

Day 4+: Container Cleanup (Optional)

# After confirming stability, remove unused containers
ansible-playbook playbooks/remediate-stopped-containers.yml \
  --check  # Verify first

# Execute removal (dry_run=false)
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e dry_run=false

# Expected space freed: 877+ GB
# Execution time: 2-5 minutes

Documentation

Three supporting documents have been created:

  1. STORAGE-AUDIT.md

    • Comprehensive storage analysis
    • Hardware inventory
    • Capacity utilization breakdown
    • Issues and recommendations
  2. STORAGE-REMEDIATION-GUIDE.md

    • Step-by-step execution guide
    • Timeline and milestones
    • Rollback procedures
    • Monitoring and validation
    • Troubleshooting guide
  3. REMEDIATION-SUMMARY.md (this file)

    • Quick reference overview
    • Playbook descriptions
    • Expected results

Expected Results

Capacity Goals

Host Issue Current Target Playbook Expected Result
proxmox-00 Root FS 84.5% <70% remediate-storage-critical-issues.yml ✓ Frees 10-15 GB
proxmox-01 dlx-docker 81.1% <75% remediate-docker-storage.yml ✓ Frees 50-150 GB
proxmox-01 SonarQube 354 GB Archive remediate-storage-critical-issues.yml Audit only
All Unused containers 1.2 TB Remove remediate-stopped-containers.yml ✓ Frees 877 GB

Total Space Freed: 1-2 TB

Automation Setup

  • Automatic Docker cleanup: Weekly
  • Continuous monitoring: Every 5-10 minutes
  • Alert integration: Syslog, systemd journal
  • Metrics export: Prometheus compatible
  • Log rotation: 14-day retention

Long-term Benefits

  1. Prevents future issues: Automated cleanup prevents regrowth
  2. Early detection: Monitoring alerts at 75%, 85%, 95% thresholds
  3. Operational insights: Container allocation tracking
  4. Integration ready: Prometheus/Grafana compatible
  5. Maintenance automation: Weekly scheduled cleanups

Key Features

Safety First

  • Dry-run mode for all destructive operations
  • Configuration backups before removal
  • Rollback procedures documented
  • Multi-phase execution with verification

Automation

  • Cron-based scheduling
  • Monitoring and alerting
  • Log rotation and archival
  • Prometheus metrics export

Operability

  • Clear execution steps
  • Expected results documented
  • Troubleshooting guide
  • Dashboard commands for status

Files Summary

playbooks/
├── remediate-storage-critical-issues.yml      (205 lines)
├── remediate-docker-storage.yml               (310 lines)
├── remediate-stopped-containers.yml           (380 lines)
└── configure-storage-monitoring.yml           (330 lines)

docs/
├── STORAGE-AUDIT.md                           (550 lines)
├── STORAGE-REMEDIATION-GUIDE.md               (480 lines)
└── REMEDIATION-SUMMARY.md                     (this file)

Total: 2,255 lines of playbooks and documentation


Next Steps

  1. Review the playbooks and documentation
  2. Test with --check flag on a non-critical host
  3. Execute in recommended order (Day 1, 2, 3+)
  4. Monitor using provided tools and scripts
  5. Schedule for monthly execution

Support & Maintenance

Monitoring Commands

# Quick status
/usr/local/bin/storage-monitoring/cluster-status.sh

# View alerts
tail -f /var/log/storage-monitor.log

# Docker status
docker system df

# Container status
pct list

Regular Maintenance

  • Daily: Review monitoring logs
  • Weekly: Execute playbooks in check mode
  • Monthly: Run full storage audit
  • Quarterly: Archive monitoring data

Scheduled Audits

  • Next scheduled audit: 2026-03-08
  • Quarterly reviews recommended
  • Document changes in git

Issues Addressed

proxmox-00 root filesystem (84.5%)

  • Compressed journal logs
  • Cleaned syslog files
  • Cleared apt cache

proxmox-01 dlx-docker (81.1%)

  • Removed dangling images
  • Purged unused volumes
  • Cleared build cache
  • Automated weekly cleanup

Unused containers (1.2 TB)

  • Safe removal with backups
  • Recovery procedures documented
  • 877+ GB recoverable

Monitoring gaps

  • Continuous capacity tracking
  • Alert thresholds configured
  • Integration with syslog/prometheus

Conclusion

Comprehensive remediation playbooks have been created to address all identified storage issues. The playbooks are:

  • Safe: Dry-run modes, backups, and rollback procedures
  • Automated: Scheduling and monitoring included
  • Documented: Complete guides and references provided
  • Operational: Dashboard commands and status checks included

Ready for deployment with immediate impact on cluster capacity and long-term operational stability.