# Storage Remediation Guide **Generated**: 2026-02-08 **Status**: Critical issues identified - Remediation playbooks created **Priority**: 🔴 HIGH - Immediate action recommended --- ## Overview Four critical storage issues have been identified in the Proxmox cluster: | Issue | Severity | Current | Target | Playbook | |-------|----------|---------|--------|----------| | proxmox-00 root FS | 🔴 CRITICAL | 84.5% | <70% | remediate-storage-critical-issues.yml | | proxmox-01 dlx-docker | 🟠 HIGH | 81.1% | <75% | remediate-docker-storage.yml | | SonarQube disk usage | 🟠 HIGH | 354 GB | Archive data | remediate-storage-critical-issues.yml | | Unused containers | ⚠️ MEDIUM | 1.2 TB allocated | Cleanup | remediate-stopped-containers.yml | Corresponding **remediation playbooks** have been created to automate fixes. --- ## Remediation Playbooks ### 1. `remediate-storage-critical-issues.yml` **Purpose**: Address immediate critical issues on proxmox-00 and proxmox-01 **What it does**: - Compresses old journal logs (>30 days) - Removes old syslog files (>90 days) - Cleans apt cache and temp files - Prunes Docker images, volumes, and build cache - Audits SonarQube usage - Lists stopped containers for manual review **Expected results**: - proxmox-00 root: Frees ~10-15 GB - proxmox-01 dlx-docker: Frees ~20-50 GB **Execution**: ```bash # Dry-run (safe, shows what would be done) ansible-playbook playbooks/remediate-storage-critical-issues.yml --check # Execute on specific host ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00 ``` **Time estimate**: 5-10 minutes per host --- ### 2. `remediate-docker-storage.yml` **Purpose**: Deep cleanup of Docker storage on proxmox-01 **What it does**: - Analyzes Docker container sizes - Lists Docker images by size - Finds dangling images and volumes - Removes unused Docker resources - Configures automated weekly cleanup - Sets up hourly monitoring **Expected results**: - Removes unused images/layers - Frees 50-150 GB depending on usage - Prevents regrowth with automation **Execution**: ```bash # Dry-run first ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 --check # Execute ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 ``` **Time estimate**: 10-15 minutes --- ### 3. `remediate-stopped-containers.yml` **Purpose**: Safely remove unused LXC containers **What it does**: - Lists all stopped containers - Calculates disk allocation per container - Creates configuration backups before removal - Safely removes containers (with dry-run mode) - Provides recovery instructions **Expected results**: - Removes 1-2 TB of unused container allocations - Allows recovery via backed-up configs **Execution**: ```bash # DRY RUN (no deletion, default) ansible-playbook playbooks/remediate-stopped-containers.yml --check # To actually remove (set dry_run=false) ansible-playbook playbooks/remediate-stopped-containers.yml \ -e dry_run=false # Remove specific containers only ansible-playbook playbooks/remediate-stopped-containers.yml \ -e 'containers_to_remove=[{vmid: 108, name: dlx-mysql-02}]' \ -e dry_run=false ``` **Safety features**: - Backups created before removal: `/tmp/pve-container-backups/` - Dry-run mode by default (set `dry_run=false` to execute) - Manual approval on each container **Time estimate**: 2-5 minutes --- ### 4. `configure-storage-monitoring.yml` **Purpose**: Set up continuous monitoring and alerting **What it does**: - Creates monitoring scripts for filesystem, Docker, containers - Installs cron jobs for continuous monitoring - Configures syslog integration - Sets alert thresholds (75%, 85%, 95%) - Provides Prometheus metrics export - Creates cluster status dashboard command **Expected results**: - Real-time capacity monitoring - Alerts before running out of space - Integration with monitoring tools **Execution**: ```bash # Deploy monitoring to all Proxmox hosts ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox # View cluster status /usr/local/bin/storage-monitoring/cluster-status.sh # View alerts tail -f /var/log/storage-monitor.log ``` **Time estimate**: 5 minutes --- ## Execution Plan ### Phase 1: Preparation (Before running playbooks) #### 1. Verify backups exist ```bash # Check backup location ls -lh /var/backups/ ``` #### 2. Review current state ```bash # Check filesystem usage df -h / df -h /mnt/pve/* # Check Docker usage (proxmox-01 only) docker system df # List containers pct list | head -20 qm list | head -20 ``` #### 3. Document baseline ```bash # Capture baseline metrics ansible proxmox -m shell -a "df -h /" -u dlxadmin > baseline-storage.txt ``` --- ### Phase 2: Execute Remediation #### Step 1: Test with dry-run (RECOMMENDED) ```bash # Test critical issues fix ansible-playbook playbooks/remediate-storage-critical-issues.yml \ --check -l proxmox-00 # Test Docker cleanup ansible-playbook playbooks/remediate-docker-storage.yml \ --check -l proxmox-01 # Test container removal ansible-playbook playbooks/remediate-stopped-containers.yml \ --check ``` Review output before proceeding to Step 2. #### Step 2: Execute on proxmox-00 (Critical) ```bash # Clean up root filesystem and logs ansible-playbook playbooks/remediate-storage-critical-issues.yml \ -l proxmox-00 -v ``` **Verification**: ```bash # SSH to proxmox-00 ssh dlxadmin@192.168.200.10 df -h / # Should show: from 84.5% → 70-75% du -sh /var/log # Should show: smaller size after cleanup ``` #### Step 3: Execute on proxmox-01 (High Priority) ```bash # Clean Docker storage ansible-playbook playbooks/remediate-docker-storage.yml \ -l proxmox-01 -v ``` **Verification**: ```bash # SSH to proxmox-01 ssh dlxadmin@192.168.200.11 df -h /mnt/pve/dlx-docker # Should show: from 81% → 60-70% docker system df # Should show: reduced image/volume sizes ``` #### Step 4: Remove Stopped Containers (Optional) ```bash # First, verify which containers will be removed ansible-playbook playbooks/remediate-stopped-containers.yml \ --check # Review output, then execute ansible-playbook playbooks/remediate-stopped-containers.yml \ -e dry_run=false -v ``` **Verification**: ```bash # Check backup location ls -lh /tmp/pve-container-backups/ # Verify stopped containers are gone pct list | grep stopped ``` #### Step 5: Enable Monitoring ```bash # Configure monitoring on all hosts ansible-playbook playbooks/configure-storage-monitoring.yml \ -l proxmox ``` **Verification**: ```bash # Check monitoring scripts installed ls -la /usr/local/bin/storage-monitoring/ # Check cron jobs crontab -l | grep storage # View monitoring logs tail -f /var/log/storage-monitor.log ``` --- ## Timeline ### Immediate (Today) 1. ✅ Review remediation playbooks 2. ✅ Run dry-run tests 3. ✅ Execute proxmox-00 cleanup 4. ✅ Execute proxmox-01 cleanup **Expected duration**: 30 minutes ### Short-term (This week) 1. ✅ Remove stopped containers 2. ✅ Enable monitoring 3. ✅ Verify stability (48 hours) 4. ✅ Document changes **Expected duration**: 2-4 hours over 48 hours ### Ongoing (Monthly) 1. Review monitoring logs 2. Execute cleanup playbooks 3. Audit new containers 4. Update storage audit --- ## Rollback Plan If something goes wrong, you can roll back: ### Restore Filesystem from Snapshot ```bash # If you have LVM snapshots lvconvert --merge /dev/mapper/pve-root_snapshot # Or restore from backup proxmox-backup-client restore /mnt/backups/... ``` ### Recover Deleted Containers ```bash # Restore from backed-up config pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 108 # Start container pct start 108 ``` ### Restore Docker Images ```bash # Pull images from registry docker pull image:tag # Or restore from backup docker load < image-backup.tar ``` --- ## Monitoring & Validation ### Daily Checks ```bash # Monitor storage trends tail -f /var/log/storage-monitor.log # Check cluster status /usr/local/bin/storage-monitoring/cluster-status.sh # Alert check grep ALERT /var/log/storage-monitor.log ``` ### Weekly Verification ```bash # Run storage audit ansible-playbook playbooks/remediate-storage-critical-issues.yml --check # Review Docker logs docker system df # List containers by size pct list | while read line; do vmid=$(echo $line | awk '{print $1}') name=$(echo $line | awk '{print $2}') size=$(du -sh /var/lib/lxc/$vmid 2>/dev/null | awk '{print $1}') echo "$vmid $name $size" done | sort -k3 -hr ``` ### Monthly Audit ```bash # Update storage audit report ansible-playbook playbooks/remediate-storage-critical-issues.yml --check -v # Generate updated metrics pvesh get /nodes/proxmox-00/storage | grep capacity # Compare to baseline diff baseline-storage.txt <(ansible proxmox -m shell -a "df -h /" -u dlxadmin) ``` --- ## Troubleshooting ### Issue: Root filesystem still full after cleanup **Symptoms**: `df -h /` still shows >80% **Solutions**: 1. Check for large files: `find / -size +1G 2>/dev/null` 2. Check Docker: `docker system prune -a` 3. Check logs: `du -sh /var/log/* | sort -hr | head` 4. Expand partition (if necessary) ### Issue: Docker cleanup removed needed image **Symptoms**: Container fails to start after cleanup **Solution**: Rebuild or pull image ```bash docker pull image:tag docker-compose up -d ``` ### Issue: Removed container was still in use **Recovery**: Restore from backup ```bash # List available backups ls -la /tmp/pve-container-backups/ # Restore to new VMID pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 200 pct start 200 ``` --- ## References - **Storage Audit**: `docs/STORAGE-AUDIT.md` - **Proxmox Docs**: https://pve.proxmox.com/wiki/Storage - **Docker Cleanup**: https://docs.docker.com/config/pruning/ - **LXC Management**: `man pct` --- ## Appendix: Commands Reference ### Quick capacity check ```bash # All hosts ansible proxmox -m shell -a "df -h / | tail -1" -u dlxadmin # Specific host ssh dlxadmin@proxmox-00 "df -h /" ``` ### Container info ```bash # All containers pct list # Container details pct config pct status # Container logs pct exec tail -f /var/log/syslog ``` ### Docker management ```bash # Storage usage docker system df # Cleanup docker system prune -af docker image prune -f docker volume prune -f # Container logs docker logs docker logs -f ``` ### Monitoring ```bash # View alerts tail -f /var/log/storage-monitor.log tail -f /var/log/docker-monitor.log # System logs journalctl -t storage-monitor -f journalctl -t docker-monitor -f ``` --- ## Support If you encounter issues: 1. Check `/var/log/storage-monitor.log` for alerts 2. Review playbook output for specific errors 3. Verify backups exist before removing containers 4. Test with `--check` flag before executing **Next scheduled audit**: 2026-03-08