dlx-ansible/docs/STORAGE-REMEDIATION-GUIDE.md

11 KiB

Storage Remediation Guide

Generated: 2026-02-08 Status: Critical issues identified - Remediation playbooks created Priority: 🔴 HIGH - Immediate action recommended


Overview

Four critical storage issues have been identified in the Proxmox cluster:

Issue Severity Current Target Playbook
proxmox-00 root FS 🔴 CRITICAL 84.5% <70% remediate-storage-critical-issues.yml
proxmox-01 dlx-docker 🟠 HIGH 81.1% <75% remediate-docker-storage.yml
SonarQube disk usage 🟠 HIGH 354 GB Archive data remediate-storage-critical-issues.yml
Unused containers ⚠️ MEDIUM 1.2 TB allocated Cleanup remediate-stopped-containers.yml

Corresponding remediation playbooks have been created to automate fixes.


Remediation Playbooks

1. remediate-storage-critical-issues.yml

Purpose: Address immediate critical issues on proxmox-00 and proxmox-01

What it does:

  • Compresses old journal logs (>30 days)
  • Removes old syslog files (>90 days)
  • Cleans apt cache and temp files
  • Prunes Docker images, volumes, and build cache
  • Audits SonarQube usage
  • Lists stopped containers for manual review

Expected results:

  • proxmox-00 root: Frees ~10-15 GB
  • proxmox-01 dlx-docker: Frees ~20-50 GB

Execution:

# Dry-run (safe, shows what would be done)
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check

# Execute on specific host
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00

Time estimate: 5-10 minutes per host


2. remediate-docker-storage.yml

Purpose: Deep cleanup of Docker storage on proxmox-01

What it does:

  • Analyzes Docker container sizes
  • Lists Docker images by size
  • Finds dangling images and volumes
  • Removes unused Docker resources
  • Configures automated weekly cleanup
  • Sets up hourly monitoring

Expected results:

  • Removes unused images/layers
  • Frees 50-150 GB depending on usage
  • Prevents regrowth with automation

Execution:

# Dry-run first
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 --check

# Execute
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01

Time estimate: 10-15 minutes


3. remediate-stopped-containers.yml

Purpose: Safely remove unused LXC containers

What it does:

  • Lists all stopped containers
  • Calculates disk allocation per container
  • Creates configuration backups before removal
  • Safely removes containers (with dry-run mode)
  • Provides recovery instructions

Expected results:

  • Removes 1-2 TB of unused container allocations
  • Allows recovery via backed-up configs

Execution:

# DRY RUN (no deletion, default)
ansible-playbook playbooks/remediate-stopped-containers.yml --check

# To actually remove (set dry_run=false)
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e dry_run=false

# Remove specific containers only
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e 'containers_to_remove=[{vmid: 108, name: dlx-mysql-02}]' \
  -e dry_run=false

Safety features:

  • Backups created before removal: /tmp/pve-container-backups/
  • Dry-run mode by default (set dry_run=false to execute)
  • Manual approval on each container

Time estimate: 2-5 minutes


4. configure-storage-monitoring.yml

Purpose: Set up continuous monitoring and alerting

What it does:

  • Creates monitoring scripts for filesystem, Docker, containers
  • Installs cron jobs for continuous monitoring
  • Configures syslog integration
  • Sets alert thresholds (75%, 85%, 95%)
  • Provides Prometheus metrics export
  • Creates cluster status dashboard command

Expected results:

  • Real-time capacity monitoring
  • Alerts before running out of space
  • Integration with monitoring tools

Execution:

# Deploy monitoring to all Proxmox hosts
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox

# View cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh

# View alerts
tail -f /var/log/storage-monitor.log

Time estimate: 5 minutes


Execution Plan

Phase 1: Preparation (Before running playbooks)

1. Verify backups exist

# Check backup location
ls -lh /var/backups/

2. Review current state

# Check filesystem usage
df -h /
df -h /mnt/pve/*

# Check Docker usage (proxmox-01 only)
docker system df

# List containers
pct list | head -20
qm list | head -20

3. Document baseline

# Capture baseline metrics
ansible proxmox -m shell -a "df -h /" -u dlxadmin > baseline-storage.txt

Phase 2: Execute Remediation

# Test critical issues fix
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
  --check -l proxmox-00

# Test Docker cleanup
ansible-playbook playbooks/remediate-docker-storage.yml \
  --check -l proxmox-01

# Test container removal
ansible-playbook playbooks/remediate-stopped-containers.yml \
  --check

Review output before proceeding to Step 2.

Step 2: Execute on proxmox-00 (Critical)

# Clean up root filesystem and logs
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
  -l proxmox-00 -v

Verification:

# SSH to proxmox-00
ssh dlxadmin@192.168.200.10
df -h /
# Should show: from 84.5% → 70-75%

du -sh /var/log
# Should show: smaller size after cleanup

Step 3: Execute on proxmox-01 (High Priority)

# Clean Docker storage
ansible-playbook playbooks/remediate-docker-storage.yml \
  -l proxmox-01 -v

Verification:

# SSH to proxmox-01
ssh dlxadmin@192.168.200.11
df -h /mnt/pve/dlx-docker
# Should show: from 81% → 60-70%

docker system df
# Should show: reduced image/volume sizes

Step 4: Remove Stopped Containers (Optional)

# First, verify which containers will be removed
ansible-playbook playbooks/remediate-stopped-containers.yml \
  --check

# Review output, then execute
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e dry_run=false -v

Verification:

# Check backup location
ls -lh /tmp/pve-container-backups/

# Verify stopped containers are gone
pct list | grep stopped

Step 5: Enable Monitoring

# Configure monitoring on all hosts
ansible-playbook playbooks/configure-storage-monitoring.yml \
  -l proxmox

Verification:

# Check monitoring scripts installed
ls -la /usr/local/bin/storage-monitoring/

# Check cron jobs
crontab -l | grep storage

# View monitoring logs
tail -f /var/log/storage-monitor.log

Timeline

Immediate (Today)

  1. Review remediation playbooks
  2. Run dry-run tests
  3. Execute proxmox-00 cleanup
  4. Execute proxmox-01 cleanup

Expected duration: 30 minutes

Short-term (This week)

  1. Remove stopped containers
  2. Enable monitoring
  3. Verify stability (48 hours)
  4. Document changes

Expected duration: 2-4 hours over 48 hours

Ongoing (Monthly)

  1. Review monitoring logs
  2. Execute cleanup playbooks
  3. Audit new containers
  4. Update storage audit

Rollback Plan

If something goes wrong, you can roll back:

Restore Filesystem from Snapshot

# If you have LVM snapshots
lvconvert --merge /dev/mapper/pve-root_snapshot

# Or restore from backup
proxmox-backup-client restore /mnt/backups/...

Recover Deleted Containers

# Restore from backed-up config
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 108

# Start container
pct start 108

Restore Docker Images

# Pull images from registry
docker pull image:tag

# Or restore from backup
docker load < image-backup.tar

Monitoring & Validation

Daily Checks

# Monitor storage trends
tail -f /var/log/storage-monitor.log

# Check cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh

# Alert check
grep ALERT /var/log/storage-monitor.log

Weekly Verification

# Run storage audit
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check

# Review Docker logs
docker system df

# List containers by size
pct list | while read line; do
  vmid=$(echo $line | awk '{print $1}')
  name=$(echo $line | awk '{print $2}')
  size=$(du -sh /var/lib/lxc/$vmid 2>/dev/null | awk '{print $1}')
  echo "$vmid $name $size"
done | sort -k3 -hr

Monthly Audit

# Update storage audit report
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check -v

# Generate updated metrics
pvesh get /nodes/proxmox-00/storage | grep capacity

# Compare to baseline
diff baseline-storage.txt <(ansible proxmox -m shell -a "df -h /" -u dlxadmin)

Troubleshooting

Issue: Root filesystem still full after cleanup

Symptoms: df -h / still shows >80%

Solutions:

  1. Check for large files: find / -size +1G 2>/dev/null
  2. Check Docker: docker system prune -a
  3. Check logs: du -sh /var/log/* | sort -hr | head
  4. Expand partition (if necessary)

Issue: Docker cleanup removed needed image

Symptoms: Container fails to start after cleanup

Solution: Rebuild or pull image

docker pull image:tag
docker-compose up -d

Issue: Removed container was still in use

Recovery: Restore from backup

# List available backups
ls -la /tmp/pve-container-backups/

# Restore to new VMID
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 200
pct start 200

References


Appendix: Commands Reference

Quick capacity check

# All hosts
ansible proxmox -m shell -a "df -h / | tail -1" -u dlxadmin

# Specific host
ssh dlxadmin@proxmox-00 "df -h /"

Container info

# All containers
pct list

# Container details
pct config <vmid>
pct status <vmid>

# Container logs
pct exec <vmid> tail -f /var/log/syslog

Docker management

# Storage usage
docker system df

# Cleanup
docker system prune -af
docker image prune -f
docker volume prune -f

# Container logs
docker logs <container>
docker logs -f <container>

Monitoring

# View alerts
tail -f /var/log/storage-monitor.log
tail -f /var/log/docker-monitor.log

# System logs
journalctl -t storage-monitor -f
journalctl -t docker-monitor -f

Support

If you encounter issues:

  1. Check /var/log/storage-monitor.log for alerts
  2. Review playbook output for specific errors
  3. Verify backups exist before removing containers
  4. Test with --check flag before executing

Next scheduled audit: 2026-03-08