11 KiB

Raw Blame History

Storage Remediation Guide

Generated: 2026-02-08 Status: Critical issues identified - Remediation playbooks created Priority: 🔴 HIGH - Immediate action recommended

Overview

Four critical storage issues have been identified in the Proxmox cluster:

Issue	Severity	Current	Target	Playbook
proxmox-00 root FS	🔴 CRITICAL	84.5%	<70%	remediate-storage-critical-issues.yml
proxmox-01 dlx-docker	🟠 HIGH	81.1%	<75%	remediate-docker-storage.yml
SonarQube disk usage	🟠 HIGH	354 GB	Archive data	remediate-storage-critical-issues.yml
Unused containers	⚠️ MEDIUM	1.2 TB allocated	Cleanup	remediate-stopped-containers.yml

Corresponding remediation playbooks have been created to automate fixes.

Remediation Playbooks

1. `remediate-storage-critical-issues.yml`

Purpose: Address immediate critical issues on proxmox-00 and proxmox-01

What it does:

Compresses old journal logs (>30 days)
Removes old syslog files (>90 days)
Cleans apt cache and temp files
Prunes Docker images, volumes, and build cache
Audits SonarQube usage
Lists stopped containers for manual review

Expected results:

proxmox-00 root: Frees ~10-15 GB
proxmox-01 dlx-docker: Frees ~20-50 GB

Execution:

# Dry-run (safe, shows what would be done)
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check

# Execute on specific host
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00

Time estimate: 5-10 minutes per host

2. `remediate-docker-storage.yml`

Purpose: Deep cleanup of Docker storage on proxmox-01

What it does:

Analyzes Docker container sizes
Lists Docker images by size
Finds dangling images and volumes
Removes unused Docker resources
Configures automated weekly cleanup
Sets up hourly monitoring

Expected results:

Removes unused images/layers
Frees 50-150 GB depending on usage
Prevents regrowth with automation

Execution:

# Dry-run first
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 --check

# Execute
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01

Time estimate: 10-15 minutes

3. `remediate-stopped-containers.yml`

Purpose: Safely remove unused LXC containers

What it does:

Lists all stopped containers
Calculates disk allocation per container
Creates configuration backups before removal
Safely removes containers (with dry-run mode)
Provides recovery instructions

Expected results:

Removes 1-2 TB of unused container allocations
Allows recovery via backed-up configs

Execution:

# DRY RUN (no deletion, default)
ansible-playbook playbooks/remediate-stopped-containers.yml --check

# To actually remove (set dry_run=false)
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e dry_run=false

# Remove specific containers only
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e 'containers_to_remove=[{vmid: 108, name: dlx-mysql-02}]' \
  -e dry_run=false

Safety features:

Backups created before removal: /tmp/pve-container-backups/
Dry-run mode by default (set dry_run=false to execute)
Manual approval on each container

Time estimate: 2-5 minutes

4. `configure-storage-monitoring.yml`

Purpose: Set up continuous monitoring and alerting

What it does:

Creates monitoring scripts for filesystem, Docker, containers
Installs cron jobs for continuous monitoring
Configures syslog integration
Sets alert thresholds (75%, 85%, 95%)
Provides Prometheus metrics export
Creates cluster status dashboard command

Expected results:

Real-time capacity monitoring
Alerts before running out of space
Integration with monitoring tools

Execution:

# Deploy monitoring to all Proxmox hosts
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox

# View cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh

# View alerts
tail -f /var/log/storage-monitor.log

Time estimate: 5 minutes

Execution Plan

Phase 1: Preparation (Before running playbooks)

1. Verify backups exist

# Check backup location
ls -lh /var/backups/

2. Review current state

# Check filesystem usage
df -h /
df -h /mnt/pve/*

# Check Docker usage (proxmox-01 only)
docker system df

# List containers
pct list | head -20
qm list | head -20

3. Document baseline

# Capture baseline metrics
ansible proxmox -m shell -a "df -h /" -u dlxadmin > baseline-storage.txt

Phase 2: Execute Remediation

Step 1: Test with dry-run (RECOMMENDED)

# Test critical issues fix
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
  --check -l proxmox-00

# Test Docker cleanup
ansible-playbook playbooks/remediate-docker-storage.yml \
  --check -l proxmox-01

# Test container removal
ansible-playbook playbooks/remediate-stopped-containers.yml \
  --check

Review output before proceeding to Step 2.

Step 2: Execute on proxmox-00 (Critical)

# Clean up root filesystem and logs
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
  -l proxmox-00 -v

Verification:

# SSH to proxmox-00
ssh dlxadmin@192.168.200.10
df -h /
# Should show: from 84.5% → 70-75%

du -sh /var/log
# Should show: smaller size after cleanup

Step 3: Execute on proxmox-01 (High Priority)

# Clean Docker storage
ansible-playbook playbooks/remediate-docker-storage.yml \
  -l proxmox-01 -v

Verification:

# SSH to proxmox-01
ssh dlxadmin@192.168.200.11
df -h /mnt/pve/dlx-docker
# Should show: from 81% → 60-70%

docker system df
# Should show: reduced image/volume sizes

Step 4: Remove Stopped Containers (Optional)

# First, verify which containers will be removed
ansible-playbook playbooks/remediate-stopped-containers.yml \
  --check

# Review output, then execute
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e dry_run=false -v

Verification:

# Check backup location
ls -lh /tmp/pve-container-backups/

# Verify stopped containers are gone
pct list | grep stopped

Step 5: Enable Monitoring

# Configure monitoring on all hosts
ansible-playbook playbooks/configure-storage-monitoring.yml \
  -l proxmox

Verification:

# Check monitoring scripts installed
ls -la /usr/local/bin/storage-monitoring/

# Check cron jobs
crontab -l | grep storage

# View monitoring logs
tail -f /var/log/storage-monitor.log

Timeline

Immediate (Today)

✅ Review remediation playbooks
✅ Run dry-run tests
✅ Execute proxmox-00 cleanup
✅ Execute proxmox-01 cleanup

Expected duration: 30 minutes

Short-term (This week)

✅ Remove stopped containers
✅ Enable monitoring
✅ Verify stability (48 hours)
✅ Document changes

Expected duration: 2-4 hours over 48 hours

Ongoing (Monthly)

Review monitoring logs
Execute cleanup playbooks
Audit new containers
Update storage audit

Rollback Plan

If something goes wrong, you can roll back:

Restore Filesystem from Snapshot

# If you have LVM snapshots
lvconvert --merge /dev/mapper/pve-root_snapshot

# Or restore from backup
proxmox-backup-client restore /mnt/backups/...

Recover Deleted Containers

# Restore from backed-up config
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 108

# Start container
pct start 108

Restore Docker Images

# Pull images from registry
docker pull image:tag

# Or restore from backup
docker load < image-backup.tar

Monitoring & Validation

Daily Checks

# Monitor storage trends
tail -f /var/log/storage-monitor.log

# Check cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh

# Alert check
grep ALERT /var/log/storage-monitor.log

Weekly Verification

# Run storage audit
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check

# Review Docker logs
docker system df

# List containers by size
pct list | while read line; do
  vmid=$(echo $line | awk '{print $1}')
  name=$(echo $line | awk '{print $2}')
  size=$(du -sh /var/lib/lxc/$vmid 2>/dev/null | awk '{print $1}')
  echo "$vmid $name $size"
done | sort -k3 -hr

Monthly Audit

# Update storage audit report
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check -v

# Generate updated metrics
pvesh get /nodes/proxmox-00/storage | grep capacity

# Compare to baseline
diff baseline-storage.txt <(ansible proxmox -m shell -a "df -h /" -u dlxadmin)

Troubleshooting

Issue: Root filesystem still full after cleanup

Symptoms: df -h / still shows >80%

Solutions:

Check for large files: find / -size +1G 2>/dev/null
Check Docker: docker system prune -a
Check logs: du -sh /var/log/* | sort -hr | head
Expand partition (if necessary)

Issue: Docker cleanup removed needed image

Symptoms: Container fails to start after cleanup

Solution: Rebuild or pull image

docker pull image:tag
docker-compose up -d

Issue: Removed container was still in use

Recovery: Restore from backup

# List available backups
ls -la /tmp/pve-container-backups/

# Restore to new VMID
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 200
pct start 200

References

Storage Audit: docs/STORAGE-AUDIT.md
Proxmox Docs: https://pve.proxmox.com/wiki/Storage
Docker Cleanup: https://docs.docker.com/config/pruning/
LXC Management: man pct

Appendix: Commands Reference

Quick capacity check

# All hosts
ansible proxmox -m shell -a "df -h / | tail -1" -u dlxadmin

# Specific host
ssh dlxadmin@proxmox-00 "df -h /"

Container info

# All containers
pct list

# Container details
pct config <vmid>
pct status <vmid>

# Container logs
pct exec <vmid> tail -f /var/log/syslog

Docker management

# Storage usage
docker system df

# Cleanup
docker system prune -af
docker image prune -f
docker volume prune -f

# Container logs
docker logs <container>
docker logs -f <container>

Monitoring

# View alerts
tail -f /var/log/storage-monitor.log
tail -f /var/log/docker-monitor.log

# System logs
journalctl -t storage-monitor -f
journalctl -t docker-monitor -f

Support

If you encounter issues:

Check /var/log/storage-monitor.log for alerts
Review playbook output for specific errors
Verify backups exist before removing containers
Test with --check flag before executing

Next scheduled audit: 2026-03-08

11 KiB Raw Blame History

Storage Remediation Guide

Overview

Remediation Playbooks

1. remediate-storage-critical-issues.yml

2. remediate-docker-storage.yml

3. remediate-stopped-containers.yml

4. configure-storage-monitoring.yml

Execution Plan

Phase 1: Preparation (Before running playbooks)

1. Verify backups exist

2. Review current state

3. Document baseline

Phase 2: Execute Remediation

Step 1: Test with dry-run (RECOMMENDED)

Step 2: Execute on proxmox-00 (Critical)

Step 3: Execute on proxmox-01 (High Priority)

Step 4: Remove Stopped Containers (Optional)

Step 5: Enable Monitoring

Timeline

Immediate (Today)

Short-term (This week)

Ongoing (Monthly)

Rollback Plan

Restore Filesystem from Snapshot

Recover Deleted Containers

Restore Docker Images

Monitoring & Validation

Daily Checks

Weekly Verification

Monthly Audit

Troubleshooting

Issue: Root filesystem still full after cleanup

Issue: Docker cleanup removed needed image

Issue: Removed container was still in use

References

Appendix: Commands Reference

Quick capacity check

Container info

Docker management

Monitoring

Support

11 KiB

Raw Blame History

1. `remediate-storage-critical-issues.yml`

2. `remediate-docker-storage.yml`

3. `remediate-stopped-containers.yml`

4. `configure-storage-monitoring.yml`