11 KiB
Storage Remediation Guide
Generated: 2026-02-08 Status: Critical issues identified - Remediation playbooks created Priority: 🔴 HIGH - Immediate action recommended
Overview
Four critical storage issues have been identified in the Proxmox cluster:
| Issue | Severity | Current | Target | Playbook |
|---|---|---|---|---|
| proxmox-00 root FS | 🔴 CRITICAL | 84.5% | <70% | remediate-storage-critical-issues.yml |
| proxmox-01 dlx-docker | 🟠 HIGH | 81.1% | <75% | remediate-docker-storage.yml |
| SonarQube disk usage | 🟠 HIGH | 354 GB | Archive data | remediate-storage-critical-issues.yml |
| Unused containers | ⚠️ MEDIUM | 1.2 TB allocated | Cleanup | remediate-stopped-containers.yml |
Corresponding remediation playbooks have been created to automate fixes.
Remediation Playbooks
1. remediate-storage-critical-issues.yml
Purpose: Address immediate critical issues on proxmox-00 and proxmox-01
What it does:
- Compresses old journal logs (>30 days)
- Removes old syslog files (>90 days)
- Cleans apt cache and temp files
- Prunes Docker images, volumes, and build cache
- Audits SonarQube usage
- Lists stopped containers for manual review
Expected results:
- proxmox-00 root: Frees ~10-15 GB
- proxmox-01 dlx-docker: Frees ~20-50 GB
Execution:
# Dry-run (safe, shows what would be done)
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
# Execute on specific host
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00
Time estimate: 5-10 minutes per host
2. remediate-docker-storage.yml
Purpose: Deep cleanup of Docker storage on proxmox-01
What it does:
- Analyzes Docker container sizes
- Lists Docker images by size
- Finds dangling images and volumes
- Removes unused Docker resources
- Configures automated weekly cleanup
- Sets up hourly monitoring
Expected results:
- Removes unused images/layers
- Frees 50-150 GB depending on usage
- Prevents regrowth with automation
Execution:
# Dry-run first
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 --check
# Execute
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01
Time estimate: 10-15 minutes
3. remediate-stopped-containers.yml
Purpose: Safely remove unused LXC containers
What it does:
- Lists all stopped containers
- Calculates disk allocation per container
- Creates configuration backups before removal
- Safely removes containers (with dry-run mode)
- Provides recovery instructions
Expected results:
- Removes 1-2 TB of unused container allocations
- Allows recovery via backed-up configs
Execution:
# DRY RUN (no deletion, default)
ansible-playbook playbooks/remediate-stopped-containers.yml --check
# To actually remove (set dry_run=false)
ansible-playbook playbooks/remediate-stopped-containers.yml \
-e dry_run=false
# Remove specific containers only
ansible-playbook playbooks/remediate-stopped-containers.yml \
-e 'containers_to_remove=[{vmid: 108, name: dlx-mysql-02}]' \
-e dry_run=false
Safety features:
- Backups created before removal:
/tmp/pve-container-backups/ - Dry-run mode by default (set
dry_run=falseto execute) - Manual approval on each container
Time estimate: 2-5 minutes
4. configure-storage-monitoring.yml
Purpose: Set up continuous monitoring and alerting
What it does:
- Creates monitoring scripts for filesystem, Docker, containers
- Installs cron jobs for continuous monitoring
- Configures syslog integration
- Sets alert thresholds (75%, 85%, 95%)
- Provides Prometheus metrics export
- Creates cluster status dashboard command
Expected results:
- Real-time capacity monitoring
- Alerts before running out of space
- Integration with monitoring tools
Execution:
# Deploy monitoring to all Proxmox hosts
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox
# View cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh
# View alerts
tail -f /var/log/storage-monitor.log
Time estimate: 5 minutes
Execution Plan
Phase 1: Preparation (Before running playbooks)
1. Verify backups exist
# Check backup location
ls -lh /var/backups/
2. Review current state
# Check filesystem usage
df -h /
df -h /mnt/pve/*
# Check Docker usage (proxmox-01 only)
docker system df
# List containers
pct list | head -20
qm list | head -20
3. Document baseline
# Capture baseline metrics
ansible proxmox -m shell -a "df -h /" -u dlxadmin > baseline-storage.txt
Phase 2: Execute Remediation
Step 1: Test with dry-run (RECOMMENDED)
# Test critical issues fix
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
--check -l proxmox-00
# Test Docker cleanup
ansible-playbook playbooks/remediate-docker-storage.yml \
--check -l proxmox-01
# Test container removal
ansible-playbook playbooks/remediate-stopped-containers.yml \
--check
Review output before proceeding to Step 2.
Step 2: Execute on proxmox-00 (Critical)
# Clean up root filesystem and logs
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
-l proxmox-00 -v
Verification:
# SSH to proxmox-00
ssh dlxadmin@192.168.200.10
df -h /
# Should show: from 84.5% → 70-75%
du -sh /var/log
# Should show: smaller size after cleanup
Step 3: Execute on proxmox-01 (High Priority)
# Clean Docker storage
ansible-playbook playbooks/remediate-docker-storage.yml \
-l proxmox-01 -v
Verification:
# SSH to proxmox-01
ssh dlxadmin@192.168.200.11
df -h /mnt/pve/dlx-docker
# Should show: from 81% → 60-70%
docker system df
# Should show: reduced image/volume sizes
Step 4: Remove Stopped Containers (Optional)
# First, verify which containers will be removed
ansible-playbook playbooks/remediate-stopped-containers.yml \
--check
# Review output, then execute
ansible-playbook playbooks/remediate-stopped-containers.yml \
-e dry_run=false -v
Verification:
# Check backup location
ls -lh /tmp/pve-container-backups/
# Verify stopped containers are gone
pct list | grep stopped
Step 5: Enable Monitoring
# Configure monitoring on all hosts
ansible-playbook playbooks/configure-storage-monitoring.yml \
-l proxmox
Verification:
# Check monitoring scripts installed
ls -la /usr/local/bin/storage-monitoring/
# Check cron jobs
crontab -l | grep storage
# View monitoring logs
tail -f /var/log/storage-monitor.log
Timeline
Immediate (Today)
- ✅ Review remediation playbooks
- ✅ Run dry-run tests
- ✅ Execute proxmox-00 cleanup
- ✅ Execute proxmox-01 cleanup
Expected duration: 30 minutes
Short-term (This week)
- ✅ Remove stopped containers
- ✅ Enable monitoring
- ✅ Verify stability (48 hours)
- ✅ Document changes
Expected duration: 2-4 hours over 48 hours
Ongoing (Monthly)
- Review monitoring logs
- Execute cleanup playbooks
- Audit new containers
- Update storage audit
Rollback Plan
If something goes wrong, you can roll back:
Restore Filesystem from Snapshot
# If you have LVM snapshots
lvconvert --merge /dev/mapper/pve-root_snapshot
# Or restore from backup
proxmox-backup-client restore /mnt/backups/...
Recover Deleted Containers
# Restore from backed-up config
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 108
# Start container
pct start 108
Restore Docker Images
# Pull images from registry
docker pull image:tag
# Or restore from backup
docker load < image-backup.tar
Monitoring & Validation
Daily Checks
# Monitor storage trends
tail -f /var/log/storage-monitor.log
# Check cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh
# Alert check
grep ALERT /var/log/storage-monitor.log
Weekly Verification
# Run storage audit
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
# Review Docker logs
docker system df
# List containers by size
pct list | while read line; do
vmid=$(echo $line | awk '{print $1}')
name=$(echo $line | awk '{print $2}')
size=$(du -sh /var/lib/lxc/$vmid 2>/dev/null | awk '{print $1}')
echo "$vmid $name $size"
done | sort -k3 -hr
Monthly Audit
# Update storage audit report
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check -v
# Generate updated metrics
pvesh get /nodes/proxmox-00/storage | grep capacity
# Compare to baseline
diff baseline-storage.txt <(ansible proxmox -m shell -a "df -h /" -u dlxadmin)
Troubleshooting
Issue: Root filesystem still full after cleanup
Symptoms: df -h / still shows >80%
Solutions:
- Check for large files:
find / -size +1G 2>/dev/null - Check Docker:
docker system prune -a - Check logs:
du -sh /var/log/* | sort -hr | head - Expand partition (if necessary)
Issue: Docker cleanup removed needed image
Symptoms: Container fails to start after cleanup
Solution: Rebuild or pull image
docker pull image:tag
docker-compose up -d
Issue: Removed container was still in use
Recovery: Restore from backup
# List available backups
ls -la /tmp/pve-container-backups/
# Restore to new VMID
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 200
pct start 200
References
- Storage Audit:
docs/STORAGE-AUDIT.md - Proxmox Docs: https://pve.proxmox.com/wiki/Storage
- Docker Cleanup: https://docs.docker.com/config/pruning/
- LXC Management:
man pct
Appendix: Commands Reference
Quick capacity check
# All hosts
ansible proxmox -m shell -a "df -h / | tail -1" -u dlxadmin
# Specific host
ssh dlxadmin@proxmox-00 "df -h /"
Container info
# All containers
pct list
# Container details
pct config <vmid>
pct status <vmid>
# Container logs
pct exec <vmid> tail -f /var/log/syslog
Docker management
# Storage usage
docker system df
# Cleanup
docker system prune -af
docker image prune -f
docker volume prune -f
# Container logs
docker logs <container>
docker logs -f <container>
Monitoring
# View alerts
tail -f /var/log/storage-monitor.log
tail -f /var/log/docker-monitor.log
# System logs
journalctl -t storage-monitor -f
journalctl -t docker-monitor -f
Support
If you encounter issues:
- Check
/var/log/storage-monitor.logfor alerts - Review playbook output for specific errors
- Verify backups exist before removing containers
- Test with
--checkflag before executing
Next scheduled audit: 2026-03-08