500 lines
11 KiB
Markdown
500 lines
11 KiB
Markdown
# Storage Remediation Guide
|
||
|
||
**Generated**: 2026-02-08
|
||
**Status**: Critical issues identified - Remediation playbooks created
|
||
**Priority**: 🔴 HIGH - Immediate action recommended
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
Four critical storage issues have been identified in the Proxmox cluster:
|
||
|
||
| Issue | Severity | Current | Target | Playbook |
|
||
|-------|----------|---------|--------|----------|
|
||
| proxmox-00 root FS | 🔴 CRITICAL | 84.5% | <70% | remediate-storage-critical-issues.yml |
|
||
| proxmox-01 dlx-docker | 🟠 HIGH | 81.1% | <75% | remediate-docker-storage.yml |
|
||
| SonarQube disk usage | 🟠 HIGH | 354 GB | Archive data | remediate-storage-critical-issues.yml |
|
||
| Unused containers | ⚠️ MEDIUM | 1.2 TB allocated | Cleanup | remediate-stopped-containers.yml |
|
||
|
||
Corresponding **remediation playbooks** have been created to automate fixes.
|
||
|
||
---
|
||
|
||
## Remediation Playbooks
|
||
|
||
### 1. `remediate-storage-critical-issues.yml`
|
||
|
||
**Purpose**: Address immediate critical issues on proxmox-00 and proxmox-01
|
||
|
||
**What it does**:
|
||
- Compresses old journal logs (>30 days)
|
||
- Removes old syslog files (>90 days)
|
||
- Cleans apt cache and temp files
|
||
- Prunes Docker images, volumes, and build cache
|
||
- Audits SonarQube usage
|
||
- Lists stopped containers for manual review
|
||
|
||
**Expected results**:
|
||
- proxmox-00 root: Frees ~10-15 GB
|
||
- proxmox-01 dlx-docker: Frees ~20-50 GB
|
||
|
||
**Execution**:
|
||
```bash
|
||
# Dry-run (safe, shows what would be done)
|
||
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
|
||
|
||
# Execute on specific host
|
||
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00
|
||
```
|
||
|
||
**Time estimate**: 5-10 minutes per host
|
||
|
||
---
|
||
|
||
### 2. `remediate-docker-storage.yml`
|
||
|
||
**Purpose**: Deep cleanup of Docker storage on proxmox-01
|
||
|
||
**What it does**:
|
||
- Analyzes Docker container sizes
|
||
- Lists Docker images by size
|
||
- Finds dangling images and volumes
|
||
- Removes unused Docker resources
|
||
- Configures automated weekly cleanup
|
||
- Sets up hourly monitoring
|
||
|
||
**Expected results**:
|
||
- Removes unused images/layers
|
||
- Frees 50-150 GB depending on usage
|
||
- Prevents regrowth with automation
|
||
|
||
**Execution**:
|
||
```bash
|
||
# Dry-run first
|
||
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 --check
|
||
|
||
# Execute
|
||
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01
|
||
```
|
||
|
||
**Time estimate**: 10-15 minutes
|
||
|
||
---
|
||
|
||
### 3. `remediate-stopped-containers.yml`
|
||
|
||
**Purpose**: Safely remove unused LXC containers
|
||
|
||
**What it does**:
|
||
- Lists all stopped containers
|
||
- Calculates disk allocation per container
|
||
- Creates configuration backups before removal
|
||
- Safely removes containers (with dry-run mode)
|
||
- Provides recovery instructions
|
||
|
||
**Expected results**:
|
||
- Removes 1-2 TB of unused container allocations
|
||
- Allows recovery via backed-up configs
|
||
|
||
**Execution**:
|
||
```bash
|
||
# DRY RUN (no deletion, default)
|
||
ansible-playbook playbooks/remediate-stopped-containers.yml --check
|
||
|
||
# To actually remove (set dry_run=false)
|
||
ansible-playbook playbooks/remediate-stopped-containers.yml \
|
||
-e dry_run=false
|
||
|
||
# Remove specific containers only
|
||
ansible-playbook playbooks/remediate-stopped-containers.yml \
|
||
-e 'containers_to_remove=[{vmid: 108, name: dlx-mysql-02}]' \
|
||
-e dry_run=false
|
||
```
|
||
|
||
**Safety features**:
|
||
- Backups created before removal: `/tmp/pve-container-backups/`
|
||
- Dry-run mode by default (set `dry_run=false` to execute)
|
||
- Manual approval on each container
|
||
|
||
**Time estimate**: 2-5 minutes
|
||
|
||
---
|
||
|
||
### 4. `configure-storage-monitoring.yml`
|
||
|
||
**Purpose**: Set up continuous monitoring and alerting
|
||
|
||
**What it does**:
|
||
- Creates monitoring scripts for filesystem, Docker, containers
|
||
- Installs cron jobs for continuous monitoring
|
||
- Configures syslog integration
|
||
- Sets alert thresholds (75%, 85%, 95%)
|
||
- Provides Prometheus metrics export
|
||
- Creates cluster status dashboard command
|
||
|
||
**Expected results**:
|
||
- Real-time capacity monitoring
|
||
- Alerts before running out of space
|
||
- Integration with monitoring tools
|
||
|
||
**Execution**:
|
||
```bash
|
||
# Deploy monitoring to all Proxmox hosts
|
||
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox
|
||
|
||
# View cluster status
|
||
/usr/local/bin/storage-monitoring/cluster-status.sh
|
||
|
||
# View alerts
|
||
tail -f /var/log/storage-monitor.log
|
||
```
|
||
|
||
**Time estimate**: 5 minutes
|
||
|
||
---
|
||
|
||
## Execution Plan
|
||
|
||
### Phase 1: Preparation (Before running playbooks)
|
||
|
||
#### 1. Verify backups exist
|
||
```bash
|
||
# Check backup location
|
||
ls -lh /var/backups/
|
||
```
|
||
|
||
#### 2. Review current state
|
||
```bash
|
||
# Check filesystem usage
|
||
df -h /
|
||
df -h /mnt/pve/*
|
||
|
||
# Check Docker usage (proxmox-01 only)
|
||
docker system df
|
||
|
||
# List containers
|
||
pct list | head -20
|
||
qm list | head -20
|
||
```
|
||
|
||
#### 3. Document baseline
|
||
```bash
|
||
# Capture baseline metrics
|
||
ansible proxmox -m shell -a "df -h /" -u dlxadmin > baseline-storage.txt
|
||
```
|
||
|
||
---
|
||
|
||
### Phase 2: Execute Remediation
|
||
|
||
#### Step 1: Test with dry-run (RECOMMENDED)
|
||
|
||
```bash
|
||
# Test critical issues fix
|
||
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
|
||
--check -l proxmox-00
|
||
|
||
# Test Docker cleanup
|
||
ansible-playbook playbooks/remediate-docker-storage.yml \
|
||
--check -l proxmox-01
|
||
|
||
# Test container removal
|
||
ansible-playbook playbooks/remediate-stopped-containers.yml \
|
||
--check
|
||
```
|
||
|
||
Review output before proceeding to Step 2.
|
||
|
||
#### Step 2: Execute on proxmox-00 (Critical)
|
||
|
||
```bash
|
||
# Clean up root filesystem and logs
|
||
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
|
||
-l proxmox-00 -v
|
||
```
|
||
|
||
**Verification**:
|
||
```bash
|
||
# SSH to proxmox-00
|
||
ssh dlxadmin@192.168.200.10
|
||
df -h /
|
||
# Should show: from 84.5% → 70-75%
|
||
|
||
du -sh /var/log
|
||
# Should show: smaller size after cleanup
|
||
```
|
||
|
||
#### Step 3: Execute on proxmox-01 (High Priority)
|
||
|
||
```bash
|
||
# Clean Docker storage
|
||
ansible-playbook playbooks/remediate-docker-storage.yml \
|
||
-l proxmox-01 -v
|
||
```
|
||
|
||
**Verification**:
|
||
```bash
|
||
# SSH to proxmox-01
|
||
ssh dlxadmin@192.168.200.11
|
||
df -h /mnt/pve/dlx-docker
|
||
# Should show: from 81% → 60-70%
|
||
|
||
docker system df
|
||
# Should show: reduced image/volume sizes
|
||
```
|
||
|
||
#### Step 4: Remove Stopped Containers (Optional)
|
||
|
||
```bash
|
||
# First, verify which containers will be removed
|
||
ansible-playbook playbooks/remediate-stopped-containers.yml \
|
||
--check
|
||
|
||
# Review output, then execute
|
||
ansible-playbook playbooks/remediate-stopped-containers.yml \
|
||
-e dry_run=false -v
|
||
```
|
||
|
||
**Verification**:
|
||
```bash
|
||
# Check backup location
|
||
ls -lh /tmp/pve-container-backups/
|
||
|
||
# Verify stopped containers are gone
|
||
pct list | grep stopped
|
||
```
|
||
|
||
#### Step 5: Enable Monitoring
|
||
|
||
```bash
|
||
# Configure monitoring on all hosts
|
||
ansible-playbook playbooks/configure-storage-monitoring.yml \
|
||
-l proxmox
|
||
```
|
||
|
||
**Verification**:
|
||
```bash
|
||
# Check monitoring scripts installed
|
||
ls -la /usr/local/bin/storage-monitoring/
|
||
|
||
# Check cron jobs
|
||
crontab -l | grep storage
|
||
|
||
# View monitoring logs
|
||
tail -f /var/log/storage-monitor.log
|
||
```
|
||
|
||
---
|
||
|
||
## Timeline
|
||
|
||
### Immediate (Today)
|
||
1. ✅ Review remediation playbooks
|
||
2. ✅ Run dry-run tests
|
||
3. ✅ Execute proxmox-00 cleanup
|
||
4. ✅ Execute proxmox-01 cleanup
|
||
|
||
**Expected duration**: 30 minutes
|
||
|
||
### Short-term (This week)
|
||
1. ✅ Remove stopped containers
|
||
2. ✅ Enable monitoring
|
||
3. ✅ Verify stability (48 hours)
|
||
4. ✅ Document changes
|
||
|
||
**Expected duration**: 2-4 hours over 48 hours
|
||
|
||
### Ongoing (Monthly)
|
||
1. Review monitoring logs
|
||
2. Execute cleanup playbooks
|
||
3. Audit new containers
|
||
4. Update storage audit
|
||
|
||
---
|
||
|
||
## Rollback Plan
|
||
|
||
If something goes wrong, you can roll back:
|
||
|
||
### Restore Filesystem from Snapshot
|
||
```bash
|
||
# If you have LVM snapshots
|
||
lvconvert --merge /dev/mapper/pve-root_snapshot
|
||
|
||
# Or restore from backup
|
||
proxmox-backup-client restore /mnt/backups/...
|
||
```
|
||
|
||
### Recover Deleted Containers
|
||
```bash
|
||
# Restore from backed-up config
|
||
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 108
|
||
|
||
# Start container
|
||
pct start 108
|
||
```
|
||
|
||
### Restore Docker Images
|
||
```bash
|
||
# Pull images from registry
|
||
docker pull image:tag
|
||
|
||
# Or restore from backup
|
||
docker load < image-backup.tar
|
||
```
|
||
|
||
---
|
||
|
||
## Monitoring & Validation
|
||
|
||
### Daily Checks
|
||
```bash
|
||
# Monitor storage trends
|
||
tail -f /var/log/storage-monitor.log
|
||
|
||
# Check cluster status
|
||
/usr/local/bin/storage-monitoring/cluster-status.sh
|
||
|
||
# Alert check
|
||
grep ALERT /var/log/storage-monitor.log
|
||
```
|
||
|
||
### Weekly Verification
|
||
```bash
|
||
# Run storage audit
|
||
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
|
||
|
||
# Review Docker logs
|
||
docker system df
|
||
|
||
# List containers by size
|
||
pct list | while read line; do
|
||
vmid=$(echo $line | awk '{print $1}')
|
||
name=$(echo $line | awk '{print $2}')
|
||
size=$(du -sh /var/lib/lxc/$vmid 2>/dev/null | awk '{print $1}')
|
||
echo "$vmid $name $size"
|
||
done | sort -k3 -hr
|
||
```
|
||
|
||
### Monthly Audit
|
||
```bash
|
||
# Update storage audit report
|
||
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check -v
|
||
|
||
# Generate updated metrics
|
||
pvesh get /nodes/proxmox-00/storage | grep capacity
|
||
|
||
# Compare to baseline
|
||
diff baseline-storage.txt <(ansible proxmox -m shell -a "df -h /" -u dlxadmin)
|
||
```
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Issue: Root filesystem still full after cleanup
|
||
|
||
**Symptoms**: `df -h /` still shows >80%
|
||
|
||
**Solutions**:
|
||
1. Check for large files: `find / -size +1G 2>/dev/null`
|
||
2. Check Docker: `docker system prune -a`
|
||
3. Check logs: `du -sh /var/log/* | sort -hr | head`
|
||
4. Expand partition (if necessary)
|
||
|
||
### Issue: Docker cleanup removed needed image
|
||
|
||
**Symptoms**: Container fails to start after cleanup
|
||
|
||
**Solution**: Rebuild or pull image
|
||
```bash
|
||
docker pull image:tag
|
||
docker-compose up -d
|
||
```
|
||
|
||
### Issue: Removed container was still in use
|
||
|
||
**Recovery**: Restore from backup
|
||
```bash
|
||
# List available backups
|
||
ls -la /tmp/pve-container-backups/
|
||
|
||
# Restore to new VMID
|
||
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 200
|
||
pct start 200
|
||
```
|
||
|
||
---
|
||
|
||
## References
|
||
|
||
- **Storage Audit**: `docs/STORAGE-AUDIT.md`
|
||
- **Proxmox Docs**: https://pve.proxmox.com/wiki/Storage
|
||
- **Docker Cleanup**: https://docs.docker.com/config/pruning/
|
||
- **LXC Management**: `man pct`
|
||
|
||
---
|
||
|
||
## Appendix: Commands Reference
|
||
|
||
### Quick capacity check
|
||
```bash
|
||
# All hosts
|
||
ansible proxmox -m shell -a "df -h / | tail -1" -u dlxadmin
|
||
|
||
# Specific host
|
||
ssh dlxadmin@proxmox-00 "df -h /"
|
||
```
|
||
|
||
### Container info
|
||
```bash
|
||
# All containers
|
||
pct list
|
||
|
||
# Container details
|
||
pct config <vmid>
|
||
pct status <vmid>
|
||
|
||
# Container logs
|
||
pct exec <vmid> tail -f /var/log/syslog
|
||
```
|
||
|
||
### Docker management
|
||
```bash
|
||
# Storage usage
|
||
docker system df
|
||
|
||
# Cleanup
|
||
docker system prune -af
|
||
docker image prune -f
|
||
docker volume prune -f
|
||
|
||
# Container logs
|
||
docker logs <container>
|
||
docker logs -f <container>
|
||
```
|
||
|
||
### Monitoring
|
||
```bash
|
||
# View alerts
|
||
tail -f /var/log/storage-monitor.log
|
||
tail -f /var/log/docker-monitor.log
|
||
|
||
# System logs
|
||
journalctl -t storage-monitor -f
|
||
journalctl -t docker-monitor -f
|
||
```
|
||
|
||
---
|
||
|
||
## Support
|
||
|
||
If you encounter issues:
|
||
1. Check `/var/log/storage-monitor.log` for alerts
|
||
2. Review playbook output for specific errors
|
||
3. Verify backups exist before removing containers
|
||
4. Test with `--check` flag before executing
|
||
|
||
**Next scheduled audit**: 2026-03-08
|