dlx-ansible/docs/STORAGE-REMEDIATION-GUIDE.md

# Storage Remediation Guide

**Generated**: 2026-02-08
**Status**: Critical issues identified - Remediation playbooks created
**Priority**: 🔴 HIGH - Immediate action recommended

---

## Overview

Four critical storage issues have been identified in the Proxmox cluster:

| Issue | Severity | Current | Target | Playbook |
|-------|----------|---------|--------|----------|
| proxmox-00 root FS | 🔴 CRITICAL | 84.5% | <70% | remediate-storage-critical-issues.yml |
| proxmox-01 dlx-docker | 🟠 HIGH | 81.1% | <75% | remediate-docker-storage.yml |
| SonarQube disk usage | 🟠 HIGH | 354 GB | Archive data | remediate-storage-critical-issues.yml |
| Unused containers | ⚠️ MEDIUM | 1.2 TB allocated | Cleanup | remediate-stopped-containers.yml |

Corresponding **remediation playbooks** have been created to automate fixes.

---

## Remediation Playbooks

### 1. `remediate-storage-critical-issues.yml`

**Purpose**: Address immediate critical issues on proxmox-00 and proxmox-01

**What it does**:
- Compresses old journal logs (>30 days)
- Removes old syslog files (>90 days)
- Cleans apt cache and temp files
- Prunes Docker images, volumes, and build cache
- Audits SonarQube usage
- Lists stopped containers for manual review

**Expected results**:
- proxmox-00 root: Frees ~10-15 GB
- proxmox-01 dlx-docker: Frees ~20-50 GB

**Execution**:
```bash
# Dry-run (safe, shows what would be done)
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check

# Execute on specific host
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00
```

**Time estimate**: 5-10 minutes per host

---

### 2. `remediate-docker-storage.yml`

**Purpose**: Deep cleanup of Docker storage on proxmox-01

**What it does**:
- Analyzes Docker container sizes
- Lists Docker images by size
- Finds dangling images and volumes
- Removes unused Docker resources
- Configures automated weekly cleanup
- Sets up hourly monitoring

**Expected results**:
- Removes unused images/layers
- Frees 50-150 GB depending on usage
- Prevents regrowth with automation

**Execution**:
```bash
# Dry-run first
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 --check

# Execute
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01
```

**Time estimate**: 10-15 minutes

---

### 3. `remediate-stopped-containers.yml`

**Purpose**: Safely remove unused LXC containers

**What it does**:
- Lists all stopped containers
- Calculates disk allocation per container
- Creates configuration backups before removal
- Safely removes containers (with dry-run mode)
- Provides recovery instructions

**Expected results**:
- Removes 1-2 TB of unused container allocations
- Allows recovery via backed-up configs

**Execution**:
```bash
# DRY RUN (no deletion, default)
ansible-playbook playbooks/remediate-stopped-containers.yml --check

# To actually remove (set dry_run=false)
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e dry_run=false

# Remove specific containers only
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e 'containers_to_remove=[{vmid: 108, name: dlx-mysql-02}]' \
  -e dry_run=false
```

**Safety features**:
- Backups created before removal: `/tmp/pve-container-backups/`
- Dry-run mode by default (set `dry_run=false` to execute)
- Manual approval on each container

**Time estimate**: 2-5 minutes

---

### 4. `configure-storage-monitoring.yml`

**Purpose**: Set up continuous monitoring and alerting

**What it does**:
- Creates monitoring scripts for filesystem, Docker, containers
- Installs cron jobs for continuous monitoring
- Configures syslog integration
- Sets alert thresholds (75%, 85%, 95%)
- Provides Prometheus metrics export
- Creates cluster status dashboard command

**Expected results**:
- Real-time capacity monitoring
- Alerts before running out of space
- Integration with monitoring tools

**Execution**:
```bash
# Deploy monitoring to all Proxmox hosts
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox

# View cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh

# View alerts
tail -f /var/log/storage-monitor.log
```

**Time estimate**: 5 minutes

---

## Execution Plan

### Phase 1: Preparation (Before running playbooks)

#### 1. Verify backups exist
```bash
# Check backup location
ls -lh /var/backups/
```

#### 2. Review current state
```bash
# Check filesystem usage
df -h /
df -h /mnt/pve/*

# Check Docker usage (proxmox-01 only)
docker system df

# List containers
pct list | head -20
qm list | head -20
```

#### 3. Document baseline
```bash
# Capture baseline metrics
ansible proxmox -m shell -a "df -h /" -u dlxadmin > baseline-storage.txt
```

---

### Phase 2: Execute Remediation

#### Step 1: Test with dry-run (RECOMMENDED)

```bash
# Test critical issues fix
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
  --check -l proxmox-00

# Test Docker cleanup
ansible-playbook playbooks/remediate-docker-storage.yml \
  --check -l proxmox-01

# Test container removal
ansible-playbook playbooks/remediate-stopped-containers.yml \
  --check
```

Review output before proceeding to Step 2.

#### Step 2: Execute on proxmox-00 (Critical)

```bash
# Clean up root filesystem and logs
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
  -l proxmox-00 -v
```

**Verification**:
```bash
# SSH to proxmox-00
ssh dlxadmin@192.168.200.10
df -h /
# Should show: from 84.5% → 70-75%

du -sh /var/log
# Should show: smaller size after cleanup
```

#### Step 3: Execute on proxmox-01 (High Priority)

```bash
# Clean Docker storage
ansible-playbook playbooks/remediate-docker-storage.yml \
  -l proxmox-01 -v
```

**Verification**:
```bash
# SSH to proxmox-01
ssh dlxadmin@192.168.200.11
df -h /mnt/pve/dlx-docker
# Should show: from 81% → 60-70%

docker system df
# Should show: reduced image/volume sizes
```

#### Step 4: Remove Stopped Containers (Optional)

```bash
# First, verify which containers will be removed
ansible-playbook playbooks/remediate-stopped-containers.yml \
  --check

# Review output, then execute
ansible-playbook playbooks/remediate-stopped-containers.yml \
  -e dry_run=false -v
```

**Verification**:
```bash
# Check backup location
ls -lh /tmp/pve-container-backups/

# Verify stopped containers are gone
pct list | grep stopped
```

#### Step 5: Enable Monitoring

```bash
# Configure monitoring on all hosts
ansible-playbook playbooks/configure-storage-monitoring.yml \
  -l proxmox
```

**Verification**:
```bash
# Check monitoring scripts installed
ls -la /usr/local/bin/storage-monitoring/

# Check cron jobs
crontab -l | grep storage

# View monitoring logs
tail -f /var/log/storage-monitor.log
```

---

## Timeline

### Immediate (Today)
1. ✅ Review remediation playbooks
2. ✅ Run dry-run tests
3. ✅ Execute proxmox-00 cleanup
4. ✅ Execute proxmox-01 cleanup

**Expected duration**: 30 minutes

### Short-term (This week)
1. ✅ Remove stopped containers
2. ✅ Enable monitoring
3. ✅ Verify stability (48 hours)
4. ✅ Document changes

**Expected duration**: 2-4 hours over 48 hours

### Ongoing (Monthly)
1. Review monitoring logs
2. Execute cleanup playbooks
3. Audit new containers
4. Update storage audit

---

## Rollback Plan

If something goes wrong, you can roll back:

### Restore Filesystem from Snapshot
```bash
# If you have LVM snapshots
lvconvert --merge /dev/mapper/pve-root_snapshot

# Or restore from backup
proxmox-backup-client restore /mnt/backups/...
```

### Recover Deleted Containers
```bash
# Restore from backed-up config
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 108

# Start container
pct start 108
```

### Restore Docker Images
```bash
# Pull images from registry
docker pull image:tag

# Or restore from backup
docker load < image-backup.tar
```

---

## Monitoring & Validation

### Daily Checks
```bash
# Monitor storage trends
tail -f /var/log/storage-monitor.log

# Check cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh

# Alert check
grep ALERT /var/log/storage-monitor.log
```

### Weekly Verification
```bash
# Run storage audit
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check

# Review Docker logs
docker system df

# List containers by size
pct list | while read line; do
  vmid=$(echo $line | awk '{print $1}')
  name=$(echo $line | awk '{print $2}')
  size=$(du -sh /var/lib/lxc/$vmid 2>/dev/null | awk '{print $1}')
  echo "$vmid $name $size"
done | sort -k3 -hr
```

### Monthly Audit
```bash
# Update storage audit report
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check -v

# Generate updated metrics
pvesh get /nodes/proxmox-00/storage | grep capacity

# Compare to baseline
diff baseline-storage.txt <(ansible proxmox -m shell -a "df -h /" -u dlxadmin)
```

---

## Troubleshooting

### Issue: Root filesystem still full after cleanup

**Symptoms**: `df -h /` still shows >80%

**Solutions**:
1. Check for large files: `find / -size +1G 2>/dev/null`
2. Check Docker: `docker system prune -a`
3. Check logs: `du -sh /var/log/* | sort -hr | head`
4. Expand partition (if necessary)

### Issue: Docker cleanup removed needed image

**Symptoms**: Container fails to start after cleanup

**Solution**: Rebuild or pull image
```bash
docker pull image:tag
docker-compose up -d
```

### Issue: Removed container was still in use

**Recovery**: Restore from backup
```bash
# List available backups
ls -la /tmp/pve-container-backups/

# Restore to new VMID
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 200
pct start 200
```

---

## References

- **Storage Audit**: `docs/STORAGE-AUDIT.md`
- **Proxmox Docs**: https://pve.proxmox.com/wiki/Storage
- **Docker Cleanup**: https://docs.docker.com/config/pruning/
- **LXC Management**: `man pct`

---

## Appendix: Commands Reference

### Quick capacity check
```bash
# All hosts
ansible proxmox -m shell -a "df -h / | tail -1" -u dlxadmin

# Specific host
ssh dlxadmin@proxmox-00 "df -h /"
```

### Container info
```bash
# All containers
pct list

# Container details
pct config <vmid>
pct status <vmid>

# Container logs
pct exec <vmid> tail -f /var/log/syslog
```

### Docker management
```bash
# Storage usage
docker system df

# Cleanup
docker system prune -af
docker image prune -f
docker volume prune -f

# Container logs
docker logs <container>
docker logs -f <container>
```

### Monitoring
```bash
# View alerts
tail -f /var/log/storage-monitor.log
tail -f /var/log/docker-monitor.log

# System logs
journalctl -t storage-monitor -f
journalctl -t docker-monitor -f
```

---

## Support

If you encounter issues:
1. Check `/var/log/storage-monitor.log` for alerts
2. Review playbook output for specific errors
3. Verify backups exist before removing containers
4. Test with `--check` flag before executing

**Next scheduled audit**: 2026-03-08