dlx-ansible/docs/STORAGE-REMEDIATION-GUIDE.md

500 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Storage Remediation Guide
**Generated**: 2026-02-08
**Status**: Critical issues identified - Remediation playbooks created
**Priority**: 🔴 HIGH - Immediate action recommended
---
## Overview
Four critical storage issues have been identified in the Proxmox cluster:
| Issue | Severity | Current | Target | Playbook |
|-------|----------|---------|--------|----------|
| proxmox-00 root FS | 🔴 CRITICAL | 84.5% | <70% | remediate-storage-critical-issues.yml |
| proxmox-01 dlx-docker | 🟠 HIGH | 81.1% | <75% | remediate-docker-storage.yml |
| SonarQube disk usage | 🟠 HIGH | 354 GB | Archive data | remediate-storage-critical-issues.yml |
| Unused containers | MEDIUM | 1.2 TB allocated | Cleanup | remediate-stopped-containers.yml |
Corresponding **remediation playbooks** have been created to automate fixes.
---
## Remediation Playbooks
### 1. `remediate-storage-critical-issues.yml`
**Purpose**: Address immediate critical issues on proxmox-00 and proxmox-01
**What it does**:
- Compresses old journal logs (>30 days)
- Removes old syslog files (>90 days)
- Cleans apt cache and temp files
- Prunes Docker images, volumes, and build cache
- Audits SonarQube usage
- Lists stopped containers for manual review
**Expected results**:
- proxmox-00 root: Frees ~10-15 GB
- proxmox-01 dlx-docker: Frees ~20-50 GB
**Execution**:
```bash
# Dry-run (safe, shows what would be done)
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
# Execute on specific host
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00
```
**Time estimate**: 5-10 minutes per host
---
### 2. `remediate-docker-storage.yml`
**Purpose**: Deep cleanup of Docker storage on proxmox-01
**What it does**:
- Analyzes Docker container sizes
- Lists Docker images by size
- Finds dangling images and volumes
- Removes unused Docker resources
- Configures automated weekly cleanup
- Sets up hourly monitoring
**Expected results**:
- Removes unused images/layers
- Frees 50-150 GB depending on usage
- Prevents regrowth with automation
**Execution**:
```bash
# Dry-run first
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 --check
# Execute
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01
```
**Time estimate**: 10-15 minutes
---
### 3. `remediate-stopped-containers.yml`
**Purpose**: Safely remove unused LXC containers
**What it does**:
- Lists all stopped containers
- Calculates disk allocation per container
- Creates configuration backups before removal
- Safely removes containers (with dry-run mode)
- Provides recovery instructions
**Expected results**:
- Removes 1-2 TB of unused container allocations
- Allows recovery via backed-up configs
**Execution**:
```bash
# DRY RUN (no deletion, default)
ansible-playbook playbooks/remediate-stopped-containers.yml --check
# To actually remove (set dry_run=false)
ansible-playbook playbooks/remediate-stopped-containers.yml \
-e dry_run=false
# Remove specific containers only
ansible-playbook playbooks/remediate-stopped-containers.yml \
-e 'containers_to_remove=[{vmid: 108, name: dlx-mysql-02}]' \
-e dry_run=false
```
**Safety features**:
- Backups created before removal: `/tmp/pve-container-backups/`
- Dry-run mode by default (set `dry_run=false` to execute)
- Manual approval on each container
**Time estimate**: 2-5 minutes
---
### 4. `configure-storage-monitoring.yml`
**Purpose**: Set up continuous monitoring and alerting
**What it does**:
- Creates monitoring scripts for filesystem, Docker, containers
- Installs cron jobs for continuous monitoring
- Configures syslog integration
- Sets alert thresholds (75%, 85%, 95%)
- Provides Prometheus metrics export
- Creates cluster status dashboard command
**Expected results**:
- Real-time capacity monitoring
- Alerts before running out of space
- Integration with monitoring tools
**Execution**:
```bash
# Deploy monitoring to all Proxmox hosts
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox
# View cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh
# View alerts
tail -f /var/log/storage-monitor.log
```
**Time estimate**: 5 minutes
---
## Execution Plan
### Phase 1: Preparation (Before running playbooks)
#### 1. Verify backups exist
```bash
# Check backup location
ls -lh /var/backups/
```
#### 2. Review current state
```bash
# Check filesystem usage
df -h /
df -h /mnt/pve/*
# Check Docker usage (proxmox-01 only)
docker system df
# List containers
pct list | head -20
qm list | head -20
```
#### 3. Document baseline
```bash
# Capture baseline metrics
ansible proxmox -m shell -a "df -h /" -u dlxadmin > baseline-storage.txt
```
---
### Phase 2: Execute Remediation
#### Step 1: Test with dry-run (RECOMMENDED)
```bash
# Test critical issues fix
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
--check -l proxmox-00
# Test Docker cleanup
ansible-playbook playbooks/remediate-docker-storage.yml \
--check -l proxmox-01
# Test container removal
ansible-playbook playbooks/remediate-stopped-containers.yml \
--check
```
Review output before proceeding to Step 2.
#### Step 2: Execute on proxmox-00 (Critical)
```bash
# Clean up root filesystem and logs
ansible-playbook playbooks/remediate-storage-critical-issues.yml \
-l proxmox-00 -v
```
**Verification**:
```bash
# SSH to proxmox-00
ssh dlxadmin@192.168.200.10
df -h /
# Should show: from 84.5% → 70-75%
du -sh /var/log
# Should show: smaller size after cleanup
```
#### Step 3: Execute on proxmox-01 (High Priority)
```bash
# Clean Docker storage
ansible-playbook playbooks/remediate-docker-storage.yml \
-l proxmox-01 -v
```
**Verification**:
```bash
# SSH to proxmox-01
ssh dlxadmin@192.168.200.11
df -h /mnt/pve/dlx-docker
# Should show: from 81% → 60-70%
docker system df
# Should show: reduced image/volume sizes
```
#### Step 4: Remove Stopped Containers (Optional)
```bash
# First, verify which containers will be removed
ansible-playbook playbooks/remediate-stopped-containers.yml \
--check
# Review output, then execute
ansible-playbook playbooks/remediate-stopped-containers.yml \
-e dry_run=false -v
```
**Verification**:
```bash
# Check backup location
ls -lh /tmp/pve-container-backups/
# Verify stopped containers are gone
pct list | grep stopped
```
#### Step 5: Enable Monitoring
```bash
# Configure monitoring on all hosts
ansible-playbook playbooks/configure-storage-monitoring.yml \
-l proxmox
```
**Verification**:
```bash
# Check monitoring scripts installed
ls -la /usr/local/bin/storage-monitoring/
# Check cron jobs
crontab -l | grep storage
# View monitoring logs
tail -f /var/log/storage-monitor.log
```
---
## Timeline
### Immediate (Today)
1. ✅ Review remediation playbooks
2. ✅ Run dry-run tests
3. ✅ Execute proxmox-00 cleanup
4. ✅ Execute proxmox-01 cleanup
**Expected duration**: 30 minutes
### Short-term (This week)
1. ✅ Remove stopped containers
2. ✅ Enable monitoring
3. ✅ Verify stability (48 hours)
4. ✅ Document changes
**Expected duration**: 2-4 hours over 48 hours
### Ongoing (Monthly)
1. Review monitoring logs
2. Execute cleanup playbooks
3. Audit new containers
4. Update storage audit
---
## Rollback Plan
If something goes wrong, you can roll back:
### Restore Filesystem from Snapshot
```bash
# If you have LVM snapshots
lvconvert --merge /dev/mapper/pve-root_snapshot
# Or restore from backup
proxmox-backup-client restore /mnt/backups/...
```
### Recover Deleted Containers
```bash
# Restore from backed-up config
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 108
# Start container
pct start 108
```
### Restore Docker Images
```bash
# Pull images from registry
docker pull image:tag
# Or restore from backup
docker load < image-backup.tar
```
---
## Monitoring & Validation
### Daily Checks
```bash
# Monitor storage trends
tail -f /var/log/storage-monitor.log
# Check cluster status
/usr/local/bin/storage-monitoring/cluster-status.sh
# Alert check
grep ALERT /var/log/storage-monitor.log
```
### Weekly Verification
```bash
# Run storage audit
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
# Review Docker logs
docker system df
# List containers by size
pct list | while read line; do
vmid=$(echo $line | awk '{print $1}')
name=$(echo $line | awk '{print $2}')
size=$(du -sh /var/lib/lxc/$vmid 2>/dev/null | awk '{print $1}')
echo "$vmid $name $size"
done | sort -k3 -hr
```
### Monthly Audit
```bash
# Update storage audit report
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check -v
# Generate updated metrics
pvesh get /nodes/proxmox-00/storage | grep capacity
# Compare to baseline
diff baseline-storage.txt <(ansible proxmox -m shell -a "df -h /" -u dlxadmin)
```
---
## Troubleshooting
### Issue: Root filesystem still full after cleanup
**Symptoms**: `df -h /` still shows >80%
**Solutions**:
1. Check for large files: `find / -size +1G 2>/dev/null`
2. Check Docker: `docker system prune -a`
3. Check logs: `du -sh /var/log/* | sort -hr | head`
4. Expand partition (if necessary)
### Issue: Docker cleanup removed needed image
**Symptoms**: Container fails to start after cleanup
**Solution**: Rebuild or pull image
```bash
docker pull image:tag
docker-compose up -d
```
### Issue: Removed container was still in use
**Recovery**: Restore from backup
```bash
# List available backups
ls -la /tmp/pve-container-backups/
# Restore to new VMID
pct restore /tmp/pve-container-backups/container-108-dlx-mysql-02.conf 200
pct start 200
```
---
## References
- **Storage Audit**: `docs/STORAGE-AUDIT.md`
- **Proxmox Docs**: https://pve.proxmox.com/wiki/Storage
- **Docker Cleanup**: https://docs.docker.com/config/pruning/
- **LXC Management**: `man pct`
---
## Appendix: Commands Reference
### Quick capacity check
```bash
# All hosts
ansible proxmox -m shell -a "df -h / | tail -1" -u dlxadmin
# Specific host
ssh dlxadmin@proxmox-00 "df -h /"
```
### Container info
```bash
# All containers
pct list
# Container details
pct config <vmid>
pct status <vmid>
# Container logs
pct exec <vmid> tail -f /var/log/syslog
```
### Docker management
```bash
# Storage usage
docker system df
# Cleanup
docker system prune -af
docker image prune -f
docker volume prune -f
# Container logs
docker logs <container>
docker logs -f <container>
```
### Monitoring
```bash
# View alerts
tail -f /var/log/storage-monitor.log
tail -f /var/log/docker-monitor.log
# System logs
journalctl -t storage-monitor -f
journalctl -t docker-monitor -f
```
---
## Support
If you encounter issues:
1. Check `/var/log/storage-monitor.log` for alerts
2. Review playbook output for specific errors
3. Verify backups exist before removing containers
4. Test with `--check` flag before executing
**Next scheduled audit**: 2026-03-08