dlx-ansible/docs/REMEDIATION-SUMMARY.md

380 lines
9.8 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Storage Remediation Playbooks Summary
**Created**: 2026-02-08
**Status**: Ready for deployment
---
## Overview
Four Ansible playbooks have been created to remediate critical storage issues identified in the Proxmox cluster storage audit.
---
## Playbooks Created
### 1. `remediate-storage-critical-issues.yml`
**Location**: `playbooks/remediate-storage-critical-issues.yml`
**Purpose**: Address immediate critical and high-priority issues
**Targets**:
- proxmox-00 (root filesystem at 84.5%)
- proxmox-01 (dlx-docker at 81.1%)
- All nodes (SonarQube, stopped containers audit)
**Actions**:
- Compress journal logs (>30 days)
- Remove old syslog files (>90 days)
- Clean apt cache and temp files
- Prune Docker images, volumes, and build cache
- Audit SonarQube disk usage
- Report on stopped containers
**Expected space freed**:
- proxmox-00: 10-15 GB
- proxmox-01: 20-50 GB
- Total: 30-65 GB
**Execution time**: 5-10 minutes
---
### 2. `remediate-docker-storage.yml`
**Location**: `playbooks/remediate-docker-storage.yml`
**Purpose**: Detailed Docker storage cleanup for proxmox-01
**Targets**:
- proxmox-01 (Docker host)
- dlx-docker LXC container
**Actions**:
- Analyze container and image sizes
- Identify dangling resources
- Remove unused images, volumes, and build cache
- Run aggressive system prune (`docker system prune -a -f --volumes`)
- Configure automated weekly cleanup
- Setup hourly monitoring with alerting
- Create log rotation policies
**Expected space freed**:
- 50-150 GB depending on usage patterns
**Automated maintenance**:
- Weekly: `docker system prune -af --volumes`
- Hourly: Capacity monitoring and alerting
- Daily: Log rotation with 7-day retention
**Execution time**: 10-15 minutes
---
### 3. `remediate-stopped-containers.yml`
**Location**: `playbooks/remediate-stopped-containers.yml`
**Purpose**: Safely remove unused LXC containers
**Targets**:
- All Proxmox hosts
- 15 stopped containers (1.2 TB allocated)
**Actions**:
- Audit all containers and identify stopped ones
- Generate size/allocation report
- Create configuration backups before removal
- Safely remove containers (dry-run by default)
- Provide recovery guide and instructions
- Verify space freed
**Containers targeted for removal** (recommendations):
- dlx-mysql-02 (108): 200 GB
- dlx-mysql-03 (109): 200 GB
- dlx-mattermost (107): 32 GB
- dlx-nocodb (116): 100 GB
- dlx-swarm-01/02/03: 195 GB combined
- dlx-kube-01/02/03: 150 GB combined
**Total recoverable**: 877+ GB
**Safety features**:
- Dry-run mode by default (`dry_run: true`)
- Config backups created before deletion
- Recovery instructions provided
- Containers listed for manual approval
**Execution time**: 2-5 minutes
---
### 4. `configure-storage-monitoring.yml`
**Location**: `playbooks/configure-storage-monitoring.yml`
**Purpose**: Set up proactive storage monitoring and alerting
**Targets**:
- All Proxmox hosts (proxmox-00, 01, 02)
**Actions**:
- Create monitoring scripts:
- `/usr/local/bin/storage-monitoring/check-capacity.sh` - Filesystem monitoring
- `/usr/local/bin/storage-monitoring/check-docker.sh` - Docker storage
- `/usr/local/bin/storage-monitoring/check-containers.sh` - Container allocation
- `/usr/local/bin/storage-monitoring/cluster-status.sh` - Dashboard view
- `/usr/local/bin/storage-monitoring/prometheus-metrics.sh` - Metrics export
- Configure cron jobs:
- Every 5 min: Filesystem capacity checks
- Every 10 min: Docker storage checks
- Every 4 hours: Container allocation audit
- Set alert thresholds:
- 75%: ALERT (notice level)
- 85%: WARNING (warning level)
- 95%: CRITICAL (critical level)
- Integrate with syslog:
- Logs to `/var/log/storage-monitor.log`
- Syslog integration for alerting
- Log rotation configured (14-day retention)
- Optional Prometheus integration:
- Metrics export script for Grafana/Prometheus
- Standard format for monitoring tools
**Execution time**: 5 minutes
---
## Execution Guide
### Quick Start
```bash
# Test all playbooks (safe, shows what would be done)
ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
ansible-playbook playbooks/remediate-docker-storage.yml --check
ansible-playbook playbooks/remediate-stopped-containers.yml --check
ansible-playbook playbooks/configure-storage-monitoring.yml --check
```
### Recommended Execution Order
#### Day 1: Critical Fixes
```bash
# 1. Deploy monitoring first (non-destructive)
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox
# 2. Fix proxmox-00 root filesystem (CRITICAL)
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00
# 3. Fix proxmox-01 Docker storage (HIGH)
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01
# Expected time: 30 minutes
# Expected space freed: 30-65 GB
```
#### Day 2-3: Verify & Monitor
```bash
# Verify fixes are working
/usr/local/bin/storage-monitoring/cluster-status.sh
# Monitor alerts
tail -f /var/log/storage-monitor.log
# Check for issues (48 hours)
ansible proxmox -m shell -a "df -h /" -u dlxadmin
```
#### Day 4+: Container Cleanup (Optional)
```bash
# After confirming stability, remove unused containers
ansible-playbook playbooks/remediate-stopped-containers.yml \
--check # Verify first
# Execute removal (dry_run=false)
ansible-playbook playbooks/remediate-stopped-containers.yml \
-e dry_run=false
# Expected space freed: 877+ GB
# Execution time: 2-5 minutes
```
---
## Documentation
Three supporting documents have been created:
1. **STORAGE-AUDIT.md**
- Comprehensive storage analysis
- Hardware inventory
- Capacity utilization breakdown
- Issues and recommendations
2. **STORAGE-REMEDIATION-GUIDE.md**
- Step-by-step execution guide
- Timeline and milestones
- Rollback procedures
- Monitoring and validation
- Troubleshooting guide
3. **REMEDIATION-SUMMARY.md** (this file)
- Quick reference overview
- Playbook descriptions
- Expected results
---
## Expected Results
### Capacity Goals
| Host | Issue | Current | Target | Playbook | Expected Result |
|------|-------|---------|--------|----------|-----------------|
| proxmox-00 | Root FS | 84.5% | <70% | remediate-storage-critical-issues.yml | Frees 10-15 GB |
| proxmox-01 | dlx-docker | 81.1% | <75% | remediate-docker-storage.yml | Frees 50-150 GB |
| proxmox-01 | SonarQube | 354 GB | Archive | remediate-storage-critical-issues.yml | Audit only |
| All | Unused containers | 1.2 TB | Remove | remediate-stopped-containers.yml | Frees 877 GB |
**Total Space Freed**: 1-2 TB
### Automation Setup
- Automatic Docker cleanup: Weekly
- Continuous monitoring: Every 5-10 minutes
- Alert integration: Syslog, systemd journal
- Metrics export: Prometheus compatible
- Log rotation: 14-day retention
### Long-term Benefits
1. **Prevents future issues**: Automated cleanup prevents regrowth
2. **Early detection**: Monitoring alerts at 75%, 85%, 95% thresholds
3. **Operational insights**: Container allocation tracking
4. **Integration ready**: Prometheus/Grafana compatible
5. **Maintenance automation**: Weekly scheduled cleanups
---
## Key Features
### Safety First
- Dry-run mode for all destructive operations
- Configuration backups before removal
- Rollback procedures documented
- Multi-phase execution with verification
### Automation
- Cron-based scheduling
- Monitoring and alerting
- Log rotation and archival
- Prometheus metrics export
### Operability
- Clear execution steps
- Expected results documented
- Troubleshooting guide
- Dashboard commands for status
---
## Files Summary
```
playbooks/
├── remediate-storage-critical-issues.yml (205 lines)
├── remediate-docker-storage.yml (310 lines)
├── remediate-stopped-containers.yml (380 lines)
└── configure-storage-monitoring.yml (330 lines)
docs/
├── STORAGE-AUDIT.md (550 lines)
├── STORAGE-REMEDIATION-GUIDE.md (480 lines)
└── REMEDIATION-SUMMARY.md (this file)
```
Total: **2,255 lines** of playbooks and documentation
---
## Next Steps
1. **Review** the playbooks and documentation
2. **Test** with `--check` flag on a non-critical host
3. **Execute** in recommended order (Day 1, 2, 3+)
4. **Monitor** using provided tools and scripts
5. **Schedule** for monthly execution
---
## Support & Maintenance
### Monitoring Commands
```bash
# Quick status
/usr/local/bin/storage-monitoring/cluster-status.sh
# View alerts
tail -f /var/log/storage-monitor.log
# Docker status
docker system df
# Container status
pct list
```
### Regular Maintenance
- **Daily**: Review monitoring logs
- **Weekly**: Execute playbooks in check mode
- **Monthly**: Run full storage audit
- **Quarterly**: Archive monitoring data
### Scheduled Audits
- Next scheduled audit: 2026-03-08
- Quarterly reviews recommended
- Document changes in git
---
## Issues Addressed
**proxmox-00 root filesystem** (84.5%)
- Compressed journal logs
- Cleaned syslog files
- Cleared apt cache
**proxmox-01 dlx-docker** (81.1%)
- Removed dangling images
- Purged unused volumes
- Cleared build cache
- Automated weekly cleanup
**Unused containers** (1.2 TB)
- Safe removal with backups
- Recovery procedures documented
- 877+ GB recoverable
**Monitoring gaps**
- Continuous capacity tracking
- Alert thresholds configured
- Integration with syslog/prometheus
---
## Conclusion
Comprehensive remediation playbooks have been created to address all identified storage issues. The playbooks are:
- **Safe**: Dry-run modes, backups, and rollback procedures
- **Automated**: Scheduling and monitoring included
- **Documented**: Complete guides and references provided
- **Operational**: Dashboard commands and status checks included
Ready for deployment with immediate impact on cluster capacity and long-term operational stability.