dlx-claude/memory/dlx-ansible/MEMORY.md

142 lines
5.8 KiB
Markdown

# Project Memory: dlx-ansible
## Infrastructure Overview
- **NPM Server**: nginx (192.168.200.71) - Nginx Proxy Manager for SSL termination
- **Application Servers**: hiveops (192.168.200.112), smartjournal (192.168.200.114)
- **CI/CD Server**: jenkins (192.168.200.91) - Jenkins + SonarQube
- All servers use `dlxadmin` user with passwordless sudo
## Critical Learnings
### SSL Certificate Offloading with Nginx Proxy Manager
**Problem**: Spring Boot applications behind NPM experience redirect loops when accessed via HTTPS.
**Root Cause**: Spring Boot doesn't trust `X-Forwarded-*` headers by default. When NPM terminates SSL and forwards HTTP to backend, Spring sees HTTP and redirects to HTTPS, creating infinite loop.
**Solution**: Configure Spring Boot to trust forwarded headers:
```yaml
environment:
SERVER_FORWARD_HEADERS_STRATEGY: native
SERVER_USE_FORWARD_HEADERS: true
```
**Key Points**:
- Containers must be **recreated** (not restarted) for env vars to take effect
- Verify with: `curl -I -H 'X-Forwarded-Proto: https' http://localhost:8080/`
- Success indicator: `Strict-Transport-Security` header in response
- Documentation: `docs/SSL-OFFLOADING-FIX.md`
### Docker Compose Best Practices
**Environment Variable Loading**:
- Use `--env-file` flag when .env is not in same directory as compose file
- Example: `docker compose -f docker/docker-compose.yml --env-file .env up -d`
**Container Updates**:
- Restart: Keeps existing container, doesn't apply env changes
- Recreate: Removes old container, creates new one with latest env/config
- Always recreate when changing environment variables
### HiveOps Application Structure
**Main Deployment** (`/opt/hiveops-deploy/`):
- Full microservices stack
- Services: incident-backend, incident-frontend, mgmt, remote
- Managed via docker-compose
**Standalone Deployment** (`/home/hiveops/`):
- Simplified incident management system
- Separate from main deployment
- Used for direct hiveops.directlx.dev access
### Jenkins Firewall Blocking (2026-02-09)
**Problem**: Jenkins and SonarQube were unreachable from network.
**Root Cause**: Server had no host_vars file, inherited default firewall config (SSH only).
**Solution**: Created `host_vars/jenkins.yml` with ports 22, 8080 (Jenkins), 9000 (SonarQube).
**Quick Fix**:
```bash
ansible jenkins -m community.general.ufw -a "rule=allow port=8080 proto=tcp" -b
ansible jenkins -m community.general.ufw -a "rule=allow port=9000 proto=tcp" -b
ansible jenkins -m shell -a "docker start postgresql sonarqube" -b
```
**Key Points**:
- Jenkins runs as Java system service (not Docker) on port 8080
- SonarQube runs in Docker with PostgreSQL backend
- Always create host_vars file for servers with specific firewall needs
- Documentation: `docs/JENKINS-CONNECTIVITY-FIX.md`
## File Locations
### Host Variables
- `/source/dlx-src/dlx-ansible/host_vars/npm.yml` - NPM firewall config
- `/source/dlx-src/dlx-ansible/host_vars/hiveops.yml` - HiveOps settings
- `/source/dlx-src/dlx-ansible/host_vars/smartjournal.yml` - SmartJournal settings
- `/source/dlx-src/dlx-ansible/host_vars/jenkins.yml` - Jenkins/SonarQube firewall config
### Playbooks Created
- `playbooks/fix-hiveops-ssl-offload.yml` - SSL offload fix automation
- `playbooks/fix-hiveops-compose-indentation.yml` - Compose file corrections
- `playbooks/fix-hiveops-mgmt-ssl.yml` - Management service SSL fix
### Templates
- `templates/hiveops-docker-compose.prod.yml.j2` - Corrected compose template
## Storage Remediation (2026-02-08)
**Critical Issues Identified**:
1. proxmox-00 root FS: 84.5% full (CRITICAL)
2. proxmox-01 dlx-docker: 81.1% full (HIGH)
3. Unused containers: 1.2 TB allocated
4. SonarQube: 354 GB (82% of allocation)
**Remediation Playbooks Created**:
- `remediate-storage-critical-issues.yml`: Log cleanup, Docker prune, audits
- `remediate-docker-storage.yml`: Deep Docker cleanup + automation
- `remediate-stopped-containers.yml`: Safe container removal with backups
- `configure-storage-monitoring.yml`: Proactive monitoring (5/10 min checks)
**Documentation**:
- `STORAGE-AUDIT.md`: Full hardware/storage analysis (550 lines)
- `STORAGE-REMEDIATION-GUIDE.md`: Step-by-step execution (480 lines)
- `REMEDIATION-SUMMARY.md`: Quick reference (300 lines)
**Expected Results**:
- Total space freed: 1-2 TB
- proxmox-00: 84.5% → 70% (10-15 GB freed)
- proxmox-01: 81.1% → 70% (50-150 GB freed)
- Automation prevents regrowth (weekly prune + hourly monitoring)
**Commit**: 90ed5c1
## Common Tasks
### Fix SSL Offloading for Spring Boot Service
1. Add env vars to .env: `SERVER_FORWARD_HEADERS_STRATEGY=native`, `SERVER_USE_FORWARD_HEADERS=true`
2. Add to docker-compose environment section
3. Recreate container: `docker stop <name> && docker rm <name> && docker compose up -d <service>`
4. Verify: Check for `Strict-Transport-Security` header
### Apply Firewall Configuration
- Firewall is managed by common role (roles/common/tasks/security.yml)
- Controlled per-host via `common_firewall_enabled` and `common_firewall_allowed_ports`
- Some hosts (docker, hiveops, smartjournal) have firewall disabled for Docker networking
### Run Storage Remediation
1. Test with `--check`: `ansible-playbook playbooks/remediate-storage-critical-issues.yml --check`
2. Deploy monitoring: `ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox`
3. Fix proxmox-00: `ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00`
4. Fix proxmox-01: `ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01`
5. Monitor: `tail -f /var/log/storage-monitor.log`
6. Remove containers (optional): `ansible-playbook playbooks/remediate-stopped-containers.yml -e dry_run=false`
## Security Notes
- Only trust forwarded headers when backend is not internet-accessible
- NPM server (192.168.200.71) should be only server that can reach backend ports
- Backend ports should bind to localhost only: `127.0.0.1:8080:8080`