185 lines
7.6 KiB
Markdown
185 lines
7.6 KiB
Markdown
# Project Memory: dlx-ansible
|
|
|
|
## Infrastructure Overview
|
|
- **NPM Server**: nginx (192.168.200.71) - Nginx Proxy Manager for SSL termination
|
|
- **Application Servers**: hiveops (192.168.200.112), smartjournal (192.168.200.114)
|
|
- **CI/CD Server**: jenkins (192.168.200.91) - Jenkins + SonarQube
|
|
- All servers use `dlxadmin` user with passwordless sudo
|
|
|
|
## Critical Learnings
|
|
|
|
### SSL Certificate Offloading with Nginx Proxy Manager
|
|
|
|
**Problem**: Spring Boot applications behind NPM experience redirect loops when accessed via HTTPS.
|
|
|
|
**Root Cause**: Spring Boot doesn't trust `X-Forwarded-*` headers by default. When NPM terminates SSL and forwards HTTP to backend, Spring sees HTTP and redirects to HTTPS, creating infinite loop.
|
|
|
|
**Solution**: Configure Spring Boot to trust forwarded headers:
|
|
```yaml
|
|
environment:
|
|
SERVER_FORWARD_HEADERS_STRATEGY: native
|
|
SERVER_USE_FORWARD_HEADERS: true
|
|
```
|
|
|
|
**Key Points**:
|
|
- Containers must be **recreated** (not restarted) for env vars to take effect
|
|
- Verify with: `curl -I -H 'X-Forwarded-Proto: https' http://localhost:8080/`
|
|
- Success indicator: `Strict-Transport-Security` header in response
|
|
- Documentation: `docs/SSL-OFFLOADING-FIX.md`
|
|
|
|
### Docker Compose Best Practices
|
|
|
|
**Environment Variable Loading**:
|
|
- Use `--env-file` flag when .env is not in same directory as compose file
|
|
- Example: `docker compose -f docker/docker-compose.yml --env-file .env up -d`
|
|
|
|
**Container Updates**:
|
|
- Restart: Keeps existing container, doesn't apply env changes
|
|
- Recreate: Removes old container, creates new one with latest env/config
|
|
- Always recreate when changing environment variables
|
|
|
|
### HiveOps Application Structure
|
|
|
|
**Main Deployment** (`/opt/hiveops-deploy/`):
|
|
- Full microservices stack
|
|
- Services: incident-backend, incident-frontend, mgmt, remote
|
|
- Managed via docker-compose
|
|
|
|
**Standalone Deployment** (`/home/hiveops/`):
|
|
- Simplified incident management system
|
|
- Separate from main deployment
|
|
- Used for direct hiveops.directlx.dev access
|
|
|
|
### Jenkins Firewall Blocking (2026-02-09)
|
|
|
|
**Problem**: Jenkins and SonarQube were unreachable from network.
|
|
|
|
**Root Cause**: Server had no host_vars file, inherited default firewall config (SSH only).
|
|
|
|
**Solution**: Created `host_vars/jenkins.yml` with ports 22, 8080 (Jenkins), 9000 (SonarQube).
|
|
|
|
**Quick Fix**:
|
|
```bash
|
|
ansible jenkins -m community.general.ufw -a "rule=allow port=8080 proto=tcp" -b
|
|
ansible jenkins -m community.general.ufw -a "rule=allow port=9000 proto=tcp" -b
|
|
ansible jenkins -m shell -a "docker start postgresql sonarqube" -b
|
|
```
|
|
|
|
**Key Points**:
|
|
- Jenkins runs as Java system service (not Docker) on port 8080
|
|
- SonarQube runs in Docker with PostgreSQL backend
|
|
- Always create host_vars file for servers with specific firewall needs
|
|
- Documentation: `docs/JENKINS-CONNECTIVITY-FIX.md`
|
|
|
|
## File Locations
|
|
|
|
### Host Variables
|
|
- `/source/dlx-src/dlx-ansible/host_vars/npm.yml` - NPM firewall config
|
|
- `/source/dlx-src/dlx-ansible/host_vars/smartjournal.yml` - SmartJournal settings
|
|
- `/source/dlx-src/dlx-ansible/host_vars/jenkins.yml` - Jenkins/SonarQube firewall config
|
|
|
|
## Storage Remediation (2026-02-08)
|
|
|
|
**Critical Issues Identified**:
|
|
1. proxmox-00 root FS: 84.5% full (CRITICAL)
|
|
2. proxmox-01 dlx-docker: 81.1% full (HIGH)
|
|
3. Unused containers: 1.2 TB allocated
|
|
4. SonarQube: 354 GB (82% of allocation)
|
|
|
|
**Remediation Playbooks Created**:
|
|
- `remediate-storage-critical-issues.yml`: Log cleanup, Docker prune, audits
|
|
- `remediate-docker-storage.yml`: Deep Docker cleanup + automation
|
|
- `remediate-stopped-containers.yml`: Safe container removal with backups
|
|
- `configure-storage-monitoring.yml`: Proactive monitoring (5/10 min checks)
|
|
|
|
**Documentation**:
|
|
- `STORAGE-AUDIT.md`: Full hardware/storage analysis (550 lines)
|
|
- `STORAGE-REMEDIATION-GUIDE.md`: Step-by-step execution (480 lines)
|
|
- `REMEDIATION-SUMMARY.md`: Quick reference (300 lines)
|
|
|
|
**Expected Results**:
|
|
- Total space freed: 1-2 TB
|
|
- proxmox-00: 84.5% → 70% (10-15 GB freed)
|
|
- proxmox-01: 81.1% → 70% (50-150 GB freed)
|
|
- Automation prevents regrowth (weekly prune + hourly monitoring)
|
|
|
|
**Commit**: 90ed5c1
|
|
|
|
## Common Tasks
|
|
|
|
### Fix SSL Offloading for Spring Boot Service
|
|
1. Add env vars to .env: `SERVER_FORWARD_HEADERS_STRATEGY=native`, `SERVER_USE_FORWARD_HEADERS=true`
|
|
2. Add to docker-compose environment section
|
|
3. Recreate container: `docker stop <name> && docker rm <name> && docker compose up -d <service>`
|
|
4. Verify: Check for `Strict-Transport-Security` header
|
|
|
|
### Apply Firewall Configuration
|
|
- Firewall is managed by common role (roles/common/tasks/security.yml)
|
|
- Controlled per-host via `common_firewall_enabled` and `common_firewall_allowed_ports`
|
|
- Some hosts (docker, hiveops, smartjournal) have firewall disabled for Docker networking
|
|
|
|
### Run Storage Remediation
|
|
1. Test with `--check`: `ansible-playbook playbooks/remediate-storage-critical-issues.yml --check`
|
|
2. Deploy monitoring: `ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox`
|
|
3. Fix proxmox-00: `ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00`
|
|
4. Fix proxmox-01: `ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01`
|
|
5. Monitor: `tail -f /var/log/storage-monitor.log`
|
|
6. Remove containers (optional): `ansible-playbook playbooks/remediate-stopped-containers.yml -e dry_run=false`
|
|
|
|
## Kubernetes Cluster Setup (2026-02-09)
|
|
|
|
**Problem**: Attempted to install K3s on LXC containers - failed due to kernel module limitations.
|
|
|
|
**Root Cause**: LXC containers share host kernel and cannot load required modules (br_netfilter, overlay).
|
|
|
|
**Solution**: Delete LXC containers, create proper QEMU/KVM VMs with Ubuntu 24.04 LTS.
|
|
|
|
**Cluster Design**:
|
|
- 3-node HA cluster with embedded etcd
|
|
- All nodes as control plane servers
|
|
- K3s v1.31.4+k3s1
|
|
- IPs: 192.168.200.215/216/217
|
|
- 4GB RAM, 4 CPU cores, 50GB disk per node
|
|
|
|
**Key Learnings**:
|
|
- LXC containers NOT suitable for Kubernetes
|
|
- Always verify: `systemd-detect-virt` should return "kvm" not "lxc"
|
|
- Use Ubuntu LTS releases (24.04) not interim releases (24.10)
|
|
- Interim releases have only 9 months support
|
|
- Ubuntu 24.10 is EOL (July 2025), repositories archived
|
|
|
|
**Files Created**:
|
|
- `playbooks/install-k3s-cluster.yml` - HA K3s installation
|
|
- `host_vars/dlx-kube-{01,02,03}.yml` - Firewall configs
|
|
- `docs/K3S-INSTALLATION-GUIDE.md` - Complete guide
|
|
- `docs/PROXMOX-VM-SETUP-FOR-K3S.md` - VM creation guide
|
|
- `docs/SESSION-PLAN-K3S-DEPLOYMENT.md` - Next session plan
|
|
- `scripts/create-k3s-vms.sh` - VM creation automation
|
|
|
|
**Next Steps**: User creates VMs, then run K3s installation playbook.
|
|
|
|
## SmartJournal Kafka Fix (2026-02-20)
|
|
|
|
**Problem**: `sj_api` logs `localhost/127.0.0.1:9092` warnings on startup, takes ~60s to start.
|
|
|
|
**Root Causes**:
|
|
1. `kafkaservice=kafka:9092` used the external listener — Kafka advertises `192.168.200.114:9092` back to containers, resolving to localhost
|
|
2. Spring Boot `dev` profile hardcodes `localhost:9092` for admin client — `KAFKASERVICE` env var only overrides producer/consumer, not admin client
|
|
|
|
**Fix**:
|
|
- `.env`: `kafkaservice=kafka:29092` (use internal PLAINTEXT listener)
|
|
- `docker-compose-prod.yaml` api service: add `SPRING_KAFKA_BOOTSTRAP_SERVERS=${kafkaservice}`
|
|
|
|
**Result**: No warnings, startup ~20s instead of ~60s
|
|
|
|
**Also fixed**:
|
|
- Typo `mfa_enabled=fasle` → `false` in .env (caused boolean parse crash)
|
|
- Duplicate hyphenated env vars `${saml-mapper-graph-proxy-port}` — shell treats hyphens as default syntax, passes literal string instead of value
|
|
|
|
**Documentation**: `docs/KAFKA-LOCALHOST-FIX.md`
|
|
|
|
## Security Notes
|
|
- Only trust forwarded headers when backend is not internet-accessible
|
|
- NPM server (192.168.200.71) should be only server that can reach backend ports
|
|
- Backend ports should bind to localhost only: `127.0.0.1:8080:8080`
|