# Project Memory: dlx-ansible ## Infrastructure Overview - **NPM Server**: nginx (192.168.200.71) - Nginx Proxy Manager for SSL termination - **Application Servers**: hiveops (192.168.200.112), smartjournal (192.168.200.114) - **CI/CD Server**: jenkins (192.168.200.91) - Jenkins + SonarQube - All servers use `dlxadmin` user with passwordless sudo ## Critical Learnings ### SSL Certificate Offloading with Nginx Proxy Manager **Problem**: Spring Boot applications behind NPM experience redirect loops when accessed via HTTPS. **Root Cause**: Spring Boot doesn't trust `X-Forwarded-*` headers by default. When NPM terminates SSL and forwards HTTP to backend, Spring sees HTTP and redirects to HTTPS, creating infinite loop. **Solution**: Configure Spring Boot to trust forwarded headers: ```yaml environment: SERVER_FORWARD_HEADERS_STRATEGY: native SERVER_USE_FORWARD_HEADERS: true ``` **Key Points**: - Containers must be **recreated** (not restarted) for env vars to take effect - Verify with: `curl -I -H 'X-Forwarded-Proto: https' http://localhost:8080/` - Success indicator: `Strict-Transport-Security` header in response - Documentation: `docs/SSL-OFFLOADING-FIX.md` ### Docker Compose Best Practices **Environment Variable Loading**: - Use `--env-file` flag when .env is not in same directory as compose file - Example: `docker compose -f docker/docker-compose.yml --env-file .env up -d` **Container Updates**: - Restart: Keeps existing container, doesn't apply env changes - Recreate: Removes old container, creates new one with latest env/config - Always recreate when changing environment variables ### HiveOps Application Structure **Main Deployment** (`/opt/hiveops-deploy/`): - Full microservices stack - Services: incident-backend, incident-frontend, mgmt, remote - Managed via docker-compose **Standalone Deployment** (`/home/hiveops/`): - Simplified incident management system - Separate from main deployment - Used for direct hiveops.directlx.dev access ### Jenkins Firewall Blocking (2026-02-09) **Problem**: Jenkins and SonarQube were unreachable from network. **Root Cause**: Server had no host_vars file, inherited default firewall config (SSH only). **Solution**: Created `host_vars/jenkins.yml` with ports 22, 8080 (Jenkins), 9000 (SonarQube). **Quick Fix**: ```bash ansible jenkins -m community.general.ufw -a "rule=allow port=8080 proto=tcp" -b ansible jenkins -m community.general.ufw -a "rule=allow port=9000 proto=tcp" -b ansible jenkins -m shell -a "docker start postgresql sonarqube" -b ``` **Key Points**: - Jenkins runs as Java system service (not Docker) on port 8080 - SonarQube runs in Docker with PostgreSQL backend - Always create host_vars file for servers with specific firewall needs - Documentation: `docs/JENKINS-CONNECTIVITY-FIX.md` ## File Locations ### Host Variables - `/source/dlx-src/dlx-ansible/host_vars/npm.yml` - NPM firewall config - `/source/dlx-src/dlx-ansible/host_vars/smartjournal.yml` - SmartJournal settings - `/source/dlx-src/dlx-ansible/host_vars/jenkins.yml` - Jenkins/SonarQube firewall config ## Storage Remediation (2026-02-08) **Critical Issues Identified**: 1. proxmox-00 root FS: 84.5% full (CRITICAL) 2. proxmox-01 dlx-docker: 81.1% full (HIGH) 3. Unused containers: 1.2 TB allocated 4. SonarQube: 354 GB (82% of allocation) **Remediation Playbooks Created**: - `remediate-storage-critical-issues.yml`: Log cleanup, Docker prune, audits - `remediate-docker-storage.yml`: Deep Docker cleanup + automation - `remediate-stopped-containers.yml`: Safe container removal with backups - `configure-storage-monitoring.yml`: Proactive monitoring (5/10 min checks) **Documentation**: - `STORAGE-AUDIT.md`: Full hardware/storage analysis (550 lines) - `STORAGE-REMEDIATION-GUIDE.md`: Step-by-step execution (480 lines) - `REMEDIATION-SUMMARY.md`: Quick reference (300 lines) **Expected Results**: - Total space freed: 1-2 TB - proxmox-00: 84.5% → 70% (10-15 GB freed) - proxmox-01: 81.1% → 70% (50-150 GB freed) - Automation prevents regrowth (weekly prune + hourly monitoring) **Commit**: 90ed5c1 ## Common Tasks ### Fix SSL Offloading for Spring Boot Service 1. Add env vars to .env: `SERVER_FORWARD_HEADERS_STRATEGY=native`, `SERVER_USE_FORWARD_HEADERS=true` 2. Add to docker-compose environment section 3. Recreate container: `docker stop && docker rm && docker compose up -d ` 4. Verify: Check for `Strict-Transport-Security` header ### Apply Firewall Configuration - Firewall is managed by common role (roles/common/tasks/security.yml) - Controlled per-host via `common_firewall_enabled` and `common_firewall_allowed_ports` - Some hosts (docker, hiveops, smartjournal) have firewall disabled for Docker networking ### Run Storage Remediation 1. Test with `--check`: `ansible-playbook playbooks/remediate-storage-critical-issues.yml --check` 2. Deploy monitoring: `ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox` 3. Fix proxmox-00: `ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00` 4. Fix proxmox-01: `ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01` 5. Monitor: `tail -f /var/log/storage-monitor.log` 6. Remove containers (optional): `ansible-playbook playbooks/remediate-stopped-containers.yml -e dry_run=false` ## Kubernetes Cluster Setup (2026-02-09) **Problem**: Attempted to install K3s on LXC containers - failed due to kernel module limitations. **Root Cause**: LXC containers share host kernel and cannot load required modules (br_netfilter, overlay). **Solution**: Delete LXC containers, create proper QEMU/KVM VMs with Ubuntu 24.04 LTS. **Cluster Design**: - 3-node HA cluster with embedded etcd - All nodes as control plane servers - K3s v1.31.4+k3s1 - IPs: 192.168.200.215/216/217 - 4GB RAM, 4 CPU cores, 50GB disk per node **Key Learnings**: - LXC containers NOT suitable for Kubernetes - Always verify: `systemd-detect-virt` should return "kvm" not "lxc" - Use Ubuntu LTS releases (24.04) not interim releases (24.10) - Interim releases have only 9 months support - Ubuntu 24.10 is EOL (July 2025), repositories archived **Files Created**: - `playbooks/install-k3s-cluster.yml` - HA K3s installation - `host_vars/dlx-kube-{01,02,03}.yml` - Firewall configs - `docs/K3S-INSTALLATION-GUIDE.md` - Complete guide - `docs/PROXMOX-VM-SETUP-FOR-K3S.md` - VM creation guide - `docs/SESSION-PLAN-K3S-DEPLOYMENT.md` - Next session plan - `scripts/create-k3s-vms.sh` - VM creation automation **Next Steps**: User creates VMs, then run K3s installation playbook. ## SmartJournal Kafka Fix (2026-02-20) **Problem**: `sj_api` logs `localhost/127.0.0.1:9092` warnings on startup, takes ~60s to start. **Root Causes**: 1. `kafkaservice=kafka:9092` used the external listener — Kafka advertises `192.168.200.114:9092` back to containers, resolving to localhost 2. Spring Boot `dev` profile hardcodes `localhost:9092` for admin client — `KAFKASERVICE` env var only overrides producer/consumer, not admin client **Fix**: - `.env`: `kafkaservice=kafka:29092` (use internal PLAINTEXT listener) - `docker-compose-prod.yaml` api service: add `SPRING_KAFKA_BOOTSTRAP_SERVERS=${kafkaservice}` **Result**: No warnings, startup ~20s instead of ~60s **Also fixed**: - Typo `mfa_enabled=fasle` → `false` in .env (caused boolean parse crash) - Duplicate hyphenated env vars `${saml-mapper-graph-proxy-port}` — shell treats hyphens as default syntax, passes literal string instead of value **Documentation**: `docs/KAFKA-LOCALHOST-FIX.md` ## Security Notes - Only trust forwarded headers when backend is not internet-accessible - NPM server (192.168.200.71) should be only server that can reach backend ports - Backend ports should bind to localhost only: `127.0.0.1:8080:8080`