directlx-claude-config/projects/-source-dlx-src-dlx-ansible/memory/MEMORY.md

7.6 KiB

Project Memory: dlx-ansible

Infrastructure Overview

  • NPM Server: nginx (192.168.200.71) - Nginx Proxy Manager for SSL termination
  • Application Servers: hiveops (192.168.200.112), smartjournal (192.168.200.114)
  • CI/CD Server: jenkins (192.168.200.91) - Jenkins + SonarQube
  • All servers use dlxadmin user with passwordless sudo

Critical Learnings

SSL Certificate Offloading with Nginx Proxy Manager

Problem: Spring Boot applications behind NPM experience redirect loops when accessed via HTTPS.

Root Cause: Spring Boot doesn't trust X-Forwarded-* headers by default. When NPM terminates SSL and forwards HTTP to backend, Spring sees HTTP and redirects to HTTPS, creating infinite loop.

Solution: Configure Spring Boot to trust forwarded headers:

environment:
  SERVER_FORWARD_HEADERS_STRATEGY: native
  SERVER_USE_FORWARD_HEADERS: true

Key Points:

  • Containers must be recreated (not restarted) for env vars to take effect
  • Verify with: curl -I -H 'X-Forwarded-Proto: https' http://localhost:8080/
  • Success indicator: Strict-Transport-Security header in response
  • Documentation: docs/SSL-OFFLOADING-FIX.md

Docker Compose Best Practices

Environment Variable Loading:

  • Use --env-file flag when .env is not in same directory as compose file
  • Example: docker compose -f docker/docker-compose.yml --env-file .env up -d

Container Updates:

  • Restart: Keeps existing container, doesn't apply env changes
  • Recreate: Removes old container, creates new one with latest env/config
  • Always recreate when changing environment variables

HiveOps Application Structure

Main Deployment (/opt/hiveops-deploy/):

  • Full microservices stack
  • Services: incident-backend, incident-frontend, mgmt, remote
  • Managed via docker-compose

Standalone Deployment (/home/hiveops/):

  • Simplified incident management system
  • Separate from main deployment
  • Used for direct hiveops.directlx.dev access

Jenkins Firewall Blocking (2026-02-09)

Problem: Jenkins and SonarQube were unreachable from network.

Root Cause: Server had no host_vars file, inherited default firewall config (SSH only).

Solution: Created host_vars/jenkins.yml with ports 22, 8080 (Jenkins), 9000 (SonarQube).

Quick Fix:

ansible jenkins -m community.general.ufw -a "rule=allow port=8080 proto=tcp" -b
ansible jenkins -m community.general.ufw -a "rule=allow port=9000 proto=tcp" -b
ansible jenkins -m shell -a "docker start postgresql sonarqube" -b

Key Points:

  • Jenkins runs as Java system service (not Docker) on port 8080
  • SonarQube runs in Docker with PostgreSQL backend
  • Always create host_vars file for servers with specific firewall needs
  • Documentation: docs/JENKINS-CONNECTIVITY-FIX.md

File Locations

Host Variables

  • /source/dlx-src/dlx-ansible/host_vars/npm.yml - NPM firewall config
  • /source/dlx-src/dlx-ansible/host_vars/smartjournal.yml - SmartJournal settings
  • /source/dlx-src/dlx-ansible/host_vars/jenkins.yml - Jenkins/SonarQube firewall config

Storage Remediation (2026-02-08)

Critical Issues Identified:

  1. proxmox-00 root FS: 84.5% full (CRITICAL)
  2. proxmox-01 dlx-docker: 81.1% full (HIGH)
  3. Unused containers: 1.2 TB allocated
  4. SonarQube: 354 GB (82% of allocation)

Remediation Playbooks Created:

  • remediate-storage-critical-issues.yml: Log cleanup, Docker prune, audits
  • remediate-docker-storage.yml: Deep Docker cleanup + automation
  • remediate-stopped-containers.yml: Safe container removal with backups
  • configure-storage-monitoring.yml: Proactive monitoring (5/10 min checks)

Documentation:

  • STORAGE-AUDIT.md: Full hardware/storage analysis (550 lines)
  • STORAGE-REMEDIATION-GUIDE.md: Step-by-step execution (480 lines)
  • REMEDIATION-SUMMARY.md: Quick reference (300 lines)

Expected Results:

  • Total space freed: 1-2 TB
  • proxmox-00: 84.5% → 70% (10-15 GB freed)
  • proxmox-01: 81.1% → 70% (50-150 GB freed)
  • Automation prevents regrowth (weekly prune + hourly monitoring)

Commit: 90ed5c1

Common Tasks

Fix SSL Offloading for Spring Boot Service

  1. Add env vars to .env: SERVER_FORWARD_HEADERS_STRATEGY=native, SERVER_USE_FORWARD_HEADERS=true
  2. Add to docker-compose environment section
  3. Recreate container: docker stop <name> && docker rm <name> && docker compose up -d <service>
  4. Verify: Check for Strict-Transport-Security header

Apply Firewall Configuration

  • Firewall is managed by common role (roles/common/tasks/security.yml)
  • Controlled per-host via common_firewall_enabled and common_firewall_allowed_ports
  • Some hosts (docker, hiveops, smartjournal) have firewall disabled for Docker networking

Run Storage Remediation

  1. Test with --check: ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
  2. Deploy monitoring: ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox
  3. Fix proxmox-00: ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00
  4. Fix proxmox-01: ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01
  5. Monitor: tail -f /var/log/storage-monitor.log
  6. Remove containers (optional): ansible-playbook playbooks/remediate-stopped-containers.yml -e dry_run=false

Kubernetes Cluster Setup (2026-02-09)

Problem: Attempted to install K3s on LXC containers - failed due to kernel module limitations.

Root Cause: LXC containers share host kernel and cannot load required modules (br_netfilter, overlay).

Solution: Delete LXC containers, create proper QEMU/KVM VMs with Ubuntu 24.04 LTS.

Cluster Design:

  • 3-node HA cluster with embedded etcd
  • All nodes as control plane servers
  • K3s v1.31.4+k3s1
  • IPs: 192.168.200.215/216/217
  • 4GB RAM, 4 CPU cores, 50GB disk per node

Key Learnings:

  • LXC containers NOT suitable for Kubernetes
  • Always verify: systemd-detect-virt should return "kvm" not "lxc"
  • Use Ubuntu LTS releases (24.04) not interim releases (24.10)
  • Interim releases have only 9 months support
  • Ubuntu 24.10 is EOL (July 2025), repositories archived

Files Created:

  • playbooks/install-k3s-cluster.yml - HA K3s installation
  • host_vars/dlx-kube-{01,02,03}.yml - Firewall configs
  • docs/K3S-INSTALLATION-GUIDE.md - Complete guide
  • docs/PROXMOX-VM-SETUP-FOR-K3S.md - VM creation guide
  • docs/SESSION-PLAN-K3S-DEPLOYMENT.md - Next session plan
  • scripts/create-k3s-vms.sh - VM creation automation

Next Steps: User creates VMs, then run K3s installation playbook.

SmartJournal Kafka Fix (2026-02-20)

Problem: sj_api logs localhost/127.0.0.1:9092 warnings on startup, takes ~60s to start.

Root Causes:

  1. kafkaservice=kafka:9092 used the external listener — Kafka advertises 192.168.200.114:9092 back to containers, resolving to localhost
  2. Spring Boot dev profile hardcodes localhost:9092 for admin client — KAFKASERVICE env var only overrides producer/consumer, not admin client

Fix:

  • .env: kafkaservice=kafka:29092 (use internal PLAINTEXT listener)
  • docker-compose-prod.yaml api service: add SPRING_KAFKA_BOOTSTRAP_SERVERS=${kafkaservice}

Result: No warnings, startup ~20s instead of ~60s

Also fixed:

  • Typo mfa_enabled=faslefalse in .env (caused boolean parse crash)
  • Duplicate hyphenated env vars ${saml-mapper-graph-proxy-port} — shell treats hyphens as default syntax, passes literal string instead of value

Documentation: docs/KAFKA-LOCALHOST-FIX.md

Security Notes

  • Only trust forwarded headers when backend is not internet-accessible
  • NPM server (192.168.200.71) should be only server that can reach backend ports
  • Backend ports should bind to localhost only: 127.0.0.1:8080:8080