7.6 KiB
Project Memory: dlx-ansible
Infrastructure Overview
- NPM Server: nginx (192.168.200.71) - Nginx Proxy Manager for SSL termination
- Application Servers: hiveops (192.168.200.112), smartjournal (192.168.200.114)
- CI/CD Server: jenkins (192.168.200.91) - Jenkins + SonarQube
- All servers use
dlxadminuser with passwordless sudo
Critical Learnings
SSL Certificate Offloading with Nginx Proxy Manager
Problem: Spring Boot applications behind NPM experience redirect loops when accessed via HTTPS.
Root Cause: Spring Boot doesn't trust X-Forwarded-* headers by default. When NPM terminates SSL and forwards HTTP to backend, Spring sees HTTP and redirects to HTTPS, creating infinite loop.
Solution: Configure Spring Boot to trust forwarded headers:
environment:
SERVER_FORWARD_HEADERS_STRATEGY: native
SERVER_USE_FORWARD_HEADERS: true
Key Points:
- Containers must be recreated (not restarted) for env vars to take effect
- Verify with:
curl -I -H 'X-Forwarded-Proto: https' http://localhost:8080/ - Success indicator:
Strict-Transport-Securityheader in response - Documentation:
docs/SSL-OFFLOADING-FIX.md
Docker Compose Best Practices
Environment Variable Loading:
- Use
--env-fileflag when .env is not in same directory as compose file - Example:
docker compose -f docker/docker-compose.yml --env-file .env up -d
Container Updates:
- Restart: Keeps existing container, doesn't apply env changes
- Recreate: Removes old container, creates new one with latest env/config
- Always recreate when changing environment variables
HiveOps Application Structure
Main Deployment (/opt/hiveops-deploy/):
- Full microservices stack
- Services: incident-backend, incident-frontend, mgmt, remote
- Managed via docker-compose
Standalone Deployment (/home/hiveops/):
- Simplified incident management system
- Separate from main deployment
- Used for direct hiveops.directlx.dev access
Jenkins Firewall Blocking (2026-02-09)
Problem: Jenkins and SonarQube were unreachable from network.
Root Cause: Server had no host_vars file, inherited default firewall config (SSH only).
Solution: Created host_vars/jenkins.yml with ports 22, 8080 (Jenkins), 9000 (SonarQube).
Quick Fix:
ansible jenkins -m community.general.ufw -a "rule=allow port=8080 proto=tcp" -b
ansible jenkins -m community.general.ufw -a "rule=allow port=9000 proto=tcp" -b
ansible jenkins -m shell -a "docker start postgresql sonarqube" -b
Key Points:
- Jenkins runs as Java system service (not Docker) on port 8080
- SonarQube runs in Docker with PostgreSQL backend
- Always create host_vars file for servers with specific firewall needs
- Documentation:
docs/JENKINS-CONNECTIVITY-FIX.md
File Locations
Host Variables
/source/dlx-src/dlx-ansible/host_vars/npm.yml- NPM firewall config/source/dlx-src/dlx-ansible/host_vars/smartjournal.yml- SmartJournal settings/source/dlx-src/dlx-ansible/host_vars/jenkins.yml- Jenkins/SonarQube firewall config
Storage Remediation (2026-02-08)
Critical Issues Identified:
- proxmox-00 root FS: 84.5% full (CRITICAL)
- proxmox-01 dlx-docker: 81.1% full (HIGH)
- Unused containers: 1.2 TB allocated
- SonarQube: 354 GB (82% of allocation)
Remediation Playbooks Created:
remediate-storage-critical-issues.yml: Log cleanup, Docker prune, auditsremediate-docker-storage.yml: Deep Docker cleanup + automationremediate-stopped-containers.yml: Safe container removal with backupsconfigure-storage-monitoring.yml: Proactive monitoring (5/10 min checks)
Documentation:
STORAGE-AUDIT.md: Full hardware/storage analysis (550 lines)STORAGE-REMEDIATION-GUIDE.md: Step-by-step execution (480 lines)REMEDIATION-SUMMARY.md: Quick reference (300 lines)
Expected Results:
- Total space freed: 1-2 TB
- proxmox-00: 84.5% → 70% (10-15 GB freed)
- proxmox-01: 81.1% → 70% (50-150 GB freed)
- Automation prevents regrowth (weekly prune + hourly monitoring)
Commit: 90ed5c1
Common Tasks
Fix SSL Offloading for Spring Boot Service
- Add env vars to .env:
SERVER_FORWARD_HEADERS_STRATEGY=native,SERVER_USE_FORWARD_HEADERS=true - Add to docker-compose environment section
- Recreate container:
docker stop <name> && docker rm <name> && docker compose up -d <service> - Verify: Check for
Strict-Transport-Securityheader
Apply Firewall Configuration
- Firewall is managed by common role (roles/common/tasks/security.yml)
- Controlled per-host via
common_firewall_enabledandcommon_firewall_allowed_ports - Some hosts (docker, hiveops, smartjournal) have firewall disabled for Docker networking
Run Storage Remediation
- Test with
--check:ansible-playbook playbooks/remediate-storage-critical-issues.yml --check - Deploy monitoring:
ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox - Fix proxmox-00:
ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00 - Fix proxmox-01:
ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01 - Monitor:
tail -f /var/log/storage-monitor.log - Remove containers (optional):
ansible-playbook playbooks/remediate-stopped-containers.yml -e dry_run=false
Kubernetes Cluster Setup (2026-02-09)
Problem: Attempted to install K3s on LXC containers - failed due to kernel module limitations.
Root Cause: LXC containers share host kernel and cannot load required modules (br_netfilter, overlay).
Solution: Delete LXC containers, create proper QEMU/KVM VMs with Ubuntu 24.04 LTS.
Cluster Design:
- 3-node HA cluster with embedded etcd
- All nodes as control plane servers
- K3s v1.31.4+k3s1
- IPs: 192.168.200.215/216/217
- 4GB RAM, 4 CPU cores, 50GB disk per node
Key Learnings:
- LXC containers NOT suitable for Kubernetes
- Always verify:
systemd-detect-virtshould return "kvm" not "lxc" - Use Ubuntu LTS releases (24.04) not interim releases (24.10)
- Interim releases have only 9 months support
- Ubuntu 24.10 is EOL (July 2025), repositories archived
Files Created:
playbooks/install-k3s-cluster.yml- HA K3s installationhost_vars/dlx-kube-{01,02,03}.yml- Firewall configsdocs/K3S-INSTALLATION-GUIDE.md- Complete guidedocs/PROXMOX-VM-SETUP-FOR-K3S.md- VM creation guidedocs/SESSION-PLAN-K3S-DEPLOYMENT.md- Next session planscripts/create-k3s-vms.sh- VM creation automation
Next Steps: User creates VMs, then run K3s installation playbook.
SmartJournal Kafka Fix (2026-02-20)
Problem: sj_api logs localhost/127.0.0.1:9092 warnings on startup, takes ~60s to start.
Root Causes:
kafkaservice=kafka:9092used the external listener — Kafka advertises192.168.200.114:9092back to containers, resolving to localhost- Spring Boot
devprofile hardcodeslocalhost:9092for admin client —KAFKASERVICEenv var only overrides producer/consumer, not admin client
Fix:
.env:kafkaservice=kafka:29092(use internal PLAINTEXT listener)docker-compose-prod.yamlapi service: addSPRING_KAFKA_BOOTSTRAP_SERVERS=${kafkaservice}
Result: No warnings, startup ~20s instead of ~60s
Also fixed:
- Typo
mfa_enabled=fasle→falsein .env (caused boolean parse crash) - Duplicate hyphenated env vars
${saml-mapper-graph-proxy-port}— shell treats hyphens as default syntax, passes literal string instead of value
Documentation: docs/KAFKA-LOCALHOST-FIX.md
Security Notes
- Only trust forwarded headers when backend is not internet-accessible
- NPM server (192.168.200.71) should be only server that can reach backend ports
- Backend ports should bind to localhost only:
127.0.0.1:8080:8080