dlx-claude/memory/dlx-ansible/MEMORY.md

5.8 KiB

Project Memory: dlx-ansible

Infrastructure Overview

  • NPM Server: nginx (192.168.200.71) - Nginx Proxy Manager for SSL termination
  • Application Servers: hiveops (192.168.200.112), smartjournal (192.168.200.114)
  • CI/CD Server: jenkins (192.168.200.91) - Jenkins + SonarQube
  • All servers use dlxadmin user with passwordless sudo

Critical Learnings

SSL Certificate Offloading with Nginx Proxy Manager

Problem: Spring Boot applications behind NPM experience redirect loops when accessed via HTTPS.

Root Cause: Spring Boot doesn't trust X-Forwarded-* headers by default. When NPM terminates SSL and forwards HTTP to backend, Spring sees HTTP and redirects to HTTPS, creating infinite loop.

Solution: Configure Spring Boot to trust forwarded headers:

environment:
  SERVER_FORWARD_HEADERS_STRATEGY: native
  SERVER_USE_FORWARD_HEADERS: true

Key Points:

  • Containers must be recreated (not restarted) for env vars to take effect
  • Verify with: curl -I -H 'X-Forwarded-Proto: https' http://localhost:8080/
  • Success indicator: Strict-Transport-Security header in response
  • Documentation: docs/SSL-OFFLOADING-FIX.md

Docker Compose Best Practices

Environment Variable Loading:

  • Use --env-file flag when .env is not in same directory as compose file
  • Example: docker compose -f docker/docker-compose.yml --env-file .env up -d

Container Updates:

  • Restart: Keeps existing container, doesn't apply env changes
  • Recreate: Removes old container, creates new one with latest env/config
  • Always recreate when changing environment variables

HiveOps Application Structure

Main Deployment (/opt/hiveops-deploy/):

  • Full microservices stack
  • Services: incident-backend, incident-frontend, mgmt, remote
  • Managed via docker-compose

Standalone Deployment (/home/hiveops/):

  • Simplified incident management system
  • Separate from main deployment
  • Used for direct hiveops.directlx.dev access

Jenkins Firewall Blocking (2026-02-09)

Problem: Jenkins and SonarQube were unreachable from network.

Root Cause: Server had no host_vars file, inherited default firewall config (SSH only).

Solution: Created host_vars/jenkins.yml with ports 22, 8080 (Jenkins), 9000 (SonarQube).

Quick Fix:

ansible jenkins -m community.general.ufw -a "rule=allow port=8080 proto=tcp" -b
ansible jenkins -m community.general.ufw -a "rule=allow port=9000 proto=tcp" -b
ansible jenkins -m shell -a "docker start postgresql sonarqube" -b

Key Points:

  • Jenkins runs as Java system service (not Docker) on port 8080
  • SonarQube runs in Docker with PostgreSQL backend
  • Always create host_vars file for servers with specific firewall needs
  • Documentation: docs/JENKINS-CONNECTIVITY-FIX.md

File Locations

Host Variables

  • /source/dlx-src/dlx-ansible/host_vars/npm.yml - NPM firewall config
  • /source/dlx-src/dlx-ansible/host_vars/hiveops.yml - HiveOps settings
  • /source/dlx-src/dlx-ansible/host_vars/smartjournal.yml - SmartJournal settings
  • /source/dlx-src/dlx-ansible/host_vars/jenkins.yml - Jenkins/SonarQube firewall config

Playbooks Created

  • playbooks/fix-hiveops-ssl-offload.yml - SSL offload fix automation
  • playbooks/fix-hiveops-compose-indentation.yml - Compose file corrections
  • playbooks/fix-hiveops-mgmt-ssl.yml - Management service SSL fix

Templates

  • templates/hiveops-docker-compose.prod.yml.j2 - Corrected compose template

Storage Remediation (2026-02-08)

Critical Issues Identified:

  1. proxmox-00 root FS: 84.5% full (CRITICAL)
  2. proxmox-01 dlx-docker: 81.1% full (HIGH)
  3. Unused containers: 1.2 TB allocated
  4. SonarQube: 354 GB (82% of allocation)

Remediation Playbooks Created:

  • remediate-storage-critical-issues.yml: Log cleanup, Docker prune, audits
  • remediate-docker-storage.yml: Deep Docker cleanup + automation
  • remediate-stopped-containers.yml: Safe container removal with backups
  • configure-storage-monitoring.yml: Proactive monitoring (5/10 min checks)

Documentation:

  • STORAGE-AUDIT.md: Full hardware/storage analysis (550 lines)
  • STORAGE-REMEDIATION-GUIDE.md: Step-by-step execution (480 lines)
  • REMEDIATION-SUMMARY.md: Quick reference (300 lines)

Expected Results:

  • Total space freed: 1-2 TB
  • proxmox-00: 84.5% → 70% (10-15 GB freed)
  • proxmox-01: 81.1% → 70% (50-150 GB freed)
  • Automation prevents regrowth (weekly prune + hourly monitoring)

Commit: 90ed5c1

Common Tasks

Fix SSL Offloading for Spring Boot Service

  1. Add env vars to .env: SERVER_FORWARD_HEADERS_STRATEGY=native, SERVER_USE_FORWARD_HEADERS=true
  2. Add to docker-compose environment section
  3. Recreate container: docker stop <name> && docker rm <name> && docker compose up -d <service>
  4. Verify: Check for Strict-Transport-Security header

Apply Firewall Configuration

  • Firewall is managed by common role (roles/common/tasks/security.yml)
  • Controlled per-host via common_firewall_enabled and common_firewall_allowed_ports
  • Some hosts (docker, hiveops, smartjournal) have firewall disabled for Docker networking

Run Storage Remediation

  1. Test with --check: ansible-playbook playbooks/remediate-storage-critical-issues.yml --check
  2. Deploy monitoring: ansible-playbook playbooks/configure-storage-monitoring.yml -l proxmox
  3. Fix proxmox-00: ansible-playbook playbooks/remediate-storage-critical-issues.yml -l proxmox-00
  4. Fix proxmox-01: ansible-playbook playbooks/remediate-docker-storage.yml -l proxmox-01
  5. Monitor: tail -f /var/log/storage-monitor.log
  6. Remove containers (optional): ansible-playbook playbooks/remediate-stopped-containers.yml -e dry_run=false

Security Notes

  • Only trust forwarded headers when backend is not internet-accessible
  • NPM server (192.168.200.71) should be only server that can reach backend ports
  • Backend ports should bind to localhost only: 127.0.0.1:8080:8080