Skip to content

HomeNet Docker Swarm Infrastructure

[!abstract] Overview Production-grade, multi-node Docker Swarm infrastructure managing 73+ services across 5 nodes, with comprehensive monitoring, logging, and automation. This vault contains complete documentation for architecture, operations, troubleshooting, and service management.

Quick Stats

Metric Value
Nodes 5 (1 manager, 4 workers)
Stacks 15 deployed
Services 73 total (60 running)
Storage ~3TB NFS (92% used)
Docker Version 29.1.3
Host OS Ubuntu 24.04.3 LTS

Core Infrastructure

  • [[01-Infrastructure/Cluster-Overview|Cluster Overview]] - Node topology and architecture
  • [[01-Infrastructure/Node-201-Manager|Node 201 (Manager)]] - Critical infrastructure node
  • [[01-Infrastructure/Node-202-Worker|Node 202 (Worker)]] - Media powerhouse
  • [[01-Infrastructure/Node-203-Worker|Node 203 (Worker)]] - Surveillance
  • [[01-Infrastructure/Node-204-Worker|Node 204 (Worker)]] - Dashboards & automation
  • [[01-Infrastructure/Node-205-Worker|Node 205 (Worker)]] - General workloads
  • [[01-Infrastructure/Network-Architecture|Network Architecture]] - Overlay networks, DNS, routing

Services & Stacks

  • [[02-Services/Service-Catalog|Service Catalog]] - Complete service inventory
  • [[02-Services/Stack-Homenet1|Stack Homenet1]] - Data layer (databases, ELK)
  • [[02-Services/Stack-Homenet4|Stack Homenet4]] - Media & applications
  • [[02-Services/ARR-Stack|ARR Stack]] - Media automation (Docker Compose)
  • [[02-Services/Monitoring-Stack|Monitoring Stack]] - Prometheus & Grafana
  • [[02-Services/Critical-Services-Offline|Critical Services Offline]] ⚠️

Operations

  • [[03-Operations/Daily-Operations|Daily Operations]] - Routine maintenance
  • [[03-Operations/Stack-Deployment|Stack Deployment]] - Deploy and manage stacks
  • [[03-Operations/Script-Reference|Script Reference]] - 72+ operational scripts
  • [[03-Operations/Cron-Jobs|Cron Jobs]] - Automated tasks
  • [[03-Operations/Backup-Procedures|Backup Procedures]] - Database and service backups

Monitoring & Observability

  • [[04-Monitoring/Prometheus-Setup|Prometheus Setup]] - Metrics collection
  • [[04-Monitoring/Grafana-Dashboards|Grafana Dashboards]] - Visualization
  • [[04-Monitoring/Service-Health|Service Health]] - Uptime monitoring
  • [[04-Monitoring/Metrics-Exporters|Metrics Exporters]] - Node, cAdvisor, custom exporters

Storage & Networking

  • [[05-Storage/NFS-Architecture|NFS Architecture]] - Storage mounts and capacity
  • [[05-Storage/Storage-Critical-Warning|Storage Critical Warning]] ⚠️ 92% capacity
  • [[05-Storage/Volume-Management|Volume Management]] - Docker volumes
  • [[01-Infrastructure/Network-Architecture|Network Architecture]] - Overlay networks

Troubleshooting

  • [[06-Troubleshooting/Known-Issues|Known Issues]] - Current problems and gaps
  • [[06-Troubleshooting/ELK-Stack-Offline|ELK Stack Offline]] ⚠️ Critical issue
  • [[06-Troubleshooting/Service-Restart-Runbook|Service Restart Runbook]]
  • [[06-Troubleshooting/Database-Recovery|Database Recovery]]
  • [[06-Troubleshooting/NFS-Mount-Issues|NFS Mount Issues]]

Documentation & Templates

  • [[07-Documentation/Existing-Docs-Index|Existing Documentation Index]] - 50+ markdown files
  • [[08-Templates/Service-Addition-Template|Service Addition Template]]
  • [[08-Templates/Troubleshooting-Template|Troubleshooting Template]]
  • [[08-Templates/Quick-Reference-Cards|Quick Reference Cards]]

Critical Alerts

[!danger] ELK Stack Offline Elasticsearch, Logstash, and Kibana are all scaled to 0/0 replicas. No centralized logging infrastructure operational - blind operations mode.

📍 [[06-Troubleshooting/ELK-Stack-Offline|Investigation Guide]]

[!warning] Storage Near Capacity Multiple NFS mounts at 92% capacity (254GB remaining of 3TB). Risk of service failures.

📍 [[05-Storage/Storage-Critical-Warning|Capacity Planning]]

[!warning] Missing Swarmpit Agent Swarmpit agent reporting 4/5 global instances - one node not reporting.

📍 [[06-Troubleshooting/Known-Issues#swarmpit-agent-missing|Troubleshooting]]

Architecture Diagrams

graph TB
    subgraph "Proxmox Hypervisors"
        PVE1[Proxmox-1<br/>100.1.100.10]
        PVE2[Proxmox-2<br/>100.1.100.15]
    end

    subgraph "Docker Swarm Cluster"
        MGR[Node 201 Manager<br/>8 CPU, 10GB RAM<br/>Databases & Logging]
        W1[Node 202 Worker<br/>12 CPU, 16GB RAM<br/>Media & Photos]
        W2[Node 203 Worker<br/>4 CPU, 3GB RAM<br/>Surveillance]
        W3[Node 204 Worker<br/>4 CPU, 4GB RAM<br/>Dashboards]
        W4[Node 205 Worker<br/>8 CPU, 8GB RAM<br/>General]
    end

    subgraph "Infrastructure Services"
        DNS[Pi DNS<br/>100.1.100.11<br/>AdGuard/Pi-hole]
        NFS[OMV NFS Server<br/>100.1.100.199<br/>3TB Storage]
    end

    PVE1 --> MGR
    PVE1 --> W1
    PVE2 --> W2
    PVE2 --> W3
    PVE2 --> W4

    MGR -.-> DNS
    W1 -.-> DNS
    W2 -.-> DNS
    W3 -.-> DNS
    W4 -.-> DNS

    MGR --> NFS
    W1 --> NFS
    W2 --> NFS
    W3 --> NFS
    W4 --> NFS

Recently Updated

  • [[06-Troubleshooting/ELK-Stack-Offline|ELK Stack Offline]] - 2026-01-11
  • [[05-Storage/Storage-Critical-Warning|Storage Critical Warning]] - 2026-01-11
  • [[02-Services/Critical-Services-Offline|Critical Services Offline]] - 2026-01-11

External Resources

  • Repository: /home/cjustin/homenet-docker-services/
  • Primary Documentation: CLAUDE.md, README.md
  • Grafana: http://100.1.100.201:3010
  • Prometheus: http://100.1.100.201:9090
  • Swarmpit: (Cluster management UI)
  • Traefik Dashboard: Port 8080

Last Research Date: 2026-01-11 Documentation Version: 1.0 Vault Created: 2026-01-11