HomeNet Docker Swarm Infrastructure¶
[!abstract] Overview Production-grade, multi-node Docker Swarm infrastructure managing 73+ services across 5 nodes, with comprehensive monitoring, logging, and automation. This vault contains complete documentation for architecture, operations, troubleshooting, and service management.
Quick Stats¶
| Metric | Value |
|---|---|
| Nodes | 5 (1 manager, 4 workers) |
| Stacks | 15 deployed |
| Services | 73 total (60 running) |
| Storage | ~3TB NFS (92% used) |
| Docker Version | 29.1.3 |
| Host OS | Ubuntu 24.04.3 LTS |
Navigation¶
Core Infrastructure¶
- [[01-Infrastructure/Cluster-Overview|Cluster Overview]] - Node topology and architecture
- [[01-Infrastructure/Node-201-Manager|Node 201 (Manager)]] - Critical infrastructure node
- [[01-Infrastructure/Node-202-Worker|Node 202 (Worker)]] - Media powerhouse
- [[01-Infrastructure/Node-203-Worker|Node 203 (Worker)]] - Surveillance
- [[01-Infrastructure/Node-204-Worker|Node 204 (Worker)]] - Dashboards & automation
- [[01-Infrastructure/Node-205-Worker|Node 205 (Worker)]] - General workloads
- [[01-Infrastructure/Network-Architecture|Network Architecture]] - Overlay networks, DNS, routing
Services & Stacks¶
- [[02-Services/Service-Catalog|Service Catalog]] - Complete service inventory
- [[02-Services/Stack-Homenet1|Stack Homenet1]] - Data layer (databases, ELK)
- [[02-Services/Stack-Homenet4|Stack Homenet4]] - Media & applications
- [[02-Services/ARR-Stack|ARR Stack]] - Media automation (Docker Compose)
- [[02-Services/Monitoring-Stack|Monitoring Stack]] - Prometheus & Grafana
- [[02-Services/Critical-Services-Offline|Critical Services Offline]] ⚠️
Operations¶
- [[03-Operations/Daily-Operations|Daily Operations]] - Routine maintenance
- [[03-Operations/Stack-Deployment|Stack Deployment]] - Deploy and manage stacks
- [[03-Operations/Script-Reference|Script Reference]] - 72+ operational scripts
- [[03-Operations/Cron-Jobs|Cron Jobs]] - Automated tasks
- [[03-Operations/Backup-Procedures|Backup Procedures]] - Database and service backups
Monitoring & Observability¶
- [[04-Monitoring/Prometheus-Setup|Prometheus Setup]] - Metrics collection
- [[04-Monitoring/Grafana-Dashboards|Grafana Dashboards]] - Visualization
- [[04-Monitoring/Service-Health|Service Health]] - Uptime monitoring
- [[04-Monitoring/Metrics-Exporters|Metrics Exporters]] - Node, cAdvisor, custom exporters
Storage & Networking¶
- [[05-Storage/NFS-Architecture|NFS Architecture]] - Storage mounts and capacity
- [[05-Storage/Storage-Critical-Warning|Storage Critical Warning]] ⚠️ 92% capacity
- [[05-Storage/Volume-Management|Volume Management]] - Docker volumes
- [[01-Infrastructure/Network-Architecture|Network Architecture]] - Overlay networks
Troubleshooting¶
- [[06-Troubleshooting/Known-Issues|Known Issues]] - Current problems and gaps
- [[06-Troubleshooting/ELK-Stack-Offline|ELK Stack Offline]] ⚠️ Critical issue
- [[06-Troubleshooting/Service-Restart-Runbook|Service Restart Runbook]]
- [[06-Troubleshooting/Database-Recovery|Database Recovery]]
- [[06-Troubleshooting/NFS-Mount-Issues|NFS Mount Issues]]
Documentation & Templates¶
- [[07-Documentation/Existing-Docs-Index|Existing Documentation Index]] - 50+ markdown files
- [[08-Templates/Service-Addition-Template|Service Addition Template]]
- [[08-Templates/Troubleshooting-Template|Troubleshooting Template]]
- [[08-Templates/Quick-Reference-Cards|Quick Reference Cards]]
Critical Alerts¶
[!danger] ELK Stack Offline Elasticsearch, Logstash, and Kibana are all scaled to 0/0 replicas. No centralized logging infrastructure operational - blind operations mode.
📍 [[06-Troubleshooting/ELK-Stack-Offline|Investigation Guide]]
[!warning] Storage Near Capacity Multiple NFS mounts at 92% capacity (254GB remaining of 3TB). Risk of service failures.
📍 [[05-Storage/Storage-Critical-Warning|Capacity Planning]]
[!warning] Missing Swarmpit Agent Swarmpit agent reporting 4/5 global instances - one node not reporting.
📍 [[06-Troubleshooting/Known-Issues#swarmpit-agent-missing|Troubleshooting]]
Architecture Diagrams¶
graph TB
subgraph "Proxmox Hypervisors"
PVE1[Proxmox-1<br/>100.1.100.10]
PVE2[Proxmox-2<br/>100.1.100.15]
end
subgraph "Docker Swarm Cluster"
MGR[Node 201 Manager<br/>8 CPU, 10GB RAM<br/>Databases & Logging]
W1[Node 202 Worker<br/>12 CPU, 16GB RAM<br/>Media & Photos]
W2[Node 203 Worker<br/>4 CPU, 3GB RAM<br/>Surveillance]
W3[Node 204 Worker<br/>4 CPU, 4GB RAM<br/>Dashboards]
W4[Node 205 Worker<br/>8 CPU, 8GB RAM<br/>General]
end
subgraph "Infrastructure Services"
DNS[Pi DNS<br/>100.1.100.11<br/>AdGuard/Pi-hole]
NFS[OMV NFS Server<br/>100.1.100.199<br/>3TB Storage]
end
PVE1 --> MGR
PVE1 --> W1
PVE2 --> W2
PVE2 --> W3
PVE2 --> W4
MGR -.-> DNS
W1 -.-> DNS
W2 -.-> DNS
W3 -.-> DNS
W4 -.-> DNS
MGR --> NFS
W1 --> NFS
W2 --> NFS
W3 --> NFS
W4 --> NFS
Recently Updated¶
- [[06-Troubleshooting/ELK-Stack-Offline|ELK Stack Offline]] - 2026-01-11
- [[05-Storage/Storage-Critical-Warning|Storage Critical Warning]] - 2026-01-11
- [[02-Services/Critical-Services-Offline|Critical Services Offline]] - 2026-01-11
External Resources¶
- Repository:
/home/cjustin/homenet-docker-services/ - Primary Documentation:
CLAUDE.md,README.md - Grafana: http://100.1.100.201:3010
- Prometheus: http://100.1.100.201:9090
- Swarmpit: (Cluster management UI)
- Traefik Dashboard: Port 8080
Last Research Date: 2026-01-11 Documentation Version: 1.0 Vault Created: 2026-01-11