Known Issues & Active Problems¶
[!warning] Infrastructure Health Summary Critical Issues: 2 Warnings: 4 Healthy: Cluster operational, most services running
Critical Issues¶
1. ELK Stack Completely Offline 🔴¶
Severity: Critical Impact: No centralized logging infrastructure Services Affected: Elasticsearch, Logstash, Logstash-cacher, Kibana
[!danger] Blind Operations Mode Without the ELK stack, there is no centralized log aggregation or analysis. Troubleshooting requires direct container/service log access.
Current Status: - Elasticsearch: 0/0 replicas (scaled down) - Logstash: 0/0 replicas (scaled down) - Logstash-cacher: 0/0 replicas (GELF endpoint offline) - Kibana: 0/0 replicas (no UI access)
Investigation Points: - Why were these services scaled to zero? - Disk space issues on node 201? - Performance problems leading to manual shutdown? - Resource constraints?
Workarounds:
# View individual service logs
docker service logs -f <service-name>
# Check container logs on specific node
ssh 100.1.100.202
docker logs <container-id>
# Use Grafana for metrics (partial visibility)
http://100.1.100.201:3010
Full Investigation: [[ELK-Stack-Offline|ELK Stack Offline Troubleshooting]]
2. Storage Near Critical Capacity 🔴¶
Severity: Critical Impact: Risk of service failures, database corruption, failed writes Affected Mounts: Multiple NFS mounts at 92% capacity
Current Utilization: | Mount | Total | Used | Available | Use% | |-------|-------|------|-----------|------| | /nfs_data | 3.0T | 2.8T | 254G | 92% | | /nfs_media | 3.0T | 2.8T | 254G | 92% | | /nfs_media_lib | 3.0T | 2.8T | 254G | 92% | | /nfs_service | 3.0T | 2.8T | 254G | 92% | | /nfs_personal | 503G | 379G | 124G | 76% |
[!caution] Shared Storage Multiple mount points share the same underlying 3TB volume on OMV server (100.1.100.199). Effective available space is ~254GB total.
Immediate Risks: - Database write failures - Plex metadata corruption - Failed Docker volume creation - Service crash loops - Photo upload failures
Action Items:
1. Identify largest consumers: du -h --max-depth=1 /nfs_data | sort -hr
2. Clean up old logs: ./sh-delete-temp.sh
3. Archive old media/photos
4. Expand OMV storage pool
5. Implement cleanup automation
Full Analysis: [[05-Storage/Storage-Critical-Warning|Storage Capacity Planning]]
Warnings¶
3. Swarmpit Agent Missing 🟡¶
Severity: Warning Impact: Incomplete cluster visibility in Swarmpit UI Services Affected: Swarmpit agent (4/5 global instances)
Symptoms: - One node not reporting to Swarmpit - Partial cluster metrics - Possible node communication issues
Troubleshooting:
# Check which nodes have agent
docker service ps swarmpit_agent
# Check agent logs
docker service logs swarmpit_agent
# Verify network connectivity
docker network inspect swarmpit_net
Investigation: - Which node is missing the agent? - Network connectivity to swarmpit_net? - Resource constraints preventing agent start?
4. Node Exporter Instances Down 🟡¶
Severity: Warning Impact: Incomplete host metrics in Prometheus Services Affected: node-exporter (mixed health)
Current State: - Expected: 5/5 global instances (one per node) - Actual: Some instances reporting DOWN in Prometheus
Check Targets: 1. Visit Prometheus: http://100.1.100.201:9090/targets 2. Identify which node-exporter instances are down 3. Check node connectivity
Troubleshooting:
# Check global service status
docker service ps monitoring_node-exporter
# Check specific node
ssh <node-ip>
docker ps | grep node-exporter
# Restart service
docker service update --force monitoring_node-exporter
5. Exportarr Metrics Offline 🟡¶
Severity: Warning Impact: No visibility into ARR application health (Radarr, Sonarr, etc.) Services Affected: Exportarr instances in monitoring stack
Current Status: - Swarm stack exportarr: 0/0 (all 4 instances scaled down) - Docker Compose exportarr: ✅ Running on node 202
[!note] Dual Deployment Exportarr runs via Docker Compose on node 202 for the ARR stack, but Swarm monitoring stack instances are offline.
Investigation: - Why were Swarm exportarr instances scaled down? - Are Compose exportarr metrics being scraped? - Consolidate to single deployment method?
Verify Compose Exportarr:
6. Tdarr Exporter Completely Down 🟡¶
Severity: Warning Impact: No transcode monitoring metrics Services Affected: Tdarr exporter (0/0)
Current Status: - Service scaled to 0/0 - Prometheus target DOWN
Investigation: - Is Tdarr service itself running? - Was exporter intentionally disabled? - Performance issues with Tdarr?
7. Systemd Service Disabled 🟡¶
Severity: Warning (Operational)
Impact: Services won't auto-start on boot
Service Affected: homenet-stack.service
Current Status:
Behavior: - Stack is currently running (started manually) - Will NOT auto-start on system boot - Requires manual intervention after reboot
Decision Required: - Enable systemd service for auto-start? - Keep manual control for safety?
To Enable:
Resolved Issues¶
[!success] Recently Fixed Track resolved issues here for historical reference
None currently documented.
Monitoring Dashboard¶
Quick Health Check¶
# Overall service health
docker service ls | grep -v "0/0"
# Check critical services
docker service ps homenet1_mariadb
docker service ps homenet1_influxdb
docker service ps traefik_traefik
# Storage capacity
df -h | grep nfs
# Swarm cluster health
docker node ls
Prometheus Alerts¶
Visit: http://100.1.100.201:9090/alerts
Expected alerts for current issues: - Node exporter instances down - High disk usage warnings - Service unavailability (ELK stack)
Grafana Dashboards¶
Visit: http://100.1.100.201:3010
Relevant dashboards: - Node exporter metrics (host health) - Docker swarm overview - Storage utilization - Service resource usage
Issue Triage Process¶
Priority Levels¶
🔴 Critical (P0): - Service completely offline affecting core functionality - Data loss risk - Security vulnerability - Storage >95% full
🟡 Warning (P1): - Partial service degradation - Missing metrics/monitoring - Non-critical configuration issues - Storage 85-95% full
🟢 Info (P2): - Nice-to-have improvements - Documentation gaps - Performance optimization opportunities
Escalation Path¶
- Detection: Monitoring alerts, manual discovery, user reports
- Triage: Assess severity and impact
- Investigation: Use relevant troubleshooting runbook
- Resolution: Implement fix and document
- Verification: Confirm issue resolved
- Documentation: Update this page and runbooks
Common Troubleshooting Commands¶
# Service not starting
docker service ps <service> --no-trunc
docker service logs <service> -f
# Network issues
docker network inspect <network>
docker network ls
# Storage issues
df -h
du -sh /nfs_data/*
./sh-correct-mounts.sh
# Node issues
docker node ls
docker node inspect <node>
ssh <node-ip>
# Restart service
docker service update --force <service>
# Scale service
docker service scale <service>=1
# Stack redeployment
docker stack deploy -c <stack-file>.yml <stack-name> --with-registry-auth
Related Documentation¶
- [[ELK-Stack-Offline|ELK Stack Investigation]]
- [[05-Storage/Storage-Critical-Warning|Storage Capacity Warning]]
- [[Service-Restart-Runbook|Service Restart Procedures]]
- [[Database-Recovery|Database Recovery Guide]]
- [[NFS-Mount-Issues|NFS Troubleshooting]]
- [[04-Monitoring/Service-Health|Service Health Monitoring]]
Last Updated: 2026-01-11 Active Critical Issues: 2 (ELK Stack, Storage) Active Warnings: 5 Next Review: Daily until critical issues resolved