Skip to content

Known Issues & Active Problems

[!warning] Infrastructure Health Summary Critical Issues: 2 Warnings: 4 Healthy: Cluster operational, most services running

Critical Issues

1. ELK Stack Completely Offline 🔴

Severity: Critical Impact: No centralized logging infrastructure Services Affected: Elasticsearch, Logstash, Logstash-cacher, Kibana

[!danger] Blind Operations Mode Without the ELK stack, there is no centralized log aggregation or analysis. Troubleshooting requires direct container/service log access.

Current Status: - Elasticsearch: 0/0 replicas (scaled down) - Logstash: 0/0 replicas (scaled down) - Logstash-cacher: 0/0 replicas (GELF endpoint offline) - Kibana: 0/0 replicas (no UI access)

Investigation Points: - Why were these services scaled to zero? - Disk space issues on node 201? - Performance problems leading to manual shutdown? - Resource constraints?

Workarounds:

# View individual service logs
docker service logs -f <service-name>

# Check container logs on specific node
ssh 100.1.100.202
docker logs <container-id>

# Use Grafana for metrics (partial visibility)
http://100.1.100.201:3010

Full Investigation: [[ELK-Stack-Offline|ELK Stack Offline Troubleshooting]]


2. Storage Near Critical Capacity 🔴

Severity: Critical Impact: Risk of service failures, database corruption, failed writes Affected Mounts: Multiple NFS mounts at 92% capacity

Current Utilization: | Mount | Total | Used | Available | Use% | |-------|-------|------|-----------|------| | /nfs_data | 3.0T | 2.8T | 254G | 92% | | /nfs_media | 3.0T | 2.8T | 254G | 92% | | /nfs_media_lib | 3.0T | 2.8T | 254G | 92% | | /nfs_service | 3.0T | 2.8T | 254G | 92% | | /nfs_personal | 503G | 379G | 124G | 76% |

[!caution] Shared Storage Multiple mount points share the same underlying 3TB volume on OMV server (100.1.100.199). Effective available space is ~254GB total.

Immediate Risks: - Database write failures - Plex metadata corruption - Failed Docker volume creation - Service crash loops - Photo upload failures

Action Items: 1. Identify largest consumers: du -h --max-depth=1 /nfs_data | sort -hr 2. Clean up old logs: ./sh-delete-temp.sh 3. Archive old media/photos 4. Expand OMV storage pool 5. Implement cleanup automation

Full Analysis: [[05-Storage/Storage-Critical-Warning|Storage Capacity Planning]]


Warnings

3. Swarmpit Agent Missing 🟡

Severity: Warning Impact: Incomplete cluster visibility in Swarmpit UI Services Affected: Swarmpit agent (4/5 global instances)

Symptoms: - One node not reporting to Swarmpit - Partial cluster metrics - Possible node communication issues

Troubleshooting:

# Check which nodes have agent
docker service ps swarmpit_agent

# Check agent logs
docker service logs swarmpit_agent

# Verify network connectivity
docker network inspect swarmpit_net

Investigation: - Which node is missing the agent? - Network connectivity to swarmpit_net? - Resource constraints preventing agent start?


4. Node Exporter Instances Down 🟡

Severity: Warning Impact: Incomplete host metrics in Prometheus Services Affected: node-exporter (mixed health)

Current State: - Expected: 5/5 global instances (one per node) - Actual: Some instances reporting DOWN in Prometheus

Check Targets: 1. Visit Prometheus: http://100.1.100.201:9090/targets 2. Identify which node-exporter instances are down 3. Check node connectivity

Troubleshooting:

# Check global service status
docker service ps monitoring_node-exporter

# Check specific node
ssh <node-ip>
docker ps | grep node-exporter

# Restart service
docker service update --force monitoring_node-exporter


5. Exportarr Metrics Offline 🟡

Severity: Warning Impact: No visibility into ARR application health (Radarr, Sonarr, etc.) Services Affected: Exportarr instances in monitoring stack

Current Status: - Swarm stack exportarr: 0/0 (all 4 instances scaled down) - Docker Compose exportarr: ✅ Running on node 202

[!note] Dual Deployment Exportarr runs via Docker Compose on node 202 for the ARR stack, but Swarm monitoring stack instances are offline.

Investigation: - Why were Swarm exportarr instances scaled down? - Are Compose exportarr metrics being scraped? - Consolidate to single deployment method?

Verify Compose Exportarr:

ssh 100.1.100.202
docker ps | grep exportarr
curl localhost:9707/metrics  # Radarr exporter


6. Tdarr Exporter Completely Down 🟡

Severity: Warning Impact: No transcode monitoring metrics Services Affected: Tdarr exporter (0/0)

Current Status: - Service scaled to 0/0 - Prometheus target DOWN

Investigation: - Is Tdarr service itself running? - Was exporter intentionally disabled? - Performance issues with Tdarr?


7. Systemd Service Disabled 🟡

Severity: Warning (Operational) Impact: Services won't auto-start on boot Service Affected: homenet-stack.service

Current Status:

Loaded: loaded
Active: inactive (dead)
Enabled: disabled

Behavior: - Stack is currently running (started manually) - Will NOT auto-start on system boot - Requires manual intervention after reboot

Decision Required: - Enable systemd service for auto-start? - Keep manual control for safety?

To Enable:

sudo systemctl enable homenet-stack.service
sudo systemctl start homenet-stack.service


Resolved Issues

[!success] Recently Fixed Track resolved issues here for historical reference

None currently documented.


Monitoring Dashboard

Quick Health Check

# Overall service health
docker service ls | grep -v "0/0"

# Check critical services
docker service ps homenet1_mariadb
docker service ps homenet1_influxdb
docker service ps traefik_traefik

# Storage capacity
df -h | grep nfs

# Swarm cluster health
docker node ls

Prometheus Alerts

Visit: http://100.1.100.201:9090/alerts

Expected alerts for current issues: - Node exporter instances down - High disk usage warnings - Service unavailability (ELK stack)

Grafana Dashboards

Visit: http://100.1.100.201:3010

Relevant dashboards: - Node exporter metrics (host health) - Docker swarm overview - Storage utilization - Service resource usage


Issue Triage Process

Priority Levels

🔴 Critical (P0): - Service completely offline affecting core functionality - Data loss risk - Security vulnerability - Storage >95% full

🟡 Warning (P1): - Partial service degradation - Missing metrics/monitoring - Non-critical configuration issues - Storage 85-95% full

🟢 Info (P2): - Nice-to-have improvements - Documentation gaps - Performance optimization opportunities

Escalation Path

  1. Detection: Monitoring alerts, manual discovery, user reports
  2. Triage: Assess severity and impact
  3. Investigation: Use relevant troubleshooting runbook
  4. Resolution: Implement fix and document
  5. Verification: Confirm issue resolved
  6. Documentation: Update this page and runbooks

Common Troubleshooting Commands

# Service not starting
docker service ps <service> --no-trunc
docker service logs <service> -f

# Network issues
docker network inspect <network>
docker network ls

# Storage issues
df -h
du -sh /nfs_data/*
./sh-correct-mounts.sh

# Node issues
docker node ls
docker node inspect <node>
ssh <node-ip>

# Restart service
docker service update --force <service>

# Scale service
docker service scale <service>=1

# Stack redeployment
docker stack deploy -c <stack-file>.yml <stack-name> --with-registry-auth

  • [[ELK-Stack-Offline|ELK Stack Investigation]]
  • [[05-Storage/Storage-Critical-Warning|Storage Capacity Warning]]
  • [[Service-Restart-Runbook|Service Restart Procedures]]
  • [[Database-Recovery|Database Recovery Guide]]
  • [[NFS-Mount-Issues|NFS Troubleshooting]]
  • [[04-Monitoring/Service-Health|Service Health Monitoring]]

Last Updated: 2026-01-11 Active Critical Issues: 2 (ELK Stack, Storage) Active Warnings: 5 Next Review: Daily until critical issues resolved