ELK Stack Offline Investigation¶
[!danger] Critical Infrastructure Offline Status: 🔴 CRITICAL Impact: No centralized logging - operating blind Services Affected: Elasticsearch, Logstash, Logstash-cacher, Kibana Duration: Unknown (currently 0/0 replicas)
Current Status¶
Elasticsearch: ❌ 0/0 replicas (scaled down) Logstash: ❌ 0/0 replicas (scaled down) Logstash-cacher: ❌ 0/0 replicas (scaled down) Kibana: ❌ 0/0 replicas (scaled down)
All ELK stack services are intentionally scaled to zero, suggesting a deliberate shutdown rather than a crash.
Impact Analysis¶
What We've Lost¶
No Centralized Log Aggregation: - ❌ Cannot view logs from all services in one place - ❌ Cannot search logs across infrastructure - ❌ Cannot correlate events between services - ❌ No log-based alerting - ❌ No log retention/archival
No GELF Endpoint: - ❌ Logstash-cacher (UDP 12201) offline - ❌ Services configured with GELF logging lose logs - ❌ Container logs not being processed
No Log Visualization: - ❌ Kibana dashboard unavailable - ❌ Cannot create log queries/visualizations - ❌ No historical log analysis
Current Workarounds¶
Viewing Service Logs:
# Direct service log access
docker service logs -f <service-name>
# Specific container logs on a node
ssh <node-ip>
docker logs <container-id>
# Follow logs from multiple services
docker service logs -f homenet1_mariadb &
docker service logs -f homenet4_plex &
Use Grafana for Metrics: - Grafana still operational: http://100.1.100.201:3010 - Can monitor service health via metrics - Limited to numerical data (not log content)
Investigation Steps¶
1. Determine Why Services Were Scaled Down¶
Check service history:
# View service update history
docker service inspect homenet1_elasticsearch --format '{{json .UpdatedAt}}'
docker service inspect homenet1_logstash --format '{{json .UpdatedAt}}'
# Check for recent stack deployments
docker stack ps homenet1 --format "{{.Name}} {{.CurrentState}} {{.Error}}"
Review git history:
cd /home/cjustin/homenet-docker-services
git log --oneline --grep="elastic\|logstash" --since="2 weeks ago"
git show <commit-hash>
Possible reasons for shutdown: 1. Storage capacity - Elasticsearch indices filling disk 2. Performance issues - High resource usage 3. Corruption - BBolt or index corruption 4. Intentional maintenance - Planned shutdown 5. Resource constraints - Node 201 overloaded
2. Check Elasticsearch Data¶
Inspect data directory:
ssh 100.1.100.201
# Check Elasticsearch data size
du -sh /nfs_data/elasticsearch/
# Look for corruption indicators
ls -la /nfs_data/elasticsearch/nodes/
# Check for disk space issues
df -h /nfs_data
Expected issues: - Large index files (>50GB) - Corrupted indices - Failed snapshots - Lock files preventing startup
3. Check Logstash Configuration¶
Verify pipeline configurations:
# Check Logstash config
ls -la /homenet_config/logstash/config/
cat /homenet_config/logstash/config/pipelines.yml
# Verify pipeline files
ls -la /homenet_config/logstash/pipeline/
Common issues: - Syntax errors in pipeline files - Elasticsearch output configuration - GELF input configuration
4. Review Resource Usage¶
Check if ELK was consuming too much:
# Node 201 current resource usage
ssh 100.1.100.201
free -h
df -h
# Historical metrics from Prometheus
# Visit: http://100.1.100.201:9090
# Query: container_memory_usage_bytes{name=~".*elastic.*"}
Restoration Plan¶
Phase 1: Pre-Flight Checks¶
Before restarting, verify:
-
Storage capacity:
-
Memory available on node 201:
-
Configuration validity:
-
Network connectivity:
Phase 2: Start Elasticsearch First¶
Elasticsearch must start before Logstash:
# Scale up Elasticsearch
docker service scale homenet1_elasticsearch=1
# Monitor startup (takes 1-2 minutes)
docker service logs -f homenet1_elasticsearch
# Wait for "started" message
# Look for: "started"
Verify Elasticsearch health:
# Check cluster health
curl http://100.1.100.201:9200/_cluster/health?pretty
# Expected response:
# {
# "status": "green" or "yellow",
# "number_of_nodes": 1
# }
# List indices
curl http://100.1.100.201:9200/_cat/indices?v
If Elasticsearch fails to start:
# Check logs for errors
docker service logs homenet1_elasticsearch --tail 100
# Common errors:
# - "BBolt panic" → Corrupted database (see Phase 4)
# - "Out of memory" → Increase heap size or free RAM
# - "Disk full" → Clean up storage (see Storage Warning)
Phase 3: Start Logstash¶
After Elasticsearch is healthy:
# Scale up Logstash
docker service scale homenet1_logstash=1
docker service scale homenet1_logstash-cacher=1
# Monitor startup
docker service logs -f homenet1_logstash
# Verify GELF input
netstat -uln | grep 12201 # Should show UDP listener
Test GELF endpoint:
# Send test message
echo '{"version":"1.1","host":"test","short_message":"test"}' | \
nc -u 100.1.100.201 12201
Phase 4: Start Kibana¶
After Logstash is running:
# Scale up Kibana
docker service scale homenet4_kibana=1
# Monitor startup
docker service logs -f homenet4_kibana
# Wait for "Server running at http://0.0.0.0:5601"
Access Kibana:
Create index pattern:
1. Navigate to Stack Management → Index Patterns
2. Create pattern: logstash-*
3. Select timestamp field: @timestamp
4. Navigate to Discover to view logs
Phase 5: Verification¶
Confirm full functionality:
- Elasticsearch:
- Cluster health: GREEN or YELLOW
- Indices visible and searchable
-
New indices being created
-
Logstash:
- No errors in service logs
- GELF port listening
-
Processing logs (check Elasticsearch indices)
-
Kibana:
- UI accessible
- Index pattern created
-
Logs visible in Discover
-
End-to-end test:
Recovery from Corruption¶
Elasticsearch BBolt Corruption¶
If Elasticsearch fails with "BBolt panic":
# Backup current state
ssh 100.1.100.201
tar -czf /tmp/elasticsearch-backup-$(date +%Y%m%d).tar.gz /nfs_data/elasticsearch/
# Remove corrupted BBolt file
rm /nfs_data/elasticsearch/influxd.bolt
# Note: Path may vary, check error message
# Restart Elasticsearch
docker service update --force homenet1_elasticsearch
If indices are corrupted:
# Delete corrupted indices
curl -X DELETE http://100.1.100.201:9200/<index-name>
# Or delete all old indices (>30 days)
curator delete indices --older-than 30 --time-unit days --timestring '%Y.%m.%d'
# Elasticsearch will create new indices automatically
Complete Data Reset¶
If restoration fails, nuclear option:
[!warning] Data Loss This will delete ALL historical logs. Only use as last resort.
# Stop Elasticsearch
docker service scale homenet1_elasticsearch=0
# Backup and remove data
ssh 100.1.100.201
mv /nfs_data/elasticsearch /nfs_data/elasticsearch.old.$(date +%Y%m%d)
mkdir /nfs_data/elasticsearch
chown -R 1000:1000 /nfs_data/elasticsearch
# Restart Elasticsearch (will initialize fresh)
docker service scale homenet1_elasticsearch=1
Performance Optimization¶
Elasticsearch Tuning¶
Reduce memory usage:
Edit docker-compose or stack file:
Index lifecycle management:
# Delete indices older than 30 days
curl -X DELETE 'http://100.1.100.201:9200/logstash-*' \
--data-urlencode 'expand_wildcards=open' \
--data-urlencode 'max_age=30d'
# Or automate with curator
Reduce replica count:
# Set replicas to 0 (single node cluster)
curl -X PUT http://100.1.100.201:9200/_settings -H 'Content-Type: application/json' -d'
{
"index": {
"number_of_replicas": 0
}
}'
Logstash Tuning¶
Pipeline optimization:
Edit /homenet_config/logstash/pipeline/*.conf:
# Add if/else to filter unnecessary logs
filter {
if [level] == "debug" {
drop { }
}
}
# Reduce batch size if memory constrained
input {
gelf {
port => 12201
batch_size => 125 # Reduce from default
}
}
Alternative Solutions¶
If ELK is too resource-intensive:¶
Option 1: Lighter alternatives - Loki - Grafana's log aggregation (lower resource usage) - Graylog - Similar features, less memory - Fluentd - Log collector with various outputs
Option 2: Cloud-based logging - Datadog - Commercial SaaS - Papertrail - Simple cloud logging - CloudWatch - If using AWS
Option 3: File-based logging
- Keep Docker logs on disk
- Use docker service logs when needed
- Rotate logs with logrotate
- Accept limited searchability
Option 4: Hybrid approach - Use Loki for recent logs (7 days) - Archive old logs to S3/cold storage - Elasticsearch only for specific high-value logs
Monitoring After Restoration¶
Prometheus Alerts¶
groups:
- name: elk_alerts
rules:
- alert: ElasticsearchDown
expr: up{job="elasticsearch"} == 0
for: 5m
labels:
severity: critical
- alert: LogstashDown
expr: up{job="logstash"} == 0
for: 5m
labels:
severity: critical
- alert: ElasticsearchDiskSpace
expr: elasticsearch_filesystem_data_available_bytes < 10737418240 # 10GB
for: 15m
labels:
severity: warning
Grafana Dashboard¶
Create ELK health dashboard with: - Elasticsearch cluster status - Index count and size - Logstash throughput - GELF message rate - Disk space usage
Health Check Script¶
#!/bin/bash
# /usr/local/bin/elk-health-check.sh
echo "=== ELK Health Check $(date) ==="
# Elasticsearch
ES_STATUS=$(curl -s http://100.1.100.201:9200/_cluster/health | jq -r '.status')
echo "Elasticsearch: $ES_STATUS"
# Logstash (check GELF port)
GELF_STATUS=$(nc -zvu 100.1.100.201 12201 2>&1 | grep -q "succeeded" && echo "UP" || echo "DOWN")
echo "Logstash GELF: $GELF_STATUS"
# Kibana
KIBANA_STATUS=$(curl -s http://100.1.100.201:5601/api/status | jq -r '.status.overall.state')
echo "Kibana: $KIBANA_STATUS"
Decision Tree¶
graph TD
A[ELK Stack Offline] --> B{Storage Available?}
B -->|<20GB| C[Clean up storage first]
B -->|>20GB| D{Memory Available?}
D -->|<4GB free| E[Free up RAM or reduce ES heap]
D -->|>4GB| F[Start Elasticsearch]
C --> G[See Storage Warning]
E --> H[Stop non-critical services]
F --> I{ES Started OK?}
I -->|Yes| J[Start Logstash]
I -->|No| K{Corruption Error?}
K -->|Yes| L[Remove corrupted files]
K -->|No| M[Check ES logs]
J --> N{Logstash OK?}
N -->|Yes| O[Start Kibana]
N -->|No| P[Check Logstash config]
O --> Q[Verify end-to-end]
L --> F
M --> R[Troubleshoot specific error]
Related Documentation¶
- [[Known-Issues|Known Issues]]
- [[05-Storage/Storage-Critical-Warning|Storage Critical Warning]]
- [[02-Services/Stack-Homenet1|Homenet1 Stack]]
- [[Service-Restart-Runbook|Service Restart Runbook]]
- [[01-Infrastructure/Node-201-Manager|Node 201 Manager]]
Useful Commands¶
# Quick status check
docker service ls | grep -E "elastic|logstash|kibana"
# Scale services
docker service scale homenet1_elasticsearch=1
docker service scale homenet1_logstash=1
docker service scale homenet1_logstash-cacher=1
docker service scale homenet4_kibana=1
# Check Elasticsearch
curl http://100.1.100.201:9200/_cluster/health?pretty
curl http://100.1.100.201:9200/_cat/indices?v
# Test GELF
echo '{"short_message":"test"}' | nc -u 100.1.100.201 12201
# View logs
docker service logs -f homenet1_elasticsearch
docker service logs -f homenet1_logstash
docker service logs -f homenet4_kibana
Last Updated: 2026-01-11 Status: 🔴 CRITICAL - All ELK services offline Next Action: Investigate root cause, then attempt phased restoration Priority: High (but check storage capacity first)