Skip to content

ELK Stack Offline Investigation

[!danger] Critical Infrastructure Offline Status: 🔴 CRITICAL Impact: No centralized logging - operating blind Services Affected: Elasticsearch, Logstash, Logstash-cacher, Kibana Duration: Unknown (currently 0/0 replicas)

Current Status

Elasticsearch: ❌ 0/0 replicas (scaled down) Logstash: ❌ 0/0 replicas (scaled down) Logstash-cacher: ❌ 0/0 replicas (scaled down) Kibana: ❌ 0/0 replicas (scaled down)

All ELK stack services are intentionally scaled to zero, suggesting a deliberate shutdown rather than a crash.

Impact Analysis

What We've Lost

No Centralized Log Aggregation: - ❌ Cannot view logs from all services in one place - ❌ Cannot search logs across infrastructure - ❌ Cannot correlate events between services - ❌ No log-based alerting - ❌ No log retention/archival

No GELF Endpoint: - ❌ Logstash-cacher (UDP 12201) offline - ❌ Services configured with GELF logging lose logs - ❌ Container logs not being processed

No Log Visualization: - ❌ Kibana dashboard unavailable - ❌ Cannot create log queries/visualizations - ❌ No historical log analysis

Current Workarounds

Viewing Service Logs:

# Direct service log access
docker service logs -f <service-name>

# Specific container logs on a node
ssh <node-ip>
docker logs <container-id>

# Follow logs from multiple services
docker service logs -f homenet1_mariadb &
docker service logs -f homenet4_plex &

Use Grafana for Metrics: - Grafana still operational: http://100.1.100.201:3010 - Can monitor service health via metrics - Limited to numerical data (not log content)

Investigation Steps

1. Determine Why Services Were Scaled Down

Check service history:

# View service update history
docker service inspect homenet1_elasticsearch --format '{{json .UpdatedAt}}'
docker service inspect homenet1_logstash --format '{{json .UpdatedAt}}'

# Check for recent stack deployments
docker stack ps homenet1 --format "{{.Name}} {{.CurrentState}} {{.Error}}"

Review git history:

cd /home/cjustin/homenet-docker-services
git log --oneline --grep="elastic\|logstash" --since="2 weeks ago"
git show <commit-hash>

Possible reasons for shutdown: 1. Storage capacity - Elasticsearch indices filling disk 2. Performance issues - High resource usage 3. Corruption - BBolt or index corruption 4. Intentional maintenance - Planned shutdown 5. Resource constraints - Node 201 overloaded

2. Check Elasticsearch Data

Inspect data directory:

ssh 100.1.100.201

# Check Elasticsearch data size
du -sh /nfs_data/elasticsearch/

# Look for corruption indicators
ls -la /nfs_data/elasticsearch/nodes/

# Check for disk space issues
df -h /nfs_data

Expected issues: - Large index files (>50GB) - Corrupted indices - Failed snapshots - Lock files preventing startup

3. Check Logstash Configuration

Verify pipeline configurations:

# Check Logstash config
ls -la /homenet_config/logstash/config/
cat /homenet_config/logstash/config/pipelines.yml

# Verify pipeline files
ls -la /homenet_config/logstash/pipeline/

Common issues: - Syntax errors in pipeline files - Elasticsearch output configuration - GELF input configuration

4. Review Resource Usage

Check if ELK was consuming too much:

# Node 201 current resource usage
ssh 100.1.100.201
free -h
df -h

# Historical metrics from Prometheus
# Visit: http://100.1.100.201:9090
# Query: container_memory_usage_bytes{name=~".*elastic.*"}

Restoration Plan

Phase 1: Pre-Flight Checks

Before restarting, verify:

  1. Storage capacity:

    df -h /nfs_data
    # Need at least 50GB free for Elasticsearch
    

  2. Memory available on node 201:

    free -h
    # Elasticsearch needs ~2-4GB RAM
    

  3. Configuration validity:

    # Check Elasticsearch config
    cat /homenet_config/elasticsearch/elasticsearch.yml
    
    # Check Logstash configs
    ls /homenet_config/logstash/pipeline/
    

  4. Network connectivity:

    # Verify elastic network exists
    docker network inspect elastic
    

Phase 2: Start Elasticsearch First

Elasticsearch must start before Logstash:

# Scale up Elasticsearch
docker service scale homenet1_elasticsearch=1

# Monitor startup (takes 1-2 minutes)
docker service logs -f homenet1_elasticsearch

# Wait for "started" message
# Look for: "started"

Verify Elasticsearch health:

# Check cluster health
curl http://100.1.100.201:9200/_cluster/health?pretty

# Expected response:
# {
#   "status": "green" or "yellow",
#   "number_of_nodes": 1
# }

# List indices
curl http://100.1.100.201:9200/_cat/indices?v

If Elasticsearch fails to start:

# Check logs for errors
docker service logs homenet1_elasticsearch --tail 100

# Common errors:
# - "BBolt panic" → Corrupted database (see Phase 4)
# - "Out of memory" → Increase heap size or free RAM
# - "Disk full" → Clean up storage (see Storage Warning)

Phase 3: Start Logstash

After Elasticsearch is healthy:

# Scale up Logstash
docker service scale homenet1_logstash=1
docker service scale homenet1_logstash-cacher=1

# Monitor startup
docker service logs -f homenet1_logstash

# Verify GELF input
netstat -uln | grep 12201  # Should show UDP listener

Test GELF endpoint:

# Send test message
echo '{"version":"1.1","host":"test","short_message":"test"}' | \
  nc -u 100.1.100.201 12201

Phase 4: Start Kibana

After Logstash is running:

# Scale up Kibana
docker service scale homenet4_kibana=1

# Monitor startup
docker service logs -f homenet4_kibana

# Wait for "Server running at http://0.0.0.0:5601"

Access Kibana:

URL: http://100.1.100.201:5601

Create index pattern: 1. Navigate to Stack Management → Index Patterns 2. Create pattern: logstash-* 3. Select timestamp field: @timestamp 4. Navigate to Discover to view logs

Phase 5: Verification

Confirm full functionality:

  1. Elasticsearch:
  2. Cluster health: GREEN or YELLOW
  3. Indices visible and searchable
  4. New indices being created

  5. Logstash:

  6. No errors in service logs
  7. GELF port listening
  8. Processing logs (check Elasticsearch indices)

  9. Kibana:

  10. UI accessible
  11. Index pattern created
  12. Logs visible in Discover

  13. End-to-end test:

    # Restart a service with GELF logging
    docker service update --force homenet1_redis
    
    # Check if logs appear in Kibana
    # Discover → filter by service name
    

Recovery from Corruption

Elasticsearch BBolt Corruption

If Elasticsearch fails with "BBolt panic":

# Backup current state
ssh 100.1.100.201
tar -czf /tmp/elasticsearch-backup-$(date +%Y%m%d).tar.gz /nfs_data/elasticsearch/

# Remove corrupted BBolt file
rm /nfs_data/elasticsearch/influxd.bolt
# Note: Path may vary, check error message

# Restart Elasticsearch
docker service update --force homenet1_elasticsearch

If indices are corrupted:

# Delete corrupted indices
curl -X DELETE http://100.1.100.201:9200/<index-name>

# Or delete all old indices (>30 days)
curator delete indices --older-than 30 --time-unit days --timestring '%Y.%m.%d'

# Elasticsearch will create new indices automatically

Complete Data Reset

If restoration fails, nuclear option:

[!warning] Data Loss This will delete ALL historical logs. Only use as last resort.

# Stop Elasticsearch
docker service scale homenet1_elasticsearch=0

# Backup and remove data
ssh 100.1.100.201
mv /nfs_data/elasticsearch /nfs_data/elasticsearch.old.$(date +%Y%m%d)
mkdir /nfs_data/elasticsearch
chown -R 1000:1000 /nfs_data/elasticsearch

# Restart Elasticsearch (will initialize fresh)
docker service scale homenet1_elasticsearch=1

Performance Optimization

Elasticsearch Tuning

Reduce memory usage:

Edit docker-compose or stack file:

environment:
  - "ES_JAVA_OPTS=-Xms1g -Xmx1g"  # Reduce from default 2g

Index lifecycle management:

# Delete indices older than 30 days
curl -X DELETE 'http://100.1.100.201:9200/logstash-*' \
  --data-urlencode 'expand_wildcards=open' \
  --data-urlencode 'max_age=30d'

# Or automate with curator

Reduce replica count:

# Set replicas to 0 (single node cluster)
curl -X PUT http://100.1.100.201:9200/_settings -H 'Content-Type: application/json' -d'
{
  "index": {
    "number_of_replicas": 0
  }
}'

Logstash Tuning

Pipeline optimization:

Edit /homenet_config/logstash/pipeline/*.conf:

# Add if/else to filter unnecessary logs
filter {
  if [level] == "debug" {
    drop { }
  }
}

# Reduce batch size if memory constrained
input {
  gelf {
    port => 12201
    batch_size => 125  # Reduce from default
  }
}

Alternative Solutions

If ELK is too resource-intensive:

Option 1: Lighter alternatives - Loki - Grafana's log aggregation (lower resource usage) - Graylog - Similar features, less memory - Fluentd - Log collector with various outputs

Option 2: Cloud-based logging - Datadog - Commercial SaaS - Papertrail - Simple cloud logging - CloudWatch - If using AWS

Option 3: File-based logging - Keep Docker logs on disk - Use docker service logs when needed - Rotate logs with logrotate - Accept limited searchability

Option 4: Hybrid approach - Use Loki for recent logs (7 days) - Archive old logs to S3/cold storage - Elasticsearch only for specific high-value logs

Monitoring After Restoration

Prometheus Alerts

groups:
  - name: elk_alerts
    rules:
      - alert: ElasticsearchDown
        expr: up{job="elasticsearch"} == 0
        for: 5m
        labels:
          severity: critical

      - alert: LogstashDown
        expr: up{job="logstash"} == 0
        for: 5m
        labels:
          severity: critical

      - alert: ElasticsearchDiskSpace
        expr: elasticsearch_filesystem_data_available_bytes < 10737418240  # 10GB
        for: 15m
        labels:
          severity: warning

Grafana Dashboard

Create ELK health dashboard with: - Elasticsearch cluster status - Index count and size - Logstash throughput - GELF message rate - Disk space usage

Health Check Script

#!/bin/bash
# /usr/local/bin/elk-health-check.sh

echo "=== ELK Health Check $(date) ==="

# Elasticsearch
ES_STATUS=$(curl -s http://100.1.100.201:9200/_cluster/health | jq -r '.status')
echo "Elasticsearch: $ES_STATUS"

# Logstash (check GELF port)
GELF_STATUS=$(nc -zvu 100.1.100.201 12201 2>&1 | grep -q "succeeded" && echo "UP" || echo "DOWN")
echo "Logstash GELF: $GELF_STATUS"

# Kibana
KIBANA_STATUS=$(curl -s http://100.1.100.201:5601/api/status | jq -r '.status.overall.state')
echo "Kibana: $KIBANA_STATUS"

Decision Tree

graph TD
    A[ELK Stack Offline] --> B{Storage Available?}
    B -->|<20GB| C[Clean up storage first]
    B -->|>20GB| D{Memory Available?}
    D -->|<4GB free| E[Free up RAM or reduce ES heap]
    D -->|>4GB| F[Start Elasticsearch]

    C --> G[See Storage Warning]
    E --> H[Stop non-critical services]

    F --> I{ES Started OK?}
    I -->|Yes| J[Start Logstash]
    I -->|No| K{Corruption Error?}

    K -->|Yes| L[Remove corrupted files]
    K -->|No| M[Check ES logs]

    J --> N{Logstash OK?}
    N -->|Yes| O[Start Kibana]
    N -->|No| P[Check Logstash config]

    O --> Q[Verify end-to-end]

    L --> F
    M --> R[Troubleshoot specific error]
  • [[Known-Issues|Known Issues]]
  • [[05-Storage/Storage-Critical-Warning|Storage Critical Warning]]
  • [[02-Services/Stack-Homenet1|Homenet1 Stack]]
  • [[Service-Restart-Runbook|Service Restart Runbook]]
  • [[01-Infrastructure/Node-201-Manager|Node 201 Manager]]

Useful Commands

# Quick status check
docker service ls | grep -E "elastic|logstash|kibana"

# Scale services
docker service scale homenet1_elasticsearch=1
docker service scale homenet1_logstash=1
docker service scale homenet1_logstash-cacher=1
docker service scale homenet4_kibana=1

# Check Elasticsearch
curl http://100.1.100.201:9200/_cluster/health?pretty
curl http://100.1.100.201:9200/_cat/indices?v

# Test GELF
echo '{"short_message":"test"}' | nc -u 100.1.100.201 12201

# View logs
docker service logs -f homenet1_elasticsearch
docker service logs -f homenet1_logstash
docker service logs -f homenet4_kibana

Last Updated: 2026-01-11 Status: 🔴 CRITICAL - All ELK services offline Next Action: Investigate root cause, then attempt phased restoration Priority: High (but check storage capacity first)