📊 Bài 19: Kafka Monitoring & Production

⏱️ Thời gian đọc: 22 phút | 📚 Độ khó: Nâng cao

🎯 Sau bài học này, bạn sẽ:

Monitor Kafka với Prometheus + Grafana
Biết key metrics cần theo dõi
Tuning Kafka performance
Production deployment checklist

1. Key Metrics

Category	Metric	Alert khi
Broker	Under-replicated partitions	> 0
Broker	Active controller count	≠ 1
Broker	Request latency (p99)	> 100ms
Topic	Messages in/sec	Đột biến tăng/giảm
Consumer	Consumer lag	> 10,000
Consumer	Consumer lag increasing	Liên tục tăng > 5 phút
System	Disk usage	> 80%
System	Network I/O	Gần saturation

2. Monitoring Stack

# docker-compose-monitoring.yml
version: '3.8'
services:
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_JMX_PORT: 9999
      KAFKA_JMX_HOSTNAME: kafka
      # ... other configs

  jmx-exporter:
    image: bitnami/jmx-exporter:latest
    ports:
      - "5556:5556"
    volumes:
      - ./jmx-config.yml:/etc/jmx-exporter/config.yml
    command: ["5556", "/etc/jmx-exporter/config.yml"]

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

3. Consumer Lag Monitoring

// lag-monitor.js - Custom lag monitoring
const { Kafka } = require('kafkajs');

const kafka = new Kafka({ clientId: 'lag-monitor', brokers: ['localhost:9092'] });
const admin = kafka.admin();

async function checkLag(groupId) {
    await admin.connect();

    const offsets = await admin.fetchOffsets({ groupId, topics: ['orders'] });
    const topicOffsets = await admin.fetchTopicOffsets('orders');

    let totalLag = 0;
    for (const partition of offsets) {
        const endOffset = topicOffsets.find(
            t => t.partition === partition.partition
        );
        const lag = parseInt(endOffset.offset) - parseInt(partition.offset);
        totalLag += lag;
        console.log(`P${partition.partition}: offset=${partition.offset}, end=${endOffset.offset}, lag=${lag}`);
    }

    console.log(`Total lag: ${totalLag}`);

    if (totalLag > 10000) {
        console.log('🚨 ALERT: Consumer lag is too high!');
        // Send alert to Slack/PagerDuty
    }

    await admin.disconnect();
}

// Check lag mỗi 30 giây
setInterval(() => checkLag('order-processors'), 30000);

4. Performance Tuning

# Broker tuning (server.properties)
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

# Producer tuning
batch.size=32768
linger.ms=5
compression.type=lz4
buffer.memory=33554432

# Consumer tuning
fetch.min.bytes=1
fetch.max.wait.ms=500
max.poll.records=500
session.timeout.ms=30000

💡 Performance tips:
• Producer: Tăng batch.size + linger.ms cho throughput
• Consumer: Tăng max.poll.records cho batch processing
• Broker: SSD disk, đủ RAM cho page cache
• Compression: lz4 cho throughput, zstd cho ratio tốt hơn

5. Production Checklist

📌 Deployment Checklist:
✅ Cluster tối thiểu 3 brokers
✅ Replication factor ≥ 3, min.insync.replicas = 2
✅ SSD storage cho Kafka logs
✅ Dedicated disks cho Kafka (không share với OS)
✅ JVM heap: 6-8GB (không quá 50% RAM)
✅ Network: ≥ 1Gbps between brokers
✅ Monitoring: Prometheus + Grafana + alerting
✅ Security: SSL/TLS, SASL authentication, ACLs
✅ Backup: MirrorMaker 2 cho cross-DC replication
✅ Auto-scaling consumers based on lag

📝 Tóm Tắt

Monitoring: JMX → Prometheus → Grafana
Key metrics: Consumer lag, under-replicated partitions, request latency
Tuning: batch.size, linger.ms, compression, max.poll.records
Production: 3+ brokers, RF=3, SSD, monitoring, security

← Bài 18: Kafka Connect Bài 20: Dự Án Kafka →