← Về danh sách bài họcBài 19/20
📊 Bài 19: Kafka Monitoring & Production
🎯 Sau bài học này, bạn sẽ:
- Monitor Kafka với Prometheus + Grafana
- Biết key metrics cần theo dõi
- Tuning Kafka performance
- Production deployment checklist
1. Key Metrics
| Category | Metric | Alert khi |
|---|---|---|
| Broker | Under-replicated partitions | > 0 |
| Broker | Active controller count | ≠ 1 |
| Broker | Request latency (p99) | > 100ms |
| Topic | Messages in/sec | Đột biến tăng/giảm |
| Consumer | Consumer lag | > 10,000 |
| Consumer | Consumer lag increasing | Liên tục tăng > 5 phút |
| System | Disk usage | > 80% |
| System | Network I/O | Gần saturation |
2. Monitoring Stack
# docker-compose-monitoring.yml
version: '3.8'
services:
kafka:
image: confluentinc/cp-kafka:7.5.0
environment:
KAFKA_JMX_PORT: 9999
KAFKA_JMX_HOSTNAME: kafka
# ... other configs
jmx-exporter:
image: bitnami/jmx-exporter:latest
ports:
- "5556:5556"
volumes:
- ./jmx-config.yml:/etc/jmx-exporter/config.yml
command: ["5556", "/etc/jmx-exporter/config.yml"]
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
3. Consumer Lag Monitoring
// lag-monitor.js - Custom lag monitoring
const { Kafka } = require('kafkajs');
const kafka = new Kafka({ clientId: 'lag-monitor', brokers: ['localhost:9092'] });
const admin = kafka.admin();
async function checkLag(groupId) {
await admin.connect();
const offsets = await admin.fetchOffsets({ groupId, topics: ['orders'] });
const topicOffsets = await admin.fetchTopicOffsets('orders');
let totalLag = 0;
for (const partition of offsets) {
const endOffset = topicOffsets.find(
t => t.partition === partition.partition
);
const lag = parseInt(endOffset.offset) - parseInt(partition.offset);
totalLag += lag;
console.log(`P${partition.partition}: offset=${partition.offset}, end=${endOffset.offset}, lag=${lag}`);
}
console.log(`Total lag: ${totalLag}`);
if (totalLag > 10000) {
console.log('🚨 ALERT: Consumer lag is too high!');
// Send alert to Slack/PagerDuty
}
await admin.disconnect();
}
// Check lag mỗi 30 giây
setInterval(() => checkLag('order-processors'), 30000);
4. Performance Tuning
# Broker tuning (server.properties)
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
# Producer tuning
batch.size=32768
linger.ms=5
compression.type=lz4
buffer.memory=33554432
# Consumer tuning
fetch.min.bytes=1
fetch.max.wait.ms=500
max.poll.records=500
session.timeout.ms=30000
💡 Performance tips:
• Producer: Tăng batch.size + linger.ms cho throughput
• Consumer: Tăng max.poll.records cho batch processing
• Broker: SSD disk, đủ RAM cho page cache
• Compression: lz4 cho throughput, zstd cho ratio tốt hơn
• Producer: Tăng batch.size + linger.ms cho throughput
• Consumer: Tăng max.poll.records cho batch processing
• Broker: SSD disk, đủ RAM cho page cache
• Compression: lz4 cho throughput, zstd cho ratio tốt hơn
5. Production Checklist
📌 Deployment Checklist:
✅ Cluster tối thiểu 3 brokers
✅ Replication factor ≥ 3, min.insync.replicas = 2
✅ SSD storage cho Kafka logs
✅ Dedicated disks cho Kafka (không share với OS)
✅ JVM heap: 6-8GB (không quá 50% RAM)
✅ Network: ≥ 1Gbps between brokers
✅ Monitoring: Prometheus + Grafana + alerting
✅ Security: SSL/TLS, SASL authentication, ACLs
✅ Backup: MirrorMaker 2 cho cross-DC replication
✅ Auto-scaling consumers based on lag
✅ Cluster tối thiểu 3 brokers
✅ Replication factor ≥ 3, min.insync.replicas = 2
✅ SSD storage cho Kafka logs
✅ Dedicated disks cho Kafka (không share với OS)
✅ JVM heap: 6-8GB (không quá 50% RAM)
✅ Network: ≥ 1Gbps between brokers
✅ Monitoring: Prometheus + Grafana + alerting
✅ Security: SSL/TLS, SASL authentication, ACLs
✅ Backup: MirrorMaker 2 cho cross-DC replication
✅ Auto-scaling consumers based on lag
📝 Tóm Tắt
- Monitoring: JMX → Prometheus → Grafana
- Key metrics: Consumer lag, under-replicated partitions, request latency
- Tuning: batch.size, linger.ms, compression, max.poll.records
- Production: 3+ brokers, RF=3, SSD, monitoring, security