Kafka Monitoring
It all comes down to:
- Do I know which metrics to monitor?
- Do I know knobs to turn if I need to tune things relative to each of these performance metrics?
To get these answers, here is what you do:
- Monitor and observe Kafka performance, throughput and latency
- Automatically detect and alert on Kafka issues maintaining data integrity
- Automatically detect threshold, component, hardware degradation and failures
- Forecast Kafka performance and capacity trends and needs over time.
Key Metrics
The Big 4
- Number of active controllers should always be 1
- Number of under replicated partitions should always be 0
- Number of offline partitions should always be 0
- Consumer lag should be under control (varies by use-case)
Producer
Production rate
- when a message leaves a producer, it is typically not on its own. It's been batched with other messages. Production rate is about:
- how big is that batch size?
- how long is it buffered on the producer before being sent?
- what's the network latency between the producer and the broker?
- what's the throughput from producer to the broker?
- were there any failures?
- how often are you acknowledging those packets that were sent?
All these are a potential gating factor in getting the message(s) from the producer over to the broker.
Broker
- Component health (topics and hardware)
- Load skew
- Capacity
- how many leaders per broker am I actually running?
Topic
- is partition healthy?
- are we fully replicated?
- are we evenly distributed among the hardware we have? (load skew)
- are topic priorities set based on most important topic (if applicable)?
Consumer
- are the consumers online?
- is there consumer lag?
- consumption rate & trends
Beyond the obvious
- Log flush latency
- Messages per second / bytes per second thresholds
- Available network processor / request handler bandwidth
- Topic status metwork throughput
- Open file handles
- Memory, Load
- Disk usage
- GC pauses
- Heap usage
- Swapping
- Dropped packets
Trends to watch
- Rate of topic growth
- Is TTL on data (retention) long enough for data safety margins, but not too long?
- Is the hardware keeping up with Kafka? - (CPU, Memory, Network and total I/O capcity)
References
- Kafka uses
Yammer Metrics
for metrics reporting in the server. For more on monitoring, refer kafka official documentation. - AWS MSK metrics details