Kafka Monitoring

It all comes down to:

Do I know which metrics to monitor?
Do I know knobs to turn if I need to tune things relative to each of these performance metrics?

To get these answers, here is what you do:

Monitor and observe Kafka performance, throughput and latency
Automatically detect and alert on Kafka issues maintaining data integrity
Automatically detect threshold, component, hardware degradation and failures
Forecast Kafka performance and capacity trends and needs over time.

Key Metrics

The Big 4

Number of active controllers should always be 1
Number of under replicated partitions should always be 0
Number of offline partitions should always be 0
Consumer lag should be under control (varies by use-case)

Producer

Production rate - when a message leaves a producer, it is typically not on its own. It's been batched with other messages. Production rate is about:

how big is that batch size?
how long is it buffered on the producer before being sent?
what's the network latency between the producer and the broker?
what's the throughput from producer to the broker?
were there any failures?
how often are you acknowledging those packets that were sent?

All these are a potential gating factor in getting the message(s) from the producer over to the broker.

Broker

Component health (topics and hardware)
Load skew
Capacity
how many leaders per broker am I actually running?

Topic

is partition healthy?
are we fully replicated?
are we evenly distributed among the hardware we have? (load skew)
are topic priorities set based on most important topic (if applicable)?

Consumer

are the consumers online?
is there consumer lag?
consumption rate & trends

Beyond the obvious

Log flush latency
Messages per second / bytes per second thresholds
Available network processor / request handler bandwidth
Topic status metwork throughput
Open file handles
Memory, Load
Disk usage
GC pauses
Heap usage
Swapping
Dropped packets

Trends to watch

Rate of topic growth
Is TTL on data (retention) long enough for data safety margins, but not too long?
Is the hardware keeping up with Kafka? - (CPU, Memory, Network and total I/O capcity)

References

Kafka uses Yammer Metrics for metrics reporting in the server. For more on monitoring, refer kafka official documentation.
AWS MSK metrics details