Monday, May 1, 2023
HomeProgrammingGuided Mission: Monitoring your personal Kafka Cluster

Guided Mission: Monitoring your personal Kafka Cluster


How usually do you see your Kafka cluster unreachable or happening? There will be quite a few explanation why a cluster might go down. But when we handle and monitor the cluster with due diligence then we are able to forestall any downtime in our utility.

Kafka is utilized by numerous firms to course of billions of occasions and messages every single day. Regardless of its reputation and robustness, monitoring Kafka is vital to make sure its clean functioning and to stop any potential issues. Monitoring Kafka helps in detecting and fixing points earlier than they turn out to be vital and have an effect on the efficiency of the system.

On this guided challenge, we are going to focus on the significance of monitoring Kafka, the several types of monitoring, and the varied instruments and strategies obtainable for monitoring a Kafka cluster.

Why Monitor Kafka?

  • Keep system well being: Monitoring Kafka helps in guaranteeing the well being of the system by detecting any potential points and fixing them earlier than they turn out to be vital. This consists of monitoring the Kafka dealer, the zookeeper, and the customers.
    • Monitoring the dealer ensures that it’s functioning correctly and may deal with incoming and outgoing messages.
    • Monitoring the zookeeper ensures that the configuration modifications are being utilized and that the cluster is functioning as anticipated.
    • Monitoring the customers ensures that they’re consuming the messages as anticipated and that there aren’t any points with the processing of the messages.
  • Improves Efficiency: Monitoring Kafka helps in bettering the efficiency of the system by detecting and fixing any efficiency points. This consists of monitoring the variety of messages being processed, the fee of messages being processed, and the latency of the system.
    • By monitoring the variety of messages being processed, it’s attainable to find out whether or not the system can deal with the quantity of messages.
    • Monitoring the speed of messages being processed helps in figuring out if the system can course of messages quick sufficient.
    • Monitoring the latency of the system helps in figuring out if the system can course of messages in real-time.
  • Detects Potential Points: Monitoring Kafka helps in detecting potential points earlier than they turn out to be vital. This consists of monitoring the dealer, the zookeeper, and the customers.
  • Prevents Information Loss: Monitoring Kafka helps in stopping information loss by detecting any points with the storage of the messages. This consists of monitoring the dealer, the zookeeper, and the customers.

Key Efficiency Metrics of Kafka

Kafka has been designed to deal with giant quantities of information, course of information in real-time, and supply excessive ranges of scalability and reliability. In consequence, it’s vital to observe Kafka efficiency metrics to make sure that it’s functioning as anticipated and to shortly determine any potential points. The Kafka platform offers a variety of efficiency metrics to assist customers monitor the well being of the system and diagnose efficiency points. These metrics are vital for guaranteeing the soundness and reliability of a Kafka cluster. Let’s perceive every of those metrics intimately.

Throughput

Throughput is an important metric for organizations to measure the speed at which information is being processed by the Apache Kafka system. It’s the variety of information processed per second and is an indicator of the system’s efficiency and scalability. The upper the throughput, the higher the efficiency of the Apache Kafka system. Throughput will be measured at completely different ranges within the Apache Kafka structure such because the dealer, matter, partition, and shopper ranges.

  • Dealer Stage: The broker-level throughput measures the overall variety of information processed by a dealer. It’s calculated because the sum of the enter and output charges of all of the matters on the dealer. The broker-level throughput is a crucial metric because it offers an total view of the dealer’s efficiency.
  • Subject Stage: The subject-level throughput measures the variety of information processed by a particular matter. It’s calculated because the sum of the enter and output charges of the subject. The subject-level throughput offers perception into the efficiency of particular matters and helps to determine any bottlenecks or points with particular matters.
  • Partition Stage: The partition-level throughput measures the variety of information processed by a particular partition. It’s calculated because the sum of the enter and output charges of the partition. The partition-level throughput offers perception into the efficiency of particular partitions and helps to determine any points with particular partitions.
  • Shopper Stage: The patron-level throughput measures the variety of information processed by a particular shopper. It’s calculated because the sum of the enter and output charges of the buyer. The patron-level throughput offers perception into the efficiency of particular customers and helps to determine any points with particular customers.

The next steps can be utilized to calculate the throughput of a Kafka cluster:

  • Decide the variety of messages produced and consumed per second: To calculate the throughput of a Kafka cluster, you must decide the variety of messages produced and consumed per second. This may be achieved utilizing numerous instruments corresponding to Kafka CLI, JMX, or third-party monitoring instruments.
  • Decide the message measurement: After getting the variety of messages produced and consumed per second, you must decide the scale of the messages. This info is essential in calculating the throughput because it helps decide the quantity of information processed per second.
  • Calculate the information processed per second: The subsequent step is to calculate the information processed per second. That is achieved by multiplying the variety of messages produced and consumed per second by the message measurement.
  • Convert the information processed per second to bytes per second: The information processed per second must be transformed to bytes per second as throughput is often measured in bytes per second.
  • Decide the variety of nodes within the cluster: It is usually vital to find out the variety of nodes within the cluster because the variety of nodes can have an effect on the throughput of the cluster.
  • Calculate the common throughput per node: Lastly, the common throughput per node will be calculated by dividing the overall throughput by the variety of nodes within the cluster.

The next are the varied components that have an effect on the throughput of a Kafka cluster:

  • Message measurement: The dimensions of the messages produced and consumed in a Kafka cluster straight impacts the throughput of the cluster. The bigger the message measurement, the slower the throughput of the cluster.

  • Variety of nodes within the cluster: The variety of nodes in a Kafka cluster additionally impacts the throughput of the cluster. The extra nodes within the cluster, the upper the throughput, as extra nodes imply extra processing energy.

  • Community bandwidth: The community bandwidth between nodes in a Kafka cluster can be a vital consider figuring out the throughput of the cluster. If the community bandwidth is low, then the producers and customers might not be capable to sustain with the information being produced and consumed, resulting in a discount in total throughput.

  • Disk I/O: The disk I/O of the nodes in a Kafka cluster additionally impacts the throughput of the cluster. If the disk I/O fee is low, then the brokers might not be capable to sustain with the information being produced and consumed, resulting in a discount in total throughput.

  • Variety of partitions: The variety of partitions in a Kafka cluster additionally impacts the throughput of the cluster.

  • CPU Utilization: CPU utilization refers back to the quantity of computing energy that’s being utilized by the Kafka cluster. If the CPU utilization is excessive, then the brokers and producers might not be capable to sustain with the information being produced and consumed, resulting in a discount in total throughput.

  • Replication issue: The replication consider Kafka refers back to the variety of copies of information saved within the system. It’s used to make sure information sturdiness and reliability within the occasion of node failures. The replication issue may also have an effect on the throughput of a Kafka system, as extra replicas require extra sources and bandwidth. A excessive replication issue can lead to elevated overhead and decreased system effectivity, resulting in low throughput. Nevertheless, a low replication issue can lead to decreased information sturdiness and reliability. It is strongly recommended to regulate the replication issue primarily based on the specified degree of sturdiness and reliability, and the obtainable sources.

  • Dealer Configuration: The configuration of the Kafka brokers may also have an effect on the throughput of the system. Brokers are the nodes in a Kafka system that handle the storage and processing of information. The configuration of the brokers, together with the quantity of reminiscence, CPU, and disk sources, can impression the processing pace and effectivity of the system. Insufficient sources can lead to efficiency degradation and low throughput. It is strongly recommended to observe the useful resource utilization of the brokers and modify the configuration accordingly to make sure optimum efficiency.

  • Information Compression: Information compression is the method of decreasing the scale of information being transmitted, which may also help to enhance the throughput of a Kafka system. Compression can scale back the quantity of information being transmitted, decreasing the latency and rising the bandwidth. Nevertheless, it is very important steadiness the advantages of compression with its overhead, because the compression course of requires extra processing sources.

Latency

Latency in Kafka refers back to the time taken between the creation and supply of a message from the producer to the buyer. It is likely one of the most important metrics to measure the efficiency of a Kafka cluster. A excessive latency worth can considerably have an effect on the effectivity of a Kafka-based system, resulting in bottlenecks, decreased productiveness, and decreased buyer satisfaction.

In a Kafka-based system, messages are despatched by producers, saved in matters, after which consumed by customers. Latency happens at completely different levels of this course of, together with the time taken for a producer to ship a message, the time taken for the message to be saved in a subject, and the time taken for a shopper to obtain the message.

There are a number of components that contribute to latency in a Kafka-based system. A few of these embody the scale of the messages being despatched, the variety of customers consuming from the identical matter, and the scale of the Kafka cluster itself. The processing capability of the Kafka dealer, community connectivity, and {hardware} sources may also contribute to latency.

To reduce latency in a Kafka-based system, it is very important optimize the Kafka cluster’s configuration. This may occasionally contain rising the variety of partitions in a subject, rising the variety of brokers in a cluster, or including extra {hardware} sources corresponding to CPU and reminiscence. One other method to decreasing latency is to implement batch processing, the place a number of messages are processed in a single batch, decreasing the overhead related to sending and receiving messages. This may be achieved through the use of Kafka’s batch processing characteristic, the place messages are saved in a buffer earlier than being processed and despatched to customers.

Shopper Lag

Shopper lag refers back to the distinction between the best offset of a partition in a Kafka matter and the most recent offset processed by a shopper in the identical partition. The lag represents the variety of messages that the buyer has but to course of. You will need to monitor shopper lag because it offers a metric for measuring the well being of the Kafka ecosystem. A big shopper lag can lead to the lack of messages, gradual processing occasions, and decreased system effectivity.

Numerous causes of shopper lag in Kafka are:

  • Gradual processing occasions: If the buyer is taking too lengthy to course of messages, the lag will improve. This may be as a result of gradual processing occasions attributable to useful resource constraints, system downtime, or a excessive variety of messages.

  • System downtime: System downtime may cause shopper lag as the buyer won’t be able to course of messages till the system is again on-line. This can lead to the buildup of unprocessed messages and a big shopper lag.

  • Community latency: Community latency may also trigger shopper lag as messages could also be delayed in transit between the dealer and the buyer.

  • Excessive message quantity: Excessive message quantity may cause shopper lag as the buyer might not be capable to course of the messages shortly sufficient.

We are able to forestall and mitigate shopper lag utilizing the next steps:

  • Enhance shopper processing pace: Bettering shopper processing pace may also help to scale back shopper lag. This may be achieved by optimizing the buyer code, rising the variety of customers, or rising system sources.

  • Use a load balancer: Utilizing a load balancer may also help to distribute the messages evenly amongst customers, decreasing the chance of shopper lag.

  • Monitor shopper lag: Monitoring shopper lag is vital to determine any points and take applicable motion. A monitoring instrument like Kafka Supervisor can be utilized to observe shopper lag in real-time.

  • Enhance partition rely: Growing the variety of partitions may also help to scale back shopper lag as messages will be processed in parallel.

Dealer Reminiscence Utilization

There are a number of components that may impression the reminiscence utilization of brokers in a Kafka cluster. The commonest components embody the scale of the information being saved, the variety of partitions, the variety of customers, and using compression. When brokers retailer giant quantities of information, they use a major quantity of reminiscence to retailer the information and preserve the state of the cluster. The extra partitions which might be created, the extra reminiscence the dealer must retailer the metadata for every partition. Equally, the extra customers which might be related to the dealer, the extra reminiscence the dealer requires to handle their connections and monitor their standing.

To reduce the reminiscence utilization of brokers in a Kafka cluster, it is very important think about the next finest practices:

  • Use applicable compression strategies: Compression can considerably scale back the scale of the information being saved, which may in flip scale back the reminiscence utilization of brokers. A number of compression strategies can be utilized, corresponding to Gzip and Snappy.

  • Monitor and tune the JVM heap measurement: The JVM heap measurement is the quantity of reminiscence that’s allotted to the dealer for operating the Java Digital Machine. By monitoring the reminiscence utilization and tuning the JVM heap measurement, you possibly can be certain that the dealer has sufficient reminiscence to run effectively.

  • Set applicable replication issue: The replication issue is the variety of copies of every partition which might be saved throughout the brokers in a cluster. By setting an applicable replication issue, you possibly can decrease the reminiscence utilization of brokers and be certain that information is saved redundantly for reliability.

  • Recurrently prune the information: Over time, the information saved in a dealer might turn out to be stale and not wanted. By usually pruning the information, you possibly can scale back the reminiscence utilization of brokers and enhance the efficiency of the cluster.

Community Bandwidth Utilization

One of many vital parts of a Kafka deployment is the community bandwidth, which refers back to the quantity of information that may be transmitted over a community connection in a given time. Community bandwidth is a finite useful resource, and managing it successfully is essential for guaranteeing that the Kafka cluster operates easily and effectively.

Numerous components that have an effect on community bandwidth utilization are:

  • Cluster measurement: The variety of brokers in a Kafka cluster straight impacts the community bandwidth utilization. The extra brokers there are in a cluster, the extra community site visitors is generated between them as they convey and coordinate with one another.

  • Replication issue: The replication consider a Kafka cluster refers back to the variety of replicas of a partition which might be stored within the cluster. A better replication issue leads to increased community bandwidth utilization as extra information is replicated between brokers.

  • Partition rely: The variety of partitions in a subject additionally impacts the community bandwidth utilization. Extra partitions lead to increased community bandwidth utilization as information is distributed among the many brokers within the cluster.

  • Message measurement: The dimensions of the messages being produced and consumed in a Kafka cluster impacts the community bandwidth utilization. Bigger messages lead to increased community bandwidth utilization, as extra information is transmitted over the community.

  • Compression: Compression can considerably scale back the community bandwidth utilization in a Kafka cluster. Compressing the messages earlier than they’re transmitted can considerably scale back the quantity of information transmitted over the community, thus decreasing community bandwidth utilization.

Disk I/O Utilization

Disk I/O Utilization is essential in Kafka as a result of the platform must learn and write information to disk at excessive speeds to maintain up with the incoming information streams. As the information streams develop bigger and extra complicated, the disk I/O utilization will increase, resulting in a lower in efficiency. The disk I/O utilization in Kafka impacts all the information streaming course of, from studying and writing to disk to information replication and processing.

To observe disk I/O utilization in Kafka, directors can use a wide range of instruments and metrics. A few of the key metrics to observe embody disk I/O operations per second, disk I/O response time, disk house utilization, and disk write and skim latency. These metrics present perception into the efficiency of the disk and can be utilized to determine efficiency bottlenecks and optimize disk I/O utilization.

To optimize disk I/O utilization in Kafka, directors can take a wide range of steps. A few of the key optimization methods embody:

  • Disk Configuration: Correct disk configuration is vital to maximizing disk I/O utilization. Directors ought to configure the disk for prime efficiency, corresponding to utilizing solid-state drives (SSDs) or Raid arrays.

  • Disk Cache: Disk cache can enhance disk I/O utilization by decreasing the variety of disk operations required. Directors can configure the disk cache measurement to match the wants of the information streaming course of.

  • Information Compression: Information compression can scale back the quantity of disk house required to retailer information, leading to improved disk I/O utilization. Directors can allow information compression within the Kafka configuration.

  • Disk Replication: Disk replication can enhance disk I/O utilization by distributing a load of disk operations throughout a number of brokers. Directors can configure the variety of replicas and their places to optimize disk I/O utilization.

Produce Request Charge

The “Produce request fee” is the speed at which messages are produced to a subject in Kafka. It is a crucial metric that signifies the speed of information being written to the Kafka cluster. This request fee is measured in requests per second (RPS). The upper the produce request fee, the extra information is being written to the Kafka cluster.

The “produce request fee” is vital to the efficiency of the Kafka cluster. A excessive manufacturing request fee can result in elevated write latency and trigger the cluster to turn out to be congested. Then again, a low manufacturing request fee can lead to a low processing fee, resulting in a discount within the effectivity of the information processing system.

A number of components can have an effect on the produce request fee in Kafka. A few of the most typical components embody the scale of the message, the variety of partitions in a subject, and the variety of producers writing to the subject. The configuration of the Kafka cluster, together with the variety of replicas and the variety of brokers, may also impression the manufacturing request fee.

To make sure the optimum efficiency of the Kafka cluster, it is very important monitor the “produce request fee” and make changes to the configuration as vital. There are a number of instruments obtainable for monitoring the produce request fee in Kafka.

Monitoring Kafka with Prometheus and Grafana

We checked out numerous key efficiency metrics and components that have an effect on them. Nevertheless, there are a selection of instruments and strategies that can be utilized for monitoring Kafka, together with Prometheus and Grafana. Firstly, we are going to focus on methods to arrange monitoring of a Kafka cluster utilizing these two instruments.

Prometheus is a well-liked open-source monitoring and alerting system that can be utilized to gather and retailer metrics from a wide range of sources, together with Kafka. Prometheus is designed to be extremely scalable and simply configurable, making it a great alternative for monitoring a big and complicated Kafka deployment.

To arrange Prometheus monitoring for Kafka, we have to first set up and configure Prometheus in your cluster. As soon as we now have Prometheus operating, we might want to set up and configure a Kafka exporter, which is a instrument that exposes Kafka metrics in a format that Prometheus can scrape and retailer. Probably the most generally used Kafka exporter is the JMX exporter, which collects metrics through the Java Administration Extensions (JMX) protocol.

As soon as we now have the exporter operating, we are able to configure Prometheus to scrape the metrics from the exporter. That is achieved by including the goal for the exporter to the Prometheus configuration file. We are able to additionally configure Prometheus to ship alerts primarily based on particular metric thresholds, so we might be notified if one thing goes mistaken with our Kafka cluster.

Grafana is a well-liked open-source dashboard and visualization instrument that can be utilized to show the metrics collected by Prometheus. Grafana permits us to create interactive and customizable dashboards that can be utilized to visualise key metrics and developments in our Kafka deployment.

To arrange Grafana, we are going to first want to put in and configure it on our cluster. As soon as we now have Grafana operating, we might want to add the Prometheus information supply to Grafana in order that it might probably entry the metrics collected by Prometheus. We are able to then create a dashboard and add panels to show the metrics that we’re desirous about. Some generally monitored metrics for a Kafka cluster embody dealer well being, shopper lag, and matter and partition sizes.

Prometheus and Grafana are highly effective instruments that can be utilized to gather and show metrics from a Kafka deployment. By establishing monitoring with these instruments, we are able to shortly determine and resolve any points that will come up in our cluster, giving us peace of thoughts and confidence in our information streaming platform.

To begin with our configuration, we are going to carry out the next steps:

  • Arrange a Kafka cluster with a ZooKeeper occasion.
  • Configure JMX Exporter to extract Kafka metrics
  • Setup a Prometheus occasion and scrape the metrics from the exporter
  • Setup a Grafana occasion and add Prometheus as a knowledge supply to create dashboards and visualize them.

Kafka Cluster

We are able to shortly create a Kafka occasion by operating it in Docker. For our use-case all through this guided challenge, we are going to use docker-compose.

Firstly, we could have a Zookeeper occasion operating on port 2181. We are going to use the official docker photographs of Confluent to create our occasion:

model: '3'
providers:
  zookeeper:
    picture: confluentinc/cp-zookeeper:7.1.0
    setting:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
      ZOOKEEPER_INIT_LIMIT: 5
      ZOOKEEPER_SYNC_LIMIT: 2
    ports:
     - "2181:2181"
    container_name: zookeeper

Subsequent, we are going to carry up the Kafka occasion through the use of the identical Confluent docker photographs:

model: '3'
providers:
  zookeeper:
    picture: confluentinc/cp-zookeeper:7.1.0
    setting:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
      ZOOKEEPER_INIT_LIMIT: 5
      ZOOKEEPER_SYNC_LIMIT: 2
    ports:
     - "2181:2181"
    container_name: zookeeper

  kafka:
    picture: confluentinc/cp-kafka:7.1.0
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
      - "9991:9991"
    container_name: kafka
    setting:
      KAFKA_BROKER_ID: 101
      KAFKA_JMX_PORT: 9991
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_HOST_NAME: localhost
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR: 1
      KAFKA_TRANSACTION_STATE_LOG_MIN_ISR: 1
      KAFKA_LOG_RETENTION_MS: 60000
      CONFLUENT_METRICS_REPORTER_BOOTSTRAP_SERVERS: kafka:29092
      CONFLUENT_METRICS_REPORTER_ZOOKEEPER_CONNECT: zookeeper:2181
      CONFLUENT_METRICS_REPORTER_TOPIC_REPLICAS: 1
      CONFLUENT_METRICS_ENABLE: 'false'
      KAFKA_HEAP_OPTS: '-XX:MaxRAMPercentage=70.0'

Lastly, we are able to run this utilizing the command:

docker-compose up --build

Prometheus JMX Exporter

Prometheus JMX Exporter for Kafka is a Java-based instrument that allows the monitoring and assortment of Kafka metrics in a Prometheus-compatible format. The instrument acts as a bridge between Kafka and Prometheus, offering a handy solution to monitor the efficiency and well being of our Kafka cluster.

The instrument works by scraping the JMX (Java Administration Extensions) MBeans of our Kafka brokers and changing the information into Prometheus-compatible metrics. These metrics can then be visualized and analyzed utilizing Prometheus. One of many key advantages of this exporter is that it offers a unified solution to acquire and analyze metrics from a number of Kafka brokers. This makes it straightforward to get an entire view of the efficiency and well being of our Kafka cluster and to determine potential points earlier than they turn out to be main issues.

One other benefit of this exporter is that it offers a variety of metrics which might be related to Kafka. These metrics embody details about the variety of messages produced and consumed, the speed of messages, the scale of the Kafka logs, and far more. This makes it straightforward to observe the efficiency of your Kafka cluster and to determine areas that will want tuning or enchancment. The instrument is very configurable and will be tailored to suit the wants of our particular Kafka cluster.

Since we now have our Kafka occasion up and operating, let’s configure the JMX exporter. First, we have to obtain the JMX exporter jar after which run it utilizing sure JVM instructions. We may even must outline a config file:

hostPort: kafka101:9991
lowercaseOutputName: true
lowercaseOutputLabelNames: true

#Whitelist is used to scale back scrapping time
whitelistObjectNames:
  - java.lang:*
  - kafka.cluster:*
  - kafka.controller:*
  - kafka.log:*
  - kafka.server:sort=KafkaServer,identify=BrokerState
  - kafka.server:sort=KafkaRequestHandlerPool,*
  - kafka.server:sort=BrokerTopicMetrics,*
  - kafka.server:sort=FetcherLagMetrics,*
  - kafka.server:sort=FetcherStats,*
  - kafka.server:sort=Request,*
  - kafka.server:sort=Fetch,*
  - kafka.server:sort=Produce,*
  - kafka.server:sort=ReplicaManager,*
  - kafka.server:sort=ReplicaFetcherManager,*
  - kafka.server:sort=SessionExpireListener,*
  - kafka.server:sort=controller-channel-metrics,*
  - kafka.server:sort=socket-server-metrics,*
  - kafka.community:sort=RequestChannel,*
  - kafka.community:sort=Processor,*
  - kafka.community:sort=SocketServer,*
  - kafka.community:sort=RequestMetrics,*
  - kafka.coordinator.group:*
blacklistObjectNames:
  - java.lang:sort=ClassLoading,*
  - java.lang:sort=Compilation,*
  - java.lang:sort=MemoryManager,*
  - kafka.utils:*
  - kafka.controller:sort=ControllerChannelManager,identify=QueueSize,*
  - kafka.log:sort=Log,identify=LogEndOffset,*
  - kafka.log:sort=Log,identify=LogStartOffset,*
  - kafka.cluster:sort=Partition,identify=InSyncReplicasCount,*
  - kafka.cluster:sort=Partition,identify=LastStableOffsetLag,*
  - kafka.cluster:sort=Partition,identify=ReplicasCounts,*
  - kafka.cluster:sort=Partition,identify=UnderReplicated,*
  - kafka.server:sort=BrokerTopicMetrics,identify=TotalFetchRequestsPerSec,*
  - kafka.server:sort=BrokerTopicMetrics,identify=TotalProduceRequestsPerSec,*
  - kafka.server:sort=BrokerTopicMetrics,identify=FailedProduceRequestsPerSec,*
  - kafka.server:sort=BrokerTopicMetrics,identify=FailedFetchRequestsPerSec,*
  - kafka.server:sort=BrokerTopicMetrics,identify=BytesRejectedPerSec,*

guidelines:
  #------------------------------------------------------------------------------------------------------
  # KafkaServers : State of dealer server
  #
  #  - BrokerState
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=KafkaServer, identify=BrokerState><>Worth
    identify: kafka_server_brokerstate
    labels:
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # Partition : Variety of partitions for every dealer
  # - InSyncReplicasCount
  # - LastStableOffsetLag
  # - ReplicasCount
  # - UnderReplicated
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.cluster<sort=Partition, identify=([^,]+), matter=([^,]+), partition=([^,]+)><>Worth
    identify: kafka_cluster_partition_$1
    labels:
      matter: $2
      partition: $3
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # KafkaController :
  #
  #  - ActiveControllerCount, OfflinePartitionsCount, PreferredReplicaImbalanceCount
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.controller<sort=KafkaController, identify=([^,]+)><>Worth
    identify: kafka_controller_$1
    labels:
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # ControllerStats : The occasion that's at present being processed by the elected dealer controller.
  #
  #  - LeaderElectionRateAndtimeMs, UncleanLeaderElectionsPerSec, AutoLeaderBalanceRateAndTimeMs, ManualLeaderBalanceRateAndTimeMs
  #  - ControllerChangeRateAndTimeMs,
  #  - TopicChangeRateAndTimeMs, TopicDeletionRateAndTimeMs, PartitionReassignmentRateAndTimeMs
  #  - IsrChangeRateAndTimeMs
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.controller<sort=ControllerStats, identify=([^,]+)><>(OneMinuteRate|Imply|75thPercentile|99thPercentile)
    identify: kafka_controller_stats_$1
    labels:
      mixture: $2
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # Coordinator : GroupMetadataManager
  #
  #  - NumGroups, NumOffsets
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.coordinator.group<sort=GroupMetadataManager, identify=([^,]+)><>(Worth)
    identify: kafka_coordinator_group_metadata_manager_$1
    labels:
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # Logs :
  #
  #  - LogEndOffset, LogStartOffset, NumLogSegments, Measurement
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.log<sort=Log, identify=([^,]+), matter=([^,]+), partition=([^,]+)><>Worth
    identify: kafka_log_$1
    labels:
      matter: $2
      partition: $3
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # LogCleaner :
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.log<sort=LogCleaner, identify=cleaner-recopy-percent><>(Worth)
    identify: kafka_log_cleaner_recopy_percent
    labels:
      service: kafka-broker
      env: kafka-cluster

  - sample: kafka.log<sort=LogCleaner, identify=max-clean-time-secs><>(Worth)
    identify: kafka_log_cleaner_max_clean_time_secs
    labels:
      service: kafka-broker
      env: kafka-cluster

  - sample: kafka.log<sort=LogCleaner, identify=max-buffer-utilization-percent><>(Worth)
    identify: kafka_log_cleaner_max_buffer_utilization_percent
    labels:
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # LogCleanerManager :
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.log<sort=LogCleanerManager, identify=max-dirty-percent><>(Worth)
    identify: kafka_log_cleaner_manager_max_dirty_percent
    labels:
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # LogFlushStats :
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.log<sort=LogFlushStats, identify=LogFlushRateAndTimeMs><>(w+)
    identify: kafka_log_flush_stats_rate_and_time_ms
    labels:
      mixture: $1
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # KafkaRequestHandlerPool : Latency
  #
  # - KafkaRequestHandlerPool
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=KafkaRequestHandlerPool, identify=RequestHandlerAvgIdlePercent><>(w+)
    identify: kafka_server_request_handler_avg_idle_percent
    labels:
      mixture: $1
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # Community Socket Server : Latency
  #
  # - NetworkProcessorAvgIdlePercent
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.community<sort=SocketServer, identify=NetworkProcessorAvgIdlePercent><>(Worth)
    identify: kafka_network_socket_server_processor_avg_idle_percent
    labels:
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # Community Processor : Latency
  #
  # - IdlePercent
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.community<sort=Processor, identify=IdlePercent, networkProcessor=([0-9]+)><>(Worth)
    identify: kafka_network_processor_idle_percent
    labels:
      processor: $1
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # Community KafkaRequestChannel :
  #
  # - RequestQueueSize, ResponseQueueSize
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.community<sort=RequestChannel, identify=(w+)QueueSize><>Worth
    identify: kafka_network_request_channel_queue_size
    labels:
      queue: $1
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # Community KafkaRequest :
  #
  # - RequestPerSec,
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.community<sort=RequestMetrics, identify=RequestsPerSec, request=(w+), model=([0-9]+)><>(OneMinuteRate|Imply)
    identify: kafka_network_request_per_sec
    labels:
      request: $1
      model: $2
      mixture: $3
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # Community KafkaRequestMetrics :
  #
  # - LocalTimeMs, RemoteTimeMs,
  # - RequestQueueTimeMs,
  # - ResponseQueueTimeMs, ResponseSendTimeMs
  # - ThrottleTimeMs
  # - TotalTimeMs
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.community<sort=RequestMetrics, identify=(w+)TimeMs, request=(Produce|Fetch|FetchConsumer|FetchFollower)><>(OneMinuteRate|Imply|75thPercentile|99thPercentile)
    identify: kafka_network_request_metrics_time_ms
    labels:
      scope: $1
      request: $2
      mixture: $3
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # KafkaServer / BrokerTopicMetrics - I/O metrics :
  #
  # - BytesInPerSec, BytesOutPerSec, BytesRejectedPerSec,
  # - FailedFetchRequestsPerSec, FailedProduceRequestsPerSec,MessagesInPerSec,
  # - TotalFetchRequestPerSec, TotalProduceRequestPerSec, ReplicationBytesInPerSec, ReplicationBytesOutPerSec
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=BrokerTopicMetrics, identify=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec|ReplicationBytesOutPerSec|ReplicationBytesInPerSec)><>(OneMinute)Charge
    identify: kafka_server_broker_topic_metrics_$1_rate
    labels:
      mixture: $2
      service: kafka-broker
      env: kafka-cluster

  - sample: kafka.server<sort=BrokerTopicMetrics, identify=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec|ReplicationBytesOutPerSec|ReplicationBytesInPerSec), matter=([^,]+)><>(OneMinute)Charge
    identify: kafka_server_broker_topic_metrics_$1_rate
    labels:
      matter: $2
      mixture: $3
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # KafkaServer / DelayedFetchMetrics :
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=DelayedFetchMetrics, identify=ExpiresPerSec, fetcherType=([^,]+)><>([^,]+)Charge
    identify: kafka_server_delayed_fetch_expires_per_sec
    labels:
      fetcher_type: $1
      mixture: $2
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # KafkaServer / DelayedOperationPurgatory :
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=DelayedOperationPurgatory, identify=([^,]+), delayedOperation=([^,]+)><>Worth
    identify: kafka_server_delayed_operation_purgatory_$1
    labels:
      operation: $2
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # FetcherLagMetrics : Lag in variety of messages per follower reproduction
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=FetcherLagMetrics, identify=([^,]+), clientId=([^,]+), matter=([^,]+), partition=([^,]+)><>Worth
    identify: kafka_server_fetcher_lag_$1
    labels:
      client_id: $2
      matter: $3
      partition: $4
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # FetcherStats : Duplicate Fetcher Thread stats
  # - BytesPerSec / RequestsPerSec
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=FetcherStats, identify=([^,]+), clientId=([^,]+), brokerHost=([^,]+), brokerPort=([^,]+)><>([^,]+)Charge
    identify: kafka_server_fetcher_stats_$1
    labels:
      client_id: $2
      broker_host: $3
      broker_port: $4
      mixture: $5
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # KafkaServer Request :
  # - request-time - Monitoring request-time per consumer/client-id
  # - throttle-time - Monitoring common throttle-time per consumer/client-id
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=Request, client-id=([^,]+)><>(request-time|throttle-time)
    identify: kafka_server_request_$2
    labels:
      client_id: $1
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # KafkaServer  Fetcher/Producer :
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=Fetch, client-id=([^,]+)><>(byte-rate|throttle-time)
    identify: kafka_server_fetch_client_$2
    labels:
      client_id: $1
      service: kafka-broker
      env: kafka-cluster

  - sample: kafka.server<sort=Produce, client-id=([^,]+)><>(byte-rate|throttle-time)
    identify: kafka_server_produce_client_$2
    labels:
      client_id: $1
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  # ReplicaManager :
  #     - IsrExpandsPerSec, IsrShrinksPerSec, FailedIsrUpdatesPerSec
  #     - LeaderCount, PartitionCount, UnderReplicatedPartitions)
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=ReplicaManager, identify=([^,]+)><>([^,]+)Charge
    identify: kafka_server_replica_manager_$1
    labels:
      mixture: $2
      service: kafka-broker
      env: kafka-cluster

  - sample: kafka.server<sort=ReplicaManager, identify=([^,]+)><>(Worth)
    identify: kafka_server_replica_manager_$1
    labels:
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # ReplicaFetcherManager :
  #     - MaxLag, MinFetchRate
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=ReplicaFetcherManager, identify=([^,]+), clientId=([^,]+)><>(Worth)
    identify: kafka_server_replica_fetcher_manager_$1_value
    labels:
      client_id: $2
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # Zookeeper / SessionExpireListener :
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=SessionExpireListener, identify=([^,]+)><>([^,]+)Charge
    identify: kafka_zookeeper_session_expire_listener_$1
    labels:
      mixture: $2
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # ControllerChannelMetrics:
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=controller-channel-metrics, broker-id=([^,]+)><>(w*)
    identify: kafka_server_controller_channel_metrics_$2
    labels:
      broker_id: $1
      service: kafka-broker
      env: kafka-cluster

  #------------------------------------------------------------------------------------------------------
  # KafkaServer / Socket Server Metrics
  #------------------------------------------------------------------------------------------------------
  - sample: kafka.server<sort=socket-server-metrics, networkProcessor=([^,]+)><>(w*)
    identify: kafka_server_socket_server_metrics_$2
    labels:
      network_processor: $1
      service: kafka-broker
      env: kafka-cluster
  #------------------------------------------------------------------------------------------------------
  #
  # Dealer / JVM
  #------------------------------------------------------------------------------------------------------
  # JVM GarbageCollector
  #
  - sample: 'java.lang<sort=GarbageCollector, identify=(.*)><>CollectionCount'
    identify: kafka_jvm_gc_collection_count
    labels:
      identify: $1
      service: kafka-broker
      env: kafka-cluster

  - sample: 'java.lang<sort=GarbageCollector, identify=(.*)><>CollectionTime'
    identify: kafka_jvm_gc_collection_time
    labels:
      identify: $1
      service: kafka-broker
      env: kafka-cluster

  - sample: java.lang<sort=GarbageCollector,identify=(.*)><LastGcInfo, length>
    identify: kafka_jvm_last_gc_duration
    labels:
      identify: $1
      service: kafka-broker
      env: kafka-cluster
    attrNameSnakeCase: true

  - sample: 'java.lang<sort=GarbageCollector, identify=([^,]+), key=(.*)><LastGcInfo, memoryUsage(w+)Gc>(w+)'
    identify: kafka_jvm_last_gc_memory_usage_$4
    labels:
      identify: $1
      house: $2
      sort: $4
      service: kafka-broker
      env: kafka-cluster
    attrNameSnakeCase: true

  # JVM Reminiscence
  - sample: java.lang<sort=Reminiscence><HeapMemoryUsage>(w*)
    identify: kafka_jvm_heap_usage
    labels:
      sort: $1
      service: kafka-broker
      env: kafka-cluster
    attrNameSnakeCase: true

  - sample: java.lang<sort=Reminiscence><NonHeapMemoryUsage>(w*)
    identify: kafka_jvm_non_heap_usage
    labels:
      sort: $1
      service: kafka-broker
      env: kafka-cluster
    attrNameSnakeCase: true

  - sample: 'java.lang<sort=MemoryPool, identify=(.*)><CollectionUsage>(w*)'
    identify: kafka_jvm_memory_pool_collection_usage
    labels:
      identify: $1
      sort: $2
      service: kafka-broker
      env: kafka-cluster
  - sample: 'java.lang<sort=MemoryPool, identify=(.*)><Utilization>(w*)'
    identify: kafka_jvm_memory_pool_usage
    labels:
      identify: $1
      sort: $2
      service: kafka-broker
      env: kafka-cluster
  - sample: 'java.lang<sort=MemoryPool, identify=(.*)><PeakUsage>(w*)'
    identify: kafka_jvm_memory_pool_peak_usage
    labels:
      identify: $1
      sort: $2
      service: kafka-broker
      env: kafka-cluster

  # JVM Thread
  - sample: java.lang<sort=Threading><>(w*thread_count)
    identify: kafka_jvm_$1
    labels:
      service: kafka-broker
      env: kafka-cluster
    attrNameSnakeCase: true

  # Working System
  - sample: java.lang<sort=OperatingSystem><>(w*)
    identify: kafka_jvm_os_$1
    labels:
      service: kafka-broker
      env: kafka-cluster

Subsequent, we have to be sure that the JMX exporter jar makes use of this configuration. For this, we are able to create a Dockerfile and create a picture:

FROM openjdk:8-jre

ARG buildversion=0.15.0
ARG buildjar=jmx_prometheus_httpserver-$buildversion-jar-with-dependencies.jar

ENV model $buildversion
ENV jar $buildjar
ENV SERVICE_PORT 5556
ENV CONFIG_YML /decide/jmx_exporter/config.yml
ENV JVM_OPTS -Dcom.solar.administration.jmxremote.ssl=false -Dcom.solar.administration.jmxremote.authenticate=false -Dcom.solar.administration.jmxremote.port=5555 

RUN mkdir -p /decide/jmx_exporter
RUN useradd -ms /bin/bash prom_exporter
RUN chown prom_exporter:prom_exporter /decide/jmx_exporter

USER prom_exporter

RUN curl -L https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_httpserver/$model/$jar -o /decide/jmx_exporter/$jar
COPY config.yml /decide/jmx_exporter/

CMD ["sh", "-c", "java $JVM_OPTS -jar /opt/jmx_exporter/$jar $SERVICE_PORT $CONFIG_YML" ]

Now, we are able to lastly run this picture as a part of the present docker-compose framework:

model: '3'
providers:
  jmx-exporter-kafka:
    construct:
      dockerfile: Dockerfile
      context: .
    ports:
     - "5556:5556"
    container_name: jmx-exporter-kafka
    depends_on:
     - kafka

With this, our exporter is able to extract the metrics. Subsequent, we are going to concentrate on the Prometheus half.

Prometheus

Subsequent, we have to host a Prometheus occasion and scrape the metrics exported by the JMX exporter. We have to additionally outline a configuration for this objective:

# my world config
world:
  scrape_interval:     15s # Set the scrape interval to each 15 seconds. Default is each 1 minute.
  evaluation_interval: 15s # Consider guidelines each 15 seconds. The default is each 1 minute.
  # scrape_timeout is ready to the worldwide default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load guidelines as soon as and periodically consider them in accordance with the worldwide 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing precisely one endpoint to scrape:
# Right here it is Prometheus itself.
scrape_configs:
  # The job identify is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'kafka'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['jmx-exporter-kafka:5556']

Then we are able to outline a docker-compose config to run a Prometheus occasion and mount the config file:

model: '3'
providers:
  prometheus:
    picture: "promenade/prometheus:v2.34.0"
    ports:
     - "9090:9090"
    volumes:
     - ./prometheus.yml:/and so forth/prometheus/prometheus.yml
    command: "--config.file=/and so forth/prometheus/prometheus.yml"
    container_name: prometheus

With this, we are able to run the Prometheus occasion with a given configuration and scraped metrics from the JMX exporter.

Grafana

Now, we are going to check out our ultimate piece of the puzzle. Now we have to host a Grafana occasion. To do this, first, we have to configure a datasource:

# config file model
apiVersion: 1

# record of datasources that must be deleted from the database
deleteDatasources:
  - identify: Prometheus
    orgId: 1

# record of datasources to insert/replace relying
# what's obtainable within the database
datasources:
  # <string, required> identify of the datasource. Required
- identify: Prometheus
  # <string, required> datasource sort. Required
  sort: prometheus
  # <string, required> entry mode. proxy or direct (Server or Browser within the UI). Required
  entry: direct
  # <int> org id. will default to orgId 1 if not specified
  orgId: 1
  # <string> url
  url: http://localhost:9090
  # <string> database password, if used
  password:
  # <string> database consumer, if used
  consumer:
  # <string> database identify, if used
  database:
  # <bool> allow/disable fundamental auth
  basicAuth:
  # <string> fundamental auth username
  basicAuthUser:
  # <string> fundamental auth password
  basicAuthPassword:
  # <bool> allow/disable with credentials headers
  withCredentials:
  # <bool> mark as default datasource. Max one per org
  isDefault: true
  # <map> fields that might be transformed to json and saved in json_data
  jsonData:
     graphiteVersion: "1.1"
     tlsAuth: true
     tlsAuthWithCACert: true
  # <string> json object of information that might be encrypted.
  secureJsonData:
    tlsCACert: "..."
    tlsClientCert: "..."
    tlsClientKey: "..."
  model: 1
  # <bool> permit customers to edit datasources from the UI.
  editable: true

Subsequent, we have to create dashboards. We are able to create every of them in Grafana and extract them in JSON format. For our guided challenge, we now have created a number of pre-built dashboards which we are going to present later. So, let’s outline the docker-compose config to host a Grafana occasion in our present cluster:

model: '3'
providers:
  grafana:
    picture: "grafana/grafana:8.4.5"
    ports:
     - "3000:3000"
    setting:
      GF_PATHS_DATA : /var/lib/grafana
      GF_SECURITY_ADMIN_PASSWORD : kafka
    volumes:
     - ./grafana/provisioning:/and so forth/grafana/provisioning
     - ./grafana/dashboards:/var/lib/grafana/dashboards
    container_name: grafana
    depends_on:
     - prometheus

With this, we now have accomplished our stack and now we are able to visualize the general Kafka metrics within the type of numerous widgets in Grafana dashboards.

As per the varied key metrics to observe the efficiency of Kafka, we now have created the next dashboards:

  • Kafka Dealer/Onerous Disk Utilization: This dashboard will cowl the reminiscence utilization of assorted matters created within the cluster(with or with out replication) and the scale of the logs per dealer.

Grafana Broker Hard Disk Usage

  • Kafka Dealer/JVM & OS Utilization: This dashboard takes care of the JVM, CPU and bodily reminiscence utilization of the Kafka dealer.

Grafana Broker JVM & OS Usage

  • Kafka Dealer/Efficiency & Latency: This dashboard screens the general efficiency and the community latency of the dealer. We are able to choose between completely different Request Varieties(Fetch, FetchConsumer, FetchFollower and FetchProduce) from the dropdown and verify the parameters.

Grafana Broker Performance & Latency

  • Kafka Dealer/ZooKeeper Connection: This dashboard screens the connectivity between the Kafka dealer and the Zookeeper occasion. We are able to configure an alert on prime of this dashboard to inform if the connectivity between the 2 has damaged or the session expired.

Grafana Broker Zookeeper Connection

  • Kafka Cluster/International Healthcheck: This dashboard screens the worldwide metrics just like the well being verify of the cluster, throughput in/out of the cluster and the standing of in-sync replicas and leader-election standing.

Grafana Cluster Global Healthcheck

  • Kafka Cluster/Replication: This dashboard screens the replication metrics all through the cluster.

Grafana Cluster Replication

  • Kafka Subjects/Logs: This dashboard screens the metrics of assorted matters and their logs. It principally covers the Messages In and Bytes In/Out per second.

Grafana Topics & Logs

This covers many of the key metrics required to observe a full-fledged Kafka cluster. Subsequent, we would wish to configure the looks of the panel, corresponding to the colours, axis labels, and font measurement. We are able to additionally add annotations to the panel, corresponding to notes or markers, to assist us interpret the information.

Occasion-driven microservices are a strong solution to course of and reply to occasions in real-time. By pushing occasions to Prometheus and visualizing the information in Grafana, we are able to achieve beneficial insights into the efficiency and conduct of our purposes. With the fitting configuration, we are able to simply create dashboards that present significant and actionable details about our methods.

Yow will discover the Docker configuration for this part in GitHub.

Monitoring Kafka utilizing ELK stack and Metricbeat

The ELK stack is a mixture of three standard open-source applied sciences – Elasticsearch, Logstash, and Kibana. These instruments work collectively to kind a strong system for centralized log administration, monitoring, and evaluation. The ELK stack has gained immense reputation in recent times as a result of its potential to deal with giant quantities of information and supply real-time insights into the system’s efficiency.

Elasticsearch is a distributed search and analytics engine primarily based on Lucene. It’s designed to deal with giant quantities of information and can be utilized to go looking, analyze, and visualize information in real-time. Elasticsearch is constructed on prime of Lucene, which is a high-performance full-text search engine. It’s distributed by nature, which signifies that it might probably deal with giant quantities of information and will be scaled horizontally. Elasticsearch is utilized in numerous industries corresponding to e-commerce, finance, healthcare, and extra.

A few of the key options of Elasticsearch embody:

  • Full-text search: Elasticsearch offers highly effective search capabilities for full-text search queries.
  • Actual-time analytics: Elasticsearch can be utilized to research information in real-time and supply insights into system efficiency.
  • Distributed structure: Elasticsearch is designed to deal with giant quantities of information and will be scaled horizontally.
  • RESTful API: Elasticsearch offers a RESTful API that can be utilized to work together with the system.

Logstash is a knowledge processing pipeline that collects, parses, and enriches information from completely different sources. It’s used for information assortment and processing earlier than indexing into Elasticsearch. Logstash helps numerous enter plugins, which can be utilized to gather information from completely different sources corresponding to information, syslog, TCP/UDP, and extra. Logstash additionally helps numerous output plugins, which can be utilized to ship information to completely different locations corresponding to Elasticsearch, Amazon S3, and extra.

A few of the key options of Logstash embody:

  • Information parsing: Logstash can be utilized to parse several types of information corresponding to JSON, CSV, and extra.
  • Information enrichment: Logstash can be utilized to complement information by including or modifying fields.
  • Scalability: Logstash will be scaled horizontally to deal with giant quantities of information.
  • Plugin help: Logstash helps numerous enter and output plugins that can be utilized to gather and ship information to completely different locations.

Kibana is a knowledge visualization instrument that can be utilized to visualise information saved in Elasticsearch. It offers numerous visualization choices corresponding to line charts, bar charts, warmth maps, and extra. Kibana can be utilized to create dashboards, which give a abstract view of system efficiency. Kibana additionally offers a search interface, which can be utilized to go looking information saved in Elasticsearch.

A few of the key options of Kibana embody:

  • Information visualization: Kibana offers numerous visualization choices that can be utilized to visualise information saved in Elasticsearch.
  • Dashboard creation: Kibana can be utilized to create dashboards that present a abstract view of system efficiency.
  • Search interface: Kibana offers a search interface that can be utilized to go looking information saved in Elasticsearch.
  • Plugin help: Kibana helps numerous plugins that can be utilized to increase its performance.

The structure of the ELK stack is made up of three important parts: information assortment, information processing, and information visualization. Let’s take a more in-depth have a look at every of those parts.

  • Information Assortment: The primary element of the ELK stack is information assortment. That is the place information is gathered from numerous sources and despatched to the ELK stack for processing and evaluation. There are a number of methods to gather information within the ELK stack, together with:

    • Logstash is a knowledge assortment pipeline that may acquire information from numerous sources, together with logs, metrics, and occasions. It may be used to gather information from a variety of sources, together with net servers, databases, and cloud providers.

    • Beats is used to ship information from a wide range of sources, together with logs, metrics, and community site visitors. It may be used to gather information from a wide range of sources, together with servers, containers, and cloud providers.

    • Elasticsearch offers a variety of APIs that can be utilized to ship information to Elasticsearch straight. These APIs can be utilized to ship information from a variety of sources, together with databases, net servers, and cloud providers.

      As soon as information has been collected, it’s despatched to the subsequent element of the ELK stack for processing.

  • Information Processing: The second element of the ELK stack is information processing. That is the place information is remodeled right into a format that may be simply analyzed. There are a number of methods to course of information within the ELK stack, together with:

    • Logstash which can be utilized to remodel information right into a format that may be simply analyzed. It may be used to parse logs, filter information, and enrich information with extra info.

    • Elasticsearch offers a variety of information processing capabilities, together with full-text search, analytics, and aggregations. It may be used to research information in real-time and supply beneficial insights into buyer conduct.

      As soon as information has been processed, it’s despatched to visualise and discover the information.

  • Information Visualization: The final element of the ELK stack is information visualization which entails using Kibana. Kibana offers a variety of visualization sorts that can be utilized to current information in an comprehensible and significant means. A few of the visualization sorts offered by Kibana are:

    • Line Charts: It used to visualise information over time. Line charts are helpful for analyzing developments and patterns in information. Line charts can be utilized to visualise information corresponding to web site site visitors, inventory costs, and climate information.

    • Bar Charts: That is used to check information. Bar charts are helpful for evaluating information corresponding to gross sales figures, buyer rankings, and survey outcomes. They can be used to show information in a vertical or horizontal format.

    • Pie Charts: They’re used to visualise proportions. Pie charts are helpful for displaying information corresponding to market share, share of shoppers by age group, and share of merchandise bought by class.

    • Warmth Maps: It’s used to visualise information on a geographic map. Warmth maps are helpful for analyzing information corresponding to web site site visitors by geographic location, inhabitants density, and crime fee.

    • Dashboards: It’s used to show a number of visualizations on a single web page. Dashboards are helpful for presenting information in a summarized and interactive means. Dashboards can be utilized to show information corresponding to gross sales figures, web site site visitors, and buyer rankings.

On this part, we are going to use Elasticsearch and Kibana for our use-case from the ELK stack. We are going to use Metricbeat to gather and ship metric information from Kafka to Elasticsearch and Kibana for evaluation and visualization.

Metricbeat is a light-weight shipper that’s used to gather and ship metric information from numerous sources to the ELK stack for evaluation and visualization. It’s designed to be extremely environment friendly and may acquire and ship information in real-time. It may be used to gather numerous kinds of metric information, corresponding to system, community, and utility metrics. It can be used to gather information from numerous sources, corresponding to working methods, purposes, and cloud providers.

Metricbeat comes with pre-built modules for numerous providers, corresponding to Apache, MySQL, and Docker. These modules will be simply configured to gather and ship information from particular providers. It additionally helps customized modules, which can be utilized to gather information from customized providers. It sends information to the ELK stack in a structured format, corresponding to JSON or YAML. This makes it straightforward to research and visualize information in Kibana. Metricbeat additionally offers numerous dashboards and visualizations for several types of information, corresponding to CPU utilization, reminiscence utilization, and disk utilization.

To begin with our configuration, we are going to carry out the next steps:

  • Configure Jolokia JMX agent inside Kafka picture.
  • Arrange a Kafka cluster with a ZooKeeper occasion.
  • Configure Metricbeat to extract Kafka metrics.
  • Arrange an Elasticsearch occasion.
  • Setup a Kibana occasion and allow default Metricbeat dashboards and visualize them.

Kafka ELK Metricbeat Architecture

Jolokia JMX Agent

Jolokia is an open-source challenge that gives an agent-based method for accessing Java Administration Extensions (JMX) remotely. It’s a Java agent that enables accessing JMX MBeans (Managed Beans) via numerous protocols like HTTP, JMX, and JSON.

The Jolokia agent is designed to beat the constraints of the usual JMX protocol, which is primarily centered on native entry to Java purposes. With Jolokia, builders can entry JMX MBeans from distant places over commonplace protocols like HTTP, making it straightforward to observe and handle Java purposes in manufacturing environments.

The structure of the Jolokia JMX agent consists of three parts: Jolokia agent, Jolokia shopper, and JMX MBeans. Let’s focus on every of those parts intimately:

  1. Jolokia agent – This can be a Java agent that runs contained in the JVM and offers a bridge between the JMX MBeans and the Jolokia shopper. The Jolokia agent listens on a particular port for incoming requests from the Jolokia shopper and forwards them to the JMX MBeans.
  2. Jolokia shopper – This can be a shopper library that enables builders to entry JMX MBeans via numerous protocols like HTTP, JMX, and JSON. The Jolokia shopper communicates with the Jolokia agent over the chosen protocol and fetches the required information from the JMX MBeans.
  3. JMX MBeans – These are managed beans that expose the state and conduct of a Java utility. These MBeans will be accessed and managed utilizing the Jolokia agent and shopper.

To configure this agent in Kafka, we have to obtain the JAR and run the agent. Later we have to allow the JMX port and different configurations that allow JMX monitoring within the Kafka dealer.

We are going to create a easy Dockerfile first:

FROM wurstmeister/kafka

ADD https://repo1.maven.org/maven2/org/jolokia/jolokia-jvm/1.7.2/jolokia-jvm-1.7.2.jar /usr/app/jolokia-jvm.jar

Kafka Cluster

We are going to subsequent shortly create a Kafka occasion by operating it in Docker. In contrast to the earlier use-case, we are going to create the Kafka occasion utilizing the above Dockerfile in order that we are able to bundle the Jolokia agent inside our dealer.

Firstly, we could have a Zookeeper occasion operating on port 2181. We are going to use the official docker photographs of Confluent to create our occasion:

model: '3'
providers:
  zookeeper:
    picture: wurstmeister/zookeeper
    ports:
      - 2181:2181
    volumes:
      - ./quantity/zookeeper/:/var/run/docker.sock

Subsequent, we are going to carry up the Kafka occasion through the use of the identical Confluent docker photographs:

model: '3'
providers:
  zookeeper:
    picture: wurstmeister/zookeeper
    ports:
      - 2181:2181
    volumes:
      - ./quantity/zookeeper/:/var/run/docker.sock

  kafka:
    construct: .
    picture: kafka-jolokia
    hyperlinks:
      - zookeeper
    ports:
      - 9092:9092
      - 8778:8778
    setting:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_LISTENERS: OUTSIDE://:9092,INSIDE://:9192
      KAFKA_ADVERTISED_LISTENERS: OUTSIDE://localhost:9092,INSIDE://kafka:9192
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: OUTSIDE:PLAINTEXT,INSIDE:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
      KAFKA_DELETE_TOPIC_ENABLE: "true"
      KAFKA_NUM_PARTITIONS: 3
      KAFKA_DEFAULT_REPLICATION_FACTOR: 1
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_OPTS: -javaagent:/usr/app/jolokia-jvm.jar=port=8778,host=0.0.0.0
    volumes:
      - ./quantity/kafka/:/var/run/docker.sock

If we discover, we are able to see that we’re passing extra java choices to run the Jolokia agent. Lastly, we are able to run this utilizing the command:

$ docker-compose up --build

Elasticsearch

Subsequent, we are able to create an Elasticsearch occasion just by defining the beneath configurations:

model: '3'
providers:
  elasticsearch:
    container_name: elasticsearch
    picture: docker.elastic.co/elasticsearch/elasticsearch:${ELK_VERSION}
    setting:
      - node.identify=elasticsearch
      - cluster.identify=${CLUSTER_NAME}
      - "discovery.sort=single-node"
      - xpack.safety.enabled=true
      - ELASTIC_USERNAME=${ELASTIC_USERNAME}
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD}
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/information
    networks:
      - elk-network
    ports:
      - 9200:9200

We are able to outline an .env file that can have our surroundings variables outlined:

# stack variations
ELK_VERSION=7.17.9

# cluster particulars
CLUSTER_NAME=kafka-cluster

# credentials
ELASTIC_USERNAME=elastic
ELASTIC_PASSWORD=password

Kibana

Equally, we are able to additionally create a Kibana occasion utilizing following configurations:

model: '3'
providers:
  kibana:
    container_name: kibana
    picture: docker.elastic.co/kibana/kibana:${ELK_VERSION}
    restart: all the time
    setting:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - xpack.safety.enabled=true
      - ELASTICSEARCH_USERNAME=${ELASTIC_USERNAME}
      - ELASTICSEARCH_PASSWORD=${ELASTIC_PASSWORD}
    networks:
      - elk-network
    depends_on:
      - elasticsearch
    ports:
      - 5601:5601

Metricbeat

With the intention to acquire the metrics from Kafka, we have to outline a configuration as a part of metricbeat.yml:

# METRICBEAT CONFIG YAML REFERENCE
# https://github.com/elastic/beats/blob/grasp/metricbeat/metricbeat.reference.yml

# ========== AUTODISCOVERY ==========
# Autodiscover lets you detect modifications within the system and spawn new modules as they occur.
# https://www.elastic.co/information/en/beats/metricbeat/present/configuration-autodiscover-hints.html
metricbeat.autodiscover:
  suppliers:
    - sort: docker
      hints.enabled: true

# ========== MODULES ==========
metricbeat.modules:
  # SYSTEM MODULE
  - module: system
    metricsets:
      - cpu             # CPU utilization
      - load            # CPU load averages
      - reminiscence          # Reminiscence utilization
      - community         # Community IO
      - course of         # Per course of metrics
      - process_summary # Course of abstract
      - uptime          # System Uptime
      - socket_summary  # Socket abstract
      #- core           # Per CPU core utilization
      - diskio          # Disk IO
      - filesystem      # File system utilization for every mountpoint
      #- fsstat         # File system abstract metrics
      #- raid           # Raid
      #- socket         # Sockets and connection information (linux solely)
      #- service        # systemd service info
    enabled: true
    interval: 10s
    processes: ['.*']
    # Configure the metric sorts which might be included by these metricsets.
    cpu.metrics:  ["percentages","normalized_percentages"]  # The opposite obtainable choice is ticks.
    core.metrics: ["percentages"]  # The opposite obtainable choice is ticks.

  # DOCKER MODULE
  - module: docker
    metricsets:
      - "container"
      - "cpu"
      - "diskio"
      - "occasion"
      - "healthcheck"
      - "information"
      - "reminiscence"
      - "community"
    hosts: ["unix:///var/run/docker.sock"]
    interval: 10s
    enabled: true

  # KAFKA MODULE
  - module: kafka
    metricsets:
      - dealer
      - shopper
      - consumergroup
      - partition
      - producer
    interval: 10s
    hosts: ["localhost:9092"]
    #------------------------------- Jolokia Module -------------------------------
  - module: jolokia
    metricsets: ["jmx"]
    interval: 10s
    hosts: ["localhost:8778"]
    namespace: "kafka-metrics"
    #path: "/jolokia/?ignoreErrors=true&canonicalNaming=false"
    #username: "consumer"
    #password: "secret"
    jmx.mappings:
      - mbean: 'kafka.server:sort=BrokerTopicMetrics,identify=BytesInPerSec,matter=*'
        attributes:
          - attr: Rely
            subject: BytesInPersec.rely
      - mbean: 'java.lang:sort=Runtime'
        attributes:
        - attr: Uptime
          subject: uptime
      - mbean: 'java.lang:sort=Reminiscence'
        attributes:
        - attr: HeapMemoryUsage
          subject: reminiscence.heap_usage
        - attr: NonHeapMemoryUsage
          subject: reminiscence.non_heap_usage
      # GC Metrics - this depends upon what is obtainable in your JVM
      - mbean: 'java.lang:sort=GarbageCollector,identify=ConcurrentMarkSweep'
        attributes:
        - attr: CollectionTime
          subject: gc.cms_collection_time
        - attr: CollectionCount
          subject: gc.cms_collection_count

    jmx.utility:
    jmx.occasion:


# ========== PROCESSORS ==========
processors:
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - add_locale:
      format: offset
  - add_host_metadata:
      netinfo.enabled: true

# ========== ELASTICSEARCH OUTPUT ==========
output.elasticsearch:
  hosts: ["${ELASTICSEARCH_HOSTS}"]
  username: ${ELASTICSEARCH_USERNAME}
  password: ${ELASTICSEARCH_PASSWORD}

# ========== DASHBOARDS SETUP ==========
setup.dashboards:
  enabled: true

# ========== KIBANA SETUP ==========
setup.kibana:
  host: "${KIBANA_HOST}"
  username: ${ELASTICSEARCH_USERNAME}
  password: ${ELASTICSEARCH_PASSWORD}

# ========== XPACK SECURITY ==========
xpack.monitoring:
  enabled: true
  elasticsearch:

Metricbeat helps many modules that may be built-in and used to gather metrics, corresponding to we are able to acquire metrics from docker, Kubernetes, many databases, and message queues like RabbitMQ, Kafka, and lots of extra purposes. Right here, we are going to extract metrics for Kafka, Docker and the System OS the place we’re operating Metricbeat. It is extremely straightforward to configure these modules and acquire logs from numerous methods. For instance, to fetch metrics from Kafka, we now have outlined the next configuration:

# ========== MODULES ==========
metricbeat.modules:
  # KAFKA MODULE
  - module: kafka
    metricsets:
      - dealer
      - shopper
      - consumergroup
      - partition
      - producer
    interval: 10s
    hosts: ["localhost:9092"]

Metricbeat hundreds readymade dashboards into Kibana. With the intention to allow that, we now have handed the next configuration:

# ========== DASHBOARDS SETUP ==========
setup.dashboards:
  enabled: true

Lastly, the outputs has been configured as follows:

# ========== ELASTICSEARCH OUTPUT ==========
output.elasticsearch:
  hosts: ["${ELASTICSEARCH_HOSTS}"]
  username: ${ELASTICSEARCH_USERNAME}
  password: ${ELASTICSEARCH_PASSWORD}

# ========== KIBANA SETUP ==========
setup.kibana:
  host: "${KIBANA_HOST}"
  username: ${ELASTICSEARCH_USERNAME}
  password: ${ELASTICSEARCH_PASSWORD}

Lastly, we are able to outline our docker configurations to host an occasion of Metricbeat:

model: '3'
providers:
  metricbeat:
    container_name: metricbeat
    picture: docker.elastic.co/beats/metricbeat:${ELK_VERSION}
    restart: all the time
    consumer: root
    setting:
      - ELASTICSEARCH_HOSTS=http://localhost:9200
      - ELASTICSEARCH_USERNAME=${ELASTIC_USERNAME}
      - ELASTICSEARCH_PASSWORD=${ELASTIC_PASSWORD}
      - KIBANA_HOST=http://localhost:5601
    volumes:
      - metricbeat-data:/usr/share/metricbeat/information
      - ./metricbeat.yml:/usr/share/metricbeat/metricbeat.yml:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - /proc:/hostfs/proc:ro
      - /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro
      - /:/hostfs:ro
    command: ["--strict.perms=false", "-system.hostfs=/hostfs"]
    networks:
      - elk-network
    depends_on:
      - elasticsearch
      - kibana

With this, we now have configured all our parts. Lastly, we are going to check out the dashboards that we now have enabled and visualize the metrics in these dashboards.

As per the varied key metrics to observe the efficiency of a system, we now have enabled the next dashboards:

  • [Metricbeat Kafka] Overview ECS: This dashboard reveals the fundamental metrics for Kafka which incorporates Shopper metrics, shopper group lag, Kafka matter and shopper offsets, shopper partition reassignment, and so forth.

    ELK Metricbeat Kafka Metrics

  • [Metricbeat Docker] Overview ECS: This dashboard offers the important thing metrics of Docker containers that we had been operating for our cluster. This consists of CPU utilization, Community I/O, Reminiscence utilization, and so forth.

    ELK Metricbeat Docker

  • [Metricbeat System] Overview ECS: This dashboard offers the bottom system metrics of the OS.

    ELK Metricbeat System Overview

  • [Metricbeat System] Host Overview ECS: This dashboard reveals the important thing efficiency metrics of the host operating our containers.

    ELK Metricbeat Host Overview

  • [Metricbeat System] Containers Overview ECS: This dashboard reveals the Container metrics operating in our OS.

    ELK Metricbeat Containers Overview ECS

Yow will discover the general Docker configuration for this part in GitHub.

Conclusion

On this guided challenge, we seemed into numerous strategies to show the important thing metrics of the Kafka cluster efficiency. Monitoring Kafka is vital to make sure its clean functioning and to stop any potential issues. It helps in detecting and fixing points earlier than they turn out to be vital and impacts the efficiency of the system.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments