Mar 215 min read

Unveiling Kafka: An Introduction to Event Streaming

Stream Service -

Utilizing a Stream service enables us to publish and subscribe to streams of data records, store these streams, and conduct real-time data processing.
Using the Stream service, you have the ability to construct real-time streaming data pipelines for seamless data transfer between various systems and applications. Additionally, you can develop applications that dynamically process and respond to streams of data.
The Stream service efficiently handles the intake, routing, and delivery of large quantities of low-latency data types, including web clicks, transactions, sensor data, work items, data pages, and customer interaction history.
To bolster operational resilience, the Stream service employs a mechanism where identical data is replicated across other nodes within the cluster. This distribution and replication of stream data records are pivotal for ensuring the scalability and fault tolerance of the Stream service.
Starting from Pega Platform™ 7.4, the Stream service becomes available for use. It is constructed upon the foundation of the Apache Kafka platform.

Introduction to Kafka -

Apache Kafka serves as an event streaming platform designed for the collection, processing, storage, and integration of data on a large scale.
It's a resilient and scalable platform that can be employed as a data source for conducting real-time analysis of customer records as they are generated.
Establish Kafka data sets to facilitate the reading and writing of data to and from Kafka topics. This data serves as a source of events, such as customer calls or messages, which your application can leverage. These events can then be utilized as input for rules that conduct real-time data processing and subsequently trigger actions.

Event-driven architecture -

Event - An event encompasses various types of actions, incidents, or changes that are detected or logged by software or applications. Examples include a payment transaction, a click on a website, or a recorded temperature reading, accompanied by a description of the occurrence.

Event-driven architecture (EDA) is a software design pattern that empowers organizations to identify "events" or significant business moments (such as transactions, website visits, shopping cart abandonment, etc.) and take action on them in real-time or near real-time.

In traditional systems, the prevailing model often revolves around what could be described as the data-centric approach, where the data serves as the primary source of truth.
Transitioning to event-driven architecture involves shifting from a data-centric model to an event-centric model. While data remains significant in the event-driven approach, events take precedence as the most crucial component.
In the service-oriented model, the paramount concern was ensuring the preservation of all data without any loss.
In event-driven architecture, the primary focus is on promptly responding to events as they occur. This emphasis stems from the understanding that there's a law of diminishing returns associated with events: the longer they persist, the less valuable they become.

Topics & Partitions -

Topics - In Kafka, the fundamental unit of organization is the topic, which can be likened to a table in a relational database. As a developer working with Kafka, the topic is likely the abstraction that you primarily focus on and think about.

You establish distinct topics to accommodate various types of events, and separate topics to store filtered and transformed versions of the same type of event.
Given that Kafka topics essentially function as logs, the data within them isn't inherently temporary. Each topic can be configured to expire data after a specified age or once the topic has reached a certain size threshold. This expiration period can range from mere seconds to extended durations spanning years, or messages can be retained indefinitely.

Partitions - Partitioning involves dividing the single-topic log into multiple logs, each of which can reside on a distinct node within the Kafka cluster. This partitioning mechanism enables the distribution of tasks such as storing messages, writing new messages, and processing existing messages across numerous nodes within the cluster.

Usually, in scenarios where a message lacks a key, subsequent messages are distributed in a round-robin fashion among all partitions of the topic. This distribution ensures that each partition receives an equal share of the data, although it doesn't preserve any specific ordering of the input messages.
When a message contains a key, Kafka determines the destination partition based on a hash of that key. This approach ensures that messages with the same key consistently land in the same partition, thereby maintaining their order.

The partitions of the log are spread across the servers within the Kafka cluster, with each server responsible for managing data and handling requests for a portion of the partitions. Additionally, every partition is duplicated across a customizable number of servers to ensure fault tolerance.

Each partition designates one server as the "leader" and zero or more servers as "followers." The leader server manages all read and write requests for the partition, while the followers passively replicate the leader. If the leader server fails, one of the followers automatically assumes the role of the new leader. Additionally, each server serves as a leader for some partitions and a follower for others, ensuring a balanced load distribution within the cluster.

Producers & Consumers -

Producers - Producers are responsible for publishing data to the topics they select. It's the producer's responsibility to decide which record to allocate to which partition within the topic. This allocation can be performed in a round-robin manner to evenly distribute the load, or it can be executed based on a semantic partition function, such as one that utilizes a key in the record.

The Producer API enables an application to publish a continuous stream of records to one or more Kafka topics.

Consumers - Consumers identify themselves with a consumer group name, and every record published to a topic is sent to one consumer instance within each subscribed consumer group. These consumer instances can operate in separate processes or on distinct machines.

The Consumer API enables an application to subscribe to one or more topics and process the stream of records produced to them.

Brokers -

In terms of physical infrastructure, Kafka consists of a network of machines referred to as brokers.
In a modern deployment, these brokers may not necessarily be individual physical servers, but rather containers running on pods, which are in turn executed on virtualized servers operating on physical processors within a physical datacenter.
Regardless of their deployment method, each broker operates as an independent machine running the Kafka broker process. Every broker hosts a specific set of partitions and manages incoming requests to write new events to these partitions or read events from them. Additionally, brokers are responsible for handling the replication of partitions between each other.

-Team Enigma Metaverse

Unveiling Kafka: An Introduction to Event Streaming

Recent Posts

Comments

Subscribe to Our Newsletter