Deploying Kafka in Production: From Zero to Hero (Part 3)

Kafka in Production - We are Building a Scalable, Secure, and Enterprise Ready Message Bus on Google Cloud Platform
opensight.ch - roman hüsler

Table of Contents

TL;DR

Although I recommend reading the blog posts, I know some folks skip straight to the “source of truth”—the code. It’s like deploying on a Friday without testing.
So, here you go:

Introduction

Apache Kafka is a powerful event streaming platform, but running it in production is a different challenge altogether. We will start with the basics, but this is not a playground—we work our way towards building an enterprise-grade message bus that delivers high availability, security, and scalability for real-world workloads.
I'm not a Kafka specialist myself—I learn on the go. As is often the case with opensight blogs, you can join me on my journey from zero to hero.

In this blog series, we will take Kafka beyond the basics. We will design a production-ready Kafka cluster deployment that is:

  • Scalable – Able to handle increasing load without performance bottlenecks.
  • Secure – Enforcing authentication (Plain, SCRAM, OAUTH), authorization (ACL, OAUTH), and encryption (TLS).
  • Well-structured – Including Schema Registry for proper data governance.
  • Enterprise-ready – Implementing features critical for an event-driven architecture in a large-scale environment.

Kafka Cluster with TLS Encryption, Authentication and ACLs

In this part of the blog series, we will build a Kafka cluster in a lab environment. For simplicity, the entire cluster will be deployed on a single Docker host. However, in a production setup, you should distribute each component across multiple virtual machines (VMs) to ensure scalability, redundancy, and fault tolerance. In one of the next blogs, I setup an actual kubernetes cluster on GKE.

The broker ports might seem unusual: 19093, 29093, and 39093. But this setup is intentional. It allows all brokers to run on a single test host without port conflicts. By using distinct port ranges, I can ensure each broker operates independently while avoiding collisions with default ports or other services running on the system. Obviously this setup is for testing - when deploying this in production use multiple VMs that run the kafka broker on the same port.

image - kafka cluster setup for blog part 3

Authentication with Client Certificates (mTLS)

For authentication, we are using mutual TLS (mTLS), where client certificates issued by our Certificate Authority (CA) are required to establish a connection with Kafka brokers. Unlike Kafka’s SASL authentication layer, this approach relies purely on SSL (TLS) certificate validation. If a client's certificate is not issued by our CA, the TLS handshake will fail, and the client will be unable to communicate with the brokers. SASL is not involved in this setup.

Our Kafka broker configuration includes:

  • SSL (TLS) encryption for secure communication.
  • Mutual TLS (mTLS) authentication, enforced via ssl.client.auth: required.
  • Inter-broker communication over SSL, ensuring all traffic between Kafka nodes remains encrypted.

Future Upgrade: SASL_SSL with OAuth
In a later deployment, we will transition to SASL_SSL with OAuth, which will introduce token-based authentication while still maintaining SSL encryption. This will allow integration with an OAuth identity provider (e.g., Keycloak or Okta) for authentication and authorization, making the system more scalable and manageable in a production environment.

Certificate Generation

image - kafka certificates

First we generate a CA.crt with a private key, which we can use to issue all certificates.

  • kafka.crt
    certificate used by kafka brokers to encrypt communication and authenticate with each other.
  • kafka-ui.crt
    certificate for KafkaUI to connect to a kafka broker
  • client-1.crt
    certificate for an arbitrary "client-1" to connect to kafka
  • client-2.crt
    certificate for an arbitrary "client-2" to connect to kafka
# generate all required certificates
./generate_certs.sh changeit  

command - generate all required certificates by using script

All keys will be automatically generated by this script in the GIT repository. Java is required, as we use keytool in order to create a java keystore. After generating all certificates, it look like this:

image - generated tls certificates for kafka

Deployment

The start the whole setup, run:

# start the kafka cluster on docker (and generate certs)
make run
image - running kafka containers for blog part 3 setup

These are the most important config parameters for this setup:

  • advertised.listeners:SSL://kafka-1:19093
    advertised.listeners
    :SSL://kafka-2:29093
    advertised.listeners
    :SSL://kafka-3:39093
    (link docs) We run now with SSL (TLS) encryption on port 9093 on the brokers
  • security.inter.broker.protocol:SSL
    (link docs) Now, SSL (TLS) Encryption is required in order to communicate with a kafka broker
  • ssl.keystore.filename:server.keystore.jks
    (link docs) Alternatively ssl.keystore.location can be used but I think then the full filepath has to be entered. As keystore, we use the previously generated server keystore, which contains the ca.crt as well as the private key and the certificate (kafka.crt) for running the kafka brokers. This keystore is essential for securing communication between Kafka brokers via SSL/TLS. By the way - ssl.keystore.credentials are the credentials for the keystore, which are stored in the file certs/credentials and mounted in the container.
  • ssl.client.auth:required
    (link docs) Clients are required to authenticate with SSL (TLS) by presenting a client certificate
  • authorizer.class.name:kafka.security.authorizer.AclAuthorizer
    (link docs) In this section of the blog, we will explore Access Control Lists (ACLs) in Apache Kafka, which help define which clients have access to specific topics. By configuring ACLs, you can enforce granular security policies, ensuring that only authorized users or applications can read from or write to Kafka topics.
  • principal.builder.class:org.apache.kafka.common.security.authenticator.DefaultKafkaPrincipalBuilder
    (link docs) I tried to use this principal builder at first. But it was not able to extract the actual username from the certificate (CN=kafka). So it extracted the username as CN=kafka instead of kafka from the certificate. So I ended up adding ssl.principal.mapping.rules : RULE:^CN=(.*?)$$/$$1/L. That extracts the username correctly.
  • super.users:your_super_users
    (link docs) you can also add some superusers in the setup that don't need ACLs

Access Control

A look at KafkaUI reveals that it cannot create a new topic on the cluster. This behavior is expected, as KafkaUI connects using the generated kafka-ui.crt, but the user kafka-ui (CN=kafka-ui) does not yet have the necessary permissions in the Access Control Lists (ACL).

image - kafka ui no permissions to create a topic

We are now updating the Access Control List (ACL) to grant Kafka-UI the necessary permissions to manage the Kafka cluster.

# Give Kafka-UI full access to all topics
docker exec -it kafka-1 kafka-acls --authorizer-properties zookeeper.connect=zookeeper-1:2181 \
  --add --allow-principal "User:kafka-ui" --operation All --topic '*'

# Allow Kafka-UI full access to the Kafka cluster
docker exec -it kafka-1 kafka-acls --authorizer-properties zookeeper.connect=zookeeper-1:2181 \
  --add --allow-principal "User:kafka-ui" --operation All --cluster

# restart kafka ui
docker restart kafka-ui

command - grant kafka-ui user necessary permissions

Also let us create the necessary permissions for the testapp, which will use the generated client-1.crt with CommonName CN=client-1

# grant client-1 permissions to create test-topic
docker exec -it kafka-1 kafka-acls --authorizer-properties zookeeper.connect=zookeeper-1:2181 \
  --add --allow-principal "User:client-1" --operation Create --topic test-topic

# grant client-1 permissions to read/write test-topic
docker exec -it kafka-1 kafka-acls --authorizer-properties zookeeper.connect=zookeeper-1:2181 \
  --add --allow-principal "User:client-1" --operation Read --operation Write --topic test-topic

# grant client-1 permissions to read the test group
docker exec -it kafka-1 kafka-acls --authorizer-properties zookeeper.connect=zookeeper-1:2181 \
  --add --allow-principal "User:client-1" --operation Read --group test-group

# grant client-1 permissions to describe the test group
docker exec -it kafka-1 kafka-acls --authorizer-properties zookeeper.connect=zookeeper-1:2181 \
  --add --allow-principal "User:client-1" --operation Describe --group test-group

command - grant client-1 necessary permissions

Testing

The deployment is complete, so it's time to run some tests using the test application. You can find the test app in the GitHub repository.

Everything appears to be functioning correctly. At this stage, you can experiment with the Access Control Lists (ACLs) to verify that users without the necessary permissions can no longer access the Kafka cluster. This will help confirm that the access restrictions are properly enforced.

image - testapp connected to kafka. send and receive messages

Conclusion

We're making great progress! At this stage, we have a Kafka cluster with encryption, authentication, access control, and scalability in place.

In the upcoming Part 4 of this series, I'll focus on implementing enterprise-grade features, specifically OAuth authentication and OAuth-based permissions. These enhancements will further strengthen security and simplify user and application access management. Stay tuned!