Apache kafka is arguably most popular distributed streaming platform in the world currently. Kafka has been battle tested in various condition with processing petabytes of data at speed. It allows you to scale your application by streaming data to consumers for processing. It is horizontally scalable and fault tolerant, which means you can keep adding more kafka nodes and you can keep scaling indefinitely. In this article we will explore how to install kafka on aws ec2.
In this article we would do a standalone deployment of kafka, just to keep things simpler. We would require an ec2 instance running some form of linux. I am using Ubuntu, but you could use any flavor, since Ubuntu is a Debian based linux, you could run the commands from this article as it is on any Debian based system. For others you may need to convert into their package, i.e. Yum for CentOS.
We will start with a blank operating system and install everything we need. So go ahead and fire up your ec2 instance, and connect to it via SSH.
Try not to use smaller instances like t2.micro or nano as kafka would needs considerable amount of memory to run. I would recommend at least t2.large or t2.xlarge if you want to run both zookeeper and kafka on same machine.
In a production setting, you should be running zookeeper independently, which can be run on a t2.micro and kafka should be running on the largest server you can afford, its recommended that kafka broker should have atleast 64 GB of RAM.
We will start with installing java 11
sudo apt-get update
sudo apt install openjdk-11-jre-headless
This will download and install java 11. You can check the installation by running following command.
To install kafka, we would need to download the official version of kafka zip from apache site. You can get that by running following command.
Above command would download kafka for scala version 2.13, if you need a different version please check apache download site.
Now unzip the tar file by running following command
tar -zxvf kafka-*.tgz
This should expand kafka in home directory, if you want you can extract it to a different location, for this article location does not matter.
We can’t talk about kafka and not talk about zookeeper; zookeeper is an independent project from apache which is used by various distributed applications. Zookeeper is used in kafka for
- State management
- Leader election
- Cluster membership
- Topic configuration
- Consumer offset (pre 0.9.1 release)
First, we would need to run an instance of zookeeper, ideally in a production setting, the zookeeper should be running on its own dedicated machine and possibly have more than 2 instances for redundancy. However for this article we would run the zookeeper locally along with kafka on the same node.
sh ~/kafka_2.13-2.6.0/bin/zookeeper-server-start.sh ~/kafka_2.13-2.6.0/config/zookeeper.properties &
now we can start the kafka by running following command
sh ~/kafka_2.13-2.6.0/bin/kafka-server-start.sh ~/kafka_2.13-2.6.0/config/server.properties &
Now that we have our kafka running properly, lets test the installation by creating a topic and sending and receiving a few message.
~/kafka_2.13-2.6.0/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 10 --topic test-topic
Lets quickly run through the options which we have used in this command.
Zookeeper option specified where the zookeeper is running, if you have external zookeeper then you would need to specify the address of host machine where zookeeper is running.
replication-factor is the factor which determines how many nodes the data of this topic would be replicated on, since in this case we only have one node, replication-factor cannot be more than 1. In production setting, ideally the replication factor should be at least 2, so that if one of the node fails then the data would be preserved.
Partitions is the segments of a topic where messages are stored, each of the consumer gets access to a dedicated partition to read from. The more the partitions, more the number of consumers you can create, so effectively partition is the parallelism which you can introduce in the way messages are read from kafka.
Lets run a console producer which comes bundled with kafka to test the behaviour of a typical producer sending message to kafka
~/kafka_2.13-2.6.0/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test-topic
This would drop you to a prompt where any message you type would be sent to kafka. So let type a few messages.
Now lets explore the consumer behaviour by running a console consumer.
~/kafka_2.13-2.6.0/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test-topic --from-beginning
In this command we are reading the messages from test-topic, the interesting bit is the –from-beginning flag, kafka maintains a consumer offset which is just a register of which consumer has read how much of data, this feature enables a consumer to go back in time and again consume the old message, this feature makes kafka a very powerful tool.
While its common to think of kafka as a messaging system, if you want to understand how kafka works, then I like to think of kafka a distributed log system rather than a messaging system. Let me know your thoughts, if you think the messaging paradigm works better with kafka or the distributed log paradigm.5