Why was Apache Kafka created?

(bigdata.2minutestreaming.com)

84 points | by enether a day ago ago

72 comments

dktalks 3 hours ago

I worked on this while I was at LI and I think the major selling point back then was Replayability of messages but it was something similar to what you would get with Pub/Sub. We could also have multiple clients listening and processing same messages for their own purposes so you could use the same queue and have different clients process them as they wanted.

[-]

pojzon 3 hours ago

Its the ability to replay messages at later notice when needed.

At least this was the reason we decided to use Kafka instead of simple queues.

It was useful when we built new consumer types for the same data we already processed or we knew we gonna have later but cant build now due to prorities.

kachapopopow 4 hours ago

As someone who has made the mistake of using kafka in a non enterprise space - it really seems like the etcd problem where you need more time to run etcd than to run whatever service you're providing.

[-]

mrweasel 3 hours ago

I previously helped clients setup and run Kafka clusters. Why they'd need Kafka was always our first question, never got a good answer from a single one of them. That's not to say that Kafka isn't useful, it is, in the right setting, but that settings is never "I need a queue". If you need a queue, great, go get RabbitMQ, ZMQ, Redis, SQS, named pipes, pretty anything but Kafka. It's not that Kafka can't do it, but you are making things harder than they needed to be.

[-]

Joeri 3 hours ago

Kafka isn’t a queue, it’s a distributed log. A partitioned topic can take very large volumes of message writes, persist them indefinitely, deliver them to any subscriber in-order and at-least-once (even for subscribers added after the message was published), and do all of that distributed and HA.

If you need all those things, there just are not a lot of options.

adev_ 3 hours ago

> Why they'd need Kafka was always our first question, never got a good answer from a single one of them

"To follow the hype train, Bro" is often the real answer.

> If you need a queue, great, go get RabbitMQ, ZMQ, Redis, SQS, named pipes, pretty anything but Kafka.

Or just freaking MQTT.

MQTT has been battle-proven for 25 years, is simple and does perfectly the job if you do not ship GBs of blobs through your messaging system (which you should not do anyway).

[-]

atomicnumber3 2 hours ago

It's resume-driven development. It honestly can make sense for both company and employee.

Companies get standard tech stacks people are happy to work with, because working with them gets people experience with tech stacks that are standard at many companies. It's a virtuous cycle.

And sure even if you need just a specific thing, it's often better to go slightly overkill for something that's got millions of stack overflow solutions for common issues figured out. Vs picking some niche thing that you are now 1 of like six total people in the entire world using in prod.

Obviously the dose makes the poison and don't use kafka for your small internal app thing and don't use k8s where docker will do, but also, probably use k8s if you need more than docker instead of using some weird other thing nobody will know about.

slau 3 hours ago

ZMQ is not a managed queue. It’s networking library.

shikhar an hour ago

You might like what we are building with https://s2.dev :)

[-]

zbentley an hour ago

Looks interesting; does it take a different architectural approach than WarpStream did?

That’s not coded “you’re reinventing the wheel”; WarpStream had some significant drawbacks, so I’m truly curious about different approaches in the message-log-backed-by-blob-store space.

[-]

shikhar an hour ago

Architecturally, there are a lot of the same considerations. A key difference is that we offer streams as a cloud API, and it does not have WarpStream's BYOC split – where some stuff runs in your environment and metadata lives in their cloud – so we can offer lower latencies. We are also not trying to be Kafka API compatible, S2 has its own REST API.

The dimensions we focus on are number of streams (unlimited, so you can do granular streams like per user or session), internet accessibility (you can generate finely-scoped access tokens that can be safely used from clients like CLIs or browsers), and soon also massive read fanout for feed-like use cases.

AtlasBarfed 2 hours ago

If you're running a distributed system...

You're running a distributed system. They aren't simple.

Especially on AWS. AWS is really a double-bladed sword. Yeah, you'll get tutorials to set up whatever distributed system pretty quickly, but your nodes aren't nearly as reliable. Your networking isn't nearly as reliable. Your costs aren't nearly as reliable and administration. Headaches go up in the long run

varbhat 3 hours ago

Does anyone use https://nats.io here? I have heard good things about it. I would love to hear about the comparisons between nats.io and kafka

[-]

nchmy 3 hours ago

I dont have kafka experience, but nats is absolutely amazing. Just a complete pleasure to use, in every way.

https://www.synadia.com/blog/nats-and-kafka-compared

dijit 3 hours ago

I got really pissed off with their field CTO for essentially trying to pull the wool over my eyes regarding performance and reliability.

Essentially their base product (NATs) has a lot of performance but trades it off for reliability. So they add Jetstream to NATs to get reliability, but use the performance numbers of pure NATs.

I got burned by MongoDB for doing this to me, I won’t work with any technology that is marketed in such a disingenuous way again.

[-]

nchmy 2 hours ago

You mean Jetstream?

Can you point to where they are using core NATS numbers to describe Jetstream?

[-]

dijit 2 hours ago

Yes, I meant Jetstream (I even typed it but second guessed myself, my mistake) I’m typing these when I get a moment as I’m at a wedding- so I apologise.

The issue in the docs was that there are no available Jetstream numbers, so I talked over a video call to the field CTO, who cited the base NATs numbers to me, and when I pressed him on if it was with Jetstream he said that it was without: so I asked for them with Jetstream enabled and he cited the same numbers back to me. Even when I pressed him again that “you just said those numbers are without Jetstream” he said that it was not an issue.

So, I got a bit miffed after the call ended, we spent about 45 minutes on the call and this was the main reason to have the call in the first place so I am a bit bent about it. Maybe its better now, this was a year ago.

[-]

jeremyjh an hour ago

This doesn’t really support your position as far as most readers are concerned - it sounds like a disconnect. If they didn’t do this in any ad copy or public docs it’s not really in Mongo territory.

[-]

dijit an hour ago

I don’t really care.

I’m telling you why I am skeptical of any tech that intentionally obfuscates trade-offs, I’m not making a comparison on which of these is worse; and I don’t really care if people take my anecdote seriously either: because they should make their own conclusions.

However it might help people go in to a topic about performance and reliability from a more informed position.

zaphirplane an hour ago

It’s deceptive if true, why are you trying to spin it as it’s ok cause the deception were not published

AtlasBarfed 2 hours ago

Don't implement any distributive technology until aphyr has put it through the paces, and even then... Pilot

sea-gold 3 hours ago

There is a good comparison between NATS, Kakfa, and others here: https://docs.nats.io/nats-concepts/overview/compare-nats

atombender an hour ago

NATS is very good. It's important to distinguish between core NATS and Jetstream, however.

Core NATS is an ephemeral message broker. Clients tell the server what subjects they want messages about, producers publish. NATS handles the routing. If nobody is listening, messages go nowhere. It's very nice for situations where lots of clients come and go. It's not reliable; it sheds messages when consumers get slow. No durability, so when a consumer disconnects, it will miss messages sent in its absence. But this means it's very lightweight. Subjects are just wildcard paths, so you can have billions of them, which means RPC is trivial: Send out a message and tell the receiver to post a reply to a randomly generated subject, then listen to that subject for the answer.

NATS organizes brokers into clusters, and clusters can form hub/spoke topologies where messages are routed between clusters by interest, so it's very scalable; if your cluster doesn't scale to the number of consumers, you can add another cluster that consumes the first cluster, and now you have two hubs/spokes. In short, NATS is a great "message router". You can build all sorts of semantics on top of it: RPC, cache invalidation channels, "actor" style processes, traditional pub/sub, logging, the sky is the limit.

Jetstream is a different technology that is built on NATS. With Jetstream, you can create streams, which are ordered sequences of messages. A stream is durable and can have settings like maximum retention by age and size. Streams are replicated, with each stream being a Raft group. Consumers follow from a position. In many ways it's like Kafka and Redpanda, but "on steroids", superficially similar but just a lot richer.

For example, Kafka is very strict about the topic being a sequence of messages that must be consumed exactly sequentially. If the client wants to subscribe to a subset of events, it must either filter client-side, or you have some intermediary that filters and writes to a topic that the consumer then consumes. With NATS, you can ask the server to filter.

Unlike Kafka, you can also nack messages; the server keeps track of what consumers have seen. Nacking means you lose ordering, as the nacked messages come back later. Jetstream also supports a Kafka-like strictly ordered mode. Unlike Kafka, clients can choose the routing behaviour, including worker style routing and deterministic partitioning.

Unlike Kafka's rigid networking model (consumers are assigned partitions and they consume the topic and that's it), as with NATS, you can set up complex topologies where streams get gatewayed and replicated. For example, you can streams in multiple regions, with replication, so that consumers only need to connect to the local region's hub.

While NATS/Jetstream has a lot of flexibility, I feel like they've compromised a bit on performance and scalability. Jetstream clusters don't scale to many servers (they recommend max 3, I think) and large numbers of consumers can make the server run really hot. I would also say that they made a mistake adopting nacking into the consuming model. The big simplification Kafka makes is that topics are strictly sequential, both for producing and consuming. This keeps the server simpler and forces the client to deal with unprocessable messages. Jetstream doesn't allow durable consumers to be strictly ordered; what the SDK calls an "ordered consumer" is just an ephemeral consumer. Furthermore, ephemeral consumers don't really exist. Every consumer will create server-side state. In our testing, we found that having more than a few thousand consumers is a really bad idea. (The newest SDK now offers a "direct fetch" API where you can consume a stream by position without registering a server-side consumer, but I've not yet tried it.)

Lastly, the mechanics of the server replication and connectivity is rather mysterious, and it's hard to understand when something goes wrong. And with all the different concepts — leaf nodes, leaf clusters, replicas, mirrors, clusters, gateways, accounts, domains, and so on — it's not easy to understand the best way to design a topology. The Kafka network model, by comparison, is very simple and straightforward, even if it's a lot less flexible. With Kafka, you can still build hub/spoke topologies yourself by reading from topics and writing to other topics, and while it's something you need to set up yourself, it's less magical, and easier to control and understand.

Where I work, we have used NATS extensively with great success. We also adopted Jetstream for some applications, but we've soured on it a bit, for the above reasons, and now use Redpanda (which is Kafka-compatible) instead. I still think JS is a great fit for certain types of apps, but I would definitely evaluate the requirements carefully first. Jetstream is different enough that it's definitely not just a "better Kafka".

[-]

shikhar an hour ago

> Jetstream clusters don't scale to many servers (they recommend max 3, I think)

Jetstream is even more limited than most Kafkas on number of streams https://github.com/nats-io/nats-server/discussions/5128#disc...

erulabs an hour ago

Kafka's ability to ingest the firehose and present it as a throttle-able consumable to many different applications is great. If you're thinking "just use a database", it's worth noting that SQL databases are _not well suited_ to drinking from a firehose of writes, and that distributed SQL in 2012 was not a thing. Kafka was one of the first systems that fully embraced the dropping of the C from CAP theorem, which was a big step forward for web applications at scale. If you bristle at that, know that using read-replicas of your postgres database present the same correctness problems.

These days though, unless I was at Fortune 100 scale, I'd absolutely turn to Redis Cluster Streams instead. So much simpler to manage and so much cheaper to run.

Also I like Kafka because I met two pretty Russian girls in San Francisco a decade back and the group we were in played a game where we described what the company we worked for did in the abstract, and then tried to guess the startup. They said "we write distributed streaming software", I guessed "confluent" immediately. At the time confluent was quite new and small. Fun night. Fun era.

BiraIgnacio 3 hours ago

It was created to teach me the concept of love-hate relationships

clippy99 4 hours ago

Startup founder here -- we tried it, and it feels bloated (Java!), bureaucratic and overcomplicated for what it is. Something like Redis queues or even ZMQ probably suffices for 90% of use cases. Maybe in hyper-scaled applications that need to be ultraperformant (e.g., realtime trading, massive streaming platforms) is where Kafka comes into play.

[-]

majormajor 3 hours ago

If you are using this sort of redis queue (https://redis.io/glossary/redis-queue/) with PUSH/POP vs fan-out you're working on a very different sort of problem than what Kafka is built for.

Like the article says, fan-out is a key design characteristic. There are "redis streams" now but they didn't exist back then. The durability story and cluster stories aren't as good either, I believe, so they can probably take you so far but won't be as generally suitable depending on where your system goes in the future. There are also things like RedPanda that speak Kafka w/o the Java.

However, if you CAN run on a single node w/o worrying about partitioning, you should do that as long as you can get away with it. Once you add multiple partitions ordering becomes hard to reason about and while there are things like message keys to address that, they have limitations and can lead to hotspotting and scaling bottlenecks.

But the push/pop based systems also aren't going to give you at-least-once guarantees (looks like Redis at least has a "pop+push" thing to move to a DIFFERENT list that a single consumer would manage but that seems like it gets hairy for scaling out even a little bit...).

njitbew 3 hours ago

> and it feels bloated (Java!)

I'm curious, what exactly feels bloated about Java? I don't feel like the Java language or runtime are particularly bloated, so I'm guessing you're referring to some practices/principles that you often see around Java software?

[-]

cameronh90 2 hours ago

> what exactly feels bloated about Java?

https://docs.spring.io/spring-framework/docs/2.5.x/javadoc-a...

[-]

dcminter 2 hours ago

Kafka does not use Spring.

buckle8017 3 hours ago

Java the language and Java the runtime are fine.

The way most Java code is written is terrible Enterprise factory factory factory.

[-]

vlovich123 2 hours ago

But the perf is not reliable. If you want latency and throughput, idiomatic Rust will give you better properties. Interestingly even will Go for some reason has better latency guarantees I believe even though it’s GC is worse than Java.

[-]

jghn 2 hours ago

This presupposes the use case is such that this even matters. Obviously that is the case sometimes, but in the vast majority of cases it is not.

throwaway8242 26 minutes ago

That doesn't match my experience in the last 15 years working for 3 companies (one was a big enterprise, one medium sized and one startup)

Maybe I have been lucky, or that the practice is more common in certain countries or eco systems? Java has been a very productive language for me, and the code has been far from the forced pattern usage that I have read horror stories about.

von_lohengramm 2 hours ago

The problem is that writing genuinely performant Java code requires that you drop most if not all of the niceties of writing Java. At that point, why write Java at all? Just find some other language that targets the JVM. But then you're already treading such DIY and frictionful waters that just adopting some other cross-platform language/runtime isn't the worst idea.

poemxo 3 hours ago

Starting up a Java program takes much longer than it should and that affects perception.

[-]

halfmatthalfcat 3 hours ago

With AOT tho that should be somewhat moot.

[-]

poemxo 2 hours ago

I am just explaining why it has that reputation.

rvz 3 hours ago

> I'm curious, what exactly feels bloated about Java?

Everything.

Why do you think Kubernetes is NOT written in Java?

[-]

AtlasBarfed 2 hours ago

... Because it came from Google?

Golang has little to distinguish itself technically. It has a more modern std lib (for now) and isn't Oracle.

Which aren't trivial, but they aren't Trump cards.

slipperydippery 3 hours ago

Whatever efficiency may hypothetically be possible with Java, you can in-fact spot a real world Java program in the wild by looking for the thing taking up 10x the memory it seems like it should need… when idle.

Yes yes I’m sure there are exceptions somewhere but I’ve been reading Java fans using benchmarks to try to convince me that I can’t tell which programs on my computer are Java just by looking for the weirdly slow ones, when I in fact very much could, for 25ish years.

Java programs have a feel and it’s “stuttery resource hog”. Whatever may be possible with the platform, that’s the real-world experience.

[-]

MathMonkeyMan 3 hours ago

The JVM eats a chunk of memory in order to make its garbage collector more efficient. Think of it like Linux's page cache.

I haven't worked with too much Java, but I suspect that the distaste many have for it is due to its wide adoption by large organizations and the obfuscating "dressed up" tendency of the coding idioms used in large organizations.

The runtime isn't inherently slow, but maybe it's easier to write slow programs in Java.

jghn 3 hours ago

> taking up 10x the memory it seems like it should need… when idle.

The JVM tends to hold onto memory in order to make things faster when it does wind up needing that memory for actual stuff. However, how much it holds on to, how the GC is setup, etc are all tunable parameters. Further, if it's holding onto memory that's not being used, these are prime candidates to be stored in virtual memory which is effectively free.

bdangubic 3 hours ago

you know why you don’t see many non-Java programs on your computer taking up 10x memory? because no one uses them to write anything :)

jokes aside, we got a shift in the industry where many java programs were replaced by electron-like programs which now take 20x memory

[-]

vlovich123 2 hours ago

Technically kind of true but at the same time Android apps are predominantly Java/Kotlin. It speaks more to Java just having a bad desktop story. But it’s also why Android devices need 2x the ram

js4ever 3 hours ago

This seems like a very bad faith argument! Java programs are an order of magnitude slower to start and indeed use much more memory than C++/Go/Rust (even NodeJS) equivalent.

I'll probably burn some karma on this but let's go with more details!

1. Startup time and latency

Cold start penalties: Java apps typically take longer to start than Go, Rust, or even Python scripts. For short-lived CLI tools or serverless workloads, this is a real drawback.

JIT warmup: Performance often relies on the JIT compiler optimizing hot paths, which means peak performance comes only after warmup, not instantly.

---

2. Memory consumption

High baseline footprint: Even trivial Java apps often take hundreds of MB of RAM just for the JVM, class libraries, and runtime structures. Compare that to Go or Rust binaries which can run in a fraction of the memory.

GC overhead: Modern garbage collectors are good, but they still consume memory headroom and CPU cycles. Low-latency GC tuning is non-trivial and often leads to further bloat.

---

3. Distribution and deployment

Huge runtime baggage: You can’t ship a minimal standalone binary without pulling in a JVM. A simple “Hello World” isn’t self-contained like in Go or C.

Container inefficiency: In cloud-native/containerized environments, Java apps are notorious for high memory requests and tuning requirements compared to leaner languages.

---

4. Complexity of tooling

Build tools are heavy: Maven and Gradle are infamous for their slowness and complexity, often requiring downloading gigabytes of dependencies.

Jar hell: Dependency management and classpath conflicts historically plagued Java, and while better today, it still contributes to the “bloated” reputation.

---

5. Historical baggage

Enterprise legacy: “Enterprise Java” (J2EE, app servers, Spring monoliths) left a legacy of verbose, heavyweight systems that contrasted with lighter ecosystems (Node.js, Go microservices, etc.).

Language verbosity: Although modern Java has improved (records, lambdas, etc.), the ecosystem still carries a reputation for wordiness and boilerplate.

---

6. Comparative benchmarks

In microservices and high-performance environments, Go and Rust often outperform Java both in latency and resource efficiency.

For small tools or CLI utilities, Java is rarely chosen because of its startup time and memory overhead.

---

7. Culture and perception

“Write once, run anywhere” costs: The abstraction layers that make Java portable also add runtime weight and overhead.

DevOps annoyance: Running Java apps at scale often requires constant GC tuning, heap sizing, and JVM parameter tweaking, which makes it feel heavyweight compared to languages with leaner defaults.

[-]

p2detar 2 hours ago

Is this an AI-generated answer? Most of these are not even true, although I still would prefer Go for micro-services. I'll address just a bunch and to be clear - I'm not even a big Java fan.

- Quarkus with GraalVM compiles your Java app to native code. There is no JIT or warm up, memory footprint is also low. By the way, the JVM Hotspot JIT can actually make your Java app faster than your Go or Rust app in many cases [citation needed] exactly due to the hot path optimizations it does.

- GC tuning - I don't even know who does this. Maybe Netflix or some trading shops? Almost no one does this nowadays and with the new JVM ZGC [0] coming up, nobody would need to.

> You can’t ship a minimal standalone binary without pulling in a JVM.

- You'd need JRE actually, e.g., 27 MB .MSI for Windows. That's probably the easiest thing to install today and if you do this via your package manager, you also get regular security fixes. Build tools like Gradle generate a fully ready-to-execute directory structure for your app. If you got the JRE on your system, it will run.

> Dependency management and classpath conflicts historically plagued Java

The keyword here is "historically". Please try Maven or Gradle today and enjoy the modern dependency management. It just works. I won't delve into Java 9 modules, but it's been ages since I last saw a class path issue.

> J2EE

Is someone still using this? It is super easy writing a web app with Java+Javalin for example. The Java library and frameworks ecosystem is super rich.

> “Write once, run anywhere” costs: The abstraction layers that make Java portable also add runtime weight and overhead.

Like I wrote above, the HotSpot JIT is actually doing the heavy lifting for your in real time. These claims are baseless without pointing to what "overhead" is meant in practice.

---

0 - https://inside.java/2023/11/28/gen-zgc-explainer/ or https://www.youtube.com/watch?v=dSLe6G3_JmE

dionian 2 hours ago

conflicts are a necessary evil with a massive dependency ecosystem

jayd16 2 hours ago

"Java is bloated because I only look at the bloated examples."

Is C++ bloated because of the memory Chrome uses?

closeparen 2 hours ago

Publishing an event to Kafka puts it “out there” in a way that guarantees it won’t be lost and allows any number of interested consumers, including the data warehouse, to deal with it at their leisure (subject to retention period which is typically like 72h). For us, your Kafka topics and their schemas are as much a part of your API as your gRPC IDLs. Something like Redis or 0MQ feels more appropriate for internal coordination between instances of the same service, or at least a producer that has a specific consumer in mind.

cortesoft 3 hours ago

It doesn't have to be 'hyper-scaled' to be needed, unless we have widely different definitions of hyper scale. Access logs from a few thousand servers with medium traffic will push you past any single instance service, and Kafka works great for that workload.

oulipo2 3 hours ago

Have you tried Redpanda?

haddr 2 hours ago

Couldn’t disagree more… if you go the ZMQ you are left alone handling many things you get in Kafka for free. If you have any sort of big data problems then good luck. You are going to reinvent the wheel.

rvz 3 hours ago

> Maybe in hyper-scaled applications that need to be ultraperformant (e.g., realtime trading, massive streaming platforms) is where Kafka comes into play.

Kafka is used because the Java folks don't want to learn something new due to job security, even though there are faster and compatible alternatives that exist today.

Rather use Redpanda, than continue to use Kafka and then complain about how resource intensive it is alongside zookeeper and all the circus the comes with it and make AWS smile as you're losing hundreds of thousands a month.

[-]

AtlasBarfed 2 hours ago

I thought Kafka ditched the zookeeper

blast 4 hours ago

(I was wondering if this was some sort of generated ripoff but the author worked on Kafka for 6 years: https://x.com/BdKozlovski.)

bjourne 2 hours ago

I read it and looked at the block diagrams. I still don't get it. You have "data integration problems". Many software share data. Use database. Problems solved.

zug_zug 3 hours ago

My only complaint with this article is that it seems to be implying kafka that linkedIn's problem couldn't have been solved with a bunch of off-the-shelf tools.

[-]

majormajor 3 hours ago

What off the shelf tools in 2012 would you propose, exactly?

[-]

zug_zug 3 hours ago

Make it less event-orchestrated and use a db. It’s just a social network for recruiters it’s not as complicated as they like to pretend.

You don’t need push, it’s just a performance optimization that almost never justifies using a whole new tool.

[-]

majormajor an hour ago

So solve "ETLs into a data warehouse are hard to make low-latency and hard to manage in a large org" by... just hypothetical better "off the shelf tools". Or "don't want low latency because you're 'just' a recruiting tool, so who cares how quickly you can get insights into your business."

Go back to the article, it wasn't about event-sourcing or replacing a DB for application code.

sebastialonso an hour ago

The only correct answer to the question asked is "I don't know the context, I need more information". Anything else is being a bad engineer.

AtlasBarfed 2 hours ago

Your solution to a queue and publish subscribe problem is to use a database?

tomrod 3 hours ago

Sounds like MQTT?

[-]

majormajor an hour ago

MQTT wouldn't give you the persistence or the decoupling of fast and slow consumers.

jyounker 3 hours ago

And what would that off-the-shelf software have been?

polynomial 4 hours ago

Why was it named that is also a question.

[-]

isaacremuant 4 hours ago

> Jay Kreps chose to name the software after the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work.

From Wikipedia.

[-]

clippy99 3 hours ago

Kafkaesque to configure and get running for simple tasks.

atombender 4 hours ago

A system optimized for writing could also describe the machine in Kafka's "In the Penal Colony".

theyinwhy 2 hours ago

It is called Kafka because it can write.