Catallaxy Services | @feaselkl |
|
Curated SQL | ||
We Speak Linux |
Apache Kafka is a message broker on the Hadoop stack. It receives messages from producers and sends messages to consumers. Everything in Kafka is distributed.
Suppose we have two applications which want to communicate. We connect them directly.
Works great at low scale--it's easy to understand, easy to work with, and fewer working parts to break. But it hits scale limitations.
We then expand out.
Easy to expand this way as long as you don't overwhelm the DB. Eventually you will.
We then expand out. Again.
Takes some effort here; need to manage connection strings and write to correct DB. But it's doable and expands indefinitely.
But what happens when a consumer (database) goes down?
Producers (app servers) hold messages or they fail. Neither option is great.
Enter brokers. Brokers take messages from producers and feed messages to consumers.
Consumer down? Broker holds messages & producers don't care. Producer down? Consumers don't care. Brokers deal with the jumble of connections and help with scale-out.
Today's talk will focus on using Kafka to ingest, enrich, and consume data. We will build .NET applications in Windows to talk to a Kafka cluster on Linux.
Our data source is flight data. I’d like to ask a few questions, with answers split out by destination state:
Our first application reads data from a CSV and pushes messages onto a topic.
This application will not try to understand the messages; it simply takes data and pushes it to a topic.
I chose Confluent's Kafka .NET library (nee RDKafka-dotnet) as my library of choice.
There are several libraries available, each with their own benefits and drawbacks. This library serves up messages in an event-based model and has official support from Confluent, so use this one.
Our second application reads data from one topic and pushes messages onto a different topic.
This application provides structure to our data and will be the largest application.
Enrichment opportunities:
Our third application reads data from the enriched topic, aggregates, and periodically writes results to SQL Server.
Basic tips:
Collections.Concurrent.BlockingCollection
Minimize latency when you want the most responsive consumers but don't need to maximize the number of messages flowing.
Maximize throughput when you want to push as many messages as possible. This is better for bulk loading operations.
Consumer config: fetch.wait.max.ms
, fetch.min.bytes
Producer config: batch.num.messages
, queue.buffering.max.ms
Kafka is a horizontally distributed system, so when in doubt, add more:
Apache Kafka is a powerful message broker. There is a small learning curve associated with Kafka, but this is a technology well worth learning.
To learn more, go here: http://CSmore.info/on/kafka
And for help, contact me: feasel@catallaxyservices.com | @feaselkl