A fully-reproducible, Dockerized, step-by-step tutorial for dipping your toes into the data stream.
With today’s terabyte and petabyte scale datasets and an uptick in demand for real-time analytics, traditional batch-oriented data processing doesn’t suffice. To keep up with the flow of incoming data, organizations are increasingly moving towards stream processing.
In batch processing, data are collected over a given period of time. These data are processed non-sequentially as a bounded unit, or batch, and pushed into an analytics system that periodically executes.
In stream processing, data are continuously processed as new data become available for analyzing. These data are processed sequentially as an unbounded stream and may be pulled in by a “listening” analytics system.
Though stream processing does not need to be real-time, it can enable data processing and analysis used by many real-time systems that may require fast reactions to incoming data — log or clickstream monitoring, financial analysis on a trading floor, or sensor data from an Internet of Things (IOT) device, for example.
Stream Processing Tools
Stream processing requires different tools from those used in traditional batch processing architecture. With large datasets, the canonical example of batch processing architecture is Hadoop’s MapReduce over data in HDFS. A number of new tools have popped up for use with data streams — e.g., a bunch of Apache tools like Storm / Twitter’s Heron, Flink, Samza, Kafka, Amazon’s Kinesis Streams, and Google DataFlow. And some tools are available for both batch and stream processing — e.g., Apache Beam and Spark. (Spark only sort of / kinda but I guess good enough. It is dropping support for its original streaming in favor of “structured streaming” for microbatch processing, which is non-ideal in some cases so I am pretend mad at it right now).
A Simple Recipe for Data Processing with Kafka and KSQL
My favorite new stream processing tool is Apache Kafka, originally a pub/sub messaging queue thought up by folks at LinkedIn and rebranded as a more general distributed data stream processing platform. Kafka takes data published by ‘producers’ (which may be, e.g., apps, files / file systems, or databases) and makes it available for ‘consumers’ subscribed to streams of different ‘topics.’ In my previous life as an astronomer, I did a lot of playing with Kafka for real-time distribution of alert data on new and changing astronomical object detection from some cool new telescopes [link for the curious].
The Kafka ecosystem is growing in support and has been supplemented with the Kafka Streams system, for building streaming apps, and KSQL, a SQL-like stream interface. I like Kafka especially because of the availability of an API for user-friendly Python and its easy integration with many other tools via Kafka Connect.
Here I’ll outline a fully reproducible step-by-step tutorial on how to stream tables from Postgres to Kafka, perform calculations with KSQL, and sync results back to Postgres using Connect.
All materials are available on my GitHub at https://github.com/mtpatter/postgres-kafka-demo. To follow along, clone the repo to your local environment. You should need only Docker and docker-compose on your system.
Ingredients
We will be using the following technologies through Docker containers:
- Kafka, the main data streaming platform
- Zookeeper, Kafka’s sidekick used for managing consumers
- KSQL server, which we will use to create live updating tables
- Kafka’s schema registry, needed to use the Avro data format, a json-based binary format that enforces schemas on our data
- Kafka Connect (pulled from Debezium), which will source and sink data back and forth to/from Postgres through Kafka
- PostgreSQL (also pulled from Debezium and tailored for use with Connect)
Directions
The data used here were originally taken from the Graduate Admissions open dataset available on Kaggle. The admit csv files are records of students and test scores with their chances of college admission. The research csv files contain a flag per student for whether or not they have research experience.
1. Bring Up The Compute Environment
2. Load data into Postgres
We will bring up a container with a psql command line, mount our local data files inside, create a database called students, and load the data on students’ chance of admission into the admission table.
Password = postgres
At the psql command line, create a database and connect:
Load the admission data table with:
Load the research data table with:
3. Connect the Postgres Database as a Source to Kafka
The postgres-source.json file contains the configuration settings needed to sink all of the students database to Kafka:
Submit the source file to Connect via curl:
The connector postgres-source should show up when curling for the list of existing connectors:
The two tables in the students database will now show up as topics in Kafka. You can check this by entering the Kafka container with the container ID (which you can get via docker ps);
and listing available topics:\
4. Start KSQL
Bring up a KSQL server command line client as a container:
To see your updates, a few settings need to be configured by first running:
5. Mirror Postgres Tables in KSQL
The Postgres table topics will be visible as dbserver1.public.admission and dbserver1.public.research in KSQL:
We will create KSQL streams (a source stream subscribed to the corresponding Kafka topic and a rekeyed stream we need to populate a table) to auto update KSQL tables mirroring the Postgres tables. Since the data are sourced from Postgres via Connect, the data format will be set to Avro.
The table and stream names I’ve used above are lowercase, but currently KSQL will enforce uppercase casing convention for stream, table, and field names no matter what.
6. Create Downstream Tables
We will create a new KSQL streaming table to join students’ chance of admission with research experience.
and another table calculating the average chance of admission for students with and without research experience:
7. Add a Connector to Sink a KSQL Table Back to Postgres
The postgres-sink.json configuration file will create a RESEARCH_AVE_BOOST table and send the data back to Postgres:
Submit the sink file to Connect:
8. Update the Source Postgres Tables and Watch the Postgres Sink Table Update
The RESEARCH_AVE_BOOST table should now be available in Postgres to query:
With these data the average admission chance will be 65.19%.
Add some new data to the admission and research tables in Postgres:
With the same query above on the RESEARCH_AVE_BOOST table, the average chance of admission for students without research experience has been updated to 63.49%.
A few things to note:
- The tables are forced by KSQL to uppercase and are case sensitive, which is annoying and also buggy. You can watch this KSQL GitHub issue for updates. The research field needs to be cast because it has been typed as text instead of integer (though it is integer type in KSQL), which may also be a bug in KSQL or Connect.
- I use a Debezium PostgresConnector for my connector.class in my Connect source file and a Confluent JdbcSinkConnector for my connector.class in my Connect sink file. I tried both for source and sink, but this was the only configuration I could get to work correctly.
Wrapping Up
Now you should be in good shape to be able to start trying out KSQL with continuously running queries on your own database tables. Here we’ve walked through new data arriving via database table updates. Because we are using Kafka already, we could easily substitute Kafka producers publishing data directly to a Kafka topic or a continuously updating file (like a log, for example) to replace a Postgres table source. These data would appear as a stream available to KSQL just as above. There are a number of different Kafka connectors available for sourcing/sinking to/from databases, file systems, and even Twitter on the Confluent Hub.
Stream processing allows data analysis pipeline results to be continuously updated with the arrival of new data, which enables automation and scalability. This architecture is one of the many cool things we work with to build and scale analysis pipelines on the Data Science team at High Alpha.
If you have any comments/questions about data stream pipelines, feel free to drop me a line here or hit me up via Twitter @OpenSciPinay.