From a young age, I have been fascinated by systems. I was captivated by the elegance of one system in particular. The flawless execution. The predictability. The sheer genius of the school bus system.The school bus system had me so enthralled that on a daily basis I replicated the process on the floor of my childhood home. The canonical transportation systems of students to that eight-year-old was awesome and I quickly learned that there are all sorts of systems in the world. Some designed, some not. Some better than others. But most importantly, systems exist to solve problems.
This problem to be solved is: How to get 700 students to and from their homes and to the school building 182 times per year?
Every school day, three rows of school buses would park with their yellow lights flashing and engines still running. The first bell would ring and students streamed out the building scurrying to board their buses which were in the same spot every afternoon.
Once all of the buses were full and the doors pulled shut, a teacher would signal the first driver to depart and a parade of yellow steel filled with pupils would roll up the big hill to the main road where the principal would stop normal traffic so every bus could swiftly exit and deliver younglings to their homes.
I’ll apply this metaphor to the remaining content of this article about Event-Driven Architecture. So take note that the various components of this metaphor are:
- The principal directing traffic is a mediator.
- The school buses and routes the channels.
- The parking lot and stop signs are queues.
- The school children are the events.
Several trends in computing have surfaced in recent years: big data, containers, serverless application, microservices, and event-driven architecture (EDA). The popularity has grown because companies know they can move faster to deliver a scalable solution that are much more manageable than monolithic applications.
What organizations are learning is that with the exponential increase in data their systems have to handle, their traditional RDBMS cannot handle the volume. Processing of information requires long running batch-based ETL to extract business insights. There is nothing wrong with batch processing of jobs for data warehousing but many companies are finding the need for real-time insights critical to achieving their desired outcomes.
In the mid-2000s, N-tier architecture was at its height of popularity. Organizations would build monolithic code bases backed by an RDBMS against which all CRUD was performed. Organizations are finding that the complexity of their applications has grown and their dependency graphs are virtually impossible to understand. Or, that they’ve piecemealed so many point-to-point integrations that they’ve built a fragile solution. No one can blame us engineers because, after all, Object-Oriented Programming (OOP) encourages code reuse and, well, at the time I really just needed service X to talk to service Y. Oops.
A Better Approach
Organizations are starting to migrate from their monolithic to microservice architectures. Decoupling business logic and services from the event processor decrease the complexity of the architecture. Services can be developed and deployed in isolation and do not have to be aware of other services. The initial investment, in my opinion, will be paid back in spades. A microservice architecture requires a system to facilitate communication from service to service.
By nature, an EDA facilitates a completely decentralized platform. Services don’t even have to live in the same system or data center and don’t have to be owned by the same organization. If a school bus system has a need for additional buses, let’s say because of an after-school event in a different school district, the bus system could request additional buses from one or more other systems to handle the additional load. This flexibility allows the system to expand and contract on demand.
Great. You’ve decided that an EDA is worth the investment. You’re ready to start building your new platform. Now you need to decide on the general topology of your architecture. There are two popular topologies.
As illustrated in Figure 1, the mediator topology consists of:
- an event queue through which all events are processed
- a mediator which manages the order and channels in which an event is to be processed
- services which perform business logic
This is an abstract pattern so remember that the type of queueing technology you use, and how you implement the mediator (also known as a controller or orchestrator) is up to you. We use GCP Pub/Sub for our queueing technology and a node.js microservice running on Kubernetes Engine and Google Compute Engine for our controller. We then have N-number of services that process data and leverage the single-responsibility principle.
Applied to the bus system metaphor, imagine that a bus needs to have its brakes replaced and oil changed, an individual would be responsible for dropping the bus off to the mechanic that will replace the brakes. Once the mechanic was done, she would inform the mediator who would then deliver the bus to the mechanic charged with replacing the oil.
In the broker topology, as illustrated in Figure 2, rather than leverage a centralized mediator for orchestrating event and services, services subscribe to channels, execute their business logic, and then publish a new message to which other services subscribe. An advantage of this approach is that by removing the need for a mediator you’ve reduced complexity. The disadvantage is that coordination and enforcement of execution order are not handled. Again, the pattern is agnostic to technology.
Applied to the same school bus system metaphor, if a bus has to have it’s brakes replaced and then the oil changed, in a broker topology, the bus would be dropped off to the mechanic replacing the brakes. Once the job was done, he’d put it in the parking lot where the oil-changing technician would know to look and pull into the garage to change the oil. I realize that example is a stretch, but you get the point.
In an EDA, there are two important factors for performance, throughput, and latency. The greater the latency, the lower the throughput. You have two options to continue to improve the performance of your system: decrease latency by optimizing code or configurations, or increase throughput by adding additional resources.
When it comes to measuring performance, I recommend that your team define Service Level Objects (SLO) and Service Level Indicators (SLI) for each service in your platform. We’ve adopted this approach and it allows us not only to monitor and gauge our success in production, but it gives us a guidepost for analyzing benchmark and performance test results prior to launching new features.
If you’ve not read it, I highly recommend the book published by Google called Site Reliability Engineering.
If there is one thing I’ve learned from building scalable platforms, it is that every millisecond matters. 1ms doesn’t matter at low volume but add 1ms to the processing of 1 million messages and you’ve added 15 minutes to processing time. Adding 1ms for one billion messages adds 277 hours.
Every millisecond matters. Adding 1 millisecond of processing to one billion messages adds 277 hours to processing time.
Here are some recommendations based on my experience:
- Be smart about what data you include on your payloads.
- Add constraints to the amount of data you include on your payload to control performance and expense.
- Never ever use a transactional database as a queue, especially when you expect high volumes. Databases like SQL have to write every transaction to a transaction log. Waits on writes to the transaction log will kill your performance. Also, reading and writing from the same table will 100% result in locks on the table and IO waits that drastically increase IO wait times.
- There are two dates that matter in an event-driven system: the actual event date and the processed date. The actual event date is the time at which a user or system action occurred. The processed date is usually the time the event was ingested by the system. The distinction is important because your architecture should manage late arrivals if you are performing any sort of logic within a window.
Given the last bullet point, imagine that the school principal is required to count all of the students that arrive to school for that day. She spends 20 minutes as all of the busses arrive and ends up with 620 students. Then she stops counting. The window closed. But let’s say that one bus was ten minutes late because a traffic light was out of service. The principal would have to go back and add the additional 80 students to her original tally. This same sort of reconciliation would have to occur in your system.
Finally, event-driven architecture is not the right pattern for all applications. Quite the contrary, such a pattern introduces the complexity of its own.
Don’t fall victim to Sledgehammer Syndrome.
EDAs could very likely be a sledgehammer for your problem when you instead need a screwdriver. Take into account the cost of development and maintenance when deciding if this is the right solution for you.
When considering the complexity that the solution will require and whether it’s worth the investment, determine how you will tackle the following:
- What is the right level of fidelity of service abstraction?
- Are your event messages schema or schema-less?
- How does your system handle failures caused by bad or corrupt data, downed services or queues?
- How do you handle a noisy neighbor problem?
- How will you debug and understand the flow of events through the system?
- How do you handle events that are not idempotent?
- How will your system handle cycle prevention and detection?
- How will your system handle rollback of a distributed transaction?
To sum it up, EDAs are:
- Highly scalable and decentralized,
- Can process events and perform functions like aggregates at time of ingestion, and
- Eliminate point-to-point integration.
Consider employing an EDA when:
- you require low latency and high volume processing,
- you require aggregating or processing data real-time within a window,
- more than one service needs to processes the same event,
- you need to horizontally scale a distributed system, and
- you want to implement a microservices architecture.