Part 1 of a series on the challenges of writing microservices
The inspiration and content for this blog series comes from our team’s (Project Charlotte at High Alpha) experience developing a system of microservices and the various challenges, complexities, and solutions we encountered along the way. Hopefully, this will give anyone else attempting to develop microservices some insight into what challenges they can expect, as well as some ways we have overcome them.
What’s the problem?
For a number of years now the trend of microservices has surged. It’s safe to say that this is a system architecture that will likely stay, and for good reasons. Microservices provide many benefits over its monolithic counterpart, such as easy horizontal scaling of crucial components, isolation of work, and allowing developers to use virtually any language without affecting the rest of the app (which is great for hiring engineers of any background). That’s not to say that the monoliths of old are bad by any means; in fact, they have their own set of advantages. However, in the world of technology things never stand still, including architectural decisions. So naturally, more and more companies find themselves refactoring their systems.
Unfortunately, there are no silver bullets and microservices aren’t perfect. Microservices present their own unique set of challenges which are likely unfamiliar when coming from the world of monoliths. For the purpose of this article, I assert that communication between your system’s apps/services presents one of the most important challenges you will encounter during the practical development of microservices.
In a monolithic system, if you wanted data or some process to run, you would likely just call a function in your codebase somewhere. Simple enough. However, in a microservice system, you will likely have to make a call to some other service over the network to get that data/perform that function. Although an unfortunate addition of complexity, you might be thinking that’s not such a big deal, and it’s not in really simple cases. Let’s say for example you had a simple web application which talked to [insert-favorite-database-here]. It might look something like this:
All you need to do is have your API service request data specifically from your data service. No big deal. However, let’s ramp up the complexity a bit. Let’s say you have an app with accounts for users (probably stored in some SQL database) and some form of events coming in that are being processed and then stored in ElasticSearch for advanced text searching. Now your architecture might look like this:
You can probably see how even in this realistically small example, the communication might get complicated. You now have events coming in, being processed on an arbitrary number and type of services, and being stored in a separate data store. So let’s say you want to change out any of those services, now you need to update the code in any services that use it to point to the new service and make sure the new service points to the correct services. Even if you have those addresses configured, you still have to change the configs every time those addresses change or need to point somewhere else. Not only that, but now you need to balance requests over each available replica of the corresponding service. Finally, now your API is not just getting data from one data service, but two.
In summary: we have to know what services exist and how many of them exist (service-discovery), we need to distribute messages across any available replicas, we need to know how to serialize our messages for each service, we have to know the protocols for each service, and we need to make clients for each service. Hopefully, you now appreciate the importance of this problem.
How can you solve this?
If you saw the previous example and thought that those were some nontrivial problems, you’re right. Fortunately there a lot of tools, patterns, and protocols out there for helping mitigate these challenges. Let’s take a look at some of the tools available for each of the described challenges (plus some others not mentioned yet).
The first and probably most visible challenge is knowing what and where services exist in your system (aka service-discovery). This isn’t a new problem, and there are a number of solutions for this which can be as simple as adding entries into a DNS. However, these days you might be working with containers or other forms of machine/network abstraction where a typical DNS might not be available or the best solution. Here, some service orchestrator or similar tool might provide some mechanism for internal networking (such as Kubernetes). If not, then using some central store for configurations (such as Consul) would allow each service to describe itself and read about other services in the system.
Another challenge we encountered was the need to distribute the requests or messages across all available instances of the desired service. This again is not a new problem, just exacerbated by the increased number of services. The solution for this can be as simple as a hardware load-balancer. However, those are somewhat determined based on your environment and you may not have the knowledge or desire to set them up anyway. You could also write your own software based load-balancer for each service, though that can be a pain. Alternatively, if you don’t care about the response, you can use queues to help distribute the messages across all consumers (a very common approach).
Some other challenges not explicitly shown earlier are message serialization, transport, logging, and tracing. As for message serialization, there are plenty of excellent options, the most popular being JSON. The key here is just picking one that works and makes the most sense for your stack. Similarly, the important thing for transport is just picking one or two protocols that make sense for your stack. Some options include plain HTTP, REST, gRPC, as well as queue based protocols such as amqp.
The next two challenges, logging and tracing, are both similar and arguably the most important of these extra challenges. Most developers don’t usually consider logging (and tracing) until it’s too late and they already have a need for it. The best thing you can do is bake this into your system as soon as possible. For logging, fluentd is a very widely-used log consumer and has a large number of services you can connect to its log streams. For tracing, simply start adding unique transaction identifiers to all your initial requests/messages and propagating them to all downstream requests/messages. This will allow you to look through your logs for any given transaction so you can trace that request/message.
As you can see, each of these challenges with communication has its own set of tools or solutions, most of which are pretty common and/or easy to start using. However, despite the fact that we now have tools and solutions for each of these problems, we still end up with a lot of code and complexity throughout our services and infrastructure. This presents more work and headaches that we’d prefer to avoid. So here on the Project Charlotte team at High Alpha, we decided to centralize as much of this logic into one tool as we could. We call this tool our ESB (enterprise service bus); a code utility that handles pretty much all the issues described so far.
The Project Charlotte ESB
Before I start explaining our ESB and its benefits, I’ll briefly list our stack of technologies to give some context to our solution. We are currently set up using Kubernetes on Google’s GKE service. Our stack is currently all written in golang and communication is handled through gRPC and RabbitMQ. Obviously, there is a lot of other stuff in there as well (databases, dashboards, frameworks, etc…), but in terms of the technology affecting or solving communication between services, that’s it.
So, what is our ESB and what does is it do? Our ESB handles virtually all aspects of communication between services; it centralizes all the code that solves our problems and provides a simple layer for sending and receiving requests/messages. Our ESB utilizes Kubernetes (and RabbitMQ to a degree) for service discovery and load balancing. We use gRPC for synchronous or request-response transport and RabbitMQ for asynchronous or fire-and-forget transport. All messages are defined and serialized with protobuf. Sitting on top of GKE, all of our services have very convenient access to Stackdriver logging out of the box. Last, but not least, all messages are wrapped up in a transport shell we call an EsbMessage which contains a number of properties including tracing identifiers.
Now, setting up our server (for receiving requests/messages) is as simple as:
//Create new instance of the EsbServer, with given name and options server := NewEsbServer("my.service", myServiceOptions...)
//Register the routes handlers on the server along with handler type //The type can be either default (gRPC) or queue (RabbitMQ) server.RegisterRoutes(routes, routesType)
What’s going on here? We are creating a new instance of our EsbServer and registering the name of our service with the name given. Then we register some routes (a map<string, HandlerFunc>) to our server with a given type (default or queue). Now all incoming messages will be handled, logged, contextual information extracted (headers, tracing identifiers, etc…), and sent on to the appropriate route regardless of transport type.
Similarly, the way we send requests/messages is quite simple:
//Create new EsbClient with given options client, err := NewEsbClient(myClientOptions...) ... //Send a message to a queue (fire and forget semantics) _, err := client.SendMessage( "other.service.hello", myMessage, NewFlags(FIRE_AND_FORGET), )
//Send a message over gRPC for request/response semantics response, err := client.SendMessage( "other.service.hello", myMessage, )
What’s going on here? We set up an instance of our EsbClient which connects to our service discovery tool (Kubernetes) and any other external service dependencies (RabbitMQ for queue transport). Then we send a message by passing in a fully name-spaced message name (e.g. my.service.dosomething), the message itself, and any options (such as our FIRE_AND_FORGET flag, telling the client to send it over the queue). The client then takes care of serializing the message, wrapping it in an EsbMessage, logging stuff, providing tracing identifiers and then sending it on to where it needs to go.
Finally, some other features not yet discussed are TTL options (message expiration and timeouts), message failure strategies (such as back-off-retry), and message encryption (encrypting the message payload over the wire). What’s great is that all these features are all easily built into this one tool and we (virtually) never have to worry about them while developing the actual services for our system. This abstraction of logic is key for efficient development in an ever-growing system.
If you’ve made it this far, congratulations and many thanks for sticking with me! Having gone through the painful process of experiencing these challenges first-hand, I hope this article was helpful, or at least informative. Should you decide to move toward a microservice architecture or just want a better inter-service communication strategy, I highly recommend checking out these pre-built tools: go-micro, linkerd, and Istio. Also, if you have any questions/comments leave them below, I would love to hear your thoughts!