One of the easily confused concepts of messaging systems is the distinction between a bus and a broker. Some of the explanations you find on the web only confuse the matter, and the current trend of vendors marketing broker products as ESB’s only muddy the waters even more.
Message brokers are a centralized message router. This allows the origin of the message to be decoupled from the destination. A message is published to the broker, the broker then routes the message to the appropriate destination(s). The publisher of an event has no knowledge of the subscriber of the event (not always a good thing, we’ll cover that later). As you may expect, the usual concerns associated with centralized architectures apply (single point of failure, difficult to scale out etc). These concerns can be addressed to some degree using standard scaling techniques for centralized architectures (e.g. clustering, sharding).
A message bus is a decentralized architecture. As such, it does not suffer the scalability and point-of-failure problems that you face with a broker. In a well designed system, this allows services to operate in a truly autonomous matter (a goal anyone building reliable distributed systems should strive for). Since there is no message router, publishers require prior knowledge of the destination(s) of the message(s). This also allows finer control over who can subscribe to each message, and which publisher(s) are the source of truth – not that big a deal in 90% of situations. The biggest advantage I see using a bus over a broker, is the decoupled applications. It is less work to scale out with a bus. In a bus environment you should be able to ‘spin up’ services with little to no effort while maintaining the tenets you strive for in a SOA. This means being able to scale out without any administrative hassles such as (re)configuring clusters and importantly, no unnecessary resource overhead that impacts performance. Remember the fallacies of distributed computing? Clustering introduces overheads (communication, I/O, cpu…), which, if you’re in a position where scaling out is a concern, you’re in a position where minimizing resource utilization is a priority. In my experience, a bus also allows you to build a solution without concerns of your infrastructure leaking into your code. This allows for a more supple design, some of which will we cover in my next post.
As with all things, the end result is only as good as the effort put into it. It’s very easy to design a bus architecture that won’t scale or operate autonomously if you don’t set these goals in the beginning. Likewise, you can build a brokered architecture that scales to your needs. A bus just allows you to scale beyond easier.
Disclaimer: This post is based on my understanding of RabbitMQ as part of my current discovery process, so take it with a grain of salt, any mistruths or noteworthy additions will be fixed or appended as they come to light.
Out of the box RabbitMQ offers 3 types of multinode configuration: Clustering (with/without high availability), Federation and Shovelling.
Clustering (active/active queues):
RabbitMQ documentation explains it better than I ever could:
All data/state required for the operation of a RabbitMQ broker is replicated across all nodes, for reliability and scaling, with full ACID properties. An exception to this are message queues, which by default reside on the node that created them, though they are visible and reachable from all nodes. 1
High Availability Clustering (active/passive queues):
High availability (HA) clustering is the same as normal clustering, however queues are mirrored across all the nodes in the cluster. A node is elected as the master, all other nodes are slaves. When the master fails, the next oldest slave is elected as master. Any messages no synchronized between the master and slave are lost. 2
One of the pitfalls of Clustering, when a node disconnects abruptly from the cluster (i.e. loss of network, server crash etc) it will not gracefully reconnect. A manual process is required to reconnect the node to the cluster. Clustering requires a highly available network (i.e. lan)
Federation allows you to run disparate brokers in a non-clustered fashion, but still propagate messages between exchanges. Federation will copy messages from one exchange to one or many others (entirely configurable). I can see uses for federation in sharded architectures. 3
Shovelling, while similar in concept to federation, differs in that rather than copying a message from one exchange to another, it acts like a pipe and moves the messages instead.
A great comparison of the different configurations can be found here. Since RabbitMQ is a broker and not a bus, it comes with the inherent architecture attributes of a centralized message based router.
One of the primary differences between an NServiceBus/MassTransit implementation using MSMQ and one that uses RabbitMQ is that MSMQ used in a bus architecture, in a high availability scenario clustering is only required for the distributors . It also means subscribers can operate in a truly autonomous fashion, without message loss during failure like you might experience with a RabbitMQ cluster (unless, of course, you have an unrecoverable HDD failure – in which case messages will be lost). I’ll address the second point first.
In a RabbitMQ HA clustered environment, if the master drops from the cluster, we know that any unsynchronized messages will be lost (my understanding is regardless of where the messages existed; master or slave, unsychronized == gone). In my testing (granted, not extensive) during a simulated network failure, I saw messages being “double processed” after the network split and both nodes assumed the role of master, with both clients on those machines inserting their copy of the same messages into the shared database (when the network was reconnected). This resulted in 1009 messages inserted, when only 1000 were sent. I would like to retest this to confirm the behavior. This means it is critical messages are idempotent when using high availability clustering, an ideal situation we should strive for that is not always an easy thing to achieve…
In the same scenario, with a bus (ala NServiceBus/MSMQ) implementation, using a clustered distributor, due to local queues, transactional (durable) messaging, and store and forward support, the message will only exist in either the distributors queue, or the consumers local queue (provided you didn’t do something stupid like make the consumer read from a remote queue). This is the same for RabbitMQ (again, don’t read messages from a remote machine). In the event of a network failure, a consumer can operate without connectivity to the distributor. All work can be processed locally. Both MSMQ and RabbitMQ will allow the consumers to operate autonomously in this configuration.
Where the concepts begin to diverge, however, is in the recovery stage. With MSMQ, the message will be placed in the local outbound queue while waiting for the connection to be remade (in my experience this can take some time – restarting the msmq instance can expedite the process). When MSMQ is able to reconnect to any of recipients of the messages in the outbound queues, the messages will be dispatched automatically*. This means in an event of a catastrophic event (e.g. complete network or power failure) there is no issue of order when machines rediscover one another. RabbitMQ on the other hand requires a more manual, and thought out process when rejoining any broker to the cluster.4
Because the distributor is not distributed across the consumers machines (like you would want in a HA RabbitMQ broker architecture) when part of the distributor cluster goes down, all consumers will pull from the true master. Unlike RabbitMQ, where the perceived ‘master’ is a matter of where you are positioned. As an added advantage, there is no additional configuration required to join a cluster as you scale out consumers (admittedly, a trivial task with Rabbit but a task nonetheless – script it!), as the only part of your architecture that needs to be clustered in a high availability scenario are the distributors. So, in the imaginary super geeky scenario where you get hammered with load all of a sudden and need to re-purpose every machine in the building to cope with it, you can do so with relative ease… so long as your network policies are already pre-configured for MSDTC.
* If an exception occurs, for example due to connectivity problems – the message will be “retried” n times and then placed in the error queue. However, messages will not be lost or duplicated.