RabbitMQ Highly Available Queues within NServiceBus

In order to minimize the possibility of lost NServiceBus messages due to an unavailable RabbitMQ transport node, a solution is to use RabbitMQ’s High Available (HA) queues.  The use of RabbitMQ HA within NServiceBus became an available option starting within NServiceBus 4.

The reason for redundant RabbitMQ nodes is stated in the RabbitMQ documentation:

If a RabbitMQ broker consists of a single node, then a failure of that node will cause downtime, temporary unavailability of service, and potentially loss of messages (especially non-persistent messages held by non-durable queues).

There are several approaches to address redundancy and high availability of RabbitMQ.  These include the user of durable queues, a cluster, and an active/passive pair of nodes.  Although each approach has its benefits, they also have trade-offs.  Again from the RabbitMQ documentation:

You could publish all messages persistent, to durable queues, but even then, due to buffering there is an amount of time between the message being sent and the message being written to disk and fsync’d. Using publisher confirms is one means to ensure the client understands which messages have been written to disk, but even so, you may not wish to suffer the downtime and inconvenience of the unavailability of service caused by a node failure, or the performance degradation of having to write every message to disk.

You could use a cluster of RabbitMQ nodes to construct your RabbitMQ broker. This will be resilient to the loss of individual nodes in terms of the overall availability of service, but some important caveats apply: whilst exchanges and bindings survive the loss of individual nodes, queues and their messages do not. This is because a queue and its contents reside on exactly one node, thus the loss of a node will render its queues unavailable.

You could use an active/passive pair of nodes such that should one node fail, the passive node will be able to come up and take over from the failed node. This can even be combined with clustering. Whilst this approach ensures that failures are quickly detected and recovered from, there can be reasons why the passive node can take a long time to start up, or potentially even fail to start. This can cause at best, temporary unavailability of queues which were located on the failed node.

To solve these various problems, RabbitMQ introduced active/active high availability for queues. This works by allowing queues to be mirrored on other nodes within a RabbitMQ cluster. The result is that should one node of a cluster fail, the queue can automatically switch to one of the mirrors and continue to operate, with no unavailability of service. This solution still requires a RabbitMQ cluster, which means that it will not cope seamlessly with network partitions within the cluster and, for that reason, is not recommended for use across a WAN (though of course, clients can still connect from as near and as far as needed).

Standing up a Cluster

The first step to setting up HA is to create a RabbitMQ cluster.  The cluster should have several independent nodes.  RabbitMQ has thoroughly detailed the step-by-step process of setting up cluster nodes in their documentation at: http://www.rabbitmq.com/clustering.html.

Configuring Cluster for High Availability

Once the RabbitMQ cluster is configured, the next step is to set up HA via a mirrored queue.  This allows one node in the cluster to be the master.  All actions other than publishes go only to the master, and the master then broadcasts the effect of the actions to the slaves. Thus clients consuming from a mirrored queue are in fact consuming from the master.  If the master fails, then one of the slaves is promoted.

HA set-up can be configured via PowerShell, the RabbitMQ API, or directly in the RabbitMQ administrative portal.  Step-by-step instructions for each can be found in the RabbitMQ support documentation located at: http://www.rabbitmq.com/ha.html

Configuring NServiceBus RabbitMQ Transport for HA Cluster

Beginning with NServiceBus 4, support has been added for RabbitMQ HA queues with this pull request: https://github.com/Particular/NServiceBus/pull/1118.  In order to make HA work, simply change your application’s NServiceBus/Transport connection string to use a comma-delimited list of the hosts in your HA cluster, similar to this:

<add name="NServiceBus/Transport" connectionString="host=rabbitNode1,rabbitNode2,rabbitNode3;username=myuser;password=password" />

Handling Cluster Failover in Code

When a node fails, the publisher may encounter an OperationInterruptedException.  During this time, the cluster is promoting  a slave node to master.  The transition may not be immediate.  The logic that publishes to RabbitMQ can catch this as an OperationInterruptedException and then retry the publish operation.

Here is a sample demonstrating a loop that makes a call to publish 1000 messages on the Bus as a test:

static void Main(string[] args)
	IBus bus = CreateBus();
	int i = 0;
	while (i < 1000)
		if (PublishMyMessage(bus, i))
			Console.WriteLine("Published message " + i);
			Console.WriteLine("Publish for message '" + i + "' previously failed; retrying...");

In line 7 it is calling PublishMyMessage, a function that makes the actual bus Publish call:

static bool PublishMyMessage(IBus bus, int messageId)
		bus.Publish<ITestEvent>(testEvent =>
			testEvent.messageId = messageId.ToString();
		return true;
	catch (RabbitMQ.Client.Exceptions.OperationInterruptedException ex)
		// wait for cluster to failover to mirrored node
		Console.WriteLine("RabbitMQ node is down; waiting for failover node...");
		System.Threading.Thread.Sleep(5000);  // 5 seconds is more than enough time; adjust as necessary
		return false;

The PublishMyMessage function attempts to publish the message on to the Bus.  Once the master RabbitMQ node goes down, the publish will run in to an OperationInterruptedException.  RabbitMQ will then assign one of the mirrored slave nodes as the master.  NServiceBus will then use the comma-delimited hosts in your connection string to establish a connection to the new master node.  To test this out, you can manually bring down the master node in your cluster via this PowerShell commandlet:

rabbit1$ rabbitmqctl stop_app

PublishMyMessage in the sample above returns a Boolean.  If the publish was successful it will return true, increment the loop counter and continue.  If the OperationInterruptedException is caught, it returns false, the loop isn’t incremented, and the the publish is tried again.

Leave a Reply

Your email address will not be published.