Dynamic RabbitMQ cluster in Docker Swarm Mode

Posted in category docker on 2017-05-22

You can find concrete implementation in the GitHub repository - rabbitmq (autocluster branch).

Build docker image

There is an excellent article on RabbitMQ web site that describes different clustering options in different scenarios. This post goes through setting up autocluster plugin that is capable to automatically discover new nodes as well as cleanup dead nodes of RabbitMQ cluster.

For those who is interested in how Dockerfile can look like for building RabbitMQ image that has autocluster plugin enabled, here it is (make sure to install plugin files from official repository):

Build it like so: docker build -t rabbitmq:autocluster.

Create service

Our services will be operating in testnet overlay network, so let’s create one:

In order for autocluster plugin to work we need to choose one of the supported discovery backends. Consul seems to be a good choice. So, here is how to start consul service in docker:

Let’s now create a new rabbit service that is using auto-clustering feature with Consul as its discovery mechanism (which is running with consul service name, remember?):

Note that AUTOCLUSTER_CLEANUP flag will only work if it is set in pair with CLEANUP_WARN_ONLY flag to make sure you understand what you are doing as removing nodes from a cluster is considered a dangerous operation.

It is important to start single instance of RabbitMQ first and let it talk to Consul ang get itself properly registered.

After first service is up and running you can scale it to as many as you wish:

Now, if you navigate to the management consule (e.g. you should be able to see similar picture:

You can scale it up and down the way you want - new nodes will show up, the old nodes will be removed (with a slight delay).


Despite the fact that this approach is great (and RabbitMQ team is officially taking this plugin and will make it part of the product starting from v3.7.0) there are some downsides (though this is open for discussion):

  1. It is not going to be possible (at least easy) to have persistent data volumes for such dynamic setup. Thus making it fragile in cases when messages has to be recoverable after disaster scenarios like network partition, etc.
  2. In case of scale down Consul will have a bunch of failed records that will never go away without manual intervention.