I’ve set up a RabbitMQ cluster (well, just one node for now) and put an AWS Elastic Load Balancer in front of it to handle traffic in case I need to add more nodes and deal with increased traffic. Unfortunately, ELB can’t really handle serving RabbitMQ traffic without causing all sorts of strange issues.
My specific issue is that I was using the ELB endpoint as the connection, but I would often get errors such as “Could not establish TCP connection to any of the configured hosts” and “Empty response received from the server”. I tried a number of different things such as changing the timeout settings on the load balancer, messing with the connection strings and various configurations, but nothing worked. Enabling logging for the load balancer also proved fruitless.
Digging around, I found a number of different posts about people having issues with this. It seems to have to do with how ELB deals with checking if connections are alive and heartbeats and the like. So after struggling with this for some time, I went a totally different direction: to use HAProxy instead of the elastic load balancer.
So I spun up an instance of HAProxy, configured it a bit, and everything works. Really well, actually; I’ve not had any issues (knocks on wood) since I made the change. The biggest thing to note was in the haproxy config:
listen rabbitmq 0.0.0.0:5672
mode tcp
log global
retries 3
timeout client 3h
timeout server 3h
option clitcpka
option tcplog
option persist
balance leastconn
server rabbitmq1 10.1.2.3:5672 check
That is, to have the timeout on HAProxy be longer than the default tcp_keepalive_time on the server.
if you have multi AZs, HAP solution won’t work. Cuz for each AZ you need an HAP and to balance load to them you again need an ELB.
It’s not the fastest thing in the world but you can create two HAProxy nodes in two AZ and use route53 health checks to do DNS failover. This is essentially what ELB does for their fail overs but a bit slower.