Ramblings on pretty much anything technical.: Starving JMS consumers when not setting destination limits in ActiveMQ

My last post is fairly old, limited time is the reason.

This blog post could get two titles. The title above but also "Stuck messages in ActiveMQ, not getting dispatched to active consumer". Both can be symptoms of the same problem I will try to discuss in this post.

In my day-to-day job I analyse many ActiveMQ related problems and have seen a number of weird, unusual behaviours that are often caused by users not knowing the side effects of their ActiveMQ tuning activities. But by the following discovery I was also very surprised.

When you download and extract one of the later (including latest) versions of Apache ActiveMQ or JBoss A-MQ and look at the out of the box configuration, then you will notice, the <destintionPolicy> section of the broker configuration does not configure any destination limits anymore. Older versions did configure these limits out of the box.

I always was a supporter of this configuration. Why do you want to restrict every queue cursor to enforce a particular memory limit if most of the time your destinations have small backlogs? If a backlog accumulates on a particular queue, its better to use the brokers full <memoryUsage> to cache messages in memory irrespective of the destination in order to dispatch them quickly when needed. This also allows to better utilize the brokers <memoryUsage>, queues on which a backlog builds up, can use the brokers memory, queues that have no backlog obviously don't need the memory at the moment. If the back log grows too high or if backlogs build up on too many queues, the broker will enforce the overall <memoryUsage> limit across all destinations. So from this point of view setting no destination limits make perfect sense.

However, we lately discovered a not so exotic use case where not setting a destination limit caused problems. Here are the details:

The Problem
We initially reproduced the problem in a test case that may be less likely to mirror a real production environment. However this test case makes it easier to explain the situation. In that test we only used two JMS queues. The broker configuration did not set any destination limits and it does not matter how high the <memoryUsage> limit is set to. The higher the limit the more messages are needed in the test but it can be reproduced with every <memoryUsage> limit. We used KahaDB as the persistence store.

The broker was started with a few messages on the first destination, lets say queue A, stored in KahaDB. As this queue had no consumers attached upon broker start, the messages remained in the store only and did not get loaded into the cursor cache. Note messages only get loaded from store into memory when there is a demand, i.e. when there is is an active consumer.

Now a producer connected to the second queue B and pushed enough messages until 70% of the brokers <memoryUsage> limit got used by queue B. Remember, no destination limits were set, so each destination can use up to the brokers full <memoryUsage> limit. However the StoreQueueCursor used for a JMS queue stops caching more messages in memory, once it reaches 70% (the threshold is configurable via cursorMemoryHighWaterMark). Any additional messages received from a producer are written to the store only but not accepted by the cursor. When its time to dispatch these additional messages (i.e. once the cache runs empty again), they will be loaded from the KahaDB store.

So we had a few messages on queue A that were not loaded into memory but only resided in KahaDB and we had a few 10,000 messages on queue B that were all loaded into memory and made the cursor for queue B use 70% of the configured <memoryUsage> limit. Since queue B did not configure for any destination limit, it inherited the limits of the <memoryUsage> and had therefore used 70% of that brokers limit.

However the same applied to all other JMS queues. They also did not set any destination limits and hence also inherited the <memoryUsage> limit of the broker, which was utilized to 70% already (due to queue B).

Since there was no consumer connected to queue B, messages would not get removed from the queue and <memoryUsage> limit would not decline.

Next a consumer connected to queue A, ready to receive messages. The cursor for queue A would typically go to the store now and load maxPageSize (200 by default) number of messages from the persistence store into memory in one batch. Just that it could not do so this time, because 70% of the brokers <memoryUsage> limit were already reached. Again, remember 70% is the tipping point at which the cursor stops accepting or loading more messages into its cache. The cursors own MemoryPercentUsage JMX attribute for queue A was 0% (it had not loaded any messages in memory yet) but the brokers MemoryPercentUsage was already at 70%. The latter condition is enough so that the cursor for queue A cannot load any more messages into memory. The broker needs to protect against running out of memory and needs to enforce its <memoryUsage>. That's why it would load a full maxPageSize (again, 200 by default) number of messages if the MemoryPercentUsage is below 70% but stops loading any messages into memory once the 70% limit got reached.

The result is an active consumer on queue A that does not receive any messages although there is a backlog of message sitting in the persistence store. Unless a consumer drains off some messages on queue B and hence reduces the brokers MemoryPercentUsage below 70%, the cursor for queue A will not be able load any messages from the store. The consumer for queue A gets starved as a result.

A few consequences:

If there are multiple consumers on queue A, they will all get starved.

If there are other destinations with no messages loaded into memory but messages in the store and active consumers connected, they get starved as well.

You don't necessarily need one destination with no consumers that uses 70% of the broker's <memoryUsage> limit. There could be multiple destinations that have no consumers but a larger backlog which sums up to 70% of the brokers <memoryUsage> limit to reproduce the same behaviour..

How to identify this problem?
The following conditions should all be met when you run into this problem:

Do you detect consumer(s) that receive no messages despite a queue back log?

Does the destination to which the consumer(s) are connected show a MemoryPercentUsage of 0%?

Look at the brokers overall MemoryPercentUsage. Does it match 70% or higher?

Then drill into the JMX MemoryPercentUsage value of the various destinations and check for destinations that use a substantial portion of these 70% and that have no consumers attached.

If you find all of these conditions then you may have hit this problem.

How to resolve this situation?
On a running broker you can either connect a consumer or more to queue A and start consuming messages or if you can afford it from a business perspective, purge the queue A. Both should bring the brokers <memoryUsage> below 70% and allow cursors of other destinations to load messages from store into their cursor cache.

Restarting the broker would also help as after the restart messages only get loaded from the store if there are consumers connected. The many messages of queue A won't be loaded unless there is a consumer connected and even then the cursor loads maxPageSize number of messages only in one go (200 as you surely learned by now). The brokers <memoryUsage> should remain well below 70% in this case.

Configuring destination limits would typically also work around the problem. If you know that certain destination may not have any consumers for a while, then perhaps explicitly configure decent memory limits for these destinations so they cannot take the entire brokers <memoryUsage>.

I raised this problem in ENTMQ-1543. However no fix was made as fixing turned out to be very difficult.

Still more?
Yes, as with that much background now, we can come to the second symptom of this problem. Above I talked about one or more destinations with large backlog(s) and no consumers starving consumers of other destinations.

If you think this further, perhaps queue A does have a consumer connected but the consumer is much slower than the rate at which messages get produced. Perhaps it takes a few seconds to process each message (not entirely off the world for certain use cases). Now imagine we have multiple destinations like queue A: slow consumers, large backlog of messages.

These destinations together could use the 70% of the brokers <memoryUsage>. Now think a second about what happens to other destinations that have fast consumers and (almost) no backlog? These destinations could see a high throughput in general. Because of the destinations with slow consumers and large backlogs together constantly reaching 70% of the brokers <memoryUsage> limit, any new messages sent to other destinations with fast consumers and no backlog would not get loaded into the cursor cache of that destination. Its the same problem as above. So these fast consumers don't receive messages until the slow consumers of other destinations have consumed some messages and reduced the broker's <memoryUsage> limit below 70%. In essence these fast consumers do not get starved completely but they get slowed down pretty much to the speed of the slow consumers on other destinations.

I produced a unit test that shows this problem in action. If you are interested, check out the code from demo StuckConsumerActiveMQ and follow the instructions in the test's README.md.

Again the solution to this problem is to set destination limits for at least the destinations that have slow consumers and a large message backlog.

Conclusion
And this would be my general advice as the biggest take away message from this post: If you know you will have destinations with large message backlogs building up for more than just short peak times, then consider configuring destination limits for these queues, in order to avoid the problems discussed here

A very good background read on message cursors in ActiveMQ is this blog post from my colleague here at Red Hat, Christian Posta: ActiveMQ: Understanding Memory Cursors

Ramblings on pretty much anything technical.

13 Apr 2016

Starving JMS consumers when not setting destination limits in ActiveMQ

No comments:

DZone MVB