Technical Overview
This technical overview gives a few hints on FFMQ behavior and the internal optimizations used to achieve its current performance.
Foreword
After spending quite some time writing (and rewriting) this JMS queuer, optimizing objects allocation, objects copy and lock contention,
I managed to create a queuer implementation that is almost purely I/O bound like other good implementations.
There are a few aspects that can be optimized to achieve maximum performance and horizontal
scalability (like fine-grained synchronization), but the bottom line is that speed differences between
queuers is directly linked to the way they handle disk and network I/O operations.
Message Serialization
Sending a message instance over a network or writing it to a persistent storage requires the instance to be 'serialized' to binary form.
Java serialization comes to mind immediately, but though it is versatile and automated (reflection based),
it is far from optimal performance wise.
Thus, using explicit serialization is a first step toward better performance :
java.io.DataInputStream/DataOutputStream implementations would be nice candidates
for this purpose but they are not fully optimal either because of several limitations :
- These implementations are supposed to handle streams and thus cannot use every possible optimization when dealing with fixed-length byte arrays. As messages exchanged through JMS queuers are typically small to medium and can fit into memory, using a stream based API is an unecessary overhead.
- JDK 'modified' UTF-8 implementation is not really optimal because it needs backward compatibility and/or compatibility with external binary streams. This also leaves room for optimization.
FFMQ uses a custom serialization mechanism handling serialization/unserialization in a more straightforward way.
Network I/O
Even when using (memory-stored) non-persistent messages, there is a non-negligible latency induced by the network layer.
Under typical load, the network traffic of a queuer is essentially composed of messages and acknowledgements of these messages.
To improve performance the objective is simple : reduce the number of round-trips between client and server.
Because in most critical environments JMS is used in a transacted way, client to server round-trips need to be synchronous (think of a commit operation for example).
That means the actual queuer throughput will be directly limited by its commit latency because a client won't send/receive more messages before the previous ones
are properly commited.
A somewhat basic remote consumer implementation would be an 'active' implementation or pull architecture.
In such an implementation, the consumer is asking the server if there is something for him and, if not,
sleeps until a server notification arrives then asks the server if there is something for him, and so on.
This kind of implementation is correct but is wasting a lot of network round-trips.
A better implementation, as used by all the fastest queuers, is a 'passive' or push architecture : the server
decides to push messages to consumers automatically (even before the consumers actually try to receive them !).
It may even push several messages ahead to maximize bandwidth (this is usually called prefetching).
This approach has two immediate benefits :
- Less round-trips because client is passively waiting for messages to arrive.
- Parallelism : more messages are actually received in the background while the client application is processing the current one.
Concerning producers, some buffering can help saving round-trips by batching messages together or batching messages with acknowledgement together. For example, imagine the following typical scenario : producer sends a message and commits in a loop. Instead of sending the message then the commit we may group the two operations as one, thus saving one round-trip per message.
FFMQ puts these concepts into application in its custom TCP-based protocol to reduce the number of packets sent back and forth between clients and server.
Disk I/O
For persistent messages, hard-disk I/O operations are a big bottleneck. Not because of the amount of data
moved around (nowadays you have read/write buffers in your application plus in the operating system plus in the hardware)
but because of the low-level hard-disk flush/synchronization required by transacted sessions.
There are various possibilities to handle persistent storage.
- The JDBC approach : some people already spent a lot of time solving these transaction problems for databases,
why not use their hard work directly ?
In practice, JDBC and SQL introduce some overhead. Added to the fact that a database is supposed to perform well in a large amount of different scenarios and not only for the brutal 'write then delete that blob over and over again' queuer scenario, a JDBC based storage only performs poorly. - The custom storage approach : if classic databases have too much overhead for what we want to do, let's write a custom one !
Most queuer implementations use a custom storage implementation tailored to there needs. They also usually use the well studied and proven transaction log approach (changes are written to a transaction log first, then merged with the actual database, allowing for a shorter write path and recovery in case of a failure).
FFMQ uses that transaction log paradigm because that's still the best algorithm around for slow syncing devices like harddisks. FFMQ implementation is fully asynchronous, with support for 'mergeable' commit barriers, allowing concurrent access to the store to be handled more efficiently.
When designing the FFMQ storage engine, I had the following constraints in mind :
- How do I get (near) constant read/write time for messages if the messages have different sizes and may be consumed out of order because of priority or message selectors ?
- How do i make sure the store could be recovered if some write operation is lost ?
To answer this, I merged two concepts : a doubly-linked list (provides order and constant removal time) and a journaling-filesystem-like store (block-based storage with an allocation table).
It was optimal for most operations I needed, but had the same usual limitations as a filesystem : it has a
fixed pre-allocated size (though FFMQ 3.x+ now support auto-extend at runtime), it tends to waste space (stored messages are padded to use an even number of blocks) and it may suffer from fragmentation.
The good news are that those problems only have a limited impact on real world scenarios, making the whole solution pretty efficient.
But what about disk sync and safety ?
- A quick benchmark shows that FileChannel.force(false); is usually the fastest syncing mechanism provided by the JVM. (Also shows that RandomAccessFile 'rwd' mode is buggy under windows !)
- Concerning the write failure safety of the store : this is fully managed by the transaction log approach, but using a doubly-linked structure has an additional benefit : you can safely modify one direction of the message chain because even if the operation is interrupted you know the other direction is still correct and can be used to recover a corrupted store file.