Life of a Message

To show how the MapR Streams concepts fit together, here is an example of the flow of one message from a producer to a consumer.

The Setup

Suppose that you are using MapR Streams as part of a system to monitor traffic in San Francisco. Your producers are sensors in streets, freeways, bridges, overpasses, and other infrastructure, as well as sensors reporting the weather in many different locations. Your consumers are various analytical and reporting tools.

In a volume in a MapR cluster, you create the stream /somepath/traffic_monitoring. In that stream, you create the topics traffic, infrastructure, and weather_conditions.

Of all of the sensors (producers) that your system uses to monitor traffic, let's choose a sensor that is under the pavement of Market Street and follow a message that it generates. We'll follow a message that is generated by this sensor and published in the traffic topic.

Suppose that, when you created this topic, you created several partitions within it to help spread the load among the different nodes in your MapR cluster and to help improve the performance of your consumers. For simplicity, we'll assume that the traffic topic has only one partition.

A Message Enters the System

Figure: A car runs over a sensor, triggering the sending of a message

  1. A car, one of hundreds on Market Street in morning rush-hour traffic, runs over the sensor. This action triggers the sensor to send a message to a MapR Streams producer client library.
    Note: This message might list geospatial coordinates, time, date, direction, weight, distance between front and rear wheels, and more. MapR Streams does not help you decide which data to collect.
  2. The client buffers the message.
  3. When the client has a large enough number of messages buffered (because other cars have subsequently triggered the sensor) or after an interval of time has expired, the client batches and sends the messages in the buffer. The message that we are following is published in the partition along with the rest of the messages in the batch. When the message is published, the MapR Streams server assigns it the offset 001030 (which is only an example offset; real offsets are more sophisticated). These messages being the most recent to be published, they are written to the head of the partition.

    For a moment, suppose that this example used more than one partition. In that case, the sensor could influence how the MapR Streams server determines which messages go to which partition. In the example that we're following, the sensor could include a key with each message. The MapR Streams server would hash the key to determine which partition to place all messages from the sensor in. More information about how partitions are selected if there are more than one in a topic is explained later in this documentation.

  4. Each partition and all of its messages are replicated. The server owning the primary partition for the traffic topic assigns the offset 001030 to the message that we're following, and replicates the message to replica containers (replication rules are controlled at the volume level) within the MapR cluster.

    Figure: Replication of the partition in the topic traffic

  5. The server acknowledges receiving the batch of messages and sends the offsets that it assigned to them.

    Figure: The server acknowledges receiving the messages

The Message is Read from the System

An analytics application (consumer) that correlates traffic volume with weather conditions is subscribed to the traffic topic. Many more consumers could subscribe to it, too.

Figure: How messages are read

  1. The application issues a request to the consumer client library to poll the topic for messages that the application has not yet read.
  2. The client requests messages that are more recent than the consumer has yet read.
  3. The primary partition returns multiple messages to the client. The originals of the messages remain on the partition and are available to other consumers.
  4. The client passes the messages to the application, which extracts the data from them and does something with the data.
  5. If more unread messages remain in the partition, the process repeats from step 2.

The Original Message is Deleted

Back in the cluster in San Francisco, messages are being continuously published to the partition in the traffic topic. Message 001030 is much further back in the partition. More recent messages have filled the partition ahead of it.

When you created the stream, you set the time-to-live for messages to be six months. Message 001030 and messages around it have now been in the partition for that long, so they are now expired. An automatic process eventually reclaims the disk space that message 001030 and the other expired messages are using.

Figure: Messages to be deleted automatically