Transactional Outbox¶

Overview¶

A minimal MassTransit + RabbitMQ implementation suffers from the following:

Publishing has a limited retry. An extended outage will result in missing messages.
The consumer has a limited retry. Failure does not add back the message to a retry queue.
Publish/consume is immediate, which seems to cause deadlocks in some scenarios when operating on the same rows.

The outbox pattern addresses these shortcomings by persisting the message locally in a table within the transaction of the emitting operation.

Publisher Scenario¶

Consumer Scenario¶

Outbox retention worker service¶

Due to the high volume of outbox message processing, a background step is required to purge published and/or processed messages. to fulfil that need a background worker service called OutboxRetentionService has been implemented with in the buildingBlocks. This service is common for all three listener services(Mono, PaymentRepo, Zelle). This BackgroundService is hosted by the listener to delete older messages.

OutboxMessage purge eligibility is based on these scenarios:

Scenario 1 - completed publisher only
- OutboxState exists
- OutboxState.PublishedAt is not null
- InboxState does not exist
Scenario 2 - completed publisher and consumer
- OutboxState exists
- OutboxState.PublishedAt is not null
- InboxState exists
- InboxState.ProcessedAt is not null
Scenario 3 - completed consumer only
- OutboxState does not exist
- InboxState exists
- InboxState.ProcessedAt is not null

Here we are going to delete messages based on Completed & Incompleted Currently we have not implemented the Incomplete section. We are deleting messages 30 days older from current UTC datetime. And the retention worker service is checking the outbox table for messages to be deleted every 15 seconds. These number of days and worker time can be configured using appsetting.json. And also we have two different feature flags to enable/disable the purge process based on Completed & Incompleted.

For the deletion process We are using cascade delete for this hard deletion process to delete all the eligible records from OutboxMessage, OutboxState & InboxState tables. For the deletion stored procedure READPASS & UPDLOCKS hints are enable. Here we are deleting the records in configurable batches. The default size of a batch would be 1000. All these 1000 records will be deleted within a one single transaction.

Node configuration for retention work service¶

The active workers can be configurable in case we need to reduce the active count to reduce contention. An array value is added to appsettings.json to allow for a list of machine names. And also we have a feature flag to completely disable this retention worker in the node

The maintenance background worker is only registered in these scenarios:

The configured setting is an empty array. All workers would run.
The configured setting is not empty and the current worker's machine name is included.

Backoff Retry Mechanism for Outbox¶

The Backoff Retry Mechanism is a strategy used to handle transient errors when processing messages in an outbox. This mechanism allows the system to wait for a certain period before attempting to reprocess a failed message. With each subsequent retry, the waiting period increases, reducing the likelihood of overwhelming the system or external services.

How It Works¶

When a message fails to process, the Backoff Retry Mechanism is triggered. The time interval between retries increases exponentially based on the following parameters:

interval: The initial amount of time (in seconds) to wait before the first retry.
exponentialRate: A value greater than 1 that determines the rate at which the backoff time increases with each retry. Higher values result in more significant increases between retries.
retryNumber: The number of the current retry attempt, starting from 0 for the initial attempt.

Backoff Formula¶

The time to wait before the next retry is calculated using the following formula: waitTime() = interval * [Math]::Pow(exponentialRate, retryNumber) - waitTime: The amount of time (in seconds) to wait before the next retry. - interval: The base time interval for the first retry. - exponentialRate: The factor by which the interval increases for each subsequent retry. - retryNumber: The number of retries attempted so far.

Example¶

For an Interval of 5 seconds, an exponentialRate of 2, and a retryNumber of 3, the wait time before the 4th retry would be:

waitTime = 5 * (2 ^ 3) = 5 * 8 = 40 seconds

Use Case¶

This retry mechanism is particularly useful in scenarios where temporary failures (e.g., network issues, service outages) are expected. By progressively increasing the wait time between retries, the system avoids overwhelming external services while still attempting to process messages.

Configuration¶

In practice, you can adjust the Interval and exponentialRate to suit the needs of your application. For instance, a lower exponentialRate would result in smaller increments between retries, which might be suitable for less critical tasks.

Example Configuration¶

"QueueOutboxSettings": {
    "FailMessageProcessMaxRetryAttempts": 5,
    "FailMesagePublishMaxRetryAttempts": 5,
    "InitialBackOffRetryDelayMilliseconds": 10000000,
    "InitialBackOffFactor": 2,
    "BackoffFailureThreshold": 3
},