At Ninja Van, processing millions of Kafka messages daily across six Southeast Asian countries meant transient failures were inevitable — database timeouts, external API errors, downstream service outages. A naive approach of retrying in-place would block entire partitions and cascade into wider service degradation.
So we built a centralized retry system with a dedicated service that supports retries for all microservices through client libraries. Messages flow through four escalating delay tiers (e.g., 1m → 5m → 30m → 60m) backed by MySQL persistence and Quartz scheduling, making retries durable across deployments and horizontally scalable.
Watch messages flow through the retry system in real time. Failed messages get wrapped in protobuf and retry through escalating delay tiers before routing to the Dead Letter Queue. Adjust the failure rate to see how the system responds.
Microservices publish domain events to Kafka topics. When a downstream consumer encounters a transient failure, the message needs to be retried without blocking other messages in the partition.
The client library catches the processing exception and wraps the original message in a RetryWrapper protobuf envelope. This preserves the original payload, and a list of remaining retry options (e.g., 1m, 5m, 30m, 60m).
The first retry option is popped from the list, and the wrapped message is published to the corresponding delay-specific topic (e.g., retry-act-00h01m for a 1-minute delay). Four dedicated topics provide escalating delay tiers.
The central Retry Service consumes from all four retry topics and persists each message to a MySQL table with a calculated retry_at timestamp. This makes retries durable — they survive service restarts and deployments.
A Quartz scheduler polls the database every 10 seconds for jobs where retry_at < NOW(). Ready messages are reconstructed from stored protobuf and published back to their original target topic for re-processing.
If re-processing fails again, the next retry option is selected (5m → 30m → 60m). After all four retry attempts are exhausted, the message is routed to a Dead Letter Queue for manual investigation and resolution.
Retry delays implemented via MySQL timestamps and Quartz polling rather than in-memory timers or consumer pauses. A multi-tier delay strategy prevents thundering herd on transient failures while retries remain durable across service restarts and horizontally scalable via Quartz clustering.
Shipped as a shared client library with identical protobuf contracts, enabling adoption across the entire engineering organization. Centralized Prometheus metrics and structured logging with request ID propagation provide unified observability across all services using this retry infrastructure.