Async Batch Processing

Q: How is exactly-once accrual achieved when the broker only guarantees at-least-once delivery?

Exactly-once is enforced at the business level, not the transport level. Each message carries a deterministic reconciliation key, and the worker claims it with an atomic SET NX EX before any compute. The ledger post and the key commit happen in one transaction, so a redelivery finds the key already committed and silently skips, and no duplicate accrual is ever written.

Q: What happens to a message that keeps failing?

After the retry budget is exhausted under exponential backoff with jitter, the message is moved to the dead-letter queue with its attempt count, last exception, source system, and vendor SLA impact attached. It is never discarded. Fallback routing escalates it to a manual adjudication queue, and the dedup key is released so a corrected resubmission processes cleanly.

Async batch processing is the execution backbone for high-volume, latency-tolerant reconciliation workloads in vendor rebate and trade promotion ecosystems. Within the broader data ingestion normalization pipelines discipline, this page owns one specific sub-problem: how raw payloads are decoupled from compute so that millions of POS transactions, vendor invoices, and promotional claims can be ingested without blocking the reconciliation engine or stalling financial close. Unlike synchronous API calls that hold a connection open until a downstream response returns, asynchronous batch processing separates payload submission from compute execution. That separation is what lets retail and CPG operations absorb a 500,000-line quarterly claim and a continuous trickle of daily store feeds on the same infrastructure, scaling worker pools independently of upstream ingestion rate.

This is the layer where ingestion bursts stop being a reliability risk. Get it wrong and the failures are financial — a replayed batch double-counts an accrual, a worker restart mid-commit leaves a half-posted ledger entry, a flooded ERP endpoint locks the general ledger during month-end close. Get it right and reconciliation throughput becomes a function of how many workers you provision, not of how fast vendors happen to submit. For Python ETL developers this means stateless, idempotent consumers that carry deterministic keys; for trade finance analysts and vendor managers it means accrual integrity that survives infrastructure failover. The deeper queue-client and consumer-scaling mechanics live in implementing async batch queues for sales data; this page frames the architecture, the schema contract, and the financial guarantees the queue must preserve.

Positioning Within the Reconciliation Architecture

The async layer sits between ingestion and settlement. It reads payloads it does not produce and emits accruals it does not adjudicate. Upstream, the CSV & EDI parsing workflows flatten and schema-validate raw vendor files before anything is enqueued, and field mapping strategies resolve SKU hierarchies and deduction reason codes against master data so workers consume canonical records rather than raw vendor terminology. Downstream, the modeled accrual is later challenged by the claim validation rule engine when a vendor disputes the figure, and validated journal entries are synchronized through POS & ERP sync patterns on the retailer’s fiscal cutoff.

Trade promotion agreements generate highly irregular data volumes, and synchronous models fail under them because they couple ingestion throughput to reconciliation compute capacity. When a batch job blocks waiting for ERP validation, vendor master lookups, or tiered accrual math, connection pools exhaust, memory spikes, and partial failures cascade into unreconciled accruals. The async layer resolves this by placing a message broker or distributed queue between ingestion and the reconciliation engine: payloads are serialized, assigned a deterministic correlation ID, and enqueued immediately, while workers consume at a controlled, configurable rate. Two design commitments keep that safe. First, the modeled accrual is the source of truth, not the vendor claim — when they disagree, the delta is routed as a dispute rather than auto-paid. Second, determinism end to end — the same payload, agreement version, and worker configuration must always produce the same accrual, so ingestion latency never propagates to financial reporting SLAs.

Entity Topology and Schema Specification

A production-grade queue topology separates the immutable message envelope (what was submitted) from the dynamic processing state (what a worker did with it). The envelope is written once at enqueue and never mutated; the processing record is appended as workers advance the message through its lifecycle. The fields below are the minimum contract the reconciliation engine and the observability layer depend on. Treat them as a versioned interface — adding a nullable field is backward-compatible, while renaming or retyping one is a breaking change that bumps the schema version.

Field	Type	Entity	Constraint
`message_id`	str (ULID)	Envelope	immutable, write-once, time-ordered
`reconciliation_key`	str (sha256)	Envelope	deterministic dedup key; unique business identity
`correlation_id`	str	Envelope	propagated from ingestion to ERP sync for tracing
`vendor_id`	str	Envelope	FK to vendor master; validated pre-enqueue
`claim_period`	str (ISO)	Envelope	fiscal window; pins the agreement version
`agreement_version`	int	Envelope	monotonic; pinned per message, never “latest”
`payload`	dict (canonical)	Envelope	schema-validated, UTC-anchored, normalized
`queue_tier`	enum	Envelope	`fast_path`, `compute_heavy`, `dead_letter`
`attempt`	int	ProcessingState	retry counter; drives backoff and DLQ routing
`status`	enum	ProcessingState	`queued`, `processing`, `committed`, `skipped`, `dead`
`accrual_amount`	Decimal (str at rest)	ProcessingState	4 dp, parsed via `decimal.Decimal`, never float
`committed_at`	datetime (UTC)	ProcessingState	write-once on atomic ledger commit

Storing accrual_amount as a string at rest — parsed into Decimal only at evaluation — is what keeps the posted figure reproducible to the cent across workers and environments. The reconciliation_key is the load-bearing field: it is a deterministic hash over the canonicalized business identity of the record, so two physically distinct broker deliveries of the same logical claim line collapse to one key and one accrual.

Conditional Logic and Rule Integration

Routing a message to the right queue is itself conditional logic. The most effective topology partitions work by processing complexity and downstream dependency, so a slow compute-heavy accrual never sits behind — or starves — a cheap validation:

Fast-path queue handles lightweight checks (SKU existence, promotion-window membership, vendor ID mapping). Workers here are stateless, highly parallel, and optimized for throughput.
Compute-heavy queue manages tiered accrual math, volume-threshold evaluation, and cross-reference matching against historical promotion windows. These workers perform database joins and rate-limited vendor API calls.
Dead-letter queue (DLQ) captures malformed payloads, schema violations, and unrecoverable mismatches, triggering alerts for manual exception handling.

Crucially, the async layer does not re-derive promotional qualification — that gate is owned upstream by the eligibility rule framework, and workers consume only records that already cleared channel, hierarchy, and temporal-window scope. What the queue layer does encode conditionally is partitioning and backpressure: messages are keyed so that a single retailer’s stalled feed cannot block others (no head-of-line blocking), and consumers apply exponential backoff, jitter, and circuit breakers to avoid thundering-herd failures when a downstream system degrades. Rate limiting at the consumer level ensures ERP endpoints are never flooded during bulk claim submissions, preserving ledger stability during month-end close.

json

{
  "routing": {
    "fast_path": ["sku_exists", "promo_window_member", "vendor_id_mapped"],
    "compute_heavy": ["tiered_accrual", "volume_threshold", "historical_match"],
    "dead_letter": ["schema_violation", "unrecoverable_mismatch"]
  },
  "backpressure": {"max_in_flight": 256, "erp_rate_limit_per_sec": 40},
  "retry": {"max_attempts": 3, "base_delay_s": 30, "jitter": true}
}

Financial Settlement Layer

Once a message reaches a compute-heavy worker, earned volume becomes monetary value — and this is where correctness is non-negotiable. The async layer does not invent the tier math; it invokes the deterministic accrual logic owned by payout structure modeling, passing the pinned agreement_version so a mid-cycle amendment never re-rates a closed period. All monetary arithmetic uses decimal.Decimal with an explicit context and rounding mode. Float introduces drift that is invisible on one line but material across a quarter-end run and differs across platforms; because IFRS 15 and ASC 606 revenue-recognition treatment requires figures reproducible to the cent, float is not permitted anywhere in the settlement path.

The settlement step must be atomic with the dedup write: the worker computes the accrual, posts the ledger entry, and commits the reconciliation_key in a single transaction, so a crash can never leave a posted accrual without its dedup marker (or vice versa). This is the exactly-once business semantic that distinguishes a reconciliation queue from a generic task queue.

python

from decimal import Decimal, ROUND_HALF_UP, getcontext

getcontext().prec = 28

def settle_line(volume: Decimal, rate: Decimal, multiplier: Decimal) -> Decimal:
    """Rate qualifying volume; channel multiplier applies as a final factor.
    Caps/floors are enforced on the aggregate accrual, never per line."""
    rated = (volume * rate) * multiplier
    return rated.quantize(Decimal("0.0001"), rounding=ROUND_HALF_UP)

Caps and floors are enforced on the aggregate accrual after per-line rating, not inside the worker loop, so a spend ceiling cannot be circumvented by batching the claim across messages. Currency is carried explicitly on the envelope and never converted inside a worker — any conversion happens upstream so the settlement path reasons in a single currency.

ETL Implementation Patterns

Enforcement begins at the boundary. Model the message envelope with Pydantic v2 so a float accrual, a missing version pin, or an unmapped vendor fails fast at enqueue instead of surfacing as a bad posting weeks later. The standard approach combines asyncio for I/O-bound ingestion with a distributed task queue such as Celery for compute-heavy reconciliation steps.

python

from decimal import Decimal
from pydantic import BaseModel, field_validator

class RebateMessage(BaseModel):
    message_id: str
    reconciliation_key: str
    vendor_id: str
    claim_period: str
    agreement_version: int
    accrual_amount: Decimal | None = None

    @field_validator("accrual_amount", mode="before")
    @classmethod
    def _no_float(cls, v):
        if isinstance(v, float):
            raise ValueError("accrual_amount must be Decimal/str, never float")
        return v

The worker derives the dedup key, checks a distributed state store before any compute, and commits the key atomically with the ledger write. Replaying an unchanged batch is then a no-op rather than a source of phantom liabilities.

python

import hashlib
from celery import Celery
from redis import Redis

app = Celery("rebate_reconciliation", broker="redis://localhost:6379/0")
state_store = Redis(host="localhost", port=6379, db=1)

def reconciliation_key(record: dict) -> str:
    payload = f"{record['vendor_id']}:{record['claim_period']}:{record['line_hash']}"
    return hashlib.sha256(payload.encode()).hexdigest()

@app.task(bind=True, max_retries=3, default_retry_delay=30)
def process_rebate_line(self, record: dict):
    key = reconciliation_key(record)
    # SET key value NX EX 86400 — atomic claim of the work item.
    if not state_store.set(key, "processing", nx=True, ex=86400):
        return {"status": "skipped", "key": key}
    try:
        accrual = compute_promotional_accrual(record)   # owned by payout modeling
        post_to_erp_ledger(accrual)                      # atomic with the commit below
        state_store.set(key, "committed", ex=86400)
        return {"status": "committed", "key": key, "accrual": str(accrual)}
    except Exception as exc:
        state_store.delete(key)                          # release for a clean retry
        raise self.retry(exc=exc)

Key implementation considerations: reuse database and broker connection pools across invocations to avoid TCP-handshake overhead; configure worker_max_tasks_per_child and worker_prefetch_multiplier for graceful shutdown so deployments hand off cleanly without leaking memory; and use Redis SET ... NX EX (the atomic equivalent of the deprecated SETNX + EXPIRE pair) or PostgreSQL advisory locks for distributed locking when multiple workers compete for the same vendor claim window. Schema evolution follows semantic versioning — backward-compatible additions ship behind a dual-read window — and the layer relies on the upstream normalization stage to deliver typed, UTC-anchored records so no worker performs conversion inside a rule.

Drift Detection and Validation

A queue is only trustworthy if its async-posted accruals keep matching the ledger. Async batch processing introduces distributed state, so observability is a first-class control, not an afterthought. Effective monitoring tracks four dimensions, and anything that fails validation is quarantined — written to a holding table with its mismatch reason intact and an exception ticket raised — rather than silently dropped.

Drift signal	Detection rule	Action
Queue lag surge	pending ÷ processed over threshold	scale workers, raise `BACKLOG_GROWTH`
Latency regression	P95 compute time > tier SLA	page on-call, throttle producers
DLQ volume spike	DLQ rate > baseline	quarantine, flag `FORMAT_DRIFT`
Replayed duplicate	key seen with new `message_id`	skip + log `DEDUP_HIT`
Async-vs-ledger delta	abs(async accrual − ERP posted) > tolerance	open dispute, route to finance

P50/P95/P99 execution times are tracked per queue tier so compute-heavy accruals are never averaged together with fast-path validations. A high DLQ rate is an early signal of vendor file-format drift or a master-data mapping failure rather than a transient blip. Distributed tracing (OpenTelemetry, Jaeger) propagates the correlation_id from ingestion through queue consumption to ERP sync, so a single rebate claim can be traced across its entire lifecycle — catching a DEDUP_HIT or BACKLOG_GROWTH before close is a remediation task, while the same drift discovered after close is a restated accrual and an audit finding.

Fallback and Dispute Routing

Data gaps and transient failures are inevitable in distributed retail ecosystems, so the queue degrades rather than halts. When a worker exhausts its retry budget, a payload is malformed, or a downstream system stays degraded past the circuit-breaker threshold, the message is routed to the dead-letter queue with its full failure context — attempt count, last exception, source system, and vendor SLA impact — instead of being dropped. The escalation, adjudication, and audit-log mechanics of those holding queues are owned by fallback routing logic; the async layer’s job is to tag each failure with severity and route it deterministically so ops teams can prioritize high-value exceptions. A claim-versus-model delta is never auto-resolved: the modeled accrual is posted, the difference is opened as a dispute ticket, and both figures plus the correlation_id trace travel with it so an analyst can reconstruct the calculation during adjudication. This keeps reconciliation continuous while maintaining a strict trail for every non-standard outcome.

Security and Access Boundaries

Queued payloads carry sensitive negotiated pricing and vendor liability balances, so segregation of duties is enforced at both the broker and warehouse layers. Field-level encryption protects payload and accrual_amount at rest, and role-based access control (RBAC) tags travel with every envelope so authorization is enforced per field rather than per endpoint: vendor managers can resubmit and triage held claims but cannot mutate posted accruals; ETL developers can deploy worker and routing configuration but cannot edit committed ledger entries; trade finance analysts can adjudicate DLQ exceptions and export audit trails but cannot alter a stored accrual amount. Broker credentials, state-store passwords, and signing keys rotate on a fixed schedule and are read from a secrets manager, never baked into worker images. Immutable, hash-chained audit logs capture every enqueue, status transition, manual replay, and override, aligning with SOX and internal control frameworks — and because each committed_at accrual record is signed, any retroactive edit surfaces as a chain break in reconciliation rather than passing unnoticed. Treating async batch processing as a versioned, access-governed financial control — not a fire-and-forget task runner — is what turns trade-promotion reconciliation from a fragile synchronous bottleneck into a scalable, auditable pipeline: faster vendor settlements, fewer deduction disputes, and a single source of truth for promotional spend that scales predictably alongside business growth.

Frequently Asked Questions

How is exactly-once accrual achieved when the broker only guarantees at-least-once delivery? At-least-once is the realistic guarantee for distributed brokers, so exactly-once is enforced at the business level, not the transport level. Each message carries a deterministic reconciliation_key, and the worker claims the key with an atomic SET ... NX EX before any compute. The ledger post and the key commit happen in a single transaction, so a redelivery finds the key already committed and silently skips — no duplicate accrual is ever written.

What happens to a message that keeps failing? After the configured retry budget is exhausted under exponential backoff with jitter, the message is moved to the dead-letter queue with its attempt count, last exception, source system, and vendor SLA impact attached. It is never discarded. Fallback routing logic then escalates it to a manual adjudication queue, and the dedup key is released so a corrected resubmission processes cleanly.

Why partition into fast-path and compute-heavy queues instead of one queue? A single queue lets a slow tiered-accrual calculation sit in front of — and starve — cheap validation work, and it makes latency SLAs impossible to track per workload. Separate tiers let fast-path workers stay stateless and highly parallel while compute-heavy workers are rate-limited against the ERP, and P95 latency is monitored independently so a regression in one tier does not hide behind the other’s average.

Does the async layer recompute promotional eligibility or rebate tiers? No. Qualification is owned upstream by the eligibility rule framework, and the tier math is owned by payout structure modeling. The async layer consumes already-qualified records, pins the agreement version on the envelope, and invokes the deterministic settlement function — keeping the queue layer free of business-rule duplication and version drift.

Data Ingestion & Normalization Pipelines — the parent discipline that delivers typed, canonical records to the queue.
CSV & EDI Parsing Workflows — the parsing stage whose spikes the async layer absorbs.
Field Mapping Strategies — master-data alignment that runs before accrual computation.
POS & ERP Sync Patterns — the downstream ledger synchronization the queue coordinates with.
Implementing Async Batch Queues for Sales Data — queue-client configuration, payload routing, and consumer scaling in depth.

Up one level: Data Ingestion & Normalization Pipelines

Async Batch Processing

Topics in this section

Related pages

Back to