Data Ingestion & Normalization Pipelines

In vendor rebate and trade promotion reconciliation, financial accuracy is never a downstream miracle; it is an upstream engineering discipline. The moment raw transactional, accrual, and claim data enters the ecosystem, the reconciliation engine’s outcome is already largely determined by the quality, consistency, and structural alignment of that data. A promotion is planned in one system, executed at thousands of stores, invoiced over EDI, and settled in a ledger that knows nothing about the original deal — and every format mismatch, encoding artefact, or unaligned date between those layers becomes a deduction dispute weeks later. When the ingestion layer is treated as plumbing rather than as a financial control, the predictable failures follow: duplicate accruals on re-sent files, claims matched against sales dated outside the promotion window, currency and unit-of-measure drift that quietly inflates payouts, and an audit trail that cannot explain why a record looks the way it does.

This page is the architectural reference for that upstream layer. It describes the canonical data model the pipeline emits, the guarantees that make ingestion safe to re-run, the rule-driven cleansing that turns heterogeneous feeds into typed records, and the financial, exception-handling, and governance controls that let finance defend every number. It is the parent topic for four implementation areas — sync patterns, parsing workflows, field mapping, and asynchronous execution — and it sits directly upstream of the core architecture promotion mapping layer that maps clean records onto agreements, which in turn feeds the claim validation rule engine that adjudicates vendor claims. The audience is concrete: Python ETL developers who own the pipeline, trade finance analysts who defend the accrual, and vendor managers who answer for the dispute.

Canonical Data Model Overview

A pipeline is only as deterministic as the records it emits. Every reconciliation decision made downstream traverses a small set of entity types, so those entities must resolve to stable, hashable identifiers before anything financial happens. The ingestion and normalization layer is responsible for producing four canonical record families and the keys that bind them together:

Source envelope — the immutable record of what arrived: the raw payload, its source system, transport (SFTP, API, EDI VAN), receipt timestamp, byte-level content hash, and the parser version that read it. Nothing is mutated in place; the envelope is the forensic anchor for every later transformation.
Transaction fact — a single normalized movement: a POS scan line, a distributor sell-through row, or an EDI 852 product-activity segment. Each fact carries a canonical SKU, a quantity in a single unit of measure, an amount in a single currency, and a UTC-anchored business date.
Claim line — a vendor’s asserted entitlement: the deduction or rebate the vendor believes it is owed, mapped to a promotion reference and an agreement so the downstream engine can confirm or contest it.
Reference resolution — the bridge records that tie incoming SKUs, vendor IDs, store numbers, and reason codes to the authoritative product, vendor, and contract registries. An unresolved reference is a first-class exception, never a silent null.

The relationships between these families are what the rest of the system reasons over, so they are made explicit and version-anchored rather than inferred at query time. A transaction fact references the version of the master-data mapping that resolved it; a claim line references exactly one promotion version. The table below specifies the minimum canonical schema the normalization engine emits for a transaction fact, which the core architecture promotion mapping layer consumes without further conversion.

Field	Type	Constraint	Notes
`record_key`	`str` (SHA-256 hex)	primary, deterministic	hash of source envelope id + line ordinal + parser version
`source_envelope_id`	`str` (UUID)	required, FK	links every fact back to the raw payload
`canonical_sku`	`str`	required, resolved	output of master-data resolution, never the raw vendor SKU
`quantity`	`Decimal`	`>= 0`, scale 4	always in canonical UoM (eaches)
`uom_source`	`str`	required	original unit before harmonization, retained for audit
`amount`	`Decimal`	scale 2	always in settlement currency
`currency`	`str` (ISO 4217)	required	post-conversion currency code
`business_date`	`datetime` (UTC)	required	anchored at ingestion, never local
`promotion_ref`	`str` \| `null`	nullable, resolved	resolved promotion; null routes to exception lane
`schema_version`	`str` (semver)	required	the canonical contract version that produced this row

Because the entire dataset is keyed on a deterministic record_key, re-ingesting the same source file produces byte-identical keys, which is what makes the upstream layer safe to replay — the foundation for every idempotency guarantee that follows.

Pipeline Design Principles

Four properties separate a reconciliation-grade pipeline from a fragile script: idempotency, deterministic hashing, schema versioning, and temporal anchoring. They are not optional hardening — they are the difference between a pipeline you can re-run during a month-end close and one whose retry doubles a vendor’s deduction.

Idempotency. Network retries, scheduler restarts, and re-sent vendor files are normal operating conditions, not edge cases. Every ingestion step is modeled as an upsert keyed on the deterministic record_key, so processing the same payload twice converges to one row rather than two. Writes use INSERT ... ON CONFLICT DO UPDATE semantics, and side effects (queue publishes, ledger postings) are guarded by the same key so a replay never emits a duplicate accrual.

Deterministic hashing. Every field that can influence a downstream financial outcome is folded into a content hash. Two runs over the same bytes must produce the same keys and the same canonical rows — otherwise reconciliation is not reproducible and an audit cannot be defended. Hashing happens at the byte level on the source envelope and again on the normalized row, so a change anywhere in the chain is detectable.

python

import hashlib
from decimal import Decimal
from datetime import datetime, timezone
from pydantic import BaseModel, field_validator

class TransactionFact(BaseModel):
    source_envelope_id: str
    canonical_sku: str
    quantity: Decimal
    amount: Decimal
    currency: str
    business_date: datetime
    schema_version: str

    @field_validator("business_date")
    @classmethod
    def must_be_utc(cls, v: datetime) -> datetime:
        # Temporal anchoring: reject naive datetimes, normalize to UTC.
        if v.tzinfo is None:
            raise ValueError("business_date must be timezone-aware")
        return v.astimezone(timezone.utc)

    @property
    def record_key(self) -> str:
        payload = "|".join([
            self.source_envelope_id,
            self.canonical_sku,
            f"{self.quantity:.4f}",
            f"{self.amount:.2f}",
            self.currency,
            self.business_date.isoformat(),
            self.schema_version,
        ])
        return hashlib.sha256(payload.encode("utf-8")).hexdigest()

Schema versioning. The canonical contract is versioned in Git with semantic versioning, and every emitted row stamps the schema_version that produced it. A breaking change to the canonical model bumps the major version and runs a backfill migration rather than silently re-interpreting historical rows under new semantics. Pydantic v2 models are the executable form of that contract — a parser that produces a row failing validation is rejected at the boundary, not absorbed.

Temporal anchoring. Trade promotions live and die on dates: a sale one day outside a promotion window is not eligible, no matter how close. Every timestamp is anchored to UTC at the moment of ingestion, the original local time and offset are retained on the envelope for audit, and all downstream window comparisons operate exclusively on the anchored value. This single discipline eliminates the most common source of false exceptions — promotions that look expired because a feed arrived in a different timezone.

These principles are implemented across the four child topics. Disciplined pos erp sync patterns govern when data is pulled and how conflicts between systems are resolved; robust csv edi parsing workflows turn raw transmissions into validated rows; deterministic field mapping strategies perform the canonical resolution; and async batch processing makes all of it scale without sacrificing ordering or idempotency.

Cleansing and Resolution Architecture

Inside the normalization engine, transformation is organized as a directed acyclic graph (DAG) of deterministic operators rather than a linear script. Each node consumes typed input and emits typed output, declares its dependencies explicitly, and is pure with respect to its inputs — given the same row and the same reference snapshot, it always produces the same result. The DAG topology lets the engine parallelize independent operators, short-circuit a row to the exception lane the moment any node fails, and reconstruct the exact path a record took.

Evaluation order follows a strict precedence so cheaper, more fundamental checks fail fast before expensive resolution runs:

Structural phase — encoding normalization, delimiter and segment integrity, required-header presence. A file that fails here never reaches semantic logic. (X12 EDI is the classic trap: segments are delimited by ~, not newlines, so a parser that splits on line breaks silently corrupts multi-line interchanges.)
Semantic phase — type coercion into the Pydantic contract, range checks (non-negative quantities, plausible amounts), and date validity. Rows that coerce cleanly proceed; rows that do not are quarantined with the failing field named.
Resolution phase — master-data lookups that translate raw vendor SKUs, store numbers, and deduction reason codes into canonical identifiers. This phase is where heterogeneity is finally collapsed: retailer terminology, vendor contract language, and internal accounting codes all map to one model.
Routing phase — resolved rows are staged; unresolved or ambiguous rows are emitted to the exception lane with a structured reason.

This rule-evaluation discipline mirrors how the downstream eligibility rule framework orders its predicates — structural before dimensional before quantitative — so an analyst tracing a record sees the same precedence model on both sides of the handoff. The normalization DAG never makes an eligibility decision; it guarantees that when the eligibility engine runs, every field it reads is canonical, typed, and resolved.

A representative resolution operator, expressed as a pure Pydantic-validated transform:

python

from decimal import Decimal, ROUND_HALF_UP

UOM_TO_EACHES = {"CASE": Decimal("12"), "PALLET": Decimal("960"), "EACH": Decimal("1")}

def harmonize_quantity(raw_qty: Decimal, uom: str) -> Decimal:
    """Collapse any source UoM to canonical eaches; unknown UoM is an exception."""
    factor = UOM_TO_EACHES.get(uom.upper())
    if factor is None:
        raise LookupError(f"unresolved UoM: {uom!r}")
    return (raw_qty * factor).quantize(Decimal("0.0001"), rounding=ROUND_HALF_UP)

Financial Modeling and Accrual Readiness

The ingestion layer does not compute final rebate liability — that is the job of payout structure modeling — but it determines whether that computation can ever be correct. Two responsibilities sit squarely here: producing decimal-safe monetary values and aligning the measurement basis so tier math is meaningful.

All monetary and quantity arithmetic uses decimal.Decimal with an explicit context and ROUND_HALF_UP quantization. Float is never used for money: accumulating thousands of line-item amounts in binary floating point introduces rounding drift that fails audit reconciliation to the cent. Currency conversion is applied once, at normalization, against a versioned rate table whose effective date matches the transaction’s business date — converting later, inside a rule, would make the same record reconcile differently depending on when the rule ran.

Tier-aware accrual downstream depends on the pipeline emitting a clean cumulative quantity per agreement scope. For an incremental tier structure, the rebate over a period is the sum of each band’s slice rated at its own rate:

$$ R = \sum_{i=1}^{n} r_i \cdot \big(\min(Q, u_i) - \ell_i\big)^{+} $$

where $Q$ is canonical cumulative quantity, $\ell_i$ and $u_i$ are the lower and upper breakpoints of band $i$, $r_i$ is that band’s rate, and $(x)^{+}$ denotes $\max(x, 0)$. The pipeline’s contribution is guaranteeing that $Q$ is exact: every contributing fact resolved to the same canonical SKU, the same UoM, and the same agreement scope, with no double-counted re-sent line. A retroactive structure re-rates all qualifying units at the highest attained band — a step change in liability — which makes the integrity of $Q$ even more consequential, because a single duplicated or mis-anchored fact can tip cumulative volume across a breakpoint and move the entire period’s payout.

Because accruals feed the general ledger, the staged dataset is shaped to support revenue-recognition treatment under IFRS 15 / ASC 606: trade spend is modeled as variable consideration, so each staged fact retains the linkage and timestamps needed to recognize, true-up, or reverse an accrual in the correct period. The write-once nature of the source envelope means a prior-period restatement can always be reconstructed from original inputs rather than reverse-engineered from a mutated table.

Exception Taxonomy and Routing

No pipeline operates on perfect data; the measure of a reconciliation-grade architecture is how legibly it handles the imperfect. Failures are classified into a fixed taxonomy, each tier carrying a routing priority and an expected service level so analysts triage by impact rather than by arrival order.

Category	Example	Routing	SLA target
Structural	malformed EDI segment, bad encoding, missing header	auto-retry once, then dead-letter	resolve same business day
Semantic	negative quantity, out-of-range date, type coercion failure	quarantine, named-field ticket	analyst review < 24h
Resolution	unresolved SKU, unknown vendor ID, no promotion match	hold in exception lane, mapping ticket	mapping owner < 48h
Business-rule	deduction outside promotion window, unauthorized reason code	escalate to vendor manager	adjudication per contract
Duplicate	re-sent file, replayed envelope	dedupe by content hash, no ticket	automatic

Recoverable formatting issues are auto-corrected where a deterministic rule exists; ambiguous or financially material cases are quarantined for human judgment; suspected compliance breaches escalate immediately. Crucially, an exception is never a dropped row — it is a routed row with a structured reason code, so the dead-letter queue is a worklist, not a graveyard. Records that resolve cleanly but carry low-confidence mappings are flagged rather than rejected, and the downstream claim validation rule engine weighs that confidence signal when it adjudicates the corresponding claim. When reference data is genuinely missing, the pipeline degrades through fallback routing logic — applying a default policy and a dispute marker — rather than failing the whole batch.

python

from enum import Enum

class ExceptionClass(str, Enum):
    STRUCTURAL = "structural"
    SEMANTIC = "semantic"
    RESOLUTION = "resolution"
    BUSINESS_RULE = "business_rule"
    DUPLICATE = "duplicate"

ROUTING_PRIORITY = {
    ExceptionClass.BUSINESS_RULE: 1,   # financial / compliance exposure first
    ExceptionClass.RESOLUTION: 2,
    ExceptionClass.SEMANTIC: 3,
    ExceptionClass.STRUCTURAL: 4,
    ExceptionClass.DUPLICATE: 5,       # automatic, lowest human priority
}

Governance, Audit Trail, and Access Control

Everything the pipeline does is written to an append-only, hash-chained audit log: each entry records the input content hash, the operator and its version, the output hash, and the actor or service identity, with each entry referencing the hash of the previous one so any tampering breaks the chain. This is what satisfies SOX expectations — an external auditor can take any staged accrual, follow its record_key back through every transformation to the original source envelope, and verify the chain has not been altered. Selected control points (period close, manual overrides) are cryptographically signed so the provenance of a financially material change is non-repudiable.

Access is governed by role-based access control with field-level granularity. Raw vendor pricing and negotiated rebate rates are restricted; an ETL operator can see that a row failed resolution without seeing the contracted rate behind it. Secrets for source connectors — SFTP keys, EDI VAN credentials, API tokens — live in a managed secret store with scheduled rotation, never in code or config. Changes to the canonical schema, the master-data mappings, or the normalization operators are gated through pull-request review and validated against a synthetic test corpus before they can affect production accruals, so a mapping change that would shift payouts is caught before it touches a vendor’s settlement.

Continuous monitoring closes the loop: throughput, exception rates by category, resolution latency, and reconciliation match percentage are tracked as first-class health metrics. A sudden rise in resolution-class exceptions, for instance, is an early signal of upstream schema drift — a vendor silently changed an export format — and surfaces as an alert long before it manifests as a disputed deduction.

Frequently Asked Questions

What stops a re-sent vendor file from doubling an accrual?

Every row carries a deterministic record_key derived from the source envelope id, line ordinal, and parser version, and all writes are idempotent upserts keyed on it. Re-ingesting the same bytes converges to the same rows rather than appending new ones, and the duplicate is logged under the DUPLICATE exception class with no analyst ticket created. Side effects like queue publishes and ledger postings are guarded by the same key, so a replay during a month-end close never emits a second accrual.

Why anchor every timestamp to UTC at ingestion instead of at query time?

Trade promotion eligibility is decided by whether a transaction’s business date falls inside a window, and a feed arriving in a different timezone can make an eligible sale look expired by a few hours. Anchoring to UTC the instant a record is received — while retaining the original local time and offset on the source envelope for audit — means every downstream window comparison operates on one consistent value, which removes the single most common cause of false exceptions.

Why use decimal.Decimal across the pipeline rather than float?

Accumulating thousands of line-item amounts in binary floating point introduces rounding drift that fails audit reconciliation to the cent. The pipeline uses decimal.Decimal with an explicit context and ROUND_HALF_UP quantization for every monetary and quantity value, so two runs over the same inputs produce byte-identical staged figures — a prerequisite for both reproducibility and a defensible audit.

What happens to a record whose SKU or vendor can't be resolved?

It is never dropped. The resolution phase emits it to the exception lane under the RESOLUTION class with a structured reason and a mapping ticket routed to the mapping owner. The row waits there as an actionable worklist item; once the master-data mapping is added, replaying the source envelope resolves the row deterministically. Where a default policy is contractually allowed, fallback routing applies it with a dispute marker rather than blocking the whole batch.

Up to the site overview: Vendor Rebate & Trade Promotion Reconciliation
POS & ERP sync patterns — pull cadence, conflict resolution, and window triggering across systems
CSV & EDI parsing workflows — delimiter, encoding, and X12 segment handling that produces validated rows
Field mapping strategies — canonical SKU, UoM, currency, and reason-code resolution
Async batch processing — queue topology and idempotent workers for high-throughput ingestion
Core architecture & promotion mapping — the sibling layer that maps these clean records onto agreements
Claim validation rule engine configuration — the sibling layer that adjudicates vendor claims against the staged dataset

Data Ingestion & Normalization Pipelines

Topics in this section

Related pages