SKU Mapping & Deduplication

Q: Why key the canonical SKU on UOM and pack size instead of just the GTIN?

Because rebate volume is paid per sellable unit. A 12/12oz case and a 24/12oz pallet can share a base GTIN yet represent different contractual units, so collapsing them onto a bare GTIN cross-contaminates qualifying volume. The composite gtin14 + uom + pack_size key keeps each sellable unit distinct and prevents a case claim from accruing at per-each scale.

Q: How does deduplication stay idempotent across pipeline replays?

Every sales fact carries a dedup_key hashed over the canonical SKU, UOM, transaction date, partner, and source document. The upsert target is keyed on that hash, so a re-ingested row maps to the same target row and the write is a no-op. Re-running the job over identical inputs produces identical output.

In vendor rebate and trade promotion reconciliation, no rule can fire correctly until the product it reasons about has a single, stable identity. A promotional item routinely surfaces as SKU-8842 in a vendor EDI 850, as ITM-991044 in a retailer POS extract, and as GTIN-00012345678905 in the manufacturer catalog — three labels for one sellable unit. SKU mapping and deduplication is the sub-problem of collapsing those fragmented identifiers onto one canonical reconciliation key and guaranteeing that the same physical transaction is counted exactly once. It is the identity-resolution layer of the broader claim validation rule engine: the parent framework decides whether a claim is owed, but it cannot do so until this layer establishes what was sold. When identity resolution is loose, the failures are deterministic and expensive — duplicate payouts slip through on remapped codes, promotional volumes inflate against phantom records, and the accrual cannot survive an audit because there is no defensible record of which external code mapped to which canonical SKU.

This page specifies the crosswalk schema that anchors product identity, the conditional logic that scopes a match to the right unit-of-measure and contract, the settlement implications of mis-mapped pack sizes, the ETL patterns that make deduplication idempotent, and the drift-detection and dispute-routing controls that keep the registry auditable under SOX.

Three external codes resolve through effective-dated AliasMap edges onto one CanonicalSku keyed by gtin14 + uom + pack_size; its SalesFact rows are then collapsed by dedup_key so a replayed batch counts each physical sale exactly once.

Positioning within the rule engine

Identity resolution runs near the front of the evaluation order because every quantitative rule downstream assumes the product key is already canonical. The raw, typed records arrive from the data ingestion normalization pipelines; this layer is the first transform that gives those records a shared product vocabulary. When it is skipped or implemented as ad hoc spreadsheet lookups, errors cascade: an unmapped retailer item is dropped from qualifying volume, or two encodings of the same unit are both counted, and the tier attainment computed downstream is wrong before any money is touched.

Concretely, this layer feeds three sibling subsystems. Volume Threshold Validation must execute strictly after canonicalization, because tiered rebates aggregate qualifying units by product — duplicate or mis-mapped rows systematically over- or under-state attainment against the breakpoints. Date Window Alignment Checks and this layer are co-dependent: a verified promotion window prevents pre-promo inventory from collapsing into promo-period sell-through during canonicalization, while canonical keys prevent the same physical sale from appearing under two codes inside one window. And Scoring & Confidence Models consume the match-confidence scores this layer emits, adjudicating fuzzy candidates rather than auto-accepting or auto-rejecting them. The contractual product hierarchies that define what a canonical SKU means originate upstream in agreement schema design and are scoped by the eligibility rule framework.

Entity topology and schema specification

The reconciliation pipeline hinges on a canonical SKU registry that functions as the authoritative crosswalk, translating every external identifier into one internal reconciliation key. The registry is a versioned entity, not a flat lookup file: alias edges are effective-dated, and every mapping decision carries a hash so two runs over the same inputs produce byte-identical canonicalization. The table below specifies the minimum contract the engine depends on.

Field	Type	Entity	Constraint
`canonical_sku`	`str` (ULID)	CanonicalSku	immutable, primary key
`gtin14`	`str`	CanonicalSku	14-digit, mod-10 check valid
`uom`	`enum`	CanonicalSku	`each` \| `case` \| `pallet`, required
`pack_size`	`int`	CanonicalSku	`> 0`, units per `uom`
`product_hierarchy_id`	`str`	CanonicalSku	FK to agreement scope
`external_id`	`str`	AliasMap	source-system item code
`source_system`	`enum`	AliasMap	`vendor_edi` \| `retailer_pos` \| `mfr_catalog`
`effective_start` / `effective_end`	`date`	AliasMap	half-open interval, UTC-anchored
`match_type`	`enum`	AliasMap	`deterministic` \| `fuzzy` \| `manual`
`match_confidence`	`Decimal`	AliasMap	`0.00`–`1.00`, 2 dp
`mapping_hash`	`str` (sha256)	AliasMap	deterministic over the alias edge
`dedup_key`	`str` (sha256)	SalesFact	over (canonical_sku, uom, txn_date, partner, source_doc)

Two design choices carry the layer. First, the canonical key is composite: a 12/12oz case cannot reconcile against a 24/12oz pallet claim even when the base GTIN matches, so identity is keyed on gtin14 + uom + pack_size, never on the bare GTIN. Collapsing pack configurations into one key is the most common source of cross-unit contamination in rebate volume. Second, the alias edges are effective-dated and versioned: catalog rotations, seasonal SKU swaps, and discontinued items mean a single external code can legitimately point at different canonical SKUs over time, and change-data-capture logging on the AliasMap preserves that history for replay. Adherence to the GS1 General Specifications for GTIN-14 normalization eliminates cross-system parsing ambiguity at the boundary.

Conditional logic and rule integration

Standardization happens at ingestion so that downstream joins never fail on formatting. Raw feeds carry inconsistent casing, embedded separators, legacy UPC-A codes, and dropped leading zeros; the normalizer strips non-alphanumeric characters, repads to fixed width, validates the mod-10 check digit, and promotes UPC-A to GTIN-14 before any matching predicate runs.

python

import re

_NON_ALNUM = re.compile(r"[^0-9A-Za-z]")

def to_gtin14(raw: str) -> str:
    """Normalize an arbitrary product code to a zero-padded GTIN-14.

    Strips separators, upper-cases, and left-pads numeric codes to 14
    digits. Non-numeric (proprietary) codes are returned cleaned for
    alias-table resolution rather than coerced into a GTIN.
    """
    cleaned = _NON_ALNUM.sub("", raw).upper()
    return cleaned.zfill(14) if cleaned.isdigit() else cleaned

A canonical match is only valid inside the scope the contract defines. The same external code can map to different canonical SKUs across retail banners, so resolution is keyed on (external_id, source_system, txn_date) against the effective-dated alias table — not on the code alone. Product scope, channel, and the promotion’s product hierarchy layer on top: a match that resolves to a SKU outside the agreement’s product_hierarchy_id is rejected before it reaches volume logic, exactly as the temporal predicate short-circuits the chain in date-window evaluation. This ordering lets the engine treat the canonical key as a cache primary key — an unmatched or out-of-scope row never reaches threshold arithmetic.

Financial settlement layer

Identity errors translate directly into money, and the translation is rarely linear. A pack-size mis-map does not just miscount one unit — it multiplies. If a case of 12 is mapped to an each key, a single case claim accrues twelve times the eligible quantity at the per-unit rate. Because of this leverage, all quantity-to-currency arithmetic in the settlement handoff uses decimal.Decimal; float accumulation across thousands of resolved lines drifts and fails cent-level audit reconciliation.

python

from decimal import Decimal, ROUND_HALF_UP

def eligible_amount(claim_units: int, pack_size: int,
                    per_each_rate: Decimal) -> Decimal:
    """Convert a claim expressed in the SKU's UOM to a per-each accrual.

    pack_size is the canonical units-per-UOM resolved from the registry,
    NOT a value taken from the inbound claim, so a mis-stated case factor
    in the vendor feed cannot inflate the payout.
    """
    each_units = Decimal(claim_units) * Decimal(pack_size)
    gross = each_units * per_each_rate
    return gross.quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)

The settlement responsibility of this layer is narrow but absolute: deliver the correct canonical quantity and the correct sellable unit for every resolved line, in the currency the agreement settles in, normalized upstream so no rule performs conversion inline. Tier boundary math itself lives in Volume Threshold Validation, and the rate structures it consumes come from payout structure modeling; this layer simply guarantees those rules are summing real, distinct, correctly-scaled units.

ETL implementation patterns

Deduplication operates on two axes. Record-level dedup prevents the same invoice or POS transaction from being ingested twice; volume-level dedup prevents overlapping promotional windows or multi-tier claims from counting one physical sale more than once. Deterministic logic runs first because it carries zero reconciliation risk and maximizes straight-through processing.

python

from datetime import date
from decimal import Decimal
from enum import Enum
from hashlib import sha256
from pydantic import BaseModel, Field, computed_field, field_validator


class Uom(str, Enum):
    each = "each"
    case = "case"
    pallet = "pallet"


class SalesFact(BaseModel):
    canonical_sku: str
    uom: Uom
    pack_size: int = Field(gt=0)
    txn_date: date
    partner_id: str
    source_doc: str
    units: int = Field(ge=0)
    claimed_amount: Decimal

    @field_validator("canonical_sku")
    @classmethod
    def _resolved(cls, v: str) -> str:
        if not v:
            raise ValueError("canonical_sku must be resolved before dedup")
        return v

    @computed_field
    @property
    def dedup_key(self) -> str:
        payload = "|".join(str(x) for x in (
            self.canonical_sku, self.uom.value, self.txn_date.isoformat(),
            self.partner_id, self.source_doc,
        ))
        return sha256(payload.encode()).hexdigest()

The upsert key is the dedup_key. Because the hash folds in the canonical SKU, the sellable unit, the transaction date, the trading partner, and the source document, a re-ingested row maps to the same target row and the write is a no-op — the pipeline is idempotent under replay. Deterministic resolution is implemented as an exact join of normalized external codes against the alias table; in pandas this is a merge(..., indicator=True) to flag unmatched rows, and drop_duplicates(subset=["dedup_key"]) to guarantee single-counting. When retailer data lacks strict formatting or carries OCR noise from scanned invoices, deterministic matching fails and a fuzzy pass takes over: token-set and edit-distance scoring (for example via the maintained thefuzz library) produces candidate matches with a match_confidence. Candidates above the auto-accept threshold are written as fuzzy alias edges; everything below it is held for adjudication rather than guessed. Schema evolution follows the same discipline as agreement schema design: new optional fields default safely, and any field that affects the canonical key or the dedup key is versioned rather than mutated in place.

Deterministic joins clear straight through to dedup and accrual; misses fall to fuzzy scoring, and the resulting match_confidence is split across a band — quarantined below low_thr, sent to the review queue and scoring models in the mid range, and auto-accepted at or above high_thr.

Drift detection and validation

A correct registry still degrades when upstream catalogs shift. Drift detection monitors the rate of unmatched codes, the share of records resolved by fuzzy rather than deterministic logic, and the distribution of match_confidence per source system, comparing each against a rolling baseline. A sudden rise in fuzzy resolution, or a spike of brand-new external codes from one banner, signals a catalog rotation or a feed format change that must be investigated before it pollutes accruals.

Concretely, each batch emits an unmatched-rate, a fuzzy-share, and a duplicate-collapse count. A breach routes the affected records to a quarantine table with an explicit mismatch code (UNMAPPED_SKU, AMBIGUOUS_ALIAS, PACK_SIZE_CONFLICT) rather than failing the whole pipeline, so clean records still settle on schedule. Records in the mid-confidence band are not auto-rejected — they are handed to Scoring & Confidence Models, which weigh vendor history and source reliability to decide whether a fuzzy candidate is likely genuine. Every quarantine decision writes a hash-based audit record of the input and output state so the resolution is reproducible.

Fallback and dispute routing

Not every product resolves programmatically. When an external code maps to two live canonical SKUs, or a feed presents a pack configuration that contradicts the registry, the pipeline triggers fallback routing instead of accruing optimistically. The default policy is conservative: an unresolved or conflicting line is held at zero accrual in a quarantine table, never settled on a guess. Fallback chains prioritize the agreement’s documented product hierarchy, vendor-approved cross-reference exhibits, and historical precedent, and they inherit the same escalation contract as the platform-wide fallback routing logic.

Unresolved records route to a structured exception queue where vendor managers and trade finance analysts adjudicate, triaged by dollar impact rather than record count so the largest leakage risks clear first. Every override — an approved new alias, a corrected pack factor, a manual canonical assignment — is written to an immutable audit log with operator identity, the prior and new mapping_hash, and a reason code. Once a mapping is approved it is backfilled into the registry and the reconciliation job re-executes via delta processing, recomputing only the affected partition rather than the full table. This decoupling of hard failures from soft exceptions keeps the accrual cycle moving while preserving a defensible audit trail.

Security and access boundaries

The canonical registry is financial control data and is protected accordingly. Alias edges and pack factors are tagged for role-based access: trade finance analysts and vendor managers may propose mappings, but only a controller role may approve a manual alias or override a pack configuration, enforced as a four-eyes check in the configuration store. Mapping changes flow through a configuration-as-code path — versioned in Git with pull-request review — so the registry is never altered by direct database mutation.

Field-level controls protect the most sensitive attributes. Vendor-specific cross-reference tables and contractual product terms are encrypted at rest, and the secrets used to sign the immutable audit log are rotated on a fixed schedule. Every read and write of an alias edge is logged with actor, timestamp, and mapping_hash, giving auditors a complete lineage from a posted accrual back to the exact, signed mapping decision that produced the canonical quantity it was based on.

Frequently asked questions

Why key the canonical SKU on UOM and pack size instead of just the GTIN?

Because rebate volume is paid per sellable unit. A 12/12oz case and a 24/12oz pallet can share a base GTIN yet represent different contractual units, so collapsing them onto a bare GTIN cross-contaminates qualifying volume. The composite gtin14 + uom + pack_size key keeps each sellable unit distinct and prevents a case claim from accruing at per-each scale.

What happens when deterministic matching fails on a noisy retailer feed?

The record drops to a fuzzy pass that scores candidates with token-set and edit-distance metrics. Candidates above the auto-accept threshold are written as fuzzy alias edges; mid-band candidates go to scoring and confidence models for adjudication; anything below the low threshold is quarantined at zero accrual. Nothing is guessed into settlement.

How does deduplication stay idempotent across pipeline replays?

Every sales fact carries a dedup_key hashed over the canonical SKU, UOM, transaction date, partner, and source document. The upsert target is keyed on that hash, so a re-ingested row maps to the same target row and the write is a no-op. Re-running the job over identical inputs produces identical output.

How are catalog rotations and discontinued SKUs handled without breaking history?

Alias edges are effective-dated with a half-open interval and CDC-logged. A single external code can point at different canonical SKUs over time; resolution always uses the transaction date to select the edge that was in force, so historical claims re-validate against the mapping that actually applied when the sale occurred.

Claim Validation & Rule Engine Configuration — the parent framework this identity layer feeds.
Volume Threshold Validation — aggregates the canonical, deduplicated units against tier breakpoints.
Date Window Alignment Checks — confines canonicalization to the verified promotion window.
Scoring & Confidence Models — adjudicates mid-confidence fuzzy matches.
Agreement Schema Design — defines the product hierarchies a canonical SKU resolves into.

Up one level: Claim Validation & Rule Engine Configuration.

SKU Mapping & Deduplication

Related pages

Back to