Scoring & Confidence Models

Q: Can a confidence score override a hard validation rule?

No. Scoring adjudicates records the deterministic rules left ambiguous, not a way to reverse a hard failure. When a score and a rule outcome disagree, the conflict is logged to the audit trail and routed to the exception queue for human adjudication; the score never silently approves a claim the rule engine rejected.

Q: Why calibrate scores instead of using raw model output?

Raw classifier outputs are rankings, not probabilities, so a raw 0.85 rarely means an 85% chance of legitimate settlement. Because scores drive accrual fractions and routing thresholds, they are calibrated with Platt scaling or isotonic regression against observed outcomes so the bands map to real-world likelihoods. The fitted calibrator is versioned as part of model_version.

Q: How does a medium-confidence claim affect the accrual?

It accrues at a haircut rather than the full rebate. The reserve factor recognizes a configurable fraction, commonly 50 percent, of the gross while the claim sits in a lightweight review queue. This keeps month-end close moving without booking the full liability on a statistically uncertain claim, bounding over-accrual to the genuinely ambiguous slice of the book.

Q: What triggers a model retrain versus a recalibration?

Input feature drift detected by PSI or KS tests, or a widening gap between predicted confidence and realized settlement, point to different fixes. A pure calibration decay where rankings stay good but probabilities are off routes to a recalibration. Structural feature drift, where the underlying relationships shifted, routes to a full retrain. Both are deliberate, versioned events, never silent in-place edits.

Deterministic validation answers a binary question — does a claim satisfy every contractual predicate? — but vendor rebate and trade promotion reconciliation lives in the space where that question has no clean answer. Late distributor feeds, partial shipments, ambiguous identifier mappings, and grace-band timestamps produce records that are neither cleanly valid nor cleanly rejectable. Scoring and confidence models are the sub-problem of attaching a calibrated probability of legitimate settlement to each of those records so the platform can route by risk instead of stalling on uncertainty. This layer sits inside the broader claim validation rule engine: the parent framework establishes whether the hard rules passed, and this layer adjudicates everything the hard rules left ambiguous, converting raw validation residue into auto-post, review, or dispute directives with an auditable confidence score behind each decision.

This page specifies the feature record schema that scoring consumes, the conditional logic that maps validation signals into model inputs, the decimal-safe settlement implications of each confidence band, the Pydantic-validated ETL patterns that keep scoring idempotent, and the drift, dispute, and access controls that keep a probabilistic system defensible under SOX.

Positioning within the rule engine

Confidence scoring runs last in the evaluation order, as a post-validation adjudication layer rather than a gate. By the time a record reaches it, the timeline has been fixed, identities resolved, and quantitative checks executed; scoring never re-derives those facts, it consumes their outcomes. This ordering matters because a score is only meaningful relative to a stable feature vector — if upstream layers can still mutate a record’s window or SKU mapping, the score is computed against a moving target and loses all calibration.

The layer draws its inputs from three sibling subsystems. Volume Threshold Validation contributes the magnitude and direction of any tier-attainment deviation — a claim that overshoots its contracted band is a stronger over-accrual signal than one that lands a single unit short. Date Window Alignment Checks hand off boundary-zone flags: transactions inside the configured tolerance band arrive here for probabilistic adjudication rather than auto-rejection. SKU Mapping & Deduplication supplies mapping-confidence and duplicate-overlap signals that flag phantom or double-counted line items. The contractual context that weights these signals originates upstream in agreement schema design and is scoped by the eligibility rule framework; the rate context that translates a score into financial exposure comes from payout structure modeling. Raw source-reliability metadata — feed latency, retransmission counts — flows in from the data ingestion normalization pipelines.

Entity topology and schema specification

Scoring operates on a ClaimFeatureRecord, a versioned, immutable snapshot of every signal available at scoring time. It is not a loose feature dictionary: every field has a fixed type and a deterministic fallback, and the whole record carries a feature_hash so two reconciliation runs over identical inputs produce byte-identical scores. The table below specifies the canonical fields.

Field	Type	Constraint	Notes
`claim_id`	`str`	required, immutable	Stable claim-line identifier
`model_version`	`str`	required	Pins the scorer + calibration artifact
`volume_deviation`	`Decimal`	signed, default `0`	Fractional over/under vs contracted tier
`fallback_depth`	`int`	`>= 0`, default `0`	Count of secondary rules triggered upstream
`window_offset_hours`	`Decimal`	signed, default `0`	Distance from nearest window boundary
`in_tolerance_band`	`bool`	default `False`	True if window check deferred to scoring
`mapping_confidence`	`Decimal`	`0`–`1`, default `1`	Identifier resolution certainty
`dup_overlap_ratio`	`Decimal`	`0`–`1`, default `0`	Share of line items seen on prior claims
`vendor_dispute_rate`	`Decimal`	`0`–`1`, default `0`	Rolling historical rejection ratio
`source_reliability`	`Decimal`	`0`–`1`, default `1`	Feed latency / retransmission score
`feature_hash`	`str` (sha256)	computed	Deterministic hash of all inputs
`confidence_score`	`Decimal`	`0`–`1`, computed	Calibrated legitimacy probability
`risk_band`	`enum`	`high` \| `medium` \| `low`	Mapped from score thresholds

Every monetary or ratio field is typed as decimal.Decimal, never float — a score that drives accrual must reproduce exactly across runs, and binary floating-point cannot represent threshold boundaries like 0.90 without drift. Null inputs are forbidden at scoring time: each feature carries a deterministic, conservative fallback (a missing source_reliability defaults to the worst case, not the mean) so that absent data depresses confidence rather than silently inflating it. The feature_hash and model_version together are the linchpin of reproducibility — every posted score is stored alongside both, so an auditor can re-instantiate the exact scorer and exact inputs that produced a given routing decision.

Conditional logic and rule integration

The first transformation is signal assembly: the heterogeneous outputs of the hard-rule layers are projected into the fixed feature schema with explicit, schema-driven encoding rather than ad-hoc column math. Boundary-zone records carry in_tolerance_band=True; their window_offset_hours becomes a continuous penalty whose weight is configured per promotion type, not hardcoded. Channel and product scope do not re-enter as predicates here — they have already passed — but the count of fallback rules they triggered survives as fallback_depth, because a claim that needed many secondary rules to resolve is structurally noisier than one that matched cleanly.

python

from decimal import Decimal

def assemble_features(validation: dict) -> dict:
    """Project hard-rule validation output into the fixed scoring schema.

    Every field has a conservative deterministic fallback so that missing
    upstream signals depress confidence rather than inflate it.
    """
    return {
        "claim_id": validation["claim_id"],
        "volume_deviation": Decimal(str(validation.get("tier_deviation", "0"))),
        "fallback_depth": int(validation.get("fallback_rules_fired", 0)),
        "window_offset_hours": Decimal(str(validation.get("window_offset_h", "0"))),
        "in_tolerance_band": bool(validation.get("tolerance_band", False)),
        "mapping_confidence": Decimal(str(validation.get("map_conf", "1"))),
        "dup_overlap_ratio": Decimal(str(validation.get("dup_overlap", "0"))),
        "vendor_dispute_rate": Decimal(str(validation.get("vendor_dispute", "0"))),
        "source_reliability": Decimal(str(validation.get("src_reliability", "0"))),
    }

The scorer itself is a pure function of this record. Whether it is a transparent rule-weighted additive model, a calibrated logistic regression, or a gradient-boosted ensemble, the contract is identical: it consumes a ClaimFeatureRecord and emits a Decimal in [0, 1]. This uniform interface is what lets the engine cache by feature_hash and swap models behind a model_version bump without disturbing the routing layer downstream.

Methodology	Use case	Trade-off
Rule-weighted additive	High-compliance environments needing full auditability	Transparent, but weak on non-linear feature interactions
Calibrated logistic	Baseline probability with confidence intervals	Needs feature selection and multicollinearity checks
Gradient-boosted ensemble	Complex promo structures, high vendor variability	Higher compute; requires rigorous drift monitoring

Financial settlement layer

A confidence score only earns its keep when it maps to money, and that mapping must be exact. Raw model outputs are first calibrated so that a 0.85 genuinely reflects an ~85% likelihood of legitimate settlement; uncalibrated scores silently distort every threshold and misalign reserves. Practitioners apply Platt scaling or isotonic regression against observed claim outcomes — the scikit-learn Calibration of Classifiers documentation covers both and shows how CalibratedClassifierCV wraps an estimator. Calibration is itself versioned: the fitted calibrator is part of the model_version artifact, so a recalibration is a deliberate, auditable event, not a silent drift.

Calibrated scores map to operational bands, and each band carries a distinct accrual posture. The arithmetic that decides how much liability to recognize is done in decimal.Decimal to avoid threshold drift at band edges:

python

from decimal import Decimal

HIGH = Decimal("0.90")
MED = Decimal("0.70")

def risk_band(score: Decimal) -> str:
    if score >= HIGH:
        return "high"     # auto-post full accrual, trigger payment
    if score >= MED:
        return "medium"   # accrue at a haircut, route to lightweight review
    return "low"          # hold at zero accrual, open dispute workflow

def reserve_factor(score: Decimal) -> Decimal:
    """Fraction of gross rebate to recognize given confidence."""
    band = risk_band(score)
    return {"high": Decimal("1.00"),
            "medium": Decimal("0.50"),
            "low": Decimal("0.00")}[band]

The actual rebate amount a factor scales against is owned by payout structure modeling and the tier math in Volume Threshold Validation; this layer’s settlement responsibility is solely to decide what fraction of that gross is safe to recognize now versus reserve pending review. Currency is already normalized to the settlement currency before scoring, so the factor multiplies a single-currency Decimal gross and quantizes to the cent. A medium-confidence claim accrues at a haircut rather than the full amount, which keeps the close moving while bounding over-accrual exposure to the genuinely uncertain slice of the book.

ETL implementation patterns

Scoring is implemented as an idempotent transform: re-running it over the same feature record under the same model_version must upsert an identical score row. Pydantic v2 validates the feature record on assembly, enforces the ratio bounds declaratively, and computes the deterministic hash that keys the upsert.

python

from decimal import Decimal
from enum import Enum
from hashlib import sha256
from pydantic import BaseModel, Field, computed_field


class RiskBand(str, Enum):
    high = "high"
    medium = "medium"
    low = "low"


class ClaimFeatureRecord(BaseModel):
    claim_id: str
    model_version: str
    volume_deviation: Decimal = Decimal("0")
    fallback_depth: int = Field(default=0, ge=0)
    window_offset_hours: Decimal = Decimal("0")
    in_tolerance_band: bool = False
    mapping_confidence: Decimal = Field(default=Decimal("1"), ge=0, le=1)
    dup_overlap_ratio: Decimal = Field(default=Decimal("0"), ge=0, le=1)
    vendor_dispute_rate: Decimal = Field(default=Decimal("0"), ge=0, le=1)
    source_reliability: Decimal = Field(default=Decimal("0"), ge=0, le=1)

    @computed_field
    @property
    def feature_hash(self) -> str:
        payload = "|".join(str(v) for v in (
            self.claim_id, self.model_version,
            self.volume_deviation, self.fallback_depth,
            self.window_offset_hours, self.in_tolerance_band,
            self.mapping_confidence, self.dup_overlap_ratio,
            self.vendor_dispute_rate, self.source_reliability,
        ))
        return sha256(payload.encode()).hexdigest()

The upsert key is (claim_id, model_version, feature_hash). Because the hash folds in every scoring-affecting field plus the model version, re-scoring an unchanged claim under an unchanged model is a no-op, while a feature change or a model bump produces a new immutable row that preserves the prior score for replay. The scoring stage must remain stateless and vectorized — Polars lazy evaluation with expression pushdown applies the model and the band thresholds across millions of records in sub-second batches; the Polars User Guide covers the optimized execution patterns. Schema evolution follows the same discipline as agreement schema design: a new feature defaults to its conservative fallback so historical records remain scoreable, and any field that changes a score is versioned rather than mutated in place.

Drift detection and validation

A calibrated model is only calibrated against the world it was trained on. Vendor behaviour shifts, promotion structures change, and upstream feeds drift, so the scoring layer monitors itself continuously rather than trusting a one-time calibration. Two distributions are watched per batch: the input feature distributions (via Population Stability Index and Kolmogorov–Smirnov tests) and the output score distribution. A material shift in either — a sudden mass of claims piling into the medium band, or a feature whose PSI breaches tolerance — quarantines the affected batch and raises a retraining ticket before mis-scored claims pollute accruals.

Beyond distribution drift, the layer tracks calibration quality directly against settled outcomes: Brier score, AUC-ROC, and precision-recall at the live thresholds are recomputed on each closed cycle. A widening gap between predicted confidence and realized settlement rate is the signal that calibration has decayed, distinct from raw feature drift, and it routes to a recalibration rather than a full retrain. Each drift breach carries an explicit code (PSI_BREACH, CALIBRATION_DECAY, SCORE_SHIFT) so the quarantine is actionable rather than a generic failure, mirroring the quarantine discipline used in Date Window Alignment Checks.

Fallback and dispute routing

Scoring augments the deterministic rules; it never overrides them. When a model output and a hard-rule outcome disagree — a high score on a claim the rule engine flagged, or vice versa — the conflict is not silently resolved in the model’s favour. It is written to a reconciliation audit trail and routed to the exception queue, where a vendor manager or trade finance analyst adjudicates. The default policy for any low-confidence or conflicted record is conservative: hold at zero accrual pending resolution, never accrue optimistically on an uncertain score.

Low-band claims open a structured dispute workflow; medium-band claims route to a lightweight review queue or an automated vendor clarification request; only high-band, non-conflicted claims auto-post. This escalation inherits the same contract as the platform-wide fallback routing logic: every override — a manual approval of a low-confidence claim, a forced re-score, a threshold exception for an SLA-critical payment — is logged immutably with the operator identity, the prior and new feature_hash, the model_version, and a reason code. Critically, the resolved outcome of every manually adjudicated claim feeds back into the labelled training set, so the model learns from exactly the records it was least certain about, tightening future calibration on the hard cases.

Security and access boundaries

Confidence scores and the features behind them are financial control data. Vendor dispute histories, source-reliability metrics, and the score itself are tagged for role-based access: analysts and vendor managers may view scores and propose overrides, but only a controller role may promote a new model_version or change a band threshold, enforced as a four-eyes check in the configuration store. Models and calibration artifacts ship through a configuration-as-code path — versioned in Git with pull-request review — so no scorer or threshold reaches production by direct database mutation.

Field-level controls protect the most sensitive attributes. Vendor-specific dispute rates and any contract-derived risk weights are encrypted at rest, and the secrets used to sign the immutable score-and-override log are rotated on a fixed schedule. Every score write and every override read is logged with actor, timestamp, feature_hash, and model_version, giving auditors complete lineage from a posted accrual back to the exact model and exact inputs that produced its confidence band.

Frequently asked questions

Can a confidence score override a hard validation rule?

No. Scoring is an adjudication layer for records the deterministic rules left ambiguous, not a way to reverse a hard failure. When a score and a rule outcome disagree, the conflict is logged to the audit trail and routed to the exception queue for human adjudication; the score never silently approves a claim the rule engine rejected.

Why calibrate scores instead of using raw model output?

Raw classifier outputs are rankings, not probabilities — a raw 0.85 rarely means an 85% chance of legitimate settlement. Because scores drive accrual fractions and routing thresholds, they must be calibrated (Platt scaling or isotonic regression) against observed outcomes so the bands map to real-world likelihoods. The fitted calibrator is versioned as part of model_version.

How does a medium-confidence claim affect the accrual?

It accrues at a haircut rather than the full rebate. The reserve_factor recognizes a configurable fraction (commonly 50%) of the gross while the claim sits in a lightweight review queue. This keeps month-end close moving without booking the full liability on a claim that is statistically uncertain, bounding over-accrual to the genuinely ambiguous slice of the book.

What triggers a model retrain versus a recalibration?

Input feature drift (PSI/KS breach) or a widening gap between predicted confidence and realized settlement points to different fixes. A pure calibration decay — rankings still good, probabilities off — routes to a recalibration. Structural feature drift, where the relationships themselves have shifted, routes to a full retrain. Both are deliberate, versioned events, never silent in-place edits.

Claim Validation & Rule Engine Configuration — the parent framework whose hard-rule outputs this layer adjudicates.
Volume Threshold Validation — supplies tier-deviation magnitude as a scoring signal.
Date Window Alignment Checks — hands off tolerance-band transactions for probabilistic adjudication.
SKU Mapping & Deduplication — provides mapping-confidence and duplicate-overlap features.
Payout Structure Modeling — owns the gross rebate that confidence bands scale into accrual.

Up one level: Claim Validation & Rule Engine Configuration.

Scoring & Confidence Models

Related pages

Back to