Predicting Same-Day Checkout Failures With Feature-Arrival Time Windows
Written by
Sage Stream
I ran into a weird failure mode while building a predictive model for “will this checkout fail today?” The data looked fine in batch, but during real-time training, the model started thrashing—sometimes it “knew” a failure too early, sometimes it missed it entirely.
What I discovered (after a weekend of staring at logs) was that the real culprit wasn’t the algorithm—it was feature arrival timing. In other words: the same row of user/order data could arrive at different moments, and my model was accidentally learning from information that would normally show up only after the prediction cutoff.
This post walks through a concrete, niche approach I built: predicting same-day checkout failures using feature-arrival time windows. The goal is simple: train on what would have been known at prediction time, and make the data observability around timing explicit.
The specific problem: “Same-day” failures are time-sensitive
Imagine you need to predict, for each order, whether the checkout will fail within the remainder of the day (same calendar day as the order start).
There are at least two timestamps in play:
order_created_at: when the order startscheckout_event_at: when failure/success is recorded
And there’s a third, often overlooked timestamp:
feature_ingested_at: when each feature (risk score, payment metadata, fraud signals) was ingested into your warehouse
Why this breaks models
If you train using features that were ingested after your prediction cutoff, the model “cheats.” In offline evaluation, it may look great, but in production it collapses because those features won’t be available yet.
So I added a rule: only use feature values whose ingestion time is before the prediction cutoff.
The niche pattern: Feature-arrival time windows
I implemented a “windowed” feature computation:
For each order, and for each predictive signal (e.g., risk_score, card_bin_risk), create features like:
risk_score_last_available_before_cutoffrisk_score_lookback_6h_mean(but only if those source events were ingested before cutoff)failed_checkouts_count_last_24h(computed from events that were ingested before cutoff)
This turns timing into a first-class feature engineering concern.
Example schema (what I used)
I used three tables:
ordersfeature_events(time-stamped feature updates)checkout_outcomes(labels)
Tables
-- Orders: one row per order -- order_id is the join key for everything. CREATE TABLE orders ( order_id TEXT PRIMARY KEY, user_id TEXT NOT NULL, order_created_at TIMESTAMP NOT NULL ); -- Feature events: each row is an update to some feature for an order. -- feature_name/value can be multiple per order. CREATE TABLE feature_events ( order_id TEXT NOT NULL, feature_name TEXT NOT NULL, feature_value DOUBLE PRECISION, feature_ingested_at TIMESTAMP NOT NULL, PRIMARY KEY (order_id, feature_name, feature_ingested_at) ); -- Checkout outcomes: label is whether failure occurs same calendar day. CREATE TABLE checkout_outcomes ( order_id TEXT PRIMARY KEY, checkout_event_at TIMESTAMP NOT NULL, checkout_status TEXT NOT NULL -- 'success' or 'failure' );
Step 1: Define the prediction cutoff
For same-day prediction, the cutoff can be “end of day at a fixed time” or “now” depending on your pipeline.
In my case, I predicted at ingestion time, using the order’s local day end as cutoff.
In SQL terms: for each order, the cutoff is date_trunc('day', order_created_at) + interval '1 day'.
To keep the example concrete, I’ll predict at day-end and label “failure if checkout happens before day-end”.
Step 2: Create labels that respect the same-day boundary
Here’s the label logic I used:
label = 1if there exists acheckout_event_aton the same calendar day and status is'failure'- otherwise
label = 0
-- Build training labels aligned to the same-day definition. WITH labeled AS ( SELECT o.order_id, o.user_id, o.order_created_at, co.checkout_event_at, co.checkout_status, -- end of the calendar day for the order (date_trunc('day', o.order_created_at) + interval '1 day') AS cutoff_at, CASE WHEN co.checkout_status = 'failure' AND co.checkout_event_at < (date_trunc('day', o.order_created_at) + interval '1 day') THEN 1 ELSE 0 END AS label FROM orders o JOIN checkout_outcomes co ON co.order_id = o.order_id ) SELECT * FROM labeled;
Step 3: Build “feature availability” windows (no cheating)
Now the key: when I compute a feature, I filter to only feature events ingested before the cutoff.
Single-value example: last known risk score before cutoff
WITH labeled AS ( SELECT o.order_id, o.user_id, o.order_created_at, co.checkout_event_at, co.checkout_status, (date_trunc('day', o.order_created_at) + interval '1 day') AS cutoff_at, CASE WHEN co.checkout_status = 'failure' AND co.checkout_event_at < (date_trunc('day', o.order_created_at) + interval '1 day') THEN 1 ELSE 0 END AS label FROM orders o JOIN checkout_outcomes co ON co.order_id = o.order_id ), risk_last AS ( SELECT l.order_id, -- Last feature ingestion before cutoff ( SELECT fe.feature_value FROM feature_events fe WHERE fe.order_id = l.order_id AND fe.feature_name = 'risk_score' AND fe.feature_ingested_at < l.cutoff_at ORDER BY fe.feature_ingested_at DESC LIMIT 1 ) AS risk_score_last_available FROM labeled l ) SELECT l.order_id, l.user_id, l.order_created_at, l.cutoff_at, r.risk_score_last_available, l.label FROM labeled l LEFT JOIN risk_last r ON l.order_id = r.order_id;
Why this matters: if a risk score update arrives after the cutoff, it’s excluded. That keeps training aligned with production reality.
Step 4: Add lookback windows (still respecting cutoff)
I also added rolling aggregates for features like “how many recent payment failures were known before cutoff.”
To do that, I used event tables to compute counts, but the same principle holds: only include rows ingested before cutoff_at.
Here’s a simplified example using a hypothetical feature event called payment_failure_signal (1 or 0).
WITH labeled AS ( SELECT o.order_id, o.user_id, o.order_created_at, co.checkout_event_at, co.checkout_status, (date_trunc('day', o.order_created_at) + interval '1 day') AS cutoff_at, CASE WHEN co.checkout_status = 'failure' AND co.checkout_event_at < (date_trunc('day', o.order_created_at) + interval '1 day') THEN 1 ELSE 0 END AS label FROM orders o JOIN checkout_outcomes co ON co.order_id = o.order_id ), features AS ( SELECT l.order_id, -- Lookback window anchored to cutoff: -- include only feature events ingested before cutoff_at, -- and within the last 24h of cutoff time. SUM( CASE WHEN fe.feature_ingested_at < l.cutoff_at AND fe.feature_ingested_at >= (l.cutoff_at - interval '24 hours') THEN fe.feature_value ELSE 0 END ) AS payment_failure_signal_count_24h FROM labeled l LEFT JOIN feature_events fe ON fe.order_id = l.order_id AND fe.feature_name = 'payment_failure_signal' GROUP BY l.order_id ) SELECT l.order_id, l.user_id, l.order_created_at, l.cutoff_at, f.payment_failure_signal_count_24h, l.label FROM labeled l LEFT JOIN features f ON l.order_id = f.order_id;
Step 5: Add data observability for “feature latency”
After implementing cutoff-aligned features, my model improved—but only after I added one more guardrail: a metric for how “fresh” features were.
For each order and feature, I computed:
risk_score_age_seconds = cutoff_at - last_ingested_at- if no feature exists, I used a sentinel and flagged it
This became my early warning system for broken pipelines.
WITH labeled AS ( SELECT o.order_id, o.user_id, o.order_created_at, co.checkout_event_at, co.checkout_status, (date_trunc('day', o.order_created_at) + interval '1 day') AS cutoff_at, CASE WHEN co.checkout_status = 'failure' AND co.checkout_event_at < (date_trunc('day', o.order_created_at) + interval '1 day') THEN 1 ELSE 0 END AS label FROM orders o JOIN checkout_outcomes co ON co.order_id = o.order_id ), risk_last AS ( SELECT l.order_id, fe_last.feature_value AS risk_score_last_available, fe_last.feature_ingested_at AS risk_last_ingested_at FROM labeled l LEFT JOIN LATERAL ( SELECT fe.feature_value, fe.feature_ingested_at FROM feature_events fe WHERE fe.order_id = l.order_id AND fe.feature_name = 'risk_score' AND fe.feature_ingested_at < l.cutoff_at ORDER BY fe.feature_ingested_at DESC LIMIT 1 ) fe_last ON TRUE ) SELECT l.*, r.risk_score_last_available, CASE WHEN r.risk_last_ingested_at IS NULL THEN -1 ELSE EXTRACT(EPOCH FROM (l.cutoff_at - r.risk_last_ingested_at)) END AS risk_score_age_seconds, CASE WHEN r.risk_last_ingested_at IS NULL THEN 1 ELSE 0 END AS risk_score_missing FROM labeled l LEFT JOIN risk_last r ON l.order_id = r.order_id;
This is the part I wish I had earlier: even a correct feature cutoff can silently degrade if upstream ingestion lags.
Step 6: Train a simple model (with working Python)
Below is a minimal working Python flow using pandas + scikit-learn. It assumes you’ve already pulled the training dataset into a DataFrame called df with columns:
risk_score_last_availablerisk_score_age_secondsrisk_score_missinglabel
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline from sklearn.metrics import roc_auc_score from sklearn.linear_model import LogisticRegression # df should come from your SQL query result. # Columns expected: # - risk_score_last_available (float or NaN) # - risk_score_age_seconds (float, can be -1 for missing) # - risk_score_missing (0/1) # - label (0/1) feature_cols = ["risk_score_last_available", "risk_score_age_seconds", "risk_score_missing"] target_col = "label" X = df[feature_cols] y = df[target_col] # Split by time in real systems; random split here for simplicity. X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25, random_state=42, stratify=y ) num_features = ["risk_score_last_available", "risk_score_age_seconds"] cat_features = ["risk_score_missing"] # treat as categorical for a simple encoder numeric_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="median")) ]) categorical_transformer = Pipeline(steps=[ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")) ]) preprocess = ColumnTransformer( transformers=[ ("num", numeric_transformer, num_features), ("cat", categorical_transformer, cat_features), ] ) model = Pipeline(steps=[ ("preprocess", preprocess), ("clf", LogisticRegression(max_iter=2000)) ]) model.fit(X_train, y_train) proba = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, proba) print("ROC AUC:", auc) print("Example coefficients (post-transform):", model.named_steps["clf"].coef_)
What I observed when I ran this
- With cutoff-respecting features, ROC AUC stabilized across runs.
- When I intentionally removed the
feature_ingested_at < cutoff_atfilter, the offline score jumped—but production-style backtests cratered. The cheating was measurable, and the timing features (likerisk_score_age_seconds) showed huge shifts.
Step 7: Data quality checks that caught the real bugs
After adding risk_score_missing and risk_score_age_seconds, I put two checks into my pipeline:
- Missing-rate check per feature per hour/day
- If
risk_score_missingspikes, upstream ingestion is broken or delayed.
- If
- Latency distribution drift
- If median
risk_score_age_secondsrises, the model is increasingly relying on stale signals.
- If median
These checks are observability. They’re not “nice to have”—they directly explain model drift.
Conclusion
I built a same-day checkout failure predictor where the key technique wasn’t the model—it was feature-arrival time windows. By filtering feature events to only those ingested before the prediction cutoff, I prevented training-time leakage and made “what the model could know” explicit. Adding simple observability signals like risk_score_age_seconds and risk_score_missing turned silent ingestion issues into measurable, debuggable events, and that made the predictive analytics pipeline reliable in real time.