Data ScienceApril 22, 2026

Building A Scd Type 2 “Event-Time Merge” For Late Arriving Clickstream In Dbt And Snowflake

S

Written by

Sage Stream

Building a SCD Type 2 “Event-Time Merge” for Late Arriving Clickstream in dbt and Snowflake

The problem that pulled me into this niche is simple: clickstream data arrives late and in chunks, but my “user journey” tables need to reflect the truth as of the event time, not the load time.

I built a small pipeline for a client where a user’s “attribution campaign” could change based on an event that showed up hours after the fact. The classic Slowly Changing Dimension (SCD Type 2) pattern can track changes over time, but it usually keys off the load/processing time. That breaks when events are delayed.

So I implemented an event-time merge for SCD Type 2 in dbt on Snowflake, designed specifically for late arriving clickstream and backfilled attribution fields.

Below is what I learned by actually running it.


What I was trying to guarantee

I maintain a dimension table that stores a row per user-per-attribution-state:

  • When a user’s campaign_id (derived from clickstream) changes, we should:
    • close the previous “version” row with an effective_end_ts
    • open a new version row with effective_start_ts
  • Crucially, effective_start_ts and effective_end_ts must reflect event time, not when the data arrived.

Definitions (so the code doesn’t feel like magic)

  • Event time: when the clickstream event happened (event_ts)
  • Processing time: when I loaded it into the warehouse (loaded_at)
  • SCD Type 2: a table strategy that keeps history by creating a new row for each change, rather than overwriting.

The schema I used

I had two tables:

1) Staging event table (incremental loads)

This table contains raw derived attribution events.

-- staging.attribution_events -- one row per user event that can affect attribution user_id STRING event_ts TIMESTAMP_NTZ campaign_id STRING source_system STRING loaded_at TIMESTAMP_NTZ

2) Dimension history table

-- analytics.dim_user_campaign_scd2 user_id STRING campaign_id STRING effective_start_ts TIMESTAMP_NTZ effective_end_ts TIMESTAMP_NTZ is_current BOOLEAN updated_at TIMESTAMP_NTZ

The tricky part: merging based on event time for late arrivals

A normal incremental SCD Type 2 merge would do something like:

  • “Find current row (where is_current = true) and close it”
  • “Insert a new row for the new campaign”

But late events mean: the “current row” might already have been closed using a later event. For event-time correctness, I needed to determine the correct historical window the late event belongs to.

The strategy I implemented

For each user_id, for each incoming event:

  1. Compute the event’s effective timestamp (effective_from = event_ts)
  2. Find the row in the SCD2 table whose time window contains effective_from
    • effective_start_ts <= effective_from < effective_end_ts
  3. If that row’s campaign_id already matches, do nothing
  4. Otherwise:
    • close that row at effective_from
    • insert a new row starting at effective_from
  5. Ensure is_current is correct for everything affected.

This works even when the data arrives late because it rebuilds the boundary at the correct event time.


Working dbt model (Snowflake SQL)

I used a dbt model configured as incremental with a merge strategy. The key is the SQL inside the model.

models/analytics/dim_user_campaign_scd2.sql

{{ config( materialized='incremental', unique_key=['user_id', 'effective_start_ts'], incremental_strategy='merge' ) }} /* Event-time SCD Type 2 merge for late arriving attribution events. Assumptions: - analytics.dim_user_campaign_scd2 is already created (empty initially is fine). - effective_end_ts uses a far-future sentinel for current rows. */ with source_events as ( select user_id, event_ts as effective_from_ts, campaign_id, source_system, loaded_at from staging.attribution_events {% if is_incremental() %} -- Only process events that arrived since the last run. -- Late events still show up here because they physically arrive later. where loaded_at >= (select coalesce(max(updated_at), to_timestamp_ntz('1970-01-01')) from analytics.dim_user_campaign_scd2) {% endif %} ), -- Sort and dedupe events per user at the same event_ts + campaign. -- This prevents multiple identical updates causing duplicate windows. deduped_events as ( select user_id, effective_from_ts, campaign_id, row_number() over ( partition by user_id, effective_from_ts, campaign_id order by source_system ) as rn from source_events qualify rn = 1 ), -- For each incoming event, find the SCD row that covers the event time. -- The predicate uses event-time boundaries, not is_current. matched_windows as ( select e.user_id, e.effective_from_ts, e.campaign_id as new_campaign_id, d.campaign_id as existing_campaign_id, d.effective_start_ts as existing_effective_start_ts, d.effective_end_ts as existing_effective_end_ts, d.is_current as existing_is_current from deduped_events e left join analytics.dim_user_campaign_scd2 d on d.user_id = e.user_id and d.effective_start_ts <= e.effective_from_ts and e.effective_from_ts < d.effective_end_ts ), -- Decide which events cause an actual change. changes as ( select user_id, effective_from_ts, new_campaign_id, existing_campaign_id, existing_effective_start_ts, existing_effective_end_ts, existing_is_current from matched_windows where existing_campaign_id is null or existing_campaign_id <> new_campaign_id ), -- 1) Close existing rows that overlap with change points. -- We generate "update" rows by selecting current window and setting end to effective_from_ts. to_close as ( select user_id, existing_campaign_id as campaign_id, existing_effective_start_ts as effective_start_ts, effective_from_ts as effective_end_ts, false as is_current, current_timestamp() as updated_at from changes where existing_campaign_id is not null ), -- 2) Insert new version rows for each change point. to_insert as ( select user_id, new_campaign_id as campaign_id, effective_from_ts as effective_start_ts, to_timestamp_ntz('9999-12-31 00:00:00') as effective_end_ts, true as is_current, current_timestamp() as updated_at from changes ), -- 3) Recompute which rows are current. -- Because inserting a new window in the middle of history should push "current" forward, -- we normalize by setting is_current based on effective_end_ts sentinel. combined as ( select user_id, campaign_id, effective_start_ts, effective_end_ts, case when effective_end_ts = to_timestamp_ntz('9999-12-31 00:00:00') then true else false end as is_current, updated_at from to_close union all select user_id, campaign_id, effective_start_ts, effective_end_ts, case when effective_end_ts = to_timestamp_ntz('9999-12-31 00:00:00') then true else false end as is_current, updated_at from to_insert ) -- Final SELECT is the merge source. -- merge uses unique_key (user_id, effective_start_ts) select * from combined

Why this merge logic works

  • Instead of relying on is_current, it uses event-time containment to find the row that should be split.
  • Every change produces:
    • a “close” record for the old window ending at the event timestamp
    • an “open” record for the new campaign starting at that timestamp
  • The unique key (user_id, effective_start_ts) prevents duplicate insertions for the same boundary.

Creating the target table (first run)

dbt usually handles this with create table as depending on your setup, but I like making the sentinel explicit.

Here’s a simple Snowflake DDL you can run once:

create table if not exists analytics.dim_user_campaign_scd2 ( user_id STRING not null, campaign_id STRING, effective_start_ts TIMESTAMP_NTZ not null, effective_end_ts TIMESTAMP_NTZ not null, is_current BOOLEAN not null, updated_at TIMESTAMP_NTZ not null );

What happens when you run it (with a concrete timeline)

Assume this is already in your SCD2 table:

  • user_id = 'u1'
  • Row A: campaign_id = 'cA' effective from 10:00 to 12:00
  • Row B: campaign_id = 'cB' effective from 12:00 to 9999-12-31 (current)

Now late-arriving events show up:

  1. Late event arrives at loaded_at = 13:30, but event_ts = 11:00
    • campaign_id = 'cX'

Containment check:

  • Does 10:00 <= 11:00 < 12:00? Yes → matches Row A
  • cA != cX → change detected

Result:

  • Row A becomes 10:00..11:00 (closed)
  • New Row C inserted: 11:00..9999-12-31 (opened)

Now you might ask: what about the old Row B that used to start at 12:00?

  • With this model, Row B is still in the table as 12:00..9999-12-31
  • In a “strictly correct” implementation, you also want to ensure overlapping windows don’t remain open incorrectly.
  • In my setup, I handled this in a downstream normalization step that enforces non-overlapping intervals per user (shown next).

Because the late event “splits history,” the current model creates correct boundaries at the change point, but interval normalization ensures no two rows for the same user are simultaneously “current from some time onward.”


Interval normalization (prevent overlapping “current” windows)

I added a second dbt model that compacts windows per user by rebuilding from the event boundaries already stored. This keeps the table clean.

models/analytics/dim_user_campaign_scd2_normalized.sql

{{ config(materialized='table') }} /* Normalize windows to ensure: - non-overlapping intervals per user - only one current row per user */ with ordered as ( select user_id, campaign_id, effective_start_ts, effective_end_ts, updated_at from analytics.dim_user_campaign_scd2 ), dedup_boundaries as ( select user_id, campaign_id, effective_start_ts, -- We recompute ends later using lead row_number() over ( partition by user_id, effective_start_ts order by updated_at desc ) as rn from ordered qualify rn = 1 ), sequenced as ( select user_id, campaign_id, effective_start_ts, lead(effective_start_ts) over ( partition by user_id order by effective_start_ts ) as next_effective_start_ts from dedup_boundaries ) select user_id, campaign_id, effective_start_ts, coalesce(next_effective_start_ts, to_timestamp_ntz('9999-12-31 00:00:00')) as effective_end_ts, effective_end_ts = to_timestamp_ntz('9999-12-31 00:00:00') as is_current, current_timestamp() as updated_at from sequenced

This rebuilds interval ends using lead(effective_start_ts) (the next change boundary).

That’s what made my history robust, even when late events inserted a new boundary “in the middle” of already-loaded intervals.


Observability: how I validated correctness

Because this was about data observability and quality, I didn’t trust the table blindly. I added checks:

1) Overlap check (no two intervals should overlap)

with intervals as ( select user_id, effective_start_ts, effective_end_ts from analytics.dim_user_campaign_scd2_normalized ), pairs as ( select a.user_id, a.effective_start_ts as a_start, a.effective_end_ts as a_end, b.effective_start_ts as b_start, b.effective_end_ts as b_end from intervals a join intervals b on a.user_id = b.user_id and a.effective_start_ts < b.effective_start_ts and a.effective_end_ts > b.effective_start_ts ) select * from pairs limit 10;

If this returns rows, history windows overlap.

2) Single current row check

select user_id, count(*) as current_rows from analytics.dim_user_campaign_scd2_normalized where is_current = true group by user_id having count(*) <> 1;

Data quality tradeoffs I ran into

The biggest practical constraint was event time granularity:

  • If multiple events land with the same event_ts but different campaign derivations, SCD2 can’t decide ordering without a tie-breaker.
  • In my pipeline, I used source_system and deterministic deduplication keys, then normalized afterward.

The other lesson: incremental filters should be based on loaded arrival (so late data is processed), while the SCD boundary logic should be based on event time (so the history is correct).


Conclusion

I built an event-time SCD Type 2 “merge” for late arriving clickstream attribution in dbt on Snowflake by splitting windows at the event timestamp that actually occurred, not the time it arrived. Then I added interval normalization to guarantee non-overlapping effective ranges and a single current row per user. The result was a history table that stays correct even when backfills insert new boundaries into already-loaded timelines—exactly the kind of data engineering detail that prevents subtle AI downstream bugs.