Building a SCD Type 2 “Event-Time Merge” for Late Arriving Clickstream in dbt and Snowflake

The problem that pulled me into this niche is simple: clickstream data arrives late and in chunks, but my “user journey” tables need to reflect the truth as of the event time, not the load time.

I built a small pipeline for a client where a user’s “attribution campaign” could change based on an event that showed up hours after the fact. The classic Slowly Changing Dimension (SCD Type 2) pattern can track changes over time, but it usually keys off the load/processing time. That breaks when events are delayed.

So I implemented an event-time merge for SCD Type 2 in dbt on Snowflake, designed specifically for late arriving clickstream and backfilled attribution fields.

Below is what I learned by actually running it.

What I was trying to guarantee

I maintain a dimension table that stores a row per user-per-attribution-state:

When a user’s campaign_id (derived from clickstream) changes, we should:
- close the previous “version” row with an effective_end_ts
- open a new version row with effective_start_ts
Crucially, effective_start_ts and effective_end_ts must reflect event time, not when the data arrived.

Definitions (so the code doesn’t feel like magic)

Event time: when the clickstream event happened (event_ts)
Processing time: when I loaded it into the warehouse (loaded_at)
SCD Type 2: a table strategy that keeps history by creating a new row for each change, rather than overwriting.

The schema I used

I had two tables:

1) Staging event table (incremental loads)

This table contains raw derived attribution events.

-- staging.attribution_events
-- one row per user event that can affect attribution
user_id          STRING
event_ts         TIMESTAMP_NTZ
campaign_id      STRING
source_system    STRING
loaded_at        TIMESTAMP_NTZ

2) Dimension history table

-- analytics.dim_user_campaign_scd2
user_id                 STRING
campaign_id             STRING
effective_start_ts     TIMESTAMP_NTZ
effective_end_ts       TIMESTAMP_NTZ
is_current              BOOLEAN
updated_at              TIMESTAMP_NTZ

The tricky part: merging based on event time for late arrivals

A normal incremental SCD Type 2 merge would do something like:

“Find current row (where is_current = true) and close it”
“Insert a new row for the new campaign”

But late events mean: the “current row” might already have been closed using a later event. For event-time correctness, I needed to determine the correct historical window the late event belongs to.

The strategy I implemented

For each user_id, for each incoming event:

Compute the event’s effective timestamp (effective_from = event_ts)
Find the row in the SCD2 table whose time window contains effective_from
- effective_start_ts <= effective_from < effective_end_ts
If that row’s campaign_id already matches, do nothing
Otherwise:
- close that row at effective_from
- insert a new row starting at effective_from
Ensure is_current is correct for everything affected.

This works even when the data arrives late because it rebuilds the boundary at the correct event time.

Working dbt model (Snowflake SQL)

I used a dbt model configured as incremental with a merge strategy. The key is the SQL inside the model.

`models/analytics/dim_user_campaign_scd2.sql`

{{ config(
    materialized='incremental',
    unique_key=['user_id', 'effective_start_ts'],
    incremental_strategy='merge'
) }}

/*
  Event-time SCD Type 2 merge for late arriving attribution events.

  Assumptions:
  - analytics.dim_user_campaign_scd2 is already created (empty initially is fine).
  - effective_end_ts uses a far-future sentinel for current rows.
*/

with source_events as (

    select
        user_id,
        event_ts as effective_from_ts,
        campaign_id,
        source_system,
        loaded_at
    from staging.attribution_events

    {% if is_incremental() %}
    -- Only process events that arrived since the last run.
    -- Late events still show up here because they physically arrive later.
    where loaded_at >= (select coalesce(max(updated_at), to_timestamp_ntz('1970-01-01')) from analytics.dim_user_campaign_scd2)
    {% endif %}

),

-- Sort and dedupe events per user at the same event_ts + campaign.
-- This prevents multiple identical updates causing duplicate windows.
deduped_events as (

    select
        user_id,
        effective_from_ts,
        campaign_id,
        row_number() over (
            partition by user_id, effective_from_ts, campaign_id
            order by source_system
        ) as rn
    from source_events
    qualify rn = 1

),

-- For each incoming event, find the SCD row that covers the event time.
-- The predicate uses event-time boundaries, not is_current.
matched_windows as (

    select
        e.user_id,
        e.effective_from_ts,
        e.campaign_id as new_campaign_id,

        d.campaign_id as existing_campaign_id,
        d.effective_start_ts as existing_effective_start_ts,
        d.effective_end_ts   as existing_effective_end_ts,
        d.is_current         as existing_is_current
    from deduped_events e
    left join analytics.dim_user_campaign_scd2 d
      on  d.user_id = e.user_id
      and d.effective_start_ts <= e.effective_from_ts
      and e.effective_from_ts < d.effective_end_ts

),

-- Decide which events cause an actual change.
changes as (

    select
        user_id,
        effective_from_ts,
        new_campaign_id,
        existing_campaign_id,

        existing_effective_start_ts,
        existing_effective_end_ts,
        existing_is_current
    from matched_windows
    where existing_campaign_id is null
       or existing_campaign_id <> new_campaign_id

),

-- 1) Close existing rows that overlap with change points.
-- We generate "update" rows by selecting current window and setting end to effective_from_ts.
to_close as (

    select
        user_id,
        existing_campaign_id as campaign_id,
        existing_effective_start_ts as effective_start_ts,
        effective_from_ts as effective_end_ts,
        false as is_current,
        current_timestamp() as updated_at
    from changes
    where existing_campaign_id is not null

),

-- 2) Insert new version rows for each change point.
to_insert as (

    select
        user_id,
        new_campaign_id as campaign_id,
        effective_from_ts as effective_start_ts,
        to_timestamp_ntz('9999-12-31 00:00:00') as effective_end_ts,
        true as is_current,
        current_timestamp() as updated_at
    from changes

),

-- 3) Recompute which rows are current.
-- Because inserting a new window in the middle of history should push "current" forward,
-- we normalize by setting is_current based on effective_end_ts sentinel.
combined as (

    select
        user_id,
        campaign_id,
        effective_start_ts,
        effective_end_ts,
        case
            when effective_end_ts = to_timestamp_ntz('9999-12-31 00:00:00') then true
            else false
        end as is_current,
        updated_at
    from to_close

    union all

    select
        user_id,
        campaign_id,
        effective_start_ts,
        effective_end_ts,
        case
            when effective_end_ts = to_timestamp_ntz('9999-12-31 00:00:00') then true
            else false
        end as is_current,
        updated_at
    from to_insert

)

-- Final SELECT is the merge source.
-- merge uses unique_key (user_id, effective_start_ts)
select * from combined

Why this merge logic works

Instead of relying on is_current, it uses event-time containment to find the row that should be split.
Every change produces:
- a “close” record for the old window ending at the event timestamp
- an “open” record for the new campaign starting at that timestamp
The unique key (user_id, effective_start_ts) prevents duplicate insertions for the same boundary.

Creating the target table (first run)

dbt usually handles this with create table as depending on your setup, but I like making the sentinel explicit.

Here’s a simple Snowflake DDL you can run once:

create table if not exists analytics.dim_user_campaign_scd2 (
    user_id STRING not null,
    campaign_id STRING,
    effective_start_ts TIMESTAMP_NTZ not null,
    effective_end_ts TIMESTAMP_NTZ not null,
    is_current BOOLEAN not null,
    updated_at TIMESTAMP_NTZ not null
);

What happens when you run it (with a concrete timeline)

Assume this is already in your SCD2 table:

user_id = 'u1'
Row A: campaign_id = 'cA' effective from 10:00 to 12:00
Row B: campaign_id = 'cB' effective from 12:00 to 9999-12-31 (current)

Now late-arriving events show up:

Late event arrives at loaded_at = 13:30, but event_ts = 11:00
- campaign_id = 'cX'

Containment check:

Does 10:00 <= 11:00 < 12:00? Yes → matches Row A
cA != cX → change detected

Result:

Row A becomes 10:00..11:00 (closed)
New Row C inserted: 11:00..9999-12-31 (opened)

Now you might ask: what about the old Row B that used to start at 12:00?

With this model, Row B is still in the table as 12:00..9999-12-31
In a “strictly correct” implementation, you also want to ensure overlapping windows don’t remain open incorrectly.
In my setup, I handled this in a downstream normalization step that enforces non-overlapping intervals per user (shown next).

Because the late event “splits history,” the current model creates correct boundaries at the change point, but interval normalization ensures no two rows for the same user are simultaneously “current from some time onward.”

Interval normalization (prevent overlapping “current” windows)

I added a second dbt model that compacts windows per user by rebuilding from the event boundaries already stored. This keeps the table clean.

`models/analytics/dim_user_campaign_scd2_normalized.sql`

{{ config(materialized='table') }}

/*
  Normalize windows to ensure:
  - non-overlapping intervals per user
  - only one current row per user
*/

with ordered as (

    select
        user_id,
        campaign_id,
        effective_start_ts,
        effective_end_ts,
        updated_at
    from analytics.dim_user_campaign_scd2

),

dedup_boundaries as (

    select
        user_id,
        campaign_id,
        effective_start_ts,
        -- We recompute ends later using lead
        row_number() over (
            partition by user_id, effective_start_ts
            order by updated_at desc
        ) as rn
    from ordered
    qualify rn = 1

),

sequenced as (

    select
        user_id,
        campaign_id,
        effective_start_ts,
        lead(effective_start_ts) over (
            partition by user_id order by effective_start_ts
        ) as next_effective_start_ts
    from dedup_boundaries

)

select
    user_id,
    campaign_id,
    effective_start_ts,
    coalesce(next_effective_start_ts, to_timestamp_ntz('9999-12-31 00:00:00')) as effective_end_ts,
    effective_end_ts = to_timestamp_ntz('9999-12-31 00:00:00') as is_current,
    current_timestamp() as updated_at
from sequenced

This rebuilds interval ends using lead(effective_start_ts) (the next change boundary).

That’s what made my history robust, even when late events inserted a new boundary “in the middle” of already-loaded intervals.

Observability: how I validated correctness

Because this was about data observability and quality, I didn’t trust the table blindly. I added checks:

1) Overlap check (no two intervals should overlap)

with intervals as (
  select
    user_id,
    effective_start_ts,
    effective_end_ts
  from analytics.dim_user_campaign_scd2_normalized
),
pairs as (
  select
    a.user_id,
    a.effective_start_ts as a_start,
    a.effective_end_ts as a_end,
    b.effective_start_ts as b_start,
    b.effective_end_ts as b_end
  from intervals a
  join intervals b
    on a.user_id = b.user_id
   and a.effective_start_ts < b.effective_start_ts
   and a.effective_end_ts > b.effective_start_ts
)
select * from pairs limit 10;

If this returns rows, history windows overlap.

2) Single current row check

select
  user_id,
  count(*) as current_rows
from analytics.dim_user_campaign_scd2_normalized
where is_current = true
group by user_id
having count(*) <> 1;

Data quality tradeoffs I ran into

The biggest practical constraint was event time granularity:

If multiple events land with the same event_ts but different campaign derivations, SCD2 can’t decide ordering without a tie-breaker.
In my pipeline, I used source_system and deterministic deduplication keys, then normalized afterward.

The other lesson: incremental filters should be based on loaded arrival (so late data is processed), while the SCD boundary logic should be based on event time (so the history is correct).

Conclusion

I built an event-time SCD Type 2 “merge” for late arriving clickstream attribution in dbt on Snowflake by splitting windows at the event timestamp that actually occurred, not the time it arrived. Then I added interval normalization to guarantee non-overlapping effective ranges and a single current row per user. The result was a history table that stays correct even when backfills insert new boundaries into already-loaded timelines—exactly the kind of data engineering detail that prevents subtle AI downstream bugs.