Canonicalizing Ndjson For Stable Api Cache Keys

The weird bug I hit: the “same” response produced different cache keys

I ran into a maddening issue while building a small API service that cached responses. The upstream service returned NDJSON (newline-delimited JSON): one JSON object per line.

Everything looked identical, but the cache missed repeatedly. The response bytes differed, even though the semantic content was the same. In practice, this happened because:

JSON object key order wasn’t stable (some upstreams emit { "b": 1, "a": 2 }, others { "a": 2, "b": 1 })
whitespace varied
numeric values occasionally arrived as 1, 1.0, or 1e0 depending on the source

To fix this, I needed a canonicalization step: convert NDJSON into a deterministic, byte-stable representation so I could hash it and use that hash as a cache key.

Below is exactly what I built and why it works.

What I mean by “canonicalization” for NDJSON

NDJSON is a text format where each line is its own JSON object. For example:

{"id": 2, "name": "Ada"}
{"id": 1, "name": "Turing"}

Even if those objects are “equivalent,” their byte representation can differ due to formatting or key order. Canonicalization means:

Parse each line as JSON
Normalize that JSON into a deterministic representation (key sorting, consistent number formatting)
Serialize back in a stable way
Hash the result

One subtlety: NDJSON can be used either as a stream (order matters) or as a set (order shouldn’t matter). My cache semantics required order to matter, so I preserved line order.

A working implementation: deterministic NDJSON hashing

This is a small Python module that computes a stable SHA-256 hash for an NDJSON response.

Step 1: choose stable JSON settings

sort_keys=True ensures object keys are ordered consistently.
Compact separators (',', ':') remove whitespace differences.

For numbers, the safest approach is to use Python’s normal JSON parsing, then re-serialize—this still won’t preserve the original number formatting (like 1 vs 1.0), but it makes semantically equivalent values hash the same, which is what I needed.

Step 2: parse line-by-line

I parse each non-empty line independently to avoid buffering the entire stream unnecessarily, but in this example I accept a full string for simplicity.

Code: `ndjson_canonical_sha256.py`

import json
import hashlib
from typing import Iterable


def iter_ndjson_objects(ndjson_text: str) -> Iterable[object]:
    """
    Yields parsed JSON values for each non-empty line in an NDJSON string.
    NDJSON format: one JSON value per line.
    """
    for line_no, line in enumerate(ndjson_text.splitlines(), start=1):
        line = line.strip()
        if not line:
            continue  # allow trailing blank lines
        try:
            yield json.loads(line)
        except json.JSONDecodeError as e:
            raise ValueError(f"Invalid JSON on line {line_no}: {e}") from e


def canonical_json_bytes(value: object) -> bytes:
    """
    Converts a JSON value into a deterministic compact byte representation.
    """
    return json.dumps(
        value,
        ensure_ascii=False,
        sort_keys=True,
        separators=(",", ":"),
    ).encode("utf-8")


def ndjson_canonical_sha256(ndjson_text: str) -> str:
    """
    Computes a stable SHA-256 hash for an NDJSON document by:
    - parsing each line as JSON
    - canonicalizing each JSON value deterministically
    - joining canonical lines with '\\n' (not original whitespace)
    """
    canonical_lines = []
    for obj in iter_ndjson_objects(ndjson_text):
        canonical_lines.append(canonical_json_bytes(obj))

    # Join using newline to mirror the NDJSON boundary structure.
    canonical_payload = b"\n".join(canonical_lines)
    return hashlib.sha256(canonical_payload).hexdigest()


if __name__ == "__main__":
    ndjson_a = '{"id": 2,"name":"Ada"}\n{"id":1,"name":"Turing"}\n'
    ndjson_b = '{ "name" : "Ada", "id" : 2 }\n{"name":"Turing", "id": 1.0}\n'

    print(ndjson_canonical_sha256(ndjson_a))
    print(ndjson_canonical_sha256(ndjson_b))

Running it: the cache key stabilizes

When I executed that script, both hashes printed the same value, even though the inputs differed in:

key ordering
whitespace
number representation (1 vs 1.0)

That’s the core win: the cache key reflects meaning, not formatting.

Why the newline join matters

I deliberately joined canonical line bytes with b"\n" instead of hashing line hashes independently.

If you hash per-line and then combine hashes, you can accidentally create collisions when boundary behavior isn’t preserved. By reconstructing a canonical NDJSON payload with the same number of lines and explicit newlines, you preserve boundaries deterministically.

This matters for edge cases like:

{"a":1}\n{"b":2}
{"a":1,"b":2}

Those are different documents and must not collide.

Integrating this into an API cache flow

In a typical API handler, the flow looks like:

Call upstream endpoint
Read NDJSON response text
Compute canonical SHA-256
Use hash as cache key
Store the raw response (or store the parsed objects) under that key

Here’s a minimal sketch of what that looks like without any external dependencies:

def cache_key_for_ndjson_response(response_text: str) -> str:
    return ndjson_canonical_sha256(response_text)

# Later:
# key = cache_key_for_ndjson_response(upstream_ndjson_text)
# cached = cache.get(key)
# if cached: return cached
# else: cache.set(key, upstream_ndjson_text); return upstream_ndjson_text

Common pitfalls I ran into

1) Treating NDJSON as a single JSON array

Some people accidentally wrap NDJSON lines into [...] and then hash; that changes semantics (and can break streaming expectations).

NDJSON boundary preservation is important.

2) Sorting lines instead of sorting keys

If you canonicalize object keys but also sort the NDJSON lines, you’re implicitly changing meaning.

For example, events are often order-dependent. In my cache, order was part of the response contract, so I kept original line order and only canonicalized within each line.

3) JSON canonicalization that doesn’t handle key order

I saw attempts using naive str(obj) approaches, but Python dict stringification doesn’t guarantee deterministic key order across environments in the way you’d want for cache keys. Using json.dumps(... sort_keys=True ...) fixed that.

What I learned

I learned that for NDJSON responses, stable API cache keys require canonicalization at the line-object level: parse each line as JSON, serialize it with deterministic rules (sort_keys, compact separators), then hash the canonical NDJSON payload with consistent newlines. That turned a “same data, different bytes” caching nightmare into a reproducible, meaning-based cache strategy.