Multi Cloud S3 Compatible Etl With Sigv4 Proxy And Deterministic Manifest Hashing

The weird problem I ran into

I was building an ETL pipeline that reads objects from one provider (S3-compatible storage) and writes into another—across a hybrid setup where outbound network traffic had to go through a SigV4-terminating proxy (a layer that signs requests on behalf of workloads).

Everything “worked” for small datasets, but as volumes grew I started seeing:

occasional SignatureDoesNotMatch errors (only under load)
non-deterministic “already processed” markers (the same input sometimes produced a different manifest hash)
retries that looked like no-ops but still burned time and cost

This post documents the exact pattern that fixed it for me: a SigV4 signing proxy + deterministic manifest hashing derived from canonical request fields so the pipeline can safely dedupe across clouds.

What I mean by “SigV4 proxy”

AWS SigV4 (Signature Version 4) is a request-signing scheme for S3-style APIs. A “SigV4 proxy” is an HTTP service sitting in the middle that:

accepts a request from your pipeline without the final provider-specific signature
signs it using credentials it has been configured with
forwards it to the actual object storage endpoint

This is common when:

egress is controlled and secrets must stay inside the proxy boundary
workloads can’t access cloud credentials directly
you’re bridging multiple S3-compatible implementations with one signing service

The failure mode: hash instability + signature drift

I had a “manifest” file per run that included per-object metadata (key, size, etag, timestamp). My dedupe logic was:

compute sha256(manifest.json)
store it as the run marker
if marker exists, skip

The problem: different clouds return metadata fields differently, and even within the same cloud, JSON object key ordering and some timestamp formatting produced different hashes for semantically identical runs.

Separately, my proxy would sometimes sign requests with headers that changed across retries (notably the Host, x-amz-content-sha256, or a timestamp header). Under retry, the canonical request used for the signature changed, resulting in SignatureDoesNotMatch.

So I fixed both sides:

make manifest hashing deterministic
make signing inputs deterministic

A concrete architecture that worked

Data flow

Pipeline enumerates objects from the source bucket.
For each object, it computes a content-stable descriptor.
It builds a deterministic manifest:
- keys sorted
- normalized fields (no provider-specific timestamp formats)
- canonical JSON encoding
It uploads the manifest to the destination using a SigV4 signing proxy.
It uses the manifest hash as the dedupe marker.

Canonical JSON rule I used

Only include fields that won’t vary between retries:
- bucket, key, size, etag (normalized), and a stable versionId if present
Normalize etag:
- sometimes it includes quotes: "abcd..." vs abcd...
Serialize with stable key ordering

Step-by-step: deterministic manifest hashing in Python

Below is a small but complete script that:

reads a list of “source objects” (simulated)
builds a deterministic manifest
produces a SHA-256 hash that is stable across runs

# manifest_hash.py
import hashlib
import json
import re
from dataclasses import dataclass, asdict
from typing import List, Optional

ETAG_RE = re.compile(r'^"?([^"]+)"?$')

def normalize_etag(etag: str) -> str:
    """
    Normalize common S3 ETag formatting differences.
    Examples:
      '"abc123"' -> 'abc123'
      'abc123'   -> 'abc123'
    """
    m = ETAG_RE.match(etag.strip())
    return m.group(1) if m else etag.strip()

@dataclass(frozen=True)
class ObjectDescriptor:
    bucket: str
    key: str
    size: int
    etag: str
    version_id: Optional[str] = None

    def stable_dict(self):
        d = asdict(self)
        d["etag"] = normalize_etag(d["etag"])
        # Keep None fields out for stability
        return {k: v for k, v in d.items() if v is not None}

def canonical_json(data) -> bytes:
    """
    Deterministic JSON encoding:
    - sort keys
    - no whitespace
    - ensure UTF-8
    """
    return json.dumps(
        data,
        sort_keys=True,
        separators=(",", ":"),
        ensure_ascii=False,
    ).encode("utf-8")

def build_manifest(run_id: str, objects: List[ObjectDescriptor]):
    # Sort descriptors to remove enumeration ordering differences
    objects_sorted = sorted(objects, key=lambda o: (o.bucket, o.key))

    manifest = {
        "schema": "etl/manifest/v1",
        "run_id": run_id,  # stable per logical run
        "objects": [o.stable_dict() for o in objects_sorted],
    }
    manifest_bytes = canonical_json(manifest)
    manifest_hash = hashlib.sha256(manifest_bytes).hexdigest()
    return manifest, manifest_hash

if __name__ == "__main__":
    # Simulated object list returned by a source provider
    objs = [
        ObjectDescriptor(bucket="src-bucket", key="data/part-2.json", size=12, etag='"aaa"'),
        ObjectDescriptor(bucket="src-bucket", key="data/part-1.json", size=12, etag="aaa"),
    ]

    manifest, h = build_manifest(run_id="2026-04-08T00:00:00Z", objects=objs)
    print("Manifest hash:", h)
    print("Canonical manifest JSON:", canonical_json(manifest).decode("utf-8"))

What happens when you run it

Run:

python manifest_hash.py

You’ll see:

the hash is computed from canonical JSON
the objects array order is normalized
etag formatting is normalized so "aaa" and aaa hash the same

That eliminates the “already processed” marker instability.

Step-by-step: deterministic request signing inputs

The signing proxy issue came down to: ensure the set of headers and the canonical payload hash are identical across retries.

Most SigV4 signing errors like SignatureDoesNotMatch are caused by one of these differences:

Host mismatch (or proxy forwards to a different host than signer used)
payload hash differs (especially if streaming uploads vary)
x-amz-date differs while canonicalization expects something else
headers included/excluded during canonical request differ

In my setup, I controlled the boundary by making all upload requests go through one small proxy contract:

Proxy contract

Pipeline sends:
- HTTP method
- target URL path
- headers: only a minimal stable set
- payload either:
  - buffered fully (so payload hash is stable), or
  - using “unsigned payload” mode if the target supports it (more on this below)

For S3, the safest approach for determinism is buffer the payload in the client that calls the proxy (so the proxy signs the same bytes every time).

A minimal signing proxy (FastAPI) example

This is not a production-grade proxy, but it shows the shape of the fix. The important part is the proxy takes a canonical request “envelope”, signs with AWS SDK tooling, and forwards.

Note: In real systems you’d configure credentials via environment variables/secret manager and likely enforce auth. Here I focus on the request determinism.

# sigv4_proxy.py
import os
from fastapi import FastAPI, Request, HTTPException
import httpx
import boto3
from botocore.awsrequest import AWSRequest
from botocore.auth import SigV4Auth
from botocore.credentials import Credentials

app = FastAPI()

AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
ACCESS_KEY = os.environ["AWS_ACCESS_KEY_ID"]
SECRET_KEY = os.environ["AWS_SECRET_ACCESS_KEY"]
SESSION_TOKEN = os.environ.get("AWS_SESSION_TOKEN", "")

TARGET_SERVICE = os.environ.get("TARGET_SERVICE", "s3")  # 's3' for object APIs

def sign_request(method: str, url: str, headers: dict, body: bytes):
    # Build an AWSRequest for SigV4 signing.
    # Using botocore ensures canonical request computation is consistent.
    aws_req = AWSRequest(method=method, url=url, data=body, headers=headers)

    creds = Credentials(ACCESS_KEY, SECRET_KEY, SESSION_TOKEN or None)
    signer = SigV4Auth(creds, TARGET_SERVICE, AWS_REGION)
    signer.add_auth(aws_req)

    # Convert signed request back to headers
    return aws_req.headers

@app.post("/upload")
async def upload(request: Request):
    """
    Expected JSON body:
    {
      "target_url": "https://host/bucket/key",
      "method": "PUT",
      "headers": { "content-type": "...", ... },
      "payload_base64": "..."
    }
    """
    payload = await request.json()
    target_url = payload["target_url"]
    method = payload.get("method", "PUT")
    headers = payload.get("headers", {})

    import base64
    body = base64.b64decode(payload["payload_base64"])

    # Ensure deterministic header set.
    # Example: don't allow changing Host; let target_url determine it.
    stable_headers = {
        "content-type": headers.get("content-type", "application/octet-stream"),
        "content-length": str(len(body)),
    }

    signed_headers = sign_request(method, target_url, stable_headers, body)

    async with httpx.AsyncClient(timeout=60) as client:
        resp = await client.request(
            method,
            target_url,
            headers=dict(signed_headers),
            content=body,
        )

    if resp.status_code >= 400:
        raise HTTPException(status_code=resp.status_code, detail=resp.text)

    return {"status": "ok", "etag": resp.headers.get("ETag")}

Why this fixed my retries

The proxy signs using the exact body bytes it receives.
The proxy sets stable content-length based on buffered payload size.
The canonical request is less likely to drift between retries.

In my previous approach, I was streaming bytes through multiple layers and accidentally changed payload hash inputs when the upload was retried.

End-to-end: upload a deterministic manifest through the proxy

Here’s a small client that:

Builds the manifest deterministically
Uploads it through the proxy
Uses the manifest hash as the object key (dedupe)

# upload_manifest.py
import base64
import os
import requests
import json
import hashlib
from manifest_hash import ObjectDescriptor, build_manifest

PROXY_URL = os.environ.get("PROXY_URL", "http://localhost:8000")
DEST_BUCKET = os.environ.get("DEST_BUCKET", "dest-bucket")
DEST_HOST = os.environ.get("DEST_HOST", "s3.example.com")  # endpoint host
DEST_REGION = os.environ.get("DEST_REGION", "us-east-1")  # not used directly here

def object_url(bucket: str, key: str) -> str:
    return f"https://{DEST_HOST}/{bucket}/{key}"

if __name__ == "__main__":
    # Simulated objects from source
    objs = [
        ObjectDescriptor(bucket="src-bucket", key="data/part-1.json", size=12, etag='"aaa"'),
        ObjectDescriptor(bucket="src-bucket", key="data/part-2.json", size=12, etag="aaa"),
    ]

    manifest, h = build_manifest(run_id="2026-04-08T00:00:00Z", objects=objs)
    manifest_bytes = json.dumps(manifest, sort_keys=True, separators=(",", ":"), ensure_ascii=False).encode("utf-8")

    dedupe_key = f"manifests/{h}.json"

    payload = {
        "target_url": object_url(DEST_BUCKET, dedupe_key),
        "method": "PUT",
        "headers": {"content-type": "application/json"},
        "payload_base64": base64.b64encode(manifest_bytes).decode("ascii"),
    }

    resp = requests.post(f"{PROXY_URL}/upload", json=payload, timeout=60)
    resp.raise_for_status()

    print("Uploaded manifest to:", dedupe_key)
    print("Manifest hash:", h)
    print("Proxy response:", resp.json())

What happens when you run it

manifest_hash.py proves stable hashing.
upload_manifest.py uploads using that hash-based key.
Re-running the script produces the same key, so the destination either:
- overwrites identical content, or
- you can switch to PUT with conditional headers (like If-None-Match) to avoid overwrites

FinOps note: why deterministic dedupe mattered for cost

Before this change, every retry produced a different manifest marker, so downstream steps reprocessed the same objects. That created:

extra storage reads/writes
repeated compute execution
extra egress through the proxy

Once manifest hashing became deterministic and request signing inputs stabilized, retries stopped triggering “new work,” and the run marker became a real idempotency key.

Operational checklist I ended up relying on

Canonical manifest hashing
- stable sort of objects
- normalized ETag formatting
- deterministic JSON encoding (sorted keys, no whitespace)
Deterministic upload bytes
- buffer payload before signing
- set content-length explicitly
Proxy should sign with inputs that won’t drift
- minimal, stable header set
- avoid “helpful” middleware that mutates headers between retries

Conclusion

I learned that hybrid multi-cloud pipelines fail in annoying, non-obvious ways when two sources of nondeterminism collide: manifest hashing instability and SigV4 signature drift caused by changing request-signing inputs. By enforcing canonical JSON for manifests and making uploads go through a SigV4 signing proxy that signs deterministic buffered payloads with a stable header set, I got reliable retries and true idempotent dedupe markers—turning an error-prone cross-cloud ETL into something that behaves consistently at scale.