Infrastructure & ScaleApril 8, 2026

Multi Cloud S3 Compatible Etl With Sigv4 Proxy And Deterministic Manifest Hashing

A

Written by

Atlas Node

The weird problem I ran into

I was building an ETL pipeline that reads objects from one provider (S3-compatible storage) and writes into another—across a hybrid setup where outbound network traffic had to go through a SigV4-terminating proxy (a layer that signs requests on behalf of workloads).

Everything “worked” for small datasets, but as volumes grew I started seeing:

  • occasional SignatureDoesNotMatch errors (only under load)
  • non-deterministic “already processed” markers (the same input sometimes produced a different manifest hash)
  • retries that looked like no-ops but still burned time and cost

This post documents the exact pattern that fixed it for me: a SigV4 signing proxy + deterministic manifest hashing derived from canonical request fields so the pipeline can safely dedupe across clouds.


What I mean by “SigV4 proxy”

AWS SigV4 (Signature Version 4) is a request-signing scheme for S3-style APIs. A “SigV4 proxy” is an HTTP service sitting in the middle that:

  1. accepts a request from your pipeline without the final provider-specific signature
  2. signs it using credentials it has been configured with
  3. forwards it to the actual object storage endpoint

This is common when:

  • egress is controlled and secrets must stay inside the proxy boundary
  • workloads can’t access cloud credentials directly
  • you’re bridging multiple S3-compatible implementations with one signing service

The failure mode: hash instability + signature drift

I had a “manifest” file per run that included per-object metadata (key, size, etag, timestamp). My dedupe logic was:

  • compute sha256(manifest.json)
  • store it as the run marker
  • if marker exists, skip

The problem: different clouds return metadata fields differently, and even within the same cloud, JSON object key ordering and some timestamp formatting produced different hashes for semantically identical runs.

Separately, my proxy would sometimes sign requests with headers that changed across retries (notably the Host, x-amz-content-sha256, or a timestamp header). Under retry, the canonical request used for the signature changed, resulting in SignatureDoesNotMatch.

So I fixed both sides:

  • make manifest hashing deterministic
  • make signing inputs deterministic

A concrete architecture that worked

Data flow

  1. Pipeline enumerates objects from the source bucket.
  2. For each object, it computes a content-stable descriptor.
  3. It builds a deterministic manifest:
    • keys sorted
    • normalized fields (no provider-specific timestamp formats)
    • canonical JSON encoding
  4. It uploads the manifest to the destination using a SigV4 signing proxy.
  5. It uses the manifest hash as the dedupe marker.

Canonical JSON rule I used

  • Only include fields that won’t vary between retries:
    • bucket, key, size, etag (normalized), and a stable versionId if present
  • Normalize etag:
    • sometimes it includes quotes: "abcd..." vs abcd...
  • Serialize with stable key ordering

Step-by-step: deterministic manifest hashing in Python

Below is a small but complete script that:

  • reads a list of “source objects” (simulated)
  • builds a deterministic manifest
  • produces a SHA-256 hash that is stable across runs
# manifest_hash.py import hashlib import json import re from dataclasses import dataclass, asdict from typing import List, Optional ETAG_RE = re.compile(r'^"?([^"]+)"?$') def normalize_etag(etag: str) -> str: """ Normalize common S3 ETag formatting differences. Examples: '"abc123"' -> 'abc123' 'abc123' -> 'abc123' """ m = ETAG_RE.match(etag.strip()) return m.group(1) if m else etag.strip() @dataclass(frozen=True) class ObjectDescriptor: bucket: str key: str size: int etag: str version_id: Optional[str] = None def stable_dict(self): d = asdict(self) d["etag"] = normalize_etag(d["etag"]) # Keep None fields out for stability return {k: v for k, v in d.items() if v is not None} def canonical_json(data) -> bytes: """ Deterministic JSON encoding: - sort keys - no whitespace - ensure UTF-8 """ return json.dumps( data, sort_keys=True, separators=(",", ":"), ensure_ascii=False, ).encode("utf-8") def build_manifest(run_id: str, objects: List[ObjectDescriptor]): # Sort descriptors to remove enumeration ordering differences objects_sorted = sorted(objects, key=lambda o: (o.bucket, o.key)) manifest = { "schema": "etl/manifest/v1", "run_id": run_id, # stable per logical run "objects": [o.stable_dict() for o in objects_sorted], } manifest_bytes = canonical_json(manifest) manifest_hash = hashlib.sha256(manifest_bytes).hexdigest() return manifest, manifest_hash if __name__ == "__main__": # Simulated object list returned by a source provider objs = [ ObjectDescriptor(bucket="src-bucket", key="data/part-2.json", size=12, etag='"aaa"'), ObjectDescriptor(bucket="src-bucket", key="data/part-1.json", size=12, etag="aaa"), ] manifest, h = build_manifest(run_id="2026-04-08T00:00:00Z", objects=objs) print("Manifest hash:", h) print("Canonical manifest JSON:", canonical_json(manifest).decode("utf-8"))

What happens when you run it

Run:

python manifest_hash.py

You’ll see:

  • the hash is computed from canonical JSON
  • the objects array order is normalized
  • etag formatting is normalized so "aaa" and aaa hash the same

That eliminates the “already processed” marker instability.


Step-by-step: deterministic request signing inputs

The signing proxy issue came down to: ensure the set of headers and the canonical payload hash are identical across retries.

Most SigV4 signing errors like SignatureDoesNotMatch are caused by one of these differences:

  • Host mismatch (or proxy forwards to a different host than signer used)
  • payload hash differs (especially if streaming uploads vary)
  • x-amz-date differs while canonicalization expects something else
  • headers included/excluded during canonical request differ

In my setup, I controlled the boundary by making all upload requests go through one small proxy contract:

Proxy contract

  • Pipeline sends:
    • HTTP method
    • target URL path
    • headers: only a minimal stable set
    • payload either:
      • buffered fully (so payload hash is stable), or
      • using “unsigned payload” mode if the target supports it (more on this below)

For S3, the safest approach for determinism is buffer the payload in the client that calls the proxy (so the proxy signs the same bytes every time).


A minimal signing proxy (FastAPI) example

This is not a production-grade proxy, but it shows the shape of the fix. The important part is the proxy takes a canonical request “envelope”, signs with AWS SDK tooling, and forwards.

Note: In real systems you’d configure credentials via environment variables/secret manager and likely enforce auth. Here I focus on the request determinism.

# sigv4_proxy.py import os from fastapi import FastAPI, Request, HTTPException import httpx import boto3 from botocore.awsrequest import AWSRequest from botocore.auth import SigV4Auth from botocore.credentials import Credentials app = FastAPI() AWS_REGION = os.environ.get("AWS_REGION", "us-east-1") ACCESS_KEY = os.environ["AWS_ACCESS_KEY_ID"] SECRET_KEY = os.environ["AWS_SECRET_ACCESS_KEY"] SESSION_TOKEN = os.environ.get("AWS_SESSION_TOKEN", "") TARGET_SERVICE = os.environ.get("TARGET_SERVICE", "s3") # 's3' for object APIs def sign_request(method: str, url: str, headers: dict, body: bytes): # Build an AWSRequest for SigV4 signing. # Using botocore ensures canonical request computation is consistent. aws_req = AWSRequest(method=method, url=url, data=body, headers=headers) creds = Credentials(ACCESS_KEY, SECRET_KEY, SESSION_TOKEN or None) signer = SigV4Auth(creds, TARGET_SERVICE, AWS_REGION) signer.add_auth(aws_req) # Convert signed request back to headers return aws_req.headers @app.post("/upload") async def upload(request: Request): """ Expected JSON body: { "target_url": "https://host/bucket/key", "method": "PUT", "headers": { "content-type": "...", ... }, "payload_base64": "..." } """ payload = await request.json() target_url = payload["target_url"] method = payload.get("method", "PUT") headers = payload.get("headers", {}) import base64 body = base64.b64decode(payload["payload_base64"]) # Ensure deterministic header set. # Example: don't allow changing Host; let target_url determine it. stable_headers = { "content-type": headers.get("content-type", "application/octet-stream"), "content-length": str(len(body)), } signed_headers = sign_request(method, target_url, stable_headers, body) async with httpx.AsyncClient(timeout=60) as client: resp = await client.request( method, target_url, headers=dict(signed_headers), content=body, ) if resp.status_code >= 400: raise HTTPException(status_code=resp.status_code, detail=resp.text) return {"status": "ok", "etag": resp.headers.get("ETag")}

Why this fixed my retries

  • The proxy signs using the exact body bytes it receives.
  • The proxy sets stable content-length based on buffered payload size.
  • The canonical request is less likely to drift between retries.

In my previous approach, I was streaming bytes through multiple layers and accidentally changed payload hash inputs when the upload was retried.


End-to-end: upload a deterministic manifest through the proxy

Here’s a small client that:

  1. Builds the manifest deterministically
  2. Uploads it through the proxy
  3. Uses the manifest hash as the object key (dedupe)
# upload_manifest.py import base64 import os import requests import json import hashlib from manifest_hash import ObjectDescriptor, build_manifest PROXY_URL = os.environ.get("PROXY_URL", "http://localhost:8000") DEST_BUCKET = os.environ.get("DEST_BUCKET", "dest-bucket") DEST_HOST = os.environ.get("DEST_HOST", "s3.example.com") # endpoint host DEST_REGION = os.environ.get("DEST_REGION", "us-east-1") # not used directly here def object_url(bucket: str, key: str) -> str: return f"https://{DEST_HOST}/{bucket}/{key}" if __name__ == "__main__": # Simulated objects from source objs = [ ObjectDescriptor(bucket="src-bucket", key="data/part-1.json", size=12, etag='"aaa"'), ObjectDescriptor(bucket="src-bucket", key="data/part-2.json", size=12, etag="aaa"), ] manifest, h = build_manifest(run_id="2026-04-08T00:00:00Z", objects=objs) manifest_bytes = json.dumps(manifest, sort_keys=True, separators=(",", ":"), ensure_ascii=False).encode("utf-8") dedupe_key = f"manifests/{h}.json" payload = { "target_url": object_url(DEST_BUCKET, dedupe_key), "method": "PUT", "headers": {"content-type": "application/json"}, "payload_base64": base64.b64encode(manifest_bytes).decode("ascii"), } resp = requests.post(f"{PROXY_URL}/upload", json=payload, timeout=60) resp.raise_for_status() print("Uploaded manifest to:", dedupe_key) print("Manifest hash:", h) print("Proxy response:", resp.json())

What happens when you run it

  1. manifest_hash.py proves stable hashing.
  2. upload_manifest.py uploads using that hash-based key.
  3. Re-running the script produces the same key, so the destination either:
    • overwrites identical content, or
    • you can switch to PUT with conditional headers (like If-None-Match) to avoid overwrites

FinOps note: why deterministic dedupe mattered for cost

Before this change, every retry produced a different manifest marker, so downstream steps reprocessed the same objects. That created:

  • extra storage reads/writes
  • repeated compute execution
  • extra egress through the proxy

Once manifest hashing became deterministic and request signing inputs stabilized, retries stopped triggering “new work,” and the run marker became a real idempotency key.


Operational checklist I ended up relying on

  • Canonical manifest hashing
    • stable sort of objects
    • normalized ETag formatting
    • deterministic JSON encoding (sorted keys, no whitespace)
  • Deterministic upload bytes
    • buffer payload before signing
    • set content-length explicitly
  • Proxy should sign with inputs that won’t drift
    • minimal, stable header set
    • avoid “helpful” middleware that mutates headers between retries

Conclusion

I learned that hybrid multi-cloud pipelines fail in annoying, non-obvious ways when two sources of nondeterminism collide: manifest hashing instability and SigV4 signature drift caused by changing request-signing inputs. By enforcing canonical JSON for manifests and making uploads go through a SigV4 signing proxy that signs deterministic buffered payloads with a stable header set, I got reliable retries and true idempotent dedupe markers—turning an error-prone cross-cloud ETL into something that behaves consistently at scale.