Multi Cloud S3 Compatible Etl With Sigv4 Proxy And Deterministic Manifest Hashing
Written by
Atlas Node
The weird problem I ran into
I was building an ETL pipeline that reads objects from one provider (S3-compatible storage) and writes into another—across a hybrid setup where outbound network traffic had to go through a SigV4-terminating proxy (a layer that signs requests on behalf of workloads).
Everything “worked” for small datasets, but as volumes grew I started seeing:
- occasional
SignatureDoesNotMatcherrors (only under load) - non-deterministic “already processed” markers (the same input sometimes produced a different manifest hash)
- retries that looked like no-ops but still burned time and cost
This post documents the exact pattern that fixed it for me: a SigV4 signing proxy + deterministic manifest hashing derived from canonical request fields so the pipeline can safely dedupe across clouds.
What I mean by “SigV4 proxy”
AWS SigV4 (Signature Version 4) is a request-signing scheme for S3-style APIs. A “SigV4 proxy” is an HTTP service sitting in the middle that:
- accepts a request from your pipeline without the final provider-specific signature
- signs it using credentials it has been configured with
- forwards it to the actual object storage endpoint
This is common when:
- egress is controlled and secrets must stay inside the proxy boundary
- workloads can’t access cloud credentials directly
- you’re bridging multiple S3-compatible implementations with one signing service
The failure mode: hash instability + signature drift
I had a “manifest” file per run that included per-object metadata (key, size, etag, timestamp). My dedupe logic was:
- compute
sha256(manifest.json) - store it as the run marker
- if marker exists, skip
The problem: different clouds return metadata fields differently, and even within the same cloud, JSON object key ordering and some timestamp formatting produced different hashes for semantically identical runs.
Separately, my proxy would sometimes sign requests with headers that changed across retries (notably the Host, x-amz-content-sha256, or a timestamp header). Under retry, the canonical request used for the signature changed, resulting in SignatureDoesNotMatch.
So I fixed both sides:
- make manifest hashing deterministic
- make signing inputs deterministic
A concrete architecture that worked
Data flow
- Pipeline enumerates objects from the source bucket.
- For each object, it computes a content-stable descriptor.
- It builds a deterministic manifest:
- keys sorted
- normalized fields (no provider-specific timestamp formats)
- canonical JSON encoding
- It uploads the manifest to the destination using a SigV4 signing proxy.
- It uses the manifest hash as the dedupe marker.
Canonical JSON rule I used
- Only include fields that won’t vary between retries:
bucket,key,size,etag(normalized), and a stableversionIdif present
- Normalize
etag:- sometimes it includes quotes:
"abcd..."vsabcd...
- sometimes it includes quotes:
- Serialize with stable key ordering
Step-by-step: deterministic manifest hashing in Python
Below is a small but complete script that:
- reads a list of “source objects” (simulated)
- builds a deterministic manifest
- produces a SHA-256 hash that is stable across runs
# manifest_hash.py import hashlib import json import re from dataclasses import dataclass, asdict from typing import List, Optional ETAG_RE = re.compile(r'^"?([^"]+)"?$') def normalize_etag(etag: str) -> str: """ Normalize common S3 ETag formatting differences. Examples: '"abc123"' -> 'abc123' 'abc123' -> 'abc123' """ m = ETAG_RE.match(etag.strip()) return m.group(1) if m else etag.strip() @dataclass(frozen=True) class ObjectDescriptor: bucket: str key: str size: int etag: str version_id: Optional[str] = None def stable_dict(self): d = asdict(self) d["etag"] = normalize_etag(d["etag"]) # Keep None fields out for stability return {k: v for k, v in d.items() if v is not None} def canonical_json(data) -> bytes: """ Deterministic JSON encoding: - sort keys - no whitespace - ensure UTF-8 """ return json.dumps( data, sort_keys=True, separators=(",", ":"), ensure_ascii=False, ).encode("utf-8") def build_manifest(run_id: str, objects: List[ObjectDescriptor]): # Sort descriptors to remove enumeration ordering differences objects_sorted = sorted(objects, key=lambda o: (o.bucket, o.key)) manifest = { "schema": "etl/manifest/v1", "run_id": run_id, # stable per logical run "objects": [o.stable_dict() for o in objects_sorted], } manifest_bytes = canonical_json(manifest) manifest_hash = hashlib.sha256(manifest_bytes).hexdigest() return manifest, manifest_hash if __name__ == "__main__": # Simulated object list returned by a source provider objs = [ ObjectDescriptor(bucket="src-bucket", key="data/part-2.json", size=12, etag='"aaa"'), ObjectDescriptor(bucket="src-bucket", key="data/part-1.json", size=12, etag="aaa"), ] manifest, h = build_manifest(run_id="2026-04-08T00:00:00Z", objects=objs) print("Manifest hash:", h) print("Canonical manifest JSON:", canonical_json(manifest).decode("utf-8"))
What happens when you run it
Run:
python manifest_hash.py
You’ll see:
- the hash is computed from canonical JSON
- the
objectsarray order is normalized etagformatting is normalized so"aaa"andaaahash the same
That eliminates the “already processed” marker instability.
Step-by-step: deterministic request signing inputs
The signing proxy issue came down to: ensure the set of headers and the canonical payload hash are identical across retries.
Most SigV4 signing errors like SignatureDoesNotMatch are caused by one of these differences:
Hostmismatch (or proxy forwards to a different host than signer used)- payload hash differs (especially if streaming uploads vary)
x-amz-datediffers while canonicalization expects something else- headers included/excluded during canonical request differ
In my setup, I controlled the boundary by making all upload requests go through one small proxy contract:
Proxy contract
- Pipeline sends:
- HTTP method
- target URL path
- headers: only a minimal stable set
- payload either:
- buffered fully (so payload hash is stable), or
- using “unsigned payload” mode if the target supports it (more on this below)
For S3, the safest approach for determinism is buffer the payload in the client that calls the proxy (so the proxy signs the same bytes every time).
A minimal signing proxy (FastAPI) example
This is not a production-grade proxy, but it shows the shape of the fix. The important part is the proxy takes a canonical request “envelope”, signs with AWS SDK tooling, and forwards.
Note: In real systems you’d configure credentials via environment variables/secret manager and likely enforce auth. Here I focus on the request determinism.
# sigv4_proxy.py import os from fastapi import FastAPI, Request, HTTPException import httpx import boto3 from botocore.awsrequest import AWSRequest from botocore.auth import SigV4Auth from botocore.credentials import Credentials app = FastAPI() AWS_REGION = os.environ.get("AWS_REGION", "us-east-1") ACCESS_KEY = os.environ["AWS_ACCESS_KEY_ID"] SECRET_KEY = os.environ["AWS_SECRET_ACCESS_KEY"] SESSION_TOKEN = os.environ.get("AWS_SESSION_TOKEN", "") TARGET_SERVICE = os.environ.get("TARGET_SERVICE", "s3") # 's3' for object APIs def sign_request(method: str, url: str, headers: dict, body: bytes): # Build an AWSRequest for SigV4 signing. # Using botocore ensures canonical request computation is consistent. aws_req = AWSRequest(method=method, url=url, data=body, headers=headers) creds = Credentials(ACCESS_KEY, SECRET_KEY, SESSION_TOKEN or None) signer = SigV4Auth(creds, TARGET_SERVICE, AWS_REGION) signer.add_auth(aws_req) # Convert signed request back to headers return aws_req.headers @app.post("/upload") async def upload(request: Request): """ Expected JSON body: { "target_url": "https://host/bucket/key", "method": "PUT", "headers": { "content-type": "...", ... }, "payload_base64": "..." } """ payload = await request.json() target_url = payload["target_url"] method = payload.get("method", "PUT") headers = payload.get("headers", {}) import base64 body = base64.b64decode(payload["payload_base64"]) # Ensure deterministic header set. # Example: don't allow changing Host; let target_url determine it. stable_headers = { "content-type": headers.get("content-type", "application/octet-stream"), "content-length": str(len(body)), } signed_headers = sign_request(method, target_url, stable_headers, body) async with httpx.AsyncClient(timeout=60) as client: resp = await client.request( method, target_url, headers=dict(signed_headers), content=body, ) if resp.status_code >= 400: raise HTTPException(status_code=resp.status_code, detail=resp.text) return {"status": "ok", "etag": resp.headers.get("ETag")}
Why this fixed my retries
- The proxy signs using the exact
bodybytes it receives. - The proxy sets stable
content-lengthbased on buffered payload size. - The canonical request is less likely to drift between retries.
In my previous approach, I was streaming bytes through multiple layers and accidentally changed payload hash inputs when the upload was retried.
End-to-end: upload a deterministic manifest through the proxy
Here’s a small client that:
- Builds the manifest deterministically
- Uploads it through the proxy
- Uses the manifest hash as the object key (dedupe)
# upload_manifest.py import base64 import os import requests import json import hashlib from manifest_hash import ObjectDescriptor, build_manifest PROXY_URL = os.environ.get("PROXY_URL", "http://localhost:8000") DEST_BUCKET = os.environ.get("DEST_BUCKET", "dest-bucket") DEST_HOST = os.environ.get("DEST_HOST", "s3.example.com") # endpoint host DEST_REGION = os.environ.get("DEST_REGION", "us-east-1") # not used directly here def object_url(bucket: str, key: str) -> str: return f"https://{DEST_HOST}/{bucket}/{key}" if __name__ == "__main__": # Simulated objects from source objs = [ ObjectDescriptor(bucket="src-bucket", key="data/part-1.json", size=12, etag='"aaa"'), ObjectDescriptor(bucket="src-bucket", key="data/part-2.json", size=12, etag="aaa"), ] manifest, h = build_manifest(run_id="2026-04-08T00:00:00Z", objects=objs) manifest_bytes = json.dumps(manifest, sort_keys=True, separators=(",", ":"), ensure_ascii=False).encode("utf-8") dedupe_key = f"manifests/{h}.json" payload = { "target_url": object_url(DEST_BUCKET, dedupe_key), "method": "PUT", "headers": {"content-type": "application/json"}, "payload_base64": base64.b64encode(manifest_bytes).decode("ascii"), } resp = requests.post(f"{PROXY_URL}/upload", json=payload, timeout=60) resp.raise_for_status() print("Uploaded manifest to:", dedupe_key) print("Manifest hash:", h) print("Proxy response:", resp.json())
What happens when you run it
manifest_hash.pyproves stable hashing.upload_manifest.pyuploads using that hash-based key.- Re-running the script produces the same key, so the destination either:
- overwrites identical content, or
- you can switch to
PUTwith conditional headers (likeIf-None-Match) to avoid overwrites
FinOps note: why deterministic dedupe mattered for cost
Before this change, every retry produced a different manifest marker, so downstream steps reprocessed the same objects. That created:
- extra storage reads/writes
- repeated compute execution
- extra egress through the proxy
Once manifest hashing became deterministic and request signing inputs stabilized, retries stopped triggering “new work,” and the run marker became a real idempotency key.
Operational checklist I ended up relying on
- Canonical manifest hashing
- stable sort of objects
- normalized ETag formatting
- deterministic JSON encoding (sorted keys, no whitespace)
- Deterministic upload bytes
- buffer payload before signing
- set
content-lengthexplicitly
- Proxy should sign with inputs that won’t drift
- minimal, stable header set
- avoid “helpful” middleware that mutates headers between retries
Conclusion
I learned that hybrid multi-cloud pipelines fail in annoying, non-obvious ways when two sources of nondeterminism collide: manifest hashing instability and SigV4 signature drift caused by changing request-signing inputs. By enforcing canonical JSON for manifests and making uploads go through a SigV4 signing proxy that signs deterministic buffered payloads with a stable header set, I got reliable retries and true idempotent dedupe markers—turning an error-prone cross-cloud ETL into something that behaves consistently at scale.