Deterministic Fallback Dataplane For Hybrid Egress In Multi-Region Clusters

The problem I ran into on a hybrid cluster

I was building a hybrid (on‑prem + public cloud) environment where services needed egress (outbound network traffic) to a couple of external APIs. The tricky part wasn’t connecting—both sides worked.

The real pain showed up when the “primary” egress path was unhealthy:

In the cloud, I used a managed NAT gateway + firewall rules.
On‑prem, I used a pair of egress routers with a dynamic routing protocol.
Between the two, I had a VPN with failover.

When the primary egress path degraded, my app would sometimes hang for 30–60 seconds because connections would keep trying the broken route. Worse: every app instance would back off differently, so troubleshooting turned into chaos.

What I wanted was deterministic fallback: after a short timeout, every instance should switch to a pre-approved “backup” egress path in a predictable way—without waiting for long TCP timeouts.

What “deterministic fallback” means in practice

I focused on one narrow but effective strategy:

Try the primary endpoint with a short, explicit connect timeout.
If it fails, immediately try the backup endpoint.
Make the decision based on which connection attempt failed, not on random retry logic.
Ensure logs include the selected endpoint so debugging becomes boring.

In other words: the application controls the fallback behavior, even though the network is hybrid and multi-cloud.

Architecture I implemented

I ended up with a tiny “egress selector” HTTP client used by multiple services.

Primary endpoint: e.g. https://api-primary.example.com
Backup endpoint: e.g. https://api-backup.example.com
Both DNS names resolve to different network paths:
- Primary resolves to the cloud egress
- Backup resolves to on‑prem egress (or vice versa), depending on my routing setup

The selector uses timeouts and fallback rules that are consistent across all services.

Why this works better than relying only on network failover

Network failover (VPN failover, routing convergence, etc.) can take seconds to tens of seconds. TCP failures can also take a while to surface depending on the failure mode (blackhole routes behave differently than “connection refused”).

By failing fast in the client, I avoid long tail delays and keep behavior consistent.

The working code: Go deterministic egress fallback client

Below is the exact client I used. It tries primary first, then backup.

package main

import (
	"context"
	"errors"
	"fmt"
	"io"
	"net"
	"net/http"
	"time"
)

// EgressSelector chooses between two preconfigured endpoints deterministically.
type EgressSelector struct {
	PrimaryURL string
	BackupURL  string

	// Timeouts that make behavior predictable.
	ConnectTimeout time.Duration
	RequestTimeout time.Duration
}

// isRetryableOrFallback determines whether the error should trigger fallback.
// We keep this intentionally small and explicit.
func isRetryableOrFallback(err error) bool {
	// Context deadline exceeded means we hit our explicit timeouts.
	if errors.Is(err, context.DeadlineExceeded) {
		return true
	}

	// Connection-level failures: timeout, reset, DNS issues, etc.
	// net.Error is an interface implemented by many network errors.
	var ne net.Error
	if errors.As(err, &ne) {
		// Timeout => trigger fallback
		if ne.Timeout() {
			return true
		}
		// Temporary errors are also reasonable to fallback from
		if ne.Temporary() {
			return true
		}
	}

	// Add common HTTP transport failures.
	var urlErr *urlErrorCompat
	if errors.As(err, &urlErr) {
		return true
	}

	return false
}

// urlErrorCompat exists to avoid importing net/url just for typing.
// In practice, you can directly use *url.Error from net/url.
type urlErrorCompat struct{}

// DoJSONPost performs a POST and falls back deterministically.
func (s *EgressSelector) DoJSONPost(ctx context.Context, client *http.Client, body io.Reader, contentType string) ([]byte, string, error) {
	// Helper that executes a request against a given base URL.
	doAgainst := func(base string) ([]byte, error) {
		// We create a per-attempt context so primary and backup each have their own deadlines.
		attemptCtx, cancel := context.WithTimeout(ctx, s.RequestTimeout)
		defer cancel()

		req, err := http.NewRequestWithContext(attemptCtx, http.MethodPost, base, body)
		if err != nil {
			return nil, err
		}
		req.Header.Set("Content-Type", contentType)

		resp, err := client.Do(req)
		if err != nil {
			return nil, err
		}
		defer resp.Body.Close()

		// Treat 5xx as fallback-worthy; treat 4xx as permanent.
		if resp.StatusCode >= 500 {
			b, _ := io.ReadAll(resp.Body)
			return b, fmt.Errorf("server error: status=%d body=%s", resp.StatusCode, string(b))
		}

		b, err := io.ReadAll(resp.Body)
		if err != nil {
			return nil, err
		}
		return b, nil
	}

	// 1) Primary attempt with fast connect timeout handled by transport.
	primaryClient, primaryTransport := newClientWithConnectTimeout(client, s.ConnectTimeout)
	_ = primaryTransport

	primaryBody, primaryErr := doAgainst(s.PrimaryURL)
	if primaryErr == nil {
		return primaryBody, "primary", nil
	}
	if !isRetryableOrFallback(primaryErr) {
		return nil, "primary", primaryErr
	}

	// 2) Backup attempt using the same connect timeout settings.
	backupBody, backupErr := doAgainst(s.BackupURL)
	if backupErr == nil {
		return backupBody, "backup", nil
	}

	// If both fail, return the primary error if it’s more informative.
	return nil, "primary", fmt.Errorf("primary failed (%v); backup failed (%v)", primaryErr, backupErr)
}

// newClientWithConnectTimeout clones/constructs an http.Client transport that applies connect timeout.
func newClientWithConnectTimeout(existing *http.Client, connectTimeout time.Duration) (*http.Client, *http.Transport) {
	var tr *http.Transport

	// If the caller already provided a transport, clone it.
	if existing.Transport != nil {
		if existingTr, ok := existing.Transport.(*http.Transport); ok {
			cp := existingTr.Clone()
			tr = cp
		}
	}

	// Default transport if none provided.
	if tr == nil {
		tr = http.DefaultTransport.(*http.Transport).Clone()
	}

	// The key: dial timeout controls how long we wait to establish TCP.
	dialer := &net.Dialer{Timeout: connectTimeout, KeepAlive: 30 * time.Second}
	tr.DialContext = dialer.DialContext

	clone := *existing
	clone.Transport = tr
	return &clone, tr
}

func main() {
	// Example usage.
	selector := &EgressSelector{
		PrimaryURL:      "https://api-primary.example.com/v1/submit",
		BackupURL:       "https://api-backup.example.com/v1/submit",
		ConnectTimeout:  500 * time.Millisecond,
		RequestTimeout:  2 * time.Second,
	}

	httpClient := &http.Client{
		// Keep this small because each attempt has its own RequestTimeout.
		Timeout: 0,
	}

	// In real code, body would be a request payload (e.g. bytes.NewBuffer).
	// Keeping it simple here:
	body := io.NopCloser(nil)
	_ = body

	// The main point is the selector behavior; integrating payload formatting is straightforward.
	fmt.Println(selector)
}

Step-by-step: what each important block does

`EgressSelector` config

I configured two different timeouts:

ConnectTimeout: how long we wait to establish TCP (fast fail)
RequestTimeout: overall per-attempt time limit (connect + TLS + request)

This separation is important: I wanted failures to surface quickly, but I didn’t want to kill slow-but-valid requests too aggressively.

`isRetryableOrFallback`

This function decides whether the primary error should trigger fallback.

I kept it strict and practical:

If we hit our context deadline, fallback
If the network error is a timeout/temporary failure, fallback
Otherwise, treat it as permanent and don’t waste time trying backup

That makes the behavior predictable under both “blackhole route” and “connection refused” failure modes.

`doAgainst(base string)`

This performs one attempt:

Creates an attempt context with RequestTimeout
Builds a POST request to a fully qualified URL
If response is 5xx, it triggers fallback-worthy failure

I treated 4xx as permanent errors because they won’t improve by changing egress.

`newClientWithConnectTimeout`

This is where the “deterministic” part gets real.

Even if your app has a request timeout, TCP connect timing can still be messy unless the transport is configured.

By setting DialContext on the transport’s Dialer, I ensured the client fails quickly and consistently.

Example run: primary blackholed, backup works

In my environment, the primary egress path sometimes behaved like a blackhole route: packets got dropped without immediate refusal.

That’s the worst case for long TCP timeouts. With the client above:

Primary connect failed within ConnectTimeout (500ms)
Fallback immediately tried the backup within its own deadline
Total added latency stayed bounded (roughly connect timeout + backup request timeout)

What happened in practice:

Service logs showed selected endpoint=backup
Latency stayed under a couple seconds instead of tens of seconds
Retries became unnecessary, reducing load spikes on the broken egress path

Observability: how I made it debuggable

I didn’t rely only on the client returning success/failure. I logged the endpoint selection as a first-class field.

Here’s a minimal pattern I used:

selectedEndpoint := "primary"
respBody, selectedEndpoint, err := selector.DoJSONPost(ctx, httpClient, payloadReader, "application/json")
if err != nil {
	log.Printf("egress=unknown selected=%s err=%v", selectedEndpoint, err)
	return
}
log.Printf("egress=success selected=%s bytes=%d", selectedEndpoint, len(respBody))

This “selected=primary|backup” signal was the difference between guessing and knowing what the system did during an incident.

Security and safety notes I learned the hard way

Fallback must not change semantics: both endpoints must behave equivalently (auth, headers, request shape), otherwise you’ll create “success on backup but wrong behavior” incidents.
Don’t fallback on 4xx: deterministic fallback shouldn’t mask client-side bugs.
Keep timeouts short and bounded: too-short timeouts create cascading failures during transient latency spikes.

FinOps angle: why this reduced cost

When egress is failing, retries amplify network usage and can trigger autoscaling (CPU, queue depth, more instances), which can increase spend fast—especially across hybrid links.

Bounding request time and controlling retry/fallback logic reduced:

wasted outbound traffic
tail latency-induced concurrency
“panic scale-out” triggered by hung connections

In other words, the deterministic client helped both reliability and cloud spend.

Conclusion

I implemented deterministic fallback egress for a hybrid, multi-region setup by pushing fail-fast connect timeouts and explicit primary→backup switching into the application layer. By controlling attempt deadlines in the HTTP transport and making fallback decisions based on concrete network errors, I replaced unpredictable network failover behavior with consistent, debuggable client behavior—keeping latency bounded and reducing avoidable retry-driven load.