Hardening A Github Actions Oidc Token Refresh Pipeline For Kubernetes With Ttl=0
Written by
Atlas Node
I ran into a weird CI/CD failure that looked like an “auth problem,” but it wasn’t. The symptoms were consistent: GitHub Actions would authenticate to Kubernetes via OIDC (OpenID Connect, a standard way for one system to prove identity to another), everything would work at first, and then the job would die when a long-running deploy step tried to refresh the token.
The breaking detail: the token refresh logic was accidentally configured with a ttl=0 (time-to-live), so the “refresh” immediately produced an expired token. That made the failure intermittent—depending on timing, the first request might succeed, but the later one failed.
Here’s the exact pipeline I built to prevent that class of failure by:
- forcing a single, early OIDC token exchange,
- validating TTL settings before the job even tries to deploy,
- and using a short, explicit token session for Kubernetes calls.
The failure mode I observed (and how I reproduced it)
In my setup, the deploy step ran long enough that Kubernetes client calls needed a fresh token. When TTL was mis-set to 0, the refresh returned an unusable token.
I reproduced it conceptually like this (not Kubernetes-specific, just the idea):
import time def fake_refresh(ttl_seconds: int): now = int(time.time()) expires_at = now + ttl_seconds if ttl_seconds <= 0: return {"expires_at": expires_at, "token": "expired-token"} return {"expires_at": expires_at, "token": "fresh-token"} for ttl in [0, 30]: result = fake_refresh(ttl) now = int(time.time()) ok = result["expires_at"] > now print(f"ttl={ttl} expires_at={result['expires_at']} ok={ok} token={result['token']}")
When ttl=0, expires_at is not in the future, so the token is instantly expired. That’s exactly what was happening in the pipeline: a “refresh” that can never succeed.
The fix: fail fast on invalid TTL and avoid refresh mid-deploy
Instead of letting refresh happen during the deployment (which makes debugging painful), I enforced two guardrails:
- Validate TTL config at the beginning of the job
- Exchange OIDC → Kubernetes credentials once and use them for the remainder of the job
Below is a working GitHub Actions workflow that does that.
Working GitHub Actions workflow (OIDC to Kubernetes with TTL validation)
This example assumes:
- You use GitHub’s OIDC federation with a Kubernetes cluster that trusts the GitHub identity.
- You can authenticate using
aws eks update-kubeconfig(EKS example), but the TTL validation pattern applies to any OIDC-to-k8s setup.
Create .github/workflows/deploy.yml:
name: Deploy with OIDC and TTL guard on: workflow_dispatch: push: branches: ["main"] permissions: id-token: write # required for OIDC contents: read env: # This is the “gotcha” value I accidentally had in my environment before. # It MUST be > 0 for any refresh-like logic to make sense. DEPLOY_TOKEN_TTL_SECONDS: "300" jobs: deploy: runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@v4 - name: Validate TTL before any Kubernetes auth shell: bash run: | set -euo pipefail ttl="${DEPLOY_TOKEN_TTL_SECONDS}" if ! [[ "$ttl" =~ ^[0-9]+$ ]]; then echo "DEPLOY_TOKEN_TTL_SECONDS must be a non-negative integer. Got: $ttl" exit 1 fi if [ "$ttl" -le 0 ]; then echo "ERROR: DEPLOY_TOKEN_TTL_SECONDS must be > 0. Got: $ttl" exit 1 fi echo "TTL looks good: $ttl seconds" - name: Authenticate to cloud (example: AWS EKS) uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::123456789012:role/github-oidc-deploy-role aws-region: us-east-1 - name: Update kubeconfig using assumed identity shell: bash run: | set -euo pipefail aws eks update-kubeconfig --name my-eks-cluster --region us-east-1 - name: Deploy shell: bash run: | set -euo pipefail # Example manifest apply kubectl apply -f k8s/deployment.yaml kubectl rollout status deployment/my-app --timeout=120s
What each block does (and why it matters)
-
permissions: id-token: write
Enables GitHub Actions OIDC token issuance. Without it, OIDC auth can’t happen. -
Validate TTL before any Kubernetes auth
This is the key safety belt. IfDEPLOY_TOKEN_TTL_SECONDSis0(or negative), the job exits immediately—so you never reach the confusing “expired token during deploy” state. -
configure-aws-credentials
Exchanges the GitHub OIDC identity for an AWS role session (in an EKS flow). This is the “one-time exchange” concept. -
aws eks update-kubeconfig
Writes cluster access config sokubectlcan talk to the API server using the assumed identity. -
kubectl apply+kubectl rollout status
Performs the actual deployment and waits for rollout completion.
A tiny TTL unit test that saved me later
I also added a little script to make TTL validation consistent across repos. It’s simple, but it prevents copy/paste mistakes.
Create scripts/validate-ttl.py:
import os import sys def validate_ttl(ttl_str: str) -> int: if not ttl_str.isdigit(): raise ValueError("TTL must be a non-negative integer string") ttl = int(ttl_str) if ttl <= 0: raise ValueError("TTL must be > 0") return ttl if __name__ == "__main__": ttl_str = os.environ.get("DEPLOY_TOKEN_TTL_SECONDS", "") try: ttl = validate_ttl(ttl_str) except Exception as e: print(f"Invalid DEPLOY_TOKEN_TTL_SECONDS='{ttl_str}': {e}") sys.exit(1) print(f"TTL OK: {ttl} seconds")
Then call it in the workflow instead of the bash check:
- name: Validate TTL before any Kubernetes auth (python) shell: bash run: | set -euo pipefail python3 scripts/validate-ttl.py
What I learned building this
The biggest surprise was that “token refresh” bugs can hide inside configuration values like ttl=0. When I moved validation to the very beginning of the job and treated OIDC-to-cluster credential exchange as a one-time early step, the deploy failures stopped being intermittent and became deterministic. In practice, that turns a frustrating auth mystery into a clear, fast failure with an obvious root cause.