⚙ Infrastructure

The Hidden Trap in Google Cloud That Nearly Derailed Our Launch

Three stacked errors, three tempting shortcuts, and one lesson about doing things properly the first time. When your Dockerfile works locally but Cloud Run returns a 503 — and every fix creates a new error — this is what we learned from that spiral.

March 2026· 8 min read· GCP · Docker · Cloud Run

The setup: a FastAPI backend on Cloud Run

Columnly's backend is a FastAPI application deployed on Google Cloud Run in asia-south1. It handles LLM routing, data-analysis pipelines, billing webhooks, and a SQLite-backed session store. Nothing exotic — the sort of architecture you've seen a hundred times.

What we hadn't seen was how a single misconfigured service account could cascade into three completely different error messages, each pointing us in the wrong direction.

The 503 that isn't a 503

The first sign something was wrong: Cloud Run returned HTTP 503 on every request, immediately after deploy. The container was starting — health checks passed — but the moment a real request came in, 503.

Service Unavailable upstream connect error or disconnect/reset before headers. reset reason: connection failure

The instinctive fix is to look at container logs. But the logs showed nothing. The container started, uvicorn bound to port 8080, FastAPI initialised — all clean. The 503 was happening before the request ever reached the application.

What we tried first — all wrong

Increasing the Cloud Run concurrency limit. Increasing the request timeout. Changing the port. Redeploying with a fresh image. All reasonable guesses for a 503 — and every one of them was wrong.

The actual cause: the service account attached to the Cloud Run service did not have permission to pull the container image from Artifact Registry. Cloud Run was starting the container from a cached previous image, passing health checks, then failing to serve traffic because the running image was one deploy behind. The new code was never running.

The fix

Grant the Cloud Run service account the roles/artifactregistry.reader role on the specific Artifact Registry repository:

Shell

gcloud artifacts repositories add-iam-policy-binding columnly-backend \
  --location=asia-south1 \
  --member="serviceAccount:[email protected]" \
  --role="roles/artifactregistry.reader"

Simple — one command. But you'd never guess it from a 503, because nothing in the error chain mentions permissions or image pulling.

The SQLite path that works locally and fails in prod

With the image pull fixed, the service came up properly. New error: the application was crashing on startup with a sqlite3.OperationalError: unable to open database file.

sqlite3.OperationalError: unable to open database file File "/app/src/columnly/runtime/db/connection.py", line 34, in get_connection conn = sqlite3.connect(db_path)

Locally, the database path was ./data/memo.db — relative to the working directory. In the container, the working directory is /app, and /app/data/ didn't exist. Cloud Run's container filesystem is ephemeral and read-only except for /tmp.

We fixed this in two steps. First, mount a persistent volume:

cloudbuild.yaml — Cloud Run deploy step

- name: 'gcr.io/cloud-builders/gcloud'
  args:
    - run
    - deploy
    - columnly-backend
    - --add-volume=name=memo-db,type=cloud-storage,bucket=columnly-memo-db
    - --add-volume-mount=volume=memo-db,mount-path=/data
    - --set-env-vars=COLUMNLY_MEMO_DB=/data/memo.db

Second, update the connection code to use an absolute path from the environment variable:

connection.py

import os

def get_db_path() -> str:
    return os.environ.get("COLUMNLY_MEMO_DB", "/tmp/memo.db")

def get_connection():
    db_path = get_db_path()
    os.makedirs(os.path.dirname(db_path), exist_ok=True)
    return sqlite3.connect(db_path, check_same_thread=False)

The lesson

Never use relative paths for anything that persists state in a containerised environment. If it touches disk, it must be an absolute path from an environment variable — always. The os.makedirs(..., exist_ok=True) guard is equally non-negotiable: the directory may not exist on first boot.

The Stripe webhook that silently ate every event

With the service running, billing stopped working. Stripe webhooks were being delivered — the Stripe dashboard showed 200 OK responses — but no subscription updates were being written to the database. Users who paid weren't being upgraded.

The root cause took an embarrassingly long time to find: the webhook signature verification was passing, but the event-type comparison was case-sensitive and Stripe's event types use dots, not underscores.

The bug

# Wrong — this never matches
if event["type"] == "checkout_session_completed":
    ...

# Stripe actually sends
# "checkout.session.completed"

The handler was silently catching all events, returning 200 (so Stripe stopped retrying), and doing nothing with them. The fix was a five-character change — underscores to dots — but finding it required re-reading the Stripe docs for the third time and finally noticing the difference.

The most expensive bugs are the ones that return 200 OK and do nothing. A hard crash is easy to find. Silent success is not.

What we'd do differently

Three stacked errors, three different root causes, each requiring a different debugging approach. In hindsight, all three were preventable:

Permissions: run a pre-deploy IAM audit script. If the service account is missing any role the service needs, fail the build — not the deployment.
Filesystem: add a container startup test that writes to every path the application uses. If it fails, the deployment fails — not the first live request.
Webhooks: write a test that replays a real Stripe event payload against your handler before any deploy that touches billing code. Stripe provides test payloads for every event type.

The pattern across all three is the same: the error surfaced as late as possible, in the most obscure way possible. The fix for each was to move the failure earlier — to build time, to startup, to test time — so it could never reach production silently.

The one thing worth remembering

Google Cloud's error messages are written for someone who already knows what's wrong. A 503 from a missing IAM role, a startup crash from a missing directory, a silent no-op from a wrong string constant — none of these messages tell you what actually caused them. You have to already know where to look.

The fastest debugging tool we've found for Cloud Run isn't the logs explorer. It's deploying a minimal version of the service — just startup code, no business logic — and adding one piece at a time until the error appears. Tedious, but it works every time.

← Back to all learnings March 2026 · Columnly