Last updated 2026-05-21

When Cloud SQL is configured, batch status, folder filenames, statement rows, document extractions, and validated metrics live in PostgreSQL. PDF bytes always stay under batches/{batchId}/raw/{relativePath} in GCS (or the local mirror). Without a database, metadata falls back to JSON in GCS or .data/batches-root/.

SQL over NoSQL

We opt for SQL (PostgreSQL on Cloud SQL) over NoSQL for workflow metadata. In general, SQL is structured — tables, columns, foreign keys, and constraints — and structured data means more context. That context helps everyone perform better: AI agents can reason over explicit relationships and field types; developers can trace batch → statement → document joins; operators can trust validated metrics tied to canonical company and period rows.

Explicit schema — companies, periods, statements, and documents are first-class rows, not opaque document blobs.
Joins and integrity — foreign keys encode portfolio semantics that agents and SQL queries can reuse.
Better context for AI — parsers and review tools read typed columns and JSONB fields with known meaning.
Legacy JSON-in-GCS — still supported without a database, but treated as a dev fallback, not the target architecture.

Entity relationship diagram

Core tables

Table	Purpose
companies	Portfolio issuer: id (slug), display_name, sector, filename_prefix
company_reference_urls	Supporting URLs per company (1-to-many): url, label, source ai_discovered \| operator
reporting_periods	Canonical quarter keys and fiscal year/quarter columns
company_periods	Company × period join with optional context JSON
batches	Upload run: id, name, notes, tags, statement_intake_by_path, workflow status
statements	One PDF package: company, period, relative_path, intake_answers, document link
documents	Parse/review workflow: status, extraction, validated_extraction JSON

How folder upload maps into columns

Upload input	Target table.column	Rule
Folder root name	batches.name / intake modal	Human label for the upload run; batch id defaults to UTC date-time.
PDF relative path	statements.relative_path	Path from selected folder root, e.g. NovaCloud/NovaCloud_Q2_2025.pdf.
PDF basename	statements.source_filename	Trailing segment of relative_path.
Filename pattern Company_Q#_YYYY.pdf	companies.id, statements.period_label, statements.fiscal_quarter, statements.fiscal_year	Parsed by statement-filename.ts; FleetLink maps to apexfreight.
Portfolio_Snapshot_Q#_YYYY.pdf	statements.statement_kind	portfolio_snapshot with company_id = NULL.
Per-PDF intake modal answers	statements.intake_answers	Optional JSON keyed by question id; stored per statement at registration.
After upload + POST complete	statements.* + documents.statement_id	registerBatchStatements inserts rows; workflow links each document to its statement.

Statement filename examples

NovaCloud_Q2_2025.pdf → company novacloud, period Q2 2025, fiscal_quarter 2, fiscal_year 2025.
MediSight_Q2_2025.pdf → company medisight, same period columns.
Portfolio_Snapshot_Q2_2025.pdf → statement_kind portfolio_snapshot, no company row.
FleetLink_Q4_2024.pdf → legacy prefix mapped to company apexfreight.

Store selection

DATABASE_URL or CLOUD_SQL_* → PostgresBatchStore + GcsBlobStore (if GCS_BUCKET) or LocalBlobStore
GCS_BUCKET only (no DB) → GcsBatchStore (legacy JSON-in-GCS for metadata)
Neither → LocalBatchStore under .data/batches-root/

Provision Cloud SQL (one-time)

# Example: Postgres 15 in us-central1
gcloud sql instances create zoethales-metadata \
  --database-version=POSTGRES_15 \
  --tier=db-f1-micro \
  --region=us-central1

gcloud sql databases create zoethales --instance=zoethales-metadata
gcloud sql users create zoethales-app --instance=zoethales-metadata --password=CHOOSE_A_STRONG_PASSWORD

# Cloud Run: attach Cloud SQL connection + set env vars (see below)

Environment variables

Variable	Purpose
DATABASE_URL	Direct Postgres URL (local dev, or Secret Manager)
CLOUD_SQL_INSTANCE	project:region:instance — use with connector on Cloud Run
CLOUD_SQL_DATABASE	Database name
CLOUD_SQL_USER	DB user
CLOUD_SQL_PASSWORD	DB password (prefer Secret Manager)
GCS_BUCKET	Still required for PDF blobs when using SQL metadata

Schema DDL ships in next/lib/document-workflow/sql/schema.sql and is applied automatically on first getBatchStore() in database mode.