Technical documentation

Data model & Cloud SQL

PostgreSQL schema for companies, batches, statements, and documents — plus Cloud SQL setup and env vars.

Last updated 2026-05-21

When Cloud SQL is configured, batch status, folder filenames, statement rows, document extractions, and validated metrics live in PostgreSQL. PDF bytes always stay under batches/{batchId}/raw/{relativePath} in GCS (or the local mirror). Without a database, metadata falls back to JSON in GCS or .data/batches-root/.

SQL over NoSQL

We opt for SQL (PostgreSQL on Cloud SQL) over NoSQL for workflow metadata. In general, SQL is structured — tables, columns, foreign keys, and constraints — and structured data means more context. That context helps everyone perform better: AI agents can reason over explicit relationships and field types; developers can trace batch → statement → document joins; operators can trust validated metrics tied to canonical company and period rows.

  • Explicit schema — companies, periods, statements, and documents are first-class rows, not opaque document blobs.
  • Joins and integrity — foreign keys encode portfolio semantics that agents and SQL queries can reuse.
  • Better context for AI — parsers and review tools read typed columns and JSONB fields with known meaning.
  • Legacy JSON-in-GCS — still supported without a database, but treated as a dev fallback, not the target architecture.

Entity relationship diagram

Core tables

TablePurpose
companiesPortfolio issuer: id (slug), display_name, sector, filename_prefix
company_reference_urlsSupporting URLs per company (1-to-many): url, label, source ai_discovered | operator
reporting_periodsCanonical quarter keys and fiscal year/quarter columns
company_periodsCompany × period join with optional context JSON
batchesUpload run: id, name, notes, tags, statement_intake_by_path, workflow status
statementsOne PDF package: company, period, relative_path, intake_answers, document link
documentsParse/review workflow: status, extraction, validated_extraction JSON

How folder upload maps into columns

Upload inputTarget table.columnRule
Folder root namebatches.name / intake modalHuman label for the upload run; batch id defaults to UTC date-time.
PDF relative pathstatements.relative_pathPath from selected folder root, e.g. NovaCloud/NovaCloud_Q2_2025.pdf.
PDF basenamestatements.source_filenameTrailing segment of relative_path.
Filename pattern Company_Q#_YYYY.pdfcompanies.id, statements.period_label, statements.fiscal_quarter, statements.fiscal_yearParsed by statement-filename.ts; FleetLink maps to apexfreight.
Portfolio_Snapshot_Q#_YYYY.pdfstatements.statement_kindportfolio_snapshot with company_id = NULL.
Per-PDF intake modal answersstatements.intake_answersOptional JSON keyed by question id; stored per statement at registration.
After upload + POST completestatements.* + documents.statement_idregisterBatchStatements inserts rows; workflow links each document to its statement.

Statement filename examples

  • NovaCloud_Q2_2025.pdf → company novacloud, period Q2 2025, fiscal_quarter 2, fiscal_year 2025.
  • MediSight_Q2_2025.pdf → company medisight, same period columns.
  • Portfolio_Snapshot_Q2_2025.pdf → statement_kind portfolio_snapshot, no company row.
  • FleetLink_Q4_2024.pdf → legacy prefix mapped to company apexfreight.

Store selection

  • DATABASE_URL or CLOUD_SQL_* → PostgresBatchStore + GcsBlobStore (if GCS_BUCKET) or LocalBlobStore
  • GCS_BUCKET only (no DB) → GcsBatchStore (legacy JSON-in-GCS for metadata)
  • Neither → LocalBatchStore under .data/batches-root/

Provision Cloud SQL (one-time)

# Example: Postgres 15 in us-central1
gcloud sql instances create zoethales-metadata \
  --database-version=POSTGRES_15 \
  --tier=db-f1-micro \
  --region=us-central1

gcloud sql databases create zoethales --instance=zoethales-metadata
gcloud sql users create zoethales-app --instance=zoethales-metadata --password=CHOOSE_A_STRONG_PASSWORD

# Cloud Run: attach Cloud SQL connection + set env vars (see below)

Environment variables

VariablePurpose
DATABASE_URLDirect Postgres URL (local dev, or Secret Manager)
CLOUD_SQL_INSTANCEproject:region:instance — use with connector on Cloud Run
CLOUD_SQL_DATABASEDatabase name
CLOUD_SQL_USERDB user
CLOUD_SQL_PASSWORDDB password (prefer Secret Manager)
GCS_BUCKETStill required for PDF blobs when using SQL metadata

Schema DDL ships in next/lib/document-workflow/sql/schema.sql and is applied automatically on first getBatchStore() in database mode.