Last updated 2026-05-21
When Cloud SQL is configured, batch status, folder filenames, statement rows, document extractions, and validated metrics live in PostgreSQL. PDF bytes always stay under batches/{batchId}/raw/{relativePath} in GCS (or the local mirror). Without a database, metadata falls back to JSON in GCS or .data/batches-root/.
SQL over NoSQL
We opt for SQL (PostgreSQL on Cloud SQL) over NoSQL for workflow metadata. In general, SQL is structured — tables, columns, foreign keys, and constraints — and structured data means more context. That context helps everyone perform better: AI agents can reason over explicit relationships and field types; developers can trace batch → statement → document joins; operators can trust validated metrics tied to canonical company and period rows.
- Explicit schema — companies, periods, statements, and documents are first-class rows, not opaque document blobs.
- Joins and integrity — foreign keys encode portfolio semantics that agents and SQL queries can reuse.
- Better context for AI — parsers and review tools read typed columns and JSONB fields with known meaning.
- Legacy JSON-in-GCS — still supported without a database, but treated as a dev fallback, not the target architecture.
Entity relationship diagram
Core tables
| Table | Purpose |
|---|---|
| companies | Portfolio issuer: id (slug), display_name, sector, filename_prefix |
| company_reference_urls | Supporting URLs per company (1-to-many): url, label, source ai_discovered | operator |
| reporting_periods | Canonical quarter keys and fiscal year/quarter columns |
| company_periods | Company × period join with optional context JSON |
| batches | Upload run: id, name, notes, tags, statement_intake_by_path, workflow status |
| statements | One PDF package: company, period, relative_path, intake_answers, document link |
| documents | Parse/review workflow: status, extraction, validated_extraction JSON |
How folder upload maps into columns
| Upload input | Target table.column | Rule |
|---|---|---|
| Folder root name | batches.name / intake modal | Human label for the upload run; batch id defaults to UTC date-time. |
| PDF relative path | statements.relative_path | Path from selected folder root, e.g. NovaCloud/NovaCloud_Q2_2025.pdf. |
| PDF basename | statements.source_filename | Trailing segment of relative_path. |
| Filename pattern Company_Q#_YYYY.pdf | companies.id, statements.period_label, statements.fiscal_quarter, statements.fiscal_year | Parsed by statement-filename.ts; FleetLink maps to apexfreight. |
| Portfolio_Snapshot_Q#_YYYY.pdf | statements.statement_kind | portfolio_snapshot with company_id = NULL. |
| Per-PDF intake modal answers | statements.intake_answers | Optional JSON keyed by question id; stored per statement at registration. |
| After upload + POST complete | statements.* + documents.statement_id | registerBatchStatements inserts rows; workflow links each document to its statement. |
Statement filename examples
- NovaCloud_Q2_2025.pdf → company novacloud, period Q2 2025, fiscal_quarter 2, fiscal_year 2025.
- MediSight_Q2_2025.pdf → company medisight, same period columns.
- Portfolio_Snapshot_Q2_2025.pdf → statement_kind portfolio_snapshot, no company row.
- FleetLink_Q4_2024.pdf → legacy prefix mapped to company apexfreight.
Store selection
- DATABASE_URL or CLOUD_SQL_* → PostgresBatchStore + GcsBlobStore (if GCS_BUCKET) or LocalBlobStore
- GCS_BUCKET only (no DB) → GcsBatchStore (legacy JSON-in-GCS for metadata)
- Neither → LocalBatchStore under .data/batches-root/
Provision Cloud SQL (one-time)
# Example: Postgres 15 in us-central1
gcloud sql instances create zoethales-metadata \
--database-version=POSTGRES_15 \
--tier=db-f1-micro \
--region=us-central1
gcloud sql databases create zoethales --instance=zoethales-metadata
gcloud sql users create zoethales-app --instance=zoethales-metadata --password=CHOOSE_A_STRONG_PASSWORD
# Cloud Run: attach Cloud SQL connection + set env vars (see below)Environment variables
| Variable | Purpose |
|---|---|
| DATABASE_URL | Direct Postgres URL (local dev, or Secret Manager) |
| CLOUD_SQL_INSTANCE | project:region:instance — use with connector on Cloud Run |
| CLOUD_SQL_DATABASE | Database name |
| CLOUD_SQL_USER | DB user |
| CLOUD_SQL_PASSWORD | DB password (prefer Secret Manager) |
| GCS_BUCKET | Still required for PDF blobs when using SQL metadata |
Schema DDL ships in next/lib/document-workflow/sql/schema.sql and is applied automatically on first getBatchStore() in database mode.
