Technical documentation

GCS storage system

GCS holds uploaded PDFs, workflow JSON, and extractions. The browser uploads via presigned URLs.

Last updated 2026-05-21

This app stores uploaded PDFs in a GCS bucket (or a local mirror). When Cloud SQL is configured (see /docs/data-model), batch and document metadata live in PostgreSQL instead of batch.json and documents/*.json. Without a database, GCS JSON objects remain the metadata store.

Presigned upload URLs

When a batch is created, the API mints short-lived presigned URLs (V4 signed URLs) for each PDF. The browser PUTs files directly to GCS — large reporting packages never pass through Cloud Run.

  • The upload UI accepts a dragged or browsed folder, traverses subfolders, uploads PDFs via POST /api/batches + presigned PUTs, then routes to /batches/{id}/prepare for metadata and per-PDF intake before POST /api/batches/{id}/complete starts parsing.
  • Each URL targets batches/{batchId}/raw/{filename} and expires after one hour.
  • The service account behind the API needs permission to sign writes on the bucket.
  • Review uses presigned read URLs so the PDF iframe loads from GCS without proxying bytes through the app.
  • Local dev without GCS_BUCKET uses /api/batches/{id}/upload as a stand-in for the same PUT flow.

Object layout

gs://{GCS_BUCKET}/batches/{batchId}/
  batch.json                      # batch status, file counts, errors
  raw/{filename}.pdf              # uploaded reporting packages
  documents/{documentId}.json     # AI extraction + review status
  _upload-complete.json           # optional Eventarc trigger marker

batch.json

Tracks batch-level workflow status (uploading → awaiting_metadata → awaiting_pdf_intake → validating → parsing → ready_for_review → completed), documentIds, uploadedFileCount, tags, name, notes, sourceFilenames, statementIntakeByPath (per-PDF answers saved before complete), and id (UTC timestamp). Statement rows are created when complete runs, not at initial upload.

documents/{documentId}.json

Stores per-PDF status (pending, parsing, onboarding, validated, failed), AI extraction output, human-validated corrections, and error messages.

Runtime adapters

AdapterWhenImplementation
GcsBatchStoreGCS_BUCKET is setnext/lib/document-workflow/gcs-batch-store.ts
LocalBatchStoreLocal dev / tests.data/batches-root/ (or LOCAL_BATCH_STORE_ROOT)

Upload and workflow flow

  • POST /api/batches creates batch.json and returns signed upload URLs.
  • Browser uploads each PDF to batches/{batchId}/raw/{filename}.
  • POST /api/batches/{batchId}/complete marks the batch uploaded and runs folder validation + AI parsing.
  • Optional: write _upload-complete.json and let Eventarc call POST /api/events/gcs.
  • Review UI reads document JSON and signed PDF URLs from the same bucket prefix.

Environment variables

VariablePurpose
GCS_BUCKETTarget bucket for uploads and JSON state
LOCAL_BATCH_STORE_ROOTOverride local mirror path for dev/tests
GOOGLE_APPLICATION_CREDENTIALSService account for signed URLs and reads

Validated metrics are available per document in the batch review flow after human validation.