Last updated 2026-05-21
This app stores uploaded PDFs in a GCS bucket (or a local mirror). When Cloud SQL is configured (see /docs/data-model), batch and document metadata live in PostgreSQL instead of batch.json and documents/*.json. Without a database, GCS JSON objects remain the metadata store.
Presigned upload URLs
When a batch is created, the API mints short-lived presigned URLs (V4 signed URLs) for each PDF. The browser PUTs files directly to GCS — large reporting packages never pass through Cloud Run.
- The upload UI accepts a dragged or browsed folder, traverses subfolders, uploads PDFs via POST /api/batches + presigned PUTs, then routes to /batches/{id}/prepare for metadata and per-PDF intake before POST /api/batches/{id}/complete starts parsing.
- Each URL targets batches/{batchId}/raw/{filename} and expires after one hour.
- The service account behind the API needs permission to sign writes on the bucket.
- Review uses presigned read URLs so the PDF iframe loads from GCS without proxying bytes through the app.
- Local dev without GCS_BUCKET uses /api/batches/{id}/upload as a stand-in for the same PUT flow.
Object layout
gs://{GCS_BUCKET}/batches/{batchId}/
batch.json # batch status, file counts, errors
raw/{filename}.pdf # uploaded reporting packages
documents/{documentId}.json # AI extraction + review status
_upload-complete.json # optional Eventarc trigger markerbatch.json
Tracks batch-level workflow status (uploading → awaiting_metadata → awaiting_pdf_intake → validating → parsing → ready_for_review → completed), documentIds, uploadedFileCount, tags, name, notes, sourceFilenames, statementIntakeByPath (per-PDF answers saved before complete), and id (UTC timestamp). Statement rows are created when complete runs, not at initial upload.
documents/{documentId}.json
Stores per-PDF status (pending, parsing, onboarding, validated, failed), AI extraction output, human-validated corrections, and error messages.
Runtime adapters
| Adapter | When | Implementation |
|---|---|---|
| GcsBatchStore | GCS_BUCKET is set | next/lib/document-workflow/gcs-batch-store.ts |
| LocalBatchStore | Local dev / tests | .data/batches-root/ (or LOCAL_BATCH_STORE_ROOT) |
Upload and workflow flow
- POST /api/batches creates batch.json and returns signed upload URLs.
- Browser uploads each PDF to batches/{batchId}/raw/{filename}.
- POST /api/batches/{batchId}/complete marks the batch uploaded and runs folder validation + AI parsing.
- Optional: write _upload-complete.json and let Eventarc call POST /api/events/gcs.
- Review UI reads document JSON and signed PDF URLs from the same bucket prefix.
Environment variables
| Variable | Purpose |
|---|---|
| GCS_BUCKET | Target bucket for uploads and JSON state |
| LOCAL_BATCH_STORE_ROOT | Override local mirror path for dev/tests |
| GOOGLE_APPLICATION_CREDENTIALS | Service account for signed URLs and reads |
Validated metrics are available per document in the batch review flow after human validation.
