Why three phases

Separated concernsUploadMetadataPDF contextHigh-touch

Batch upload, entity metadata prep, and per-PDF management are intentionally separate. Operators can upload a large folder to storage, leave, and return later to work through dozens of company×quarter steps without re-selecting files. Parsing and folder validation only run after both metadata and PDF context are saved.

End-to-end pipeline

Phase 1 — Upload (files only)

Upload tabFolder dropawaiting_metadataPOST /api/batchesuploaded

From Upload (/): drag or browse a reporting folder. The app traverses subfolders and lists every PDF with its path from the folder root. Confirm optional batch name, tags, and notes, then upload. The API creates a batch (status uploading), returns presigned PUT URLs, the browser uploads each PDF, and POST /api/batches/{id}/uploaded moves the batch to awaiting_metadata. No company or PDF intake happens in this phase.

Phase 2 — Metadata preparation

Prepare pageEntity discoveryCompanyPeriodcompany_periodcompany_reference_urls

Open /batches/{batchId}/prepare (linked from Recent batches as Continue prep while status is awaiting_metadata or awaiting_pdf_intake). A chevron stepper shows Upload → Metadata → PDF context → Parse & review. Metadata runs entity discovery from filenames, then a wizard: each company uses two steps (confirm reference URLs, then company summary enriched from those links) → each reporting period → each company×quarter pair. Finishing POSTs entity intake to /api/batches/{id}/metadata and sets status awaiting_pdf_intake.

Company context — new vs existing

New company discoveredExisting company foundPortfolio catalogisNew

Each company uses three wizard steps. Step 1 (legal entity name): badge In database vs Not in database reflects companies row presence only. The UI lists provenance for each candidate — quarterly PDF text, portfolio database/catalog, and live checks against company website or LinkedIn (title, Open Graph, JSON-LD) via POST /api/upload/resolve-company-legal-name so punctuation like commas and Inc. match public records. Step 2 (reference links): human-in-the-loop URL confirmation. Step 3 (company summary): POST /api/upload/enrich-company-summary after URLs are confirmed. Saved legal name becomes companies.display_name.

Company supporting URLs (1-to-many)

company_reference_urlsAI discoveredOperatorWebsiteLinkedIn

Each company can have many supporting URLs (company_reference_urls table). During metadata prep, AI pre-seeds links (inferred website, LinkedIn, web search) as ai_discovered rows in the working list. Operators remove wrong links, edit labels, click a URL to open a small preview window, or add operator URLs manually. Confirmed URLs are saved with metadata and used as company context for parsing and review. URLs are normalized (https only, deduped per company).

Phase 3 — PDF management

Per-PDF intakestatementIntakeByPathStart parsingcomplete API

After metadata, the prepare page shows one wizard step per PDF (reporting period, review focus, known issues, comparison notes — all optional). Saving the last step PATCHes statementIntakeByPath via /api/batches/{id}/pdf-intake, then POST /api/batches/{id}/complete registers statement rows, runs folder validation, and starts AI parsing. The batch detail page tracks progress through validating → parsing → ready_for_review.

Resume and saved progress

localStoragebatch draftContinue prepSave progress

Metadata and PDF wizards auto-save drafts in the browser keyed by batch id (zoethales:batch-metadata:{id} and zoethales:batch-pdf-intake:{id}). Operators can close the prepare page and use Continue prep on Upload or Recent batches. Start over on a company step clears the metadata draft and re-runs discovery.

Batch statuses (prepare vs processing)

Status	Meaning	Operator action
uploading	Batch created; browser uploading PDFs	Wait for upload to finish
awaiting_metadata	Files in storage; entity context not saved	Open /batches/{id}/prepare — complete metadata wizard
awaiting_pdf_intake	Companies/periods saved; per-PDF context pending	Finish PDF wizard and Start parsing
validating	Folder rules checked after complete	Monitor batch detail page
parsing	AI extraction running per document	Monitor batch detail page
ready_for_review	All PDFs parsed; human review can start	Open Review links per document
completed	Every document validated or failed	View portfolio metrics
failed	Validation or workflow error on batch	Read error on batch detail page

API routes (prepare flow)

Method	Route	When
POST	/api/batches	Create batch + signed upload URLs (filenames only)
PUT	/api/batches/{id}/upload?filename=…	Upload one PDF (or GCS presigned PUT)
POST	/api/batches/{id}/uploaded	Finalize upload → awaiting_metadata
POST	/api/upload/discover-entities	AI entity discovery from filenames (metadata wizard)
POST	/api/batches/{id}/metadata	Save entity intake + company URLs → awaiting_pdf_intake
PATCH	/api/batches/{id}/pdf-intake	Save per-PDF statementIntakeByPath
POST	/api/batches/{id}/complete	Register statements + start validation/parsing (only from awaiting_pdf_intake)

UI routes

/ — Upload tab: folder drop + Upload folder dialog (batch label only).
/batches/{batchId}/prepare — Metadata and PDF wizards (chevron pipeline).
/batches/{batchId} — Batch progress, document table, link back to prepare if incomplete.
/batches/{batchId}/documents/{documentId}/review — Human validation after parsing.

Folder validation

PDFs onlyUnique filenamesCompanyName_Q#_YYYYPre-parse gate

Non-PDF files are ignored at listing time. Duplicate filenames in one batch are rejected. Preferred naming: CompanyName_Q2_2025.pdf. Validation runs when POST complete is called — after metadata and PDF intake — not during the initial file upload.

Quality over speed

High-touchContext-firstNo fire-and-forget

Folder drop is step one, not the finish line. Separating upload from metadata and PDF context keeps large batches manageable and ensures company reference URLs and company×quarter mappings exist before any parser runs.