Upload files first; invest context before parsing.
Metadata and PDF intake are separate, resumable steps.
Last updated 2026-05-21
Why three phases
Read more
Batch upload, entity metadata prep, and per-PDF management are intentionally separate. Operators can upload a large folder to storage, leave, and return later to work through dozens of company×quarter steps without re-selecting files. Parsing and folder validation only run after both metadata and PDF context are saved.
End-to-end pipeline
Phase 1 — Upload (files only)
Read more
From Upload (/): drag or browse a reporting folder. The app traverses subfolders and lists every PDF with its path from the folder root. Confirm optional batch name, tags, and notes, then upload. The API creates a batch (status uploading), returns presigned PUT URLs, the browser uploads each PDF, and POST /api/batches/{id}/uploaded moves the batch to awaiting_metadata. No company or PDF intake happens in this phase.
Phase 2 — Metadata preparation
Read more
Open /batches/{batchId}/prepare (linked from Recent batches as Continue prep while status is awaiting_metadata or awaiting_pdf_intake). A chevron stepper shows Upload → Metadata → PDF context → Parse & review. Metadata runs entity discovery from filenames, then a wizard: each company uses two steps (confirm reference URLs, then company summary enriched from those links) → each reporting period → each company×quarter pair. Finishing POSTs entity intake to /api/batches/{id}/metadata and sets status awaiting_pdf_intake.
Company context — new vs existing
Read more
Each company uses three wizard steps. Step 1 (legal entity name): badge In database vs Not in database reflects companies row presence only. The UI lists provenance for each candidate — quarterly PDF text, portfolio database/catalog, and live checks against company website or LinkedIn (title, Open Graph, JSON-LD) via POST /api/upload/resolve-company-legal-name so punctuation like commas and Inc. match public records. Step 2 (reference links): human-in-the-loop URL confirmation. Step 3 (company summary): POST /api/upload/enrich-company-summary after URLs are confirmed. Saved legal name becomes companies.display_name.
Company supporting URLs (1-to-many)
Read more
Each company can have many supporting URLs (company_reference_urls table). During metadata prep, AI pre-seeds links (inferred website, LinkedIn, web search) as ai_discovered rows in the working list. Operators remove wrong links, edit labels, click a URL to open a small preview window, or add operator URLs manually. Confirmed URLs are saved with metadata and used as company context for parsing and review. URLs are normalized (https only, deduped per company).
Phase 3 — PDF management
Read more
After metadata, the prepare page shows one wizard step per PDF (reporting period, review focus, known issues, comparison notes — all optional). Saving the last step PATCHes statementIntakeByPath via /api/batches/{id}/pdf-intake, then POST /api/batches/{id}/complete registers statement rows, runs folder validation, and starts AI parsing. The batch detail page tracks progress through validating → parsing → ready_for_review.
Resume and saved progress
Read more
Metadata and PDF wizards auto-save drafts in the browser keyed by batch id (zoethales:batch-metadata:{id} and zoethales:batch-pdf-intake:{id}). Operators can close the prepare page and use Continue prep on Upload or Recent batches. Start over on a company step clears the metadata draft and re-runs discovery.
Batch statuses (prepare vs processing)
| Status | Meaning | Operator action |
|---|---|---|
| uploading | Batch created; browser uploading PDFs | Wait for upload to finish |
| awaiting_metadata | Files in storage; entity context not saved | Open /batches/{id}/prepare — complete metadata wizard |
| awaiting_pdf_intake | Companies/periods saved; per-PDF context pending | Finish PDF wizard and Start parsing |
| validating | Folder rules checked after complete | Monitor batch detail page |
| parsing | AI extraction running per document | Monitor batch detail page |
| ready_for_review | All PDFs parsed; human review can start | Open Review links per document |
| completed | Every document validated or failed | View portfolio metrics |
| failed | Validation or workflow error on batch | Read error on batch detail page |
API routes (prepare flow)
| Method | Route | When |
|---|---|---|
| POST | /api/batches | Create batch + signed upload URLs (filenames only) |
| PUT | /api/batches/{id}/upload?filename=… | Upload one PDF (or GCS presigned PUT) |
| POST | /api/batches/{id}/uploaded | Finalize upload → awaiting_metadata |
| POST | /api/upload/discover-entities | AI entity discovery from filenames (metadata wizard) |
| POST | /api/batches/{id}/metadata | Save entity intake + company URLs → awaiting_pdf_intake |
| PATCH | /api/batches/{id}/pdf-intake | Save per-PDF statementIntakeByPath |
| POST | /api/batches/{id}/complete | Register statements + start validation/parsing (only from awaiting_pdf_intake) |
UI routes
- / — Upload tab: folder drop + Upload folder dialog (batch label only).
- /batches/{batchId}/prepare — Metadata and PDF wizards (chevron pipeline).
- /batches/{batchId} — Batch progress, document table, link back to prepare if incomplete.
- /batches/{batchId}/documents/{documentId}/review — Human validation after parsing.
Folder validation
Read more
Non-PDF files are ignored at listing time. Duplicate filenames in one batch are rejected. Preferred naming: CompanyName_Q2_2025.pdf. Validation runs when POST complete is called — after metadata and PDF intake — not during the initial file upload.
Quality over speed
Read more
Folder drop is step one, not the finish line. Separating upload from metadata and PDF context keeps large batches manageable and ensures company reference URLs and company×quarter mappings exist before any parser runs.
