Business documentation

Upload & batch prepare workflow

Three separated phases — file upload, metadata prep (companies, periods, URLs), then per-PDF context before parsing starts.

Upload files first; invest context before parsing.

Metadata and PDF intake are separate, resumable steps.

Last updated 2026-05-21

Why three phases

Separated concernsUploadMetadataPDF contextHigh-touch
Read more

Batch upload, entity metadata prep, and per-PDF management are intentionally separate. Operators can upload a large folder to storage, leave, and return later to work through dozens of company×quarter steps without re-selecting files. Parsing and folder validation only run after both metadata and PDF context are saved.

End-to-end pipeline

Phase 1 — Upload (files only)

Upload tabFolder dropawaiting_metadataPOST /api/batchesuploaded
Read more

From Upload (/): drag or browse a reporting folder. The app traverses subfolders and lists every PDF with its path from the folder root. Confirm optional batch name, tags, and notes, then upload. The API creates a batch (status uploading), returns presigned PUT URLs, the browser uploads each PDF, and POST /api/batches/{id}/uploaded moves the batch to awaiting_metadata. No company or PDF intake happens in this phase.

Phase 2 — Metadata preparation

Prepare pageEntity discoveryCompanyPeriodcompany_periodcompany_reference_urls
Read more

Open /batches/{batchId}/prepare (linked from Recent batches as Continue prep while status is awaiting_metadata or awaiting_pdf_intake). A chevron stepper shows Upload → Metadata → PDF context → Parse & review. Metadata runs entity discovery from filenames, then a wizard: each company uses two steps (confirm reference URLs, then company summary enriched from those links) → each reporting period → each company×quarter pair. Finishing POSTs entity intake to /api/batches/{id}/metadata and sets status awaiting_pdf_intake.

Company context — new vs existing

New company discoveredExisting company foundPortfolio catalogisNew
Read more

Each company uses three wizard steps. Step 1 (legal entity name): badge In database vs Not in database reflects companies row presence only. The UI lists provenance for each candidate — quarterly PDF text, portfolio database/catalog, and live checks against company website or LinkedIn (title, Open Graph, JSON-LD) via POST /api/upload/resolve-company-legal-name so punctuation like commas and Inc. match public records. Step 2 (reference links): human-in-the-loop URL confirmation. Step 3 (company summary): POST /api/upload/enrich-company-summary after URLs are confirmed. Saved legal name becomes companies.display_name.

Company supporting URLs (1-to-many)

company_reference_urlsAI discoveredOperatorWebsiteLinkedIn
Read more

Each company can have many supporting URLs (company_reference_urls table). During metadata prep, AI pre-seeds links (inferred website, LinkedIn, web search) as ai_discovered rows in the working list. Operators remove wrong links, edit labels, click a URL to open a small preview window, or add operator URLs manually. Confirmed URLs are saved with metadata and used as company context for parsing and review. URLs are normalized (https only, deduped per company).

Phase 3 — PDF management

Per-PDF intakestatementIntakeByPathStart parsingcomplete API
Read more

After metadata, the prepare page shows one wizard step per PDF (reporting period, review focus, known issues, comparison notes — all optional). Saving the last step PATCHes statementIntakeByPath via /api/batches/{id}/pdf-intake, then POST /api/batches/{id}/complete registers statement rows, runs folder validation, and starts AI parsing. The batch detail page tracks progress through validating → parsing → ready_for_review.

Resume and saved progress

localStoragebatch draftContinue prepSave progress
Read more

Metadata and PDF wizards auto-save drafts in the browser keyed by batch id (zoethales:batch-metadata:{id} and zoethales:batch-pdf-intake:{id}). Operators can close the prepare page and use Continue prep on Upload or Recent batches. Start over on a company step clears the metadata draft and re-runs discovery.

Batch statuses (prepare vs processing)

StatusMeaningOperator action
uploadingBatch created; browser uploading PDFsWait for upload to finish
awaiting_metadataFiles in storage; entity context not savedOpen /batches/{id}/prepare — complete metadata wizard
awaiting_pdf_intakeCompanies/periods saved; per-PDF context pendingFinish PDF wizard and Start parsing
validatingFolder rules checked after completeMonitor batch detail page
parsingAI extraction running per documentMonitor batch detail page
ready_for_reviewAll PDFs parsed; human review can startOpen Review links per document
completedEvery document validated or failedView portfolio metrics
failedValidation or workflow error on batchRead error on batch detail page

API routes (prepare flow)

MethodRouteWhen
POST/api/batchesCreate batch + signed upload URLs (filenames only)
PUT/api/batches/{id}/upload?filename=…Upload one PDF (or GCS presigned PUT)
POST/api/batches/{id}/uploadedFinalize upload → awaiting_metadata
POST/api/upload/discover-entitiesAI entity discovery from filenames (metadata wizard)
POST/api/batches/{id}/metadataSave entity intake + company URLs → awaiting_pdf_intake
PATCH/api/batches/{id}/pdf-intakeSave per-PDF statementIntakeByPath
POST/api/batches/{id}/completeRegister statements + start validation/parsing (only from awaiting_pdf_intake)

UI routes

  • / — Upload tab: folder drop + Upload folder dialog (batch label only).
  • /batches/{batchId}/prepare — Metadata and PDF wizards (chevron pipeline).
  • /batches/{batchId} — Batch progress, document table, link back to prepare if incomplete.
  • /batches/{batchId}/documents/{documentId}/review — Human validation after parsing.

Folder validation

PDFs onlyUnique filenamesCompanyName_Q#_YYYYPre-parse gate
Read more

Non-PDF files are ignored at listing time. Duplicate filenames in one batch are rejected. Preferred naming: CompanyName_Q2_2025.pdf. Validation runs when POST complete is called — after metadata and PDF intake — not during the initial file upload.

Quality over speed

High-touchContext-firstNo fire-and-forget
Read more

Folder drop is step one, not the finish line. Separating upload from metadata and PDF context keeps large batches manageable and ensures company reference URLs and company×quarter mappings exist before any parser runs.