Last updated 2026-05-21

Portfolio PDF parsing uses a fixed system prompt (next/lib/document-workflow/extraction-system-prompt.ts) so the model documents how each metric maps from board-deck language into documents.extraction JSONB — the same shape stored in PostgreSQL and shown in review as Tables & justifications.

Two-phase output

Phase 1 — Per-field justifications with sourceQuote, synonym rationale, pageHint, and confidence. Required for populated and explicitly absent metrics.
Phase 2 — insights[] only after justifications; no new numbers in narrative that are not justified in Phase 1.

Schema mapping (high level)

Extraction field	Primary storage
companyName	documents.extraction + companies.display_name
period	documents.extraction + statements.period_label / fiscal columns
currency	documents.extraction
revenue, arr, grossMargin, retention, churn, headcount	documents.extraction (validated → validated_extraction)
tables[], justifications[]	documents.extraction

Synonym rules in the prompt normalize Net Revenue / Recognized Revenue → revenue, NRR / NDR / NPR → retention, and GBP symbols → currency GBP. Vertex and combined prompts use buildExtractionPromptParts() (system + user) or buildExtractionPrompt() for single-message models.