Last updated 2026-05-21
Portfolio PDF parsing uses a fixed system prompt (next/lib/document-workflow/extraction-system-prompt.ts) so the model documents how each metric maps from board-deck language into documents.extraction JSONB — the same shape stored in PostgreSQL and shown in review as Tables & justifications.
Two-phase output
- Phase 1 — Per-field justifications with sourceQuote, synonym rationale, pageHint, and confidence. Required for populated and explicitly absent metrics.
- Phase 2 — insights[] only after justifications; no new numbers in narrative that are not justified in Phase 1.
Schema mapping (high level)
| Extraction field | Primary storage |
|---|---|
| companyName | documents.extraction + companies.display_name |
| period | documents.extraction + statements.period_label / fiscal columns |
| currency | documents.extraction |
| revenue, arr, grossMargin, retention, churn, headcount | documents.extraction (validated → validated_extraction) |
| tables[], justifications[] | documents.extraction |
Synonym rules in the prompt normalize Net Revenue / Recognized Revenue → revenue, NRR / NDR / NPR → retention, and GBP symbols → currency GBP. Vertex and combined prompts use buildExtractionPromptParts() (system + user) or buildExtractionPrompt() for single-message models.
