Technical documentation

Extraction system prompt

Justification-first LLM prompt: map PDF labels to documents.extraction JSON and SQL before narrative insights.

Last updated 2026-05-21

Portfolio PDF parsing uses a fixed system prompt (next/lib/document-workflow/extraction-system-prompt.ts) so the model documents how each metric maps from board-deck language into documents.extraction JSONB — the same shape stored in PostgreSQL and shown in review as Tables & justifications.

Two-phase output

  • Phase 1 — Per-field justifications with sourceQuote, synonym rationale, pageHint, and confidence. Required for populated and explicitly absent metrics.
  • Phase 2 — insights[] only after justifications; no new numbers in narrative that are not justified in Phase 1.

Schema mapping (high level)

Extraction fieldPrimary storage
companyNamedocuments.extraction + companies.display_name
perioddocuments.extraction + statements.period_label / fiscal columns
currencydocuments.extraction
revenue, arr, grossMargin, retention, churn, headcountdocuments.extraction (validated → validated_extraction)
tables[], justifications[]documents.extraction

Synonym rules in the prompt normalize Net Revenue / Recognized Revenue → revenue, NRR / NDR / NPR → retention, and GBP symbols → currency GBP. Vertex and combined prompts use buildExtractionPromptParts() (system + user) or buildExtractionPrompt() for single-message models.