Abstract
We extract structured data from unstructured documents at production scale — insurance policies, regulatory filings, pharmacy claim files, real estate comps, contracts, invoices. "95% accuracy" here means field-level accuracy on a labeled holdout set of 4,200 documents spanning 11 document types. This paper details the exact techniques that moved our accuracy from 71% baseline to 95.3%.
Production Performance
Figure 1: Accuracy Improvement Journey
1. Extract with Citations
The single highest-leverage change. Instead of asking the model to return a JSON object of fields, ask it to return each field alongside the exact verbatim span from the source document that justifies that value. A follow-up deterministic check verifies that the span appears in the source and that the extracted value can be derived from the span.
{
"contract_value": {
"value": 250000,
"citation": "Total Contract Amount: $250,000.00 USD",
"page": 3,
"confidence": 0.94
}
}This one change moved us from 71% to 86% field-level accuracy by catching hallucinated values.
2. Two-Pass Pipeline
Pass one asks the model to identify the page or section where each field of interest lives. Pass two asks it to extract the field given only that localized context. Two-pass is measurably more accurate than one-pass because the model makes fewer attention errors in shorter contexts.
"Find the page/section containing the contract value"
"Given page 3, extract the contract value with citation"
Cost is roughly 1.4x a single call; accuracy gain averaged 4.2 points in our eval.
3. Schema as Prompt
The JSON schema you ask the model to produce is prompt. It's the most efficient prompt you have. Fields should have: a clear semantic name, a type, a description including edge cases, a list of acceptable formats for strings, explicit null-vs-missing semantics, and where relevant a small enum of allowed values.
A well-documented schema outperforms several paragraphs of prose instructions. Our extraction accuracy rose 2.8 points when we replaced free-form instructions with schema-as-prompt using Anthropic's tool use for structured output.
4. Multi-Sample Agreement
For high-stakes fields (contract value, effective date, counterparty legal name) we run the extraction three times at temperature 0.3 and score the agreement across samples. Fields that agree across all three samples go through without review; fields with disagreement get flagged.
Agreement Scoring Results
5. Production Pipeline
Figure 2: Accuracy by Document Type
Evaluation Methodology
4,200 documents with field-level gold standards across 11 document types
Claude 3.5 Sonnet via Anthropic API with tool use
Field-level accuracy, latency, cost per document, malformed output rate
Double-blind annotation with adjudication for disagreements
The methods matter less than the eval-driven discipline behind them. Build the eval first. Everything else follows. Contact engineering@avanon.com for implementation details.