Avanon | Enterprise AI Consulting & Intelligent Automation

Abstract

We extract structured data from unstructured documents at production scale — insurance policies, regulatory filings, pharmacy claim files, real estate comps, contracts, invoices. "95% accuracy" here means field-level accuracy on a labeled holdout set of 4,200 documents spanning 11 document types. This paper details the exact techniques that moved our accuracy from 71% baseline to 95.3%.

Production Performance

95.3%

Field-level accuracy

11 sec

Median latency

$0.08

Cost per document

0.0%

Malformed JSON

Figure 1: Accuracy Improvement Journey

Baseline

71%

+ Citations

86%

+ Two-pass

90%

+ Schema

93%

+ Multi-sample

95%

1. Extract with Citations

The single highest-leverage change. Instead of asking the model to return a JSON object of fields, ask it to return each field alongside the exact verbatim span from the source document that justifies that value. A follow-up deterministic check verifies that the span appears in the source and that the extracted value can be derived from the span.

{
  "contract_value": {
    "value": 250000,
    "citation": "Total Contract Amount: $250,000.00 USD",
    "page": 3,
    "confidence": 0.94
  }
}

This one change moved us from 71% to 86% field-level accuracy by catching hallucinated values.

2. Two-Pass Pipeline

Pass one asks the model to identify the page or section where each field of interest lives. Pass two asks it to extract the field given only that localized context. Two-pass is measurably more accurate than one-pass because the model makes fewer attention errors in shorter contexts.

Pass 1: Locate

"Find the page/section containing the contract value"

Pass 2: Extract

"Given page 3, extract the contract value with citation"

Cost is roughly 1.4x a single call; accuracy gain averaged 4.2 points in our eval.

3. Schema as Prompt

The JSON schema you ask the model to produce is prompt. It's the most efficient prompt you have. Fields should have: a clear semantic name, a type, a description including edge cases, a list of acceptable formats for strings, explicit null-vs-missing semantics, and where relevant a small enum of allowed values.

A well-documented schema outperforms several paragraphs of prose instructions. Our extraction accuracy rose 2.8 points when we replaced free-form instructions with schema-as-prompt using Anthropic's tool use for structured output.

4. Multi-Sample Agreement

For high-stakes fields (contract value, effective date, counterparty legal name) we run the extraction three times at temperature 0.3 and score the agreement across samples. Fields that agree across all three samples go through without review; fields with disagreement get flagged.

Agreement Scoring Results

97.1%

3-sample agreement rate

0.79

Combined signal correlation

Inference cost for high-stakes

5. Production Pipeline

Document Ingestion

PDF/image → OCR → structured text

Page-level Localization

Identify relevant sections per field

Table Sub-agent

Specialized handling for tabular data

Field Extraction

Tool-use schema with citations

Multi-sample Scoring

3x inference for high-stakes fields

Validation

Citation check + schema validation

Routing

Auto-accept, auto-retry, or human review

Figure 2: Accuracy by Document Type

Invoices

97%

Contracts

96%

Insurance

95%

Regulatory

94%

Financial

93%

Claims

92%

Evaluation Methodology

Eval Set

4,200 documents with field-level gold standards across 11 document types

Model

Claude 3.5 Sonnet via Anthropic API with tool use

Metrics

Field-level accuracy, latency, cost per document, malformed output rate

Labeling

Double-blind annotation with adjudication for disagreements

The methods matter less than the eval-driven discipline behind them. Build the eval first. Everything else follows. Contact engineering@avanon.com for implementation details.

The Claude API Techniques We Use for 95% Extraction Accuracy