Receipt Parsing Strategies

Comparison of approaches for converting unstructured OCR text from receipts into structured data (vendor, date, line items, tax, totals). Ordered from simplest to most sophisticated.

1. Regex / Template-Based

Pattern matching for known receipt layouts.

  • Match “Total: $XX.XX”, “Tax”, “Date” patterns
  • Fast (microseconds), offline, predictable
  • Fragile: breaks on layout changes, handwriting, new vendors
  • Accuracy: 55-65% on general receipts
  • Best: first-pass filter for known vendors (see Expensify Layer 3)

2. NLP-Based

Post-OCR natural language processing pipeline.

  • Tokenization, POS tagging, named entity recognition (NER)
  • Libraries: NLTK, spaCy
  • Limitation: receipt text is semi-structured, not natural prose

3. ML/DL Layout Analysis

Models that understand document layout.

  • LayoutLM, PaddleOCR layout analysis, custom models
  • Learns column relationships and spatial layout
  • Training data and GPU requirements significant
  • Best: server-side processing pipelines

4. LLM-Based Extraction

Feed OCR text to a language model with a structured prompt.

See well-ai-invoice-extractor for open-source implementation.

  • How: Structured prompt with JSON schema → LLM returns parsed data
  • Strengths: handles varied formats, no templates, can handle handwriting
  • Weaknesses: network latency, per-request cost, hallucination risk
  • Model-agnostic: OpenAI, Mistral, local models via LM Studio/vLLM

5. Hybrid (Expensify Model)

Multiple layers with human verification fallback.

See expensify-receipt-pipeline for full architecture.

  • OCR → template parsers → AI/ML → human review → bank matching
  • Achieves 99% accuracy at scale
  • Human verification network is the “secret sauce”

Comparison Table

StrategyAccuracySpeedCostMaintenance
Regex/Template55-65%<1msFreeHigh (per vendor)
NLP60-70%~10msFreeMedium
ML LayoutLM80-90%~100msGPU costMedium
LLM extraction85-95%1-5sAPI cost/requestLow
Hybrid (Expensify)~99%VariableHuman + APIHigh (operations)