Receipt Parsing Strategies

Comparison of approaches for converting unstructured OCR text from receipts into structured data (vendor, date, line items, tax, totals). Ordered from simplest to most sophisticated.

1. Regex / Template-Based

Pattern matching for known receipt layouts.

Match “Total: $XX.XX”, “Tax”, “Date” patterns
Fast (microseconds), offline, predictable
Fragile: breaks on layout changes, handwriting, new vendors
Accuracy: 55-65% on general receipts
Best: first-pass filter for known vendors (see Expensify Layer 3)

2. NLP-Based

Post-OCR natural language processing pipeline.

Tokenization, POS tagging, named entity recognition (NER)
Libraries: NLTK, spaCy
Limitation: receipt text is semi-structured, not natural prose

3. ML/DL Layout Analysis

Models that understand document layout.

LayoutLM, PaddleOCR layout analysis, custom models
Learns column relationships and spatial layout
Training data and GPU requirements significant
Best: server-side processing pipelines

4. LLM-Based Extraction

Feed OCR text to a language model with a structured prompt.

See well-ai-invoice-extractor for open-source implementation.

How: Structured prompt with JSON schema → LLM returns parsed data
Strengths: handles varied formats, no templates, can handle handwriting
Weaknesses: network latency, per-request cost, hallucination risk
Model-agnostic: OpenAI, Mistral, local models via LM Studio/vLLM

5. Hybrid (Expensify Model)

Multiple layers with human verification fallback.

See expensify-receipt-pipeline for full architecture.

OCR → template parsers → AI/ML → human review → bank matching
Achieves 99% accuracy at scale
Human verification network is the “secret sauce”

Comparison Table

Strategy	Accuracy	Speed	Cost	Maintenance
Regex/Template	55-65%	<1ms	Free	High (per vendor)
NLP	60-70%	~10ms	Free	Medium
ML LayoutLM	80-90%	~100ms	GPU cost	Medium
LLM extraction	85-95%	1-5s	API cost/request	Low
Hybrid (Expensify)	~99%	Variable	Human + API	High (operations)

ios-receipt-scanning — Full receipt scanning landscape
well-ai-invoice-extractor — LLM-based implementation
expensify-receipt-pipeline — Production hybrid architecture

type	concept
tags	receipt-scanning, ocr, document-parsing, llm-parsing, comparison
confidence	high

Project Aries

Explorer

Receipt Parsing Strategies

Receipt Parsing Strategies

1. Regex / Template-Based

2. NLP-Based

3. ML/DL Layout Analysis

4. LLM-Based Extraction

5. Hybrid (Expensify Model)

Comparison Table

Graph View

Table of Contents

Backlinks

Project Aries

Explorer

Receipt Parsing Strategies

Receipt Parsing Strategies

1. Regex / Template-Based

2. NLP-Based

3. ML/DL Layout Analysis

4. LLM-Based Extraction

5. Hybrid (Expensify Model)

Comparison Table

Related Pages

Graph View

Table of Contents

Backlinks