Receipt Parsing Strategies
Comparison of approaches for converting unstructured OCR text from receipts into structured data (vendor, date, line items, tax, totals). Ordered from simplest to most sophisticated.
1. Regex / Template-Based
Pattern matching for known receipt layouts.
- Match “Total: $XX.XX”, “Tax”, “Date” patterns
- Fast (microseconds), offline, predictable
- Fragile: breaks on layout changes, handwriting, new vendors
- Accuracy: 55-65% on general receipts
- Best: first-pass filter for known vendors (see Expensify Layer 3)
2. NLP-Based
Post-OCR natural language processing pipeline.
- Tokenization, POS tagging, named entity recognition (NER)
- Libraries: NLTK, spaCy
- Limitation: receipt text is semi-structured, not natural prose
3. ML/DL Layout Analysis
Models that understand document layout.
- LayoutLM, PaddleOCR layout analysis, custom models
- Learns column relationships and spatial layout
- Training data and GPU requirements significant
- Best: server-side processing pipelines
4. LLM-Based Extraction
Feed OCR text to a language model with a structured prompt.
See well-ai-invoice-extractor for open-source implementation.
- How: Structured prompt with JSON schema → LLM returns parsed data
- Strengths: handles varied formats, no templates, can handle handwriting
- Weaknesses: network latency, per-request cost, hallucination risk
- Model-agnostic: OpenAI, Mistral, local models via LM Studio/vLLM
5. Hybrid (Expensify Model)
Multiple layers with human verification fallback.
See expensify-receipt-pipeline for full architecture.
- OCR → template parsers → AI/ML → human review → bank matching
- Achieves 99% accuracy at scale
- Human verification network is the “secret sauce”
Comparison Table
| Strategy | Accuracy | Speed | Cost | Maintenance |
|---|---|---|---|---|
| Regex/Template | 55-65% | <1ms | Free | High (per vendor) |
| NLP | 60-70% | ~10ms | Free | Medium |
| ML LayoutLM | 80-90% | ~100ms | GPU cost | Medium |
| LLM extraction | 85-95% | 1-5s | API cost/request | Low |
| Hybrid (Expensify) | ~99% | Variable | Human + API | High (operations) |
Related Pages
- ios-receipt-scanning — Full receipt scanning landscape
- well-ai-invoice-extractor — LLM-based implementation
- expensify-receipt-pipeline — Production hybrid architecture