Well AI Invoice Extractor
Open-source (MIT license) tool that uses LLMs to parse invoices and receipts into structured JSON. Part of the Well project — an open financial operations platform.
Why It Matters for Project Aries
This is the most directly applicable open-source library for the “extract structured data from OCR text” problem. It demonstrates the LLM-based parsing approach with a production-quality implementation, including schema validation and confidence scoring. Could serve as a starting point or reference architecture for Project Aries’ receipt parsing pipeline.
How It Works
Pipeline
- OCR (optional): Tesseract, PaddleOCR, or any OCR engine produces raw text. Plain
.txtfrom emails/parsers also accepted. - LLM Extraction: Structured prompt with target JSON schema sent to LLM (OpenAI, Mistral, or custom endpoints)
- Validation: Extracted JSON validated against schema
- Confidence Scoring: Per-field confidence for downstream systems
Extracted Fields
Vendor name, invoice date, line items, tax, subtotal, total, payment method, currency, and more.
Model Support
- OpenAI (GPT-4, GPT-3.5)
- Mistral models
- Custom endpoints (LM Studio, vLLM)
- Model-agnostic by design
Key Facts
- License: MIT
- Repository: github.com/WellApp-ai/Well (ai-invoice-extractor directory)
- Author: Maxime Champoux, Co-founder & CEO @Well
- Created: 2025
Philosophy
“Open, hackable, model-agnostic, developer-first.”
Designed as an alternative to proprietary, expensive, or rigid extraction tools. Template-based regex breaks when layouts change — LLMs generalise across formats.
Related Pages
- ios-receipt-scanning — Comprehensive receipt scanning guide
- receipt-parsing-strategies — Comparison of regex, ML, and LLM approaches