Open-Source Invoice & Receipt Extraction with LLMs
Author: Maxime Champoux, Co-founder & CEO @Well Date: June 1, 2025
The Problem
Traditional invoice/receipt extraction fails because:
- OCR alone produces unstructured text
- Regex/template rules break when layouts change
- Handwritten/multilingual values are missed
- Enterprise APIs are expensive and closed-source
The Solution: AI Invoice Extractor by Well
Open-source (MIT) tool feeding OCR or plain text into an LLM for structured extraction.
How It Works
- OCR (optional): Tesseract, PaddleOCR, or any OCR gets raw text
- Prompt-Based Extraction: Structured prompt with target JSON schema sent to LLM
- Schema Validation & Confidence: Extracted JSON validated, each field scored
- Model Routing: OpenAI, Mistral, or custom endpoints (LM Studio, vLLM)
Extracted Fields
Vendor name, invoice date, line items, tax/subtotal/total, payment method, currency, and more.
Philosophy
Open, hackable, model-agnostic, developer-first.
GitHub: https://github.com/WellApp-ai/Well/tree/main/ai-invoice-extractor