Open-Source Invoice & Receipt Extraction with LLMs

Author: Maxime Champoux, Co-founder & CEO @Well Date: June 1, 2025

The Problem

Traditional invoice/receipt extraction fails because:

  • OCR alone produces unstructured text
  • Regex/template rules break when layouts change
  • Handwritten/multilingual values are missed
  • Enterprise APIs are expensive and closed-source

The Solution: AI Invoice Extractor by Well

Open-source (MIT) tool feeding OCR or plain text into an LLM for structured extraction.

How It Works

  1. OCR (optional): Tesseract, PaddleOCR, or any OCR gets raw text
  2. Prompt-Based Extraction: Structured prompt with target JSON schema sent to LLM
  3. Schema Validation & Confidence: Extracted JSON validated, each field scored
  4. Model Routing: OpenAI, Mistral, or custom endpoints (LM Studio, vLLM)

Extracted Fields

Vendor name, invoice date, line items, tax/subtotal/total, payment method, currency, and more.

Philosophy

Open, hackable, model-agnostic, developer-first.

GitHub: https://github.com/WellApp-ai/Well/tree/main/ai-invoice-extractor