Well AI Invoice Extractor

Open-source (MIT license) tool that uses LLMs to parse invoices and receipts into structured JSON. Part of the Well project — an open financial operations platform.

Why It Matters for Project Aries

This is the most directly applicable open-source library for the “extract structured data from OCR text” problem. It demonstrates the LLM-based parsing approach with a production-quality implementation, including schema validation and confidence scoring. Could serve as a starting point or reference architecture for Project Aries’ receipt parsing pipeline.

How It Works

Pipeline

  1. OCR (optional): Tesseract, PaddleOCR, or any OCR engine produces raw text. Plain .txt from emails/parsers also accepted.
  2. LLM Extraction: Structured prompt with target JSON schema sent to LLM (OpenAI, Mistral, or custom endpoints)
  3. Validation: Extracted JSON validated against schema
  4. Confidence Scoring: Per-field confidence for downstream systems

Extracted Fields

Vendor name, invoice date, line items, tax, subtotal, total, payment method, currency, and more.

Model Support

  • OpenAI (GPT-4, GPT-3.5)
  • Mistral models
  • Custom endpoints (LM Studio, vLLM)
  • Model-agnostic by design

Key Facts

  • License: MIT
  • Repository: github.com/WellApp-ai/Well (ai-invoice-extractor directory)
  • Author: Maxime Champoux, Co-founder & CEO @Well
  • Created: 2025

Philosophy

“Open, hackable, model-agnostic, developer-first.”

Designed as an alternative to proprietary, expensive, or rigid extraction tools. Template-based regex breaks when layouts change — LLMs generalise across formats.