Recognizing Text in Images — Apple Vision Framework

Overview

The Vision framework detects and recognizes multilanguage text in images. All on-device.

Two processing paths:

  • Fast: character-detection + small ML model; similar to traditional OCR
  • Accurate: neural network finds strings/lines, then words/sentences; more human-like

Optional language-correction phase (NLP-based) reduces misreadings.

API

VNRecognizeTextRequest with VNImageRequestHandler. Returns VNRecognizedTextObservation array. Use topCandidates(1).first?.string for best result.

Language Settings

Default bias toward English. Override with recognitionLanguages array. usesLanguageCorrection = true for NLP correction. customWords for domain-specific jargon given precedence during correction.

Bounding Boxes

Each observation provides normalized bounding rectangle. Convert to image coordinates via VNImageRectForNormalizedRect. Fast path: character-based boxes. Accurate path: whitespace-tokenized boxes (Chinese may give line fragments).

  • Structuring recognized text on a document — business card/receipt text structuring
  • Extracting phone numbers from text in images