OCR

OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.

Slack channel

#ocr

Current state

On-demand OCR extraction of ingredients Google Cloud Vision, as google gently give us with free credits.
Each photo, when uploaded is sent to the OCR^{[Code 1]}
you can retrieve the OCR file associated to a photo replacing the .jpg extension in original image by .json
some old photos might have Tesseract OCR output

Archive

Tesseract

Previously we where using tesseract

Uses the French dictionary for all languages

-- /home/off-fr/cgi# grep get_ocr *
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');

Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
- https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary

Old Roadmap

OCR/Roadmap

Old OCR results

OCR/Results
Cite error: <ref> tags exist for a group named "Code", but no corresponding <references group="Code"/> tag was found

[Code 1]