OCR: Difference between revisions
(More up to date informations) |
No edit summary |
||
Line 7: | Line 7: | ||
* Each photo, when uploaded is sent to the OCR<ref group="Code">See [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/process_new_image_off.sh process_new_image_off.sh] called through [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/conf/incron.conf incron]</ref> | * Each photo, when uploaded is sent to the OCR<ref group="Code">See [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/process_new_image_off.sh process_new_image_off.sh] called through [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/conf/incron.conf incron]</ref> | ||
* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code> | * you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code> | ||
== Archive == | == Archive == | ||
=== Tesseract === | === Tesseract === | ||
Previously we | Previously we were using Tesseract on demand (but not storing output) | ||
* Uses the French dictionary for all languages | * Uses the French dictionary for all languages | ||
<pre> | <pre> |
Revision as of 15:45, 21 March 2022
OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.
|
---|
Current state
- On-demand OCR extraction of ingredients Google Cloud Vision, as google gently give us with free credits.
- Each photo, when uploaded is sent to the OCR[Code 1]
- you can retrieve the OCR file associated to a photo replacing the
.jpg
extension in original image by.json
Archive
Tesseract
Previously we were using Tesseract on demand (but not storing output)
- Uses the French dictionary for all languages
-- /home/off-fr/cgi# grep get_ocr * Ingredients.pm:use Image::OCR::Tesseract 'get_ocr'; Ingredients.pm: $text = decode utf8=>get_ocr($image,undef,'fra');
- Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
Old Roadmap
Old OCR results
OCR/Results
Cite error: <ref>
tags exist for a group named "Code", but no corresponding <references group="Code"/>
tag was found