OCR: Difference between revisions

Revision as of 08:22, 28 October 2022

OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.

Slack channel

#ocr

Current state

On-demand OCR extraction of ingredients Google Cloud Vision, as google gently give us with free credits.
Each photo, when uploaded is sent to the OCR^{[Code 1]}
you can retrieve the OCR file associated to a photo replacing the .jpg extension in original image by .json

A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.

This archive is refreshed from time to time manually.

Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:

source: the path of the JSON file, such as /50414727/1.json. The product barcode and image ID can be extracted from this field.
content: the OCR response returned by Google Cloud API.
created_at: timestamp of last modification.

The full dump is used by Robotoff to generate predictions manually for all products, after creating new prediction rules.

If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.

The fields are almost the same as for the full dump: content field is replaced by text field.

Old Roadmap

OCR/Roadmap

Old OCR results

OCR/Results

References

↑ See process_new_image_off.sh called through incron

[1] See process_new_image_off.sh called through incron

[Code 1]

@@ Line 8: / Line 8: @@
 * you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code>
-== Archive ==
-=== Tesseract ===
+A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.
-Previously we were using Tesseract on demand (but not storing output)
-* Uses the French dictionary for all languages
+This archive is refreshed from time to time manually.
-<pre>
--- /home/off-fr/cgi# grep get_ocr *
+Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:
-Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
-Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');
+* <code>source</code>: the path of the JSON file, such as <code>/50414727/1.json</code>. The product barcode and image ID can be extracted from this field.
-</pre>
+* <code>content</code>: the OCR response returned by Google Cloud API.
-* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
+* <code>created_at</code>: timestamp of last modification.
-**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary
+The full dump is used by [https://openfoodfacts.github.io/robotoff/ Robotoff] to generate predictions manually for all products, after creating new prediction rules.
+If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.
+The fields are almost the same as for the full dump: <code>content</code> field is replaced by <code>text</code> field.
 === Old Roadmap ===