OCR: Difference between revisions

Latest revision as of 09:28, 22 August 2024

OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.

Current state

On-demand OCR extraction of ingredients Google Cloud Vision, as Google gently give us with free credits.
Each photo, when uploaded is sent to the OCR^{[Code 1]}
you can retrieve the OCR file associated to a photo replacing the .jpg extension in original image by .json

A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.

This archive is refreshed from time to time manually.

Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:

source: the path of the JSON file, such as /50414727/1.json. The product barcode and image ID can be extracted from this field.
content: the OCR response returned by Google Cloud API.
created_at: timestamp of last modification.

The full dump is used by Robotoff to generate predictions manually for all products, after creating new prediction rules.

If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.

The fields are almost the same as for the full dump: content field is replaced by text field.

References

↑ See process_new_image_off.sh called through incron

[1] See process_new_image_off.sh called through incron

[Code 1]

@@ Line 1: / Line 1: @@
-OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.{{Box
+OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.
- | 1     =  Slack channel
- | 2     =  [https://openfoodfacts.slack.com/messages/ocr/ #ocr]
-}}
 === Current state ===
-* On-demand OCR extraction of ingredients [https://cloud.google.com/vision/overview/docs/ Google Cloud Vision], as google gently give us with free credits.
+* On-demand OCR extraction of ingredients [https://cloud.google.com/vision/overview/docs/ Google Cloud Vision], as Google gently give us with free credits.
 * Each photo, when uploaded is sent to the OCR<ref group="Code">See [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/process_new_image_off.sh process_new_image_off.sh]  called through [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/conf/incron.conf incron]</ref>
-* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code>
+* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code><br />
-* some old photos might have [[OCR#Tesseract|Tesseract OCR]] output
+A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.
-== Archive ==
+This archive is refreshed from time to time manually.
-=== Tesseract ===
+Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:
-Previously we where using tesseract
-* Uses the French dictionary for all languages
-<pre>
--- /home/off-fr/cgi# grep get_ocr *
-Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
-Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');
-</pre>
-* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
-**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary
-=== Old Roadmap ===
+* <code>source</code>: the path of the JSON file, such as <code>/50414727/1.json</code>. The product barcode and image ID can be extracted from this field.
-[[OCR/Roadmap]]
+* <code>content</code>: the OCR response returned by Google Cloud API.
-=== Old OCR results ===
+* <code>created_at</code>: timestamp of last modification.
-[[OCR/Results]]
+The full dump is used by [https://openfoodfacts.github.io/robotoff/ Robotoff] to generate predictions manually for all products, after creating new prediction rules.
+If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.
+The fields are almost the same as for the full dump: <code>content</code> field is replaced by <code>text</code> field.
+== References ==
+<references group="Code" />
 [[Category:OCR]]
 [[Category:Artificial Intelligence]]