OCR: Difference between revisions

Latest revision as of 09:28, 22 August 2024

OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.

Current state

On-demand OCR extraction of ingredients Google Cloud Vision, as Google gently give us with free credits.
Each photo, when uploaded is sent to the OCR^{[Code 1]}
you can retrieve the OCR file associated to a photo replacing the .jpg extension in original image by .json

A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.

This archive is refreshed from time to time manually.

Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:

source: the path of the JSON file, such as /50414727/1.json. The product barcode and image ID can be extracted from this field.
content: the OCR response returned by Google Cloud API.
created_at: timestamp of last modification.

The full dump is used by Robotoff to generate predictions manually for all products, after creating new prediction rules.

If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.

The fields are almost the same as for the full dump: content field is replaced by text field.

References

↑ See process_new_image_off.sh called through incron

[1] See process_new_image_off.sh called through incron

[Code 1]

@@ Line 3: / Line 3: @@
 * On-demand OCR extraction of ingredients [https://cloud.google.com/vision/overview/docs/ Google Cloud Vision], as Google gently give us with free credits.
 * Each photo, when uploaded is sent to the OCR<ref group="Code">See [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/scripts/process_new_image_off.sh process_new_image_off.sh]  called through [https://github.com/openfoodfacts/openfoodfacts-server/blob/main/conf/incron.conf incron]</ref>
-* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code>
+* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code><br />
 A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.
@@ Line 16: / Line 13: @@
 * <code>content</code>: the OCR response returned by Google Cloud API.
 * <code>created_at</code>: timestamp of last modification.
 The full dump is used by [https://openfoodfacts.github.io/robotoff/ Robotoff] to generate predictions manually for all products, after creating new prediction rules.
@@ Line 24: / Line 19: @@
 The fields are almost the same as for the full dump: <code>content</code> field is replaced by <code>text</code> field.
-=== Old Roadmap ===
-[[OCR/Roadmap]]
-=== Old OCR results ===
-[[OCR/Results]]
 == References ==