OCR

From Open Food Facts wiki

OCR (aka Optical Character Recognition) is an important building block for Open Food Facts. As most users feeds in images (convenient from mobile device), we have to extract information from those image, and OCR is one of the best way to do this.

Current state

  • On-demand OCR extraction of ingredients Google Cloud Vision, as Google gently give us with free credits.
  • Each photo, when uploaded is sent to the OCR[Code 1]
  • you can retrieve the OCR file associated to a photo replacing the .jpg extension in original image by .json

A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.

This archive is refreshed from time to time manually.

Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:

  • source: the path of the JSON file, such as /50414727/1.json. The product barcode and image ID can be extracted from this field.
  • content: the OCR response returned by Google Cloud API.
  • created_at: timestamp of last modification.

The full dump is used by Robotoff to generate predictions manually for all products, after creating new prediction rules.

If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.

The fields are almost the same as for the full dump: content field is replaced by text field.

References

  1. See process_new_image_off.sh called through incron