113
edits
No edit summary |
Raphael0202 (talk | contribs) (Describe OCR dumps) |
||
Line 8: | Line 8: | ||
* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code> | * you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code> | ||
A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz. | |||
* | This archive is refreshed from time to time manually. | ||
< | |||
Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available: | |||
* <code>source</code>: the path of the JSON file, such as <code>/50414727/1.json</code>. The product barcode and image ID can be extracted from this field. | |||
</ | * <code>content</code>: the OCR response returned by Google Cloud API. | ||
* <code>created_at</code>: timestamp of last modification. | |||
The full dump is used by [https://openfoodfacts.github.io/robotoff/ Robotoff] to generate predictions manually for all products, after creating new prediction rules. | |||
If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz. | |||
The fields are almost the same as for the full dump: <code>content</code> field is replaced by <code>text</code> field. | |||
=== Old Roadmap === | === Old Roadmap === |