OCR: Difference between revisions

OCR (view source)

Revision as of 08:22, 28 October 2022

632 bytes added , 28 October 2022

Describe OCR dumps

VisualWikitext

Raphael0202

Bureaucrats, Interface administrators, Suppressors, Administrators

113

edits

@@ Line 8: / Line 8: @@
 * you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code>
-== Archive ==
-=== Tesseract ===
+A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.
-Previously we were using Tesseract on demand (but not storing output)
-* Uses the French dictionary for all languages
+This archive is refreshed from time to time manually.
-<pre>
--- /home/off-fr/cgi# grep get_ocr *
+Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:
-Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
-Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');
+* <code>source</code>: the path of the JSON file, such as <code>/50414727/1.json</code>. The product barcode and image ID can be extracted from this field.
-</pre>
+* <code>content</code>: the OCR response returned by Google Cloud API.
-* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
+* <code>created_at</code>: timestamp of last modification.
-**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary
+The full dump is used by [https://openfoodfacts.github.io/robotoff/ Robotoff] to generate predictions manually for all products, after creating new prediction rules.
+If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.
+The fields are almost the same as for the full dump: <code>content</code> field is replaced by <code>text</code> field.
 === Old Roadmap ===