Jump to content

OCR: Difference between revisions

632 bytes added ,  28 October 2022
Describe OCR dumps
No edit summary
(Describe OCR dumps)
Line 8: Line 8:
* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code>
* you can retrieve the OCR file associated to a photo replacing the <code>.jpg</code> extension in original image by <code>.json</code>


== Archive ==


=== Tesseract ===
A single JSONL archive containing all OCR results is available here: https://static.openfoodfacts.org/data/ocr-latest.jsonl.gz.
Previously we were using Tesseract on demand (but not storing output)
 
* Uses the French dictionary for all languages
This archive is refreshed from time to time manually.
<pre>
 
-- /home/off-fr/cgi# grep get_ocr *
Beware that this file is huge (56 GB compressed), as it contains all information returned by Google OCR. The following fields are available:
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
 
Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');
* <code>source</code>: the path of the JSON file, such as <code>/50414727/1.json</code>. The product barcode and image ID can be extracted from this field.
</pre>
* <code>content</code>: the OCR response returned by Google Cloud API.
* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
* <code>created_at</code>: timestamp of last modification.
**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary
 
 
The full dump is used by [https://openfoodfacts.github.io/robotoff/ Robotoff] to generate predictions manually for all products, after creating new prediction rules.
 
If you don't need all Google Cloud Vision analysis results but only the text present on the image, you can download a minified version of the dump with text only: https://static.openfoodfacts.org/data/ocr-text-latest.jsonl.gz.
 
The fields are almost the same as for the full dump: <code>content</code> field is replaced by <code>text</code> field.


=== Old Roadmap ===
=== Old Roadmap ===