OCR/Roadmap: Difference between revisions

Revision as of 17:22, 2 August 2015

Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.

Current state

OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net)
Uses the French dictionary for all languages

-- /home/off-fr/cgi# grep get_ocr *
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');

Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
- https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary

Short term goals

Use the right standard dict for each language
Integrate custom lists from Global Ingredients Taxonomy
- Create a golden set
  - e.g. someproduct.jpg -> ingredients image
  - someproduct.golden -> ingredients text
  - then we create a script that runs the OCR through the images, compare with the golden text, and report some accuracy measures
  - Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script
Integrate custom lists from the live instances; language per language.
- http://de.openfoodfacts.org/zutaten
- http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients
- http:// fr.openfoodfacts.org/ingredients
- USDA UNII list of ingredients (will also work for Open Beauty Facts)
Process all images and make products searchable, even if not filled yet

Long-term goals

* Get dictionaries translations from Wikidata
* Investigate Ocropus for complex layout extractions
* Investigate Open CV for detection of patterns, logos…

Targets

Logos of brands (Getting them from POD ?)
Logos of Labels (standardized)
Text (distorted - bottle case, diagonally - with low light, bright light)
Standardized layouts (US Nutrition labels)
Standardized text (quantities, EU Packaging codes)
Barcodes (extraction in uploaded images)
- Store in separate image for further reference
Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
Deep Learning
- Product photo on packaging - guess category based on product picture
- Container: guess whether it's a bottle, cardboard…

Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.

@@ Line 1: / Line 1: @@
 Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.
-Tools:
+== Current state ==
-* Google Drive OCR or Google Goggles
+* OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net)
-* Ocropus
+* Uses the French dictionary for all languages
-* OpenCV
+<pre>
-* Moodstocks
+-- /home/off-fr/cgi# grep get_ocr *
+Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
+Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');
+</pre>
+* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
+**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary
+== Short term goals ==
+* Use the right standard dict for each language
+* Integrate custom lists from Global Ingredients Taxonomy
+**Create a golden set
+*** e.g.  someproduct.jpg -> ingredients image
+*** someproduct.golden -> ingredients text
+*** then we create a script that runs the OCR through the images, compare with the golden text, and report some accuracy measures
+*** Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script
+* Integrate custom lists from the live instances; language per language.
+** http://de.openfoodfacts.org/zutaten
+** http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients
+**http:// fr.openfoodfacts.org/ingredients
+** USDA UNII list of ingredients (will also work for Open Beauty Facts)
+* Process all images and make products searchable, even if not filled yet
+==  Long-term goals ==
+ * Get dictionaries translations from Wikidata
+ * Investigate Ocropus for complex layout extractions
+ * Investigate Open CV for detection of patterns, logos…
-Targets:
+== Targets ==
-* Logos (standardized)
+* Logos of brands (Getting them from POD ?)
-* Text
+* Logos of Labels (standardized)
+* Text (distorted - bottle case, diagonally - with low light, bright light)
 * Standardized layouts (US Nutrition labels)
 * Standardized text (quantities, EU Packaging codes)
 * Barcodes (extraction in uploaded images)
+** Store in separate image for further reference
 * Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
+* Deep Learning
+** Product photo on packaging - guess category based on product picture
+** Container: guess whether it's a bottle, cardboard…
 Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.
 [[Category:Project]]
+[[Category:Product Opener]]