OCR/Roadmap: Difference between revisions

From Open Food Facts wiki
No edit summary
No edit summary
Line 1: Line 1:
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.


Tools:
== Current state ==
* Google Drive OCR or Google Goggles
* OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net)
* Ocropus
* Uses the French dictionary for all languages
* OpenCV
<pre>
* Moodstocks
-- /home/off-fr/cgi# grep get_ocr *
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');
</pre>
* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary
== Short term goals ==
* Use the right standard dict for each language
* Integrate custom lists from Global Ingredients Taxonomy
**Create a golden set
*** e.g.  someproduct.jpg -> ingredients image
*** someproduct.golden -> ingredients text
*** then we create a script that runs the OCR through the images, compare with the golden text, and report some accuracy measures
*** Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script
* Integrate custom lists from the live instances; language per language.
** http://de.openfoodfacts.org/zutaten
** http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients
**http:// fr.openfoodfacts.org/ingredients
** USDA UNII list of ingredients (will also work for Open Beauty Facts)
* Process all images and make products searchable, even if not filled yet
==  Long-term goals ==
* Get dictionaries translations from Wikidata
* Investigate Ocropus for complex layout extractions
* Investigate Open CV for detection of patterns, logos…


Targets:
== Targets ==
* Logos (standardized)
* Logos of brands (Getting them from POD ?)
* Text
* Logos of Labels (standardized)
* Text (distorted - bottle case, diagonally - with low light, bright light)
* Standardized layouts (US Nutrition labels)
* Standardized layouts (US Nutrition labels)
* Standardized text (quantities, EU Packaging codes)
* Standardized text (quantities, EU Packaging codes)
* Barcodes (extraction in uploaded images)
* Barcodes (extraction in uploaded images)
** Store in separate image for further reference
* Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
* Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
 
* Deep Learning
** Product photo on packaging - guess category based on product picture
** Container: guess whether it's a bottle, cardboard…
Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.
Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.


[[Category:Project]]
[[Category:Project]]
[[Category:Product Opener]]

Revision as of 17:22, 2 August 2015

Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.

Current state

  • OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net)
  • Uses the French dictionary for all languages
-- /home/off-fr/cgi# grep get_ocr *
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');

Short term goals

Long-term goals

* Get dictionaries translations from Wikidata
* Investigate Ocropus for complex layout extractions
* Investigate Open CV for detection of patterns, logos…

Targets

  • Logos of brands (Getting them from POD ?)
  • Logos of Labels (standardized)
  • Text (distorted - bottle case, diagonally - with low light, bright light)
  • Standardized layouts (US Nutrition labels)
  • Standardized text (quantities, EU Packaging codes)
  • Barcodes (extraction in uploaded images)
    • Store in separate image for further reference
  • Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
  • Deep Learning
    • Product photo on packaging - guess category based on product picture
    • Container: guess whether it's a bottle, cardboard…

Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.