OCR/Roadmap: Difference between revisions

From Open Food Facts wiki
Line 25: Line 25:
** USDA UNII list of ingredients (will also work for Open Beauty Facts)
** USDA UNII list of ingredients (will also work for Open Beauty Facts)
* Process all images and make products searchable, even if not filled yet
* Process all images and make products searchable, even if not filled yet
===Dictionaries ===
* https://openfoodfacts.slack.com/files/teolemon/F08FC3T6V/deu.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBT3BM/eng.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBNBLG/fra.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBQ45D/nld.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBQ45V/spa.user-words
==  Long-term goals ==
==  Long-term goals ==
* Get dictionaries translations from Wikidata
* Get dictionaries translations from Wikidata

Revision as of 09:34, 18 August 2015

Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.

Current state

  • OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net)
  • Uses the French dictionary for all languages
-- /home/off-fr/cgi# grep get_ocr *
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
Ingredients.pm: $text =  decode utf8=>get_ocr($image,undef,'fra');

Short term goals

Dictionaries

Long-term goals

  • Get dictionaries translations from Wikidata
  • Investigate Ocropus for complex layout extractions
  • Investigate Open CV for detection of patterns, logos…

Targets

  • Logos of brands (Getting them from POD ?)
  • Logos of Labels (standardized)
  • Text (distorted - bottle case, diagonally - with low light, bright light)
  • Standardized layouts (US Nutrition labels)
    • Store in separate image for further reference
  • Standardized text (quantities, EU Packaging codes)
    • Store in separate image for further reference
  • Barcodes (extraction in uploaded images)
    • Store in separate image for further reference
  • Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
  • Deep Learning
    • Product photo on packaging - guess category based on product picture
    • Container: guess whether it's a bottle, cardboard…

Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.