OCR/Roadmap: Difference between revisions

From Open Food Facts wiki
No edit summary
Line 14: Line 14:
* Integrate custom lists from Global Ingredients Taxonomy  
* Integrate custom lists from Global Ingredients Taxonomy  
* USDA UNII list of ingredients (will also work for Open Beauty Facts)
* USDA UNII list of ingredients (will also work for Open Beauty Facts)
* Integrate custom lists from the live instances; language per language.
** http://de.openfoodfacts.org/zutaten
** http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients
** http://fr.openfoodfacts.org/ingredients
** http://es.openfoodfacts.org/ingredientes
** http://pt.openfoodfacts.org/ingredientes
** http://it.openfoodfacts.org/ingredientes
** http://ru.openfoodfacts.org/ingredients


===Dictionaries ===
* https://openfoodfacts.slack.com/files/teolemon/F08FC3T6V/deu.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBT3BM/eng.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBNBLG/fra.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBQ45D/nld.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBQ45V/spa.user-words


=== Testing ===
=== Testing ===

Revision as of 15:35, 5 November 2016

Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.

Product Opener improvements

  • Process all uploaded images using Tesseract and/or the New Cloud based engine
  • Return JSON to mobile client and/or web client for suggestions to the user
  • Add support to search into OCR results

TODO

  • Process Open Beauty Facts images
  • Process the Belgian Food Photographs

Short term goals

  • Use the right standard dict for each language
  • Integrate custom lists from Global Ingredients Taxonomy
  • USDA UNII list of ingredients (will also work for Open Beauty Facts)


Testing

Create a golden set of products that are complete

  • Product
    • Category: "Ingredients complete" "Ingredient images selected"
    • Get the ingredients image
    • Get the canonical (typed by contributors) ingredient list
    • Get the ingredients list generated with the current OCR system
    • Generate the ingredient list on your laptop based on the image, and the custom dictionary above
    • Compare the result with the canonical/golden test and report some accuracy measures
  • Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script

Easy wins

  • Process all images and make products searchable, even if not filled yet

Long-term goals

  • Get dictionaries translations from Wikidata
  • Investigate Ocropus for complex layout extractions
  • Investigate Open CV for detection of patterns, logos…

Targets

  • Logos of brands (Getting them from POD ?)
  • Logos of Labels (standardized)
  • Text (distorted - bottle case, diagonally - with low light, bright light)
  • Standardized layouts (US Nutrition labels)
    • Store in separate image for further reference
  • Standardized text (quantities, EU Packaging codes)
    • Store in separate image for further reference
  • Barcodes (extraction in uploaded images)
    • Store in separate image for further reference
  • Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
  • Deep Learning
    • Product photo on packaging - guess category based on product picture
    • Container: guess whether it's a bottle, cardboard…

Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.