OCR/Roadmap
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.
Product Opener improvements
- Process all uploaded images using Tesseract and/or the New Cloud based engine
- Return JSON to mobile client and/or web client for suggestions to the user
- Add support to search into OCR results
TODO
- Process Open Beauty Facts images
- Process the Belgian Food Photographs
Short term goals
- Use the right standard dict for each language
- Integrate custom lists from Global Ingredients Taxonomy
- USDA UNII list of ingredients (will also work for Open Beauty Facts)
- Integrate custom lists from the live instances; language per language.
- http://de.openfoodfacts.org/zutaten
- http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients
- http:// fr.openfoodfacts.org/ingredients
Dictionaries
- https://openfoodfacts.slack.com/files/teolemon/F08FC3T6V/deu.user-words
- https://openfoodfacts.slack.com/files/teolemon/F08FBT3BM/eng.user-words
- https://openfoodfacts.slack.com/files/teolemon/F08FBNBLG/fra.user-words
- https://openfoodfacts.slack.com/files/teolemon/F08FBQ45D/nld.user-words
- https://openfoodfacts.slack.com/files/teolemon/F08FBQ45V/spa.user-words
Testing
Create a golden set of products that are complete
- Product
- Category: "Ingredients complete" "Ingredient images selected"
- Get the ingredients image
- Get the canonical (typed by contributors) ingredient list
- Get the ingredients list generated with the current OCR system
- Generate the ingredient list on your laptop based on the image, and the custom dictionary above
- Compare the result with the canonical/golden test and report some accuracy measures
- Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script
Easy wins
- Process all images and make products searchable, even if not filled yet
Long-term goals
- Get dictionaries translations from Wikidata
- Investigate Ocropus for complex layout extractions
- Investigate Open CV for detection of patterns, logosâŚ
Targets
- Logos of brands (Getting them from POD ?)
- Logos of Labels (standardized)
- Text (distorted - bottle case, diagonally - with low light, bright light)
- Standardized layouts (US Nutrition labels)
- Store in separate image for further reference
- Standardized text (quantities, EU Packaging codes)
- Store in separate image for further reference
- Barcodes (extraction in uploaded images)
- Store in separate image for further reference
- Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
- Deep Learning
- Product photo on packaging - guess category based on product picture
- Container: guess whether it's a bottle, cardboardâŚ
Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.