OCR/Roadmap: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision. | Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision. | ||
== Current state == | |||
* | * OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net) | ||
* | * Uses the French dictionary for all languages | ||
* | <pre> | ||
* | -- /home/off-fr/cgi# grep get_ocr * | ||
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr'; | |||
Ingredients.pm: $text = decode utf8=>get_ocr($image,undef,'fra'); | |||
</pre> | |||
* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words) | |||
**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary | |||
== Short term goals == | |||
* Use the right standard dict for each language | |||
* Integrate custom lists from Global Ingredients Taxonomy | |||
**Create a golden set | |||
*** e.g. someproduct.jpg -> ingredients image | |||
*** someproduct.golden -> ingredients text | |||
*** then we create a script that runs the OCR through the images, compare with the golden text, and report some accuracy measures | |||
*** Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script | |||
* Integrate custom lists from the live instances; language per language. | |||
** http://de.openfoodfacts.org/zutaten | |||
** http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients | |||
**http:// fr.openfoodfacts.org/ingredients | |||
** USDA UNII list of ingredients (will also work for Open Beauty Facts) | |||
* Process all images and make products searchable, even if not filled yet | |||
== Long-term goals == | |||
* Get dictionaries translations from Wikidata | |||
* Investigate Ocropus for complex layout extractions | |||
* Investigate Open CV for detection of patterns, logos… | |||
Targets | == Targets == | ||
* Logos (standardized) | * Logos of brands (Getting them from POD ?) | ||
* Text | * Logos of Labels (standardized) | ||
* Text (distorted - bottle case, diagonally - with low light, bright light) | |||
* Standardized layouts (US Nutrition labels) | * Standardized layouts (US Nutrition labels) | ||
* Standardized text (quantities, EU Packaging codes) | * Standardized text (quantities, EU Packaging codes) | ||
* Barcodes (extraction in uploaded images) | * Barcodes (extraction in uploaded images) | ||
** Store in separate image for further reference | |||
* Image orientation: check that the text is properly oriented to guess if the image is properly oriented. | * Image orientation: check that the text is properly oriented to guess if the image is properly oriented. | ||
* Deep Learning | |||
** Product photo on packaging - guess category based on product picture | |||
** Container: guess whether it's a bottle, cardboard… | |||
Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text. | Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text. | ||
[[Category:Project]] | [[Category:Project]] | ||
[[Category:Product Opener]] |
Revision as of 17:22, 2 August 2015
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.
Current state
- OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net)
- Uses the French dictionary for all languages
-- /home/off-fr/cgi# grep get_ocr * Ingredients.pm:use Image::OCR::Tesseract 'get_ocr'; Ingredients.pm: $text = decode utf8=>get_ocr($image,undef,'fra');
- Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
Short term goals
- Use the right standard dict for each language
- Integrate custom lists from Global Ingredients Taxonomy
- Create a golden set
- e.g. someproduct.jpg -> ingredients image
- someproduct.golden -> ingredients text
- then we create a script that runs the OCR through the images, compare with the golden text, and report some accuracy measures
- Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script
- Create a golden set
- Integrate custom lists from the live instances; language per language.
- http://de.openfoodfacts.org/zutaten
- http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients
- http:// fr.openfoodfacts.org/ingredients
- USDA UNII list of ingredients (will also work for Open Beauty Facts)
- Process all images and make products searchable, even if not filled yet
Long-term goals
* Get dictionaries translations from Wikidata * Investigate Ocropus for complex layout extractions * Investigate Open CV for detection of patterns, logos…
Targets
- Logos of brands (Getting them from POD ?)
- Logos of Labels (standardized)
- Text (distorted - bottle case, diagonally - with low light, bright light)
- Standardized layouts (US Nutrition labels)
- Standardized text (quantities, EU Packaging codes)
- Barcodes (extraction in uploaded images)
- Store in separate image for further reference
- Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
- Deep Learning
- Product photo on packaging - guess category based on product picture
- Container: guess whether it's a bottle, cardboard…
Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.