OCR/Roadmap: Difference between revisions

From Open Food Facts wiki
No edit summary
No edit summary
Β 
(9 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.


== Current state == Β 
== Product Opener improvements ==
* OCR extraction of Ingredients using Tesseract 2 (production) and 3 (.net)
* Process all uploaded images using Tesseract and/or the New Cloud based engine
* Uses the French dictionary for all languages
* Return JSON to mobile client and/or web client for suggestions to the user
<pre>
* Add support to search into OCR results
-- /home/off-fr/cgi# grep get_ocr *
Β 
Ingredients.pm:use Image::OCR::Tesseract 'get_ocr';
== βœ… TODO ==
Ingredients.pm: $text = decode utf8=>get_ocr($image,undef,'fra');
* Process Open Beauty Facts, Open Pet Food Facts, Open Products Facts images
</pre>
* Process the Belgian Food Photographs
* Has a small custom dictionary for French ( /usr/share/tesseract-ocr/tessdata/fra.user-words)
Β 
**https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_provide_my_own_dictionary
== Short term goals ==
== Short term goals ==
* Use the right standard dict for each language
* Use the right standard dict for each language
* Integrate custom lists from Global Ingredients Taxonomy Β 
* Integrate custom lists from Global Ingredients Taxonomy Β 
* USDA UNII list of ingredients (will also work for Open Beauty Facts)
* USDA UNII list of ingredients (will also work for Open Beauty Facts)
* Integrate custom lists from the live instances; language per language.
** http://de.openfoodfacts.org/zutaten
** http://uk.openfoodfacts.org/ingredients + http://us.openfoodfacts.org/ingredients
**http:// fr.openfoodfacts.org/ingredients
===Dictionaries ===
* https://openfoodfacts.slack.com/files/teolemon/F08FC3T6V/deu.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBT3BM/eng.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBNBLG/fra.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBQ45D/nld.user-words
* https://openfoodfacts.slack.com/files/teolemon/F08FBQ45V/spa.user-words


=== Testing ===
=== Testing ===
Line 49: Line 38:
* Text (distorted - bottle case, diagonally - with low light, bright light)
* Text (distorted - bottle case, diagonally - with low light, bright light)
* Standardized layouts (US Nutrition labels)
* Standardized layouts (US Nutrition labels)
** Store in separate image for further reference
== Store in separate image for further reference ==
* Standardized text (quantities, EU Packaging codes)
* Standardized text (quantities, EU Packaging codes)
** Store in separate image for further reference
* Barcodes (extraction in uploaded images)
* Barcodes (extraction in uploaded images)
** Store in separate image for further reference
* Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
* Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
* Deep Learning
* Deep Learning
** Product photo on packaging - guess category based on product picture
** Product photo on packaging - guess category based on product picture
Line 62: Line 50:
[[Category:Roadmap]]
[[Category:Roadmap]]
[[Category:Project]]
[[Category:Project]]
[[Category:Product Opener]]
[[Category:ProductOpener]]
[[Category:OCR]]
[[Category:Artificial Intelligence]]

Latest revision as of 08:42, 28 August 2024

Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.

Product Opener improvements

  • Process all uploaded images using Tesseract and/or the New Cloud based engine
  • Return JSON to mobile client and/or web client for suggestions to the user
  • Add support to search into OCR results

βœ… TODO

  • Process Open Beauty Facts, Open Pet Food Facts, Open Products Facts images
  • Process the Belgian Food Photographs

Short term goals

  • Use the right standard dict for each language
  • Integrate custom lists from Global Ingredients Taxonomy
  • USDA UNII list of ingredients (will also work for Open Beauty Facts)

Testing

Create a golden set of products that are complete

  • Product
    • Category: "Ingredients complete" "Ingredient images selected"
    • Get the ingredients image
    • Get the canonical (typed by contributors) ingredient list
    • Get the ingredients list generated with the current OCR system
    • Generate the ingredient list on your laptop based on the image, and the custom dictionary above
    • Compare the result with the canonical/golden test and report some accuracy measures
  • Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script

Easy wins

  • Process all images and make products searchable, even if not filled yet

Long-term goals

  • Get dictionaries translations from Wikidata
  • Investigate Ocropus for complex layout extractions
  • Investigate Open CV for detection of patterns, logos…

Targets

  • Logos of brands (Getting them from POD ?)
  • Logos of Labels (standardized)
  • Text (distorted - bottle case, diagonally - with low light, bright light)
  • Standardized layouts (US Nutrition labels)

Store in separate image for further reference

  • Standardized text (quantities, EU Packaging codes)
  • Barcodes (extraction in uploaded images)
  • Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
  • Deep Learning
    • Product photo on packaging - guess category based on product picture
    • Container: guess whether it's a bottle, cardboard…

Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.