OCR/Roadmap: Difference between revisions

From Open Food Facts wiki
No edit summary
No edit summary
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.
Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.


Tools:
== Product Opener improvements ==
* Google Drive OCR or Google Goggles
* Process all uploaded images using Tesseract and/or the New Cloud based engine
* Ocropus
* Return JSON to mobile client and/or web client for suggestions to the user
* OpenCV
* Add support to search into OCR results
* Moodstocks


Targets:
== ✅ TODO ==
* Logos (standardized)
* Process Open Beauty Facts, Open Pet Food Facts, Open Products Facts images
* Text
* Process the Belgian Food Photographs
 
== Short term goals ==
* Use the right standard dict for each language
* Integrate custom lists from Global Ingredients Taxonomy
* USDA UNII list of ingredients (will also work for Open Beauty Facts)
 
=== Testing ===
==== Create a golden set of products that are complete ====
* Product
** Category: "Ingredients complete" "Ingredient images selected"
** Get the ingredients image
** Get the canonical (typed by contributors) ingredient list
** Get the ingredients list generated with the current OCR system
** Generate the ingredient list on your laptop based on the image, and the custom dictionary above
** Compare the result with the canonical/golden test and report some accuracy measures
* Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script
=== Easy wins ===
* Process all images and make products searchable, even if not filled yet
 
==  Long-term goals ==
* Get dictionaries translations from Wikidata
* Investigate Ocropus for complex layout extractions
* Investigate Open CV for detection of patterns, logos…
 
== Targets ==
* Logos of brands (Getting them from POD ?)
* Logos of Labels (standardized)
* Text (distorted - bottle case, diagonally - with low light, bright light)
* Standardized layouts (US Nutrition labels)
* Standardized layouts (US Nutrition labels)
== Store in separate image for further reference ==
* Standardized text (quantities, EU Packaging codes)
* Standardized text (quantities, EU Packaging codes)
* Barcodes (extraction in uploaded images)
* Barcodes (extraction in uploaded images)
* Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
* Image orientation: check that the text is properly oriented to guess if the image is properly oriented.


* Deep Learning
** Product photo on packaging - guess category based on product picture
** Container: guess whether it's a bottle, cardboard…
Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.
Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.


[[Category:OFF-Project]]
[[Category:Roadmap]]
[[Category:Project]]
[[Category:ProductOpener]]
[[Category:OCR]]
[[Category:Artificial Intelligence]]

Latest revision as of 08:42, 28 August 2024

Currently, all products are edited manually. This project is about automatic or semi-automatic detection of a number of things using OCR and Computer vision.

Product Opener improvements

  • Process all uploaded images using Tesseract and/or the New Cloud based engine
  • Return JSON to mobile client and/or web client for suggestions to the user
  • Add support to search into OCR results

✅ TODO

  • Process Open Beauty Facts, Open Pet Food Facts, Open Products Facts images
  • Process the Belgian Food Photographs

Short term goals

  • Use the right standard dict for each language
  • Integrate custom lists from Global Ingredients Taxonomy
  • USDA UNII list of ingredients (will also work for Open Beauty Facts)

Testing

Create a golden set of products that are complete

  • Product
    • Category: "Ingredients complete" "Ingredient images selected"
    • Get the ingredients image
    • Get the canonical (typed by contributors) ingredient list
    • Get the ingredients list generated with the current OCR system
    • Generate the ingredient list on your laptop based on the image, and the custom dictionary above
    • Compare the result with the canonical/golden test and report some accuracy measures
  • Draft Script: https://lite6.framapad.org/p/OFF_OCR_Script

Easy wins

  • Process all images and make products searchable, even if not filled yet

Long-term goals

  • Get dictionaries translations from Wikidata
  • Investigate Ocropus for complex layout extractions
  • Investigate Open CV for detection of patterns, logos…

Targets

  • Logos of brands (Getting them from POD ?)
  • Logos of Labels (standardized)
  • Text (distorted - bottle case, diagonally - with low light, bright light)
  • Standardized layouts (US Nutrition labels)

Store in separate image for further reference

  • Standardized text (quantities, EU Packaging codes)
  • Barcodes (extraction in uploaded images)
  • Image orientation: check that the text is properly oriented to guess if the image is properly oriented.
  • Deep Learning
    • Product photo on packaging - guess category based on product picture
    • Container: guess whether it's a bottle, cardboard…

Extracting areas is already great work: if we can extract logos or patterns, it will be faster for humans to double check and turn that into text.