Jump to content

Ingredients Extraction and Analysis: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 47: Line 47:
** Perfect quality
** Perfect quality
** Needs cropping and/or rotation to select the ingredients list
** Needs cropping and/or rotation to select the ingredients list
=== Steps for ingredients lists ===
==== Picture taking ====
* Taken with mobile app, uploaded to OFF server
==== Ingredients list cropping ====
* Done on mobile app just after picture taking
** Cropping may be very inaccurate
* Or done on web site at a later time, possibly by another user
** Cropping slightly easier than on mobile
==== OCR ====
* Launched after cropping, done by the server which calls Google Cloud Vision
* Cloud Vision returns a JSON object which is stored on the server
==== Ingredients list cutting ====
* The image sent by the OCR can also contain other text content
** Things that are not ingredients
** Ingredients in other languages
** The word "Ingredients:"
* Current solution
** Hardcoded regular expressions
* Other possible solutions
** Language identification to remove other languages
* Metrics
** False negatives (words before or after the ingredients list that should have been removed)
** False positives (words that were removed but are part of the ingredients and should have been kept)
*** It is very important to have as few false positives as possible as it destructs data
* Test and training sets
** Only a few adhoc tests run during builds
** Test sets needs to be created
==== Validation and/or correction by users ====
* Current solution:
** Users on the app or the web site are shown the OCR result
** OCR result is not applied if not validated by the user
** but users tend to validate lists without changes even if there are errors, especially on mobile
* Other possible solutions
** Use the result of ingredient analysis to show users ingredients that were not recognized
** Show spell suggestions
==== Spell correction ====
* Current solution:
** Currently only done during ingredients analysis, not during ingredients extraction
** Very simple (and slow) implementation of Peter Norvig algorithm
* Other possible solutions
** Spell checkers trained on ingredients
*** Elastic search spellchecker
*** Simspell
* Metrics
** Recall and precision
* Test and training sets
** Language models can be build with lists of ingredients from OFF
*** e.g. including only ingredients lists from producers, or lists for which we have a very high ingredients recognition rate
** Test sets need to be created
*** Run spellcheckers on actual ingredients lists from OFF, review corrections