1,082
edits
No edit summary |
No edit summary |
||
Line 47: | Line 47: | ||
** Perfect quality | ** Perfect quality | ||
** Needs cropping and/or rotation to select the ingredients list | ** Needs cropping and/or rotation to select the ingredients list | ||
=== Steps for ingredients lists === | |||
==== Picture taking ==== | |||
* Taken with mobile app, uploaded to OFF server | |||
==== Ingredients list cropping ==== | |||
* Done on mobile app just after picture taking | |||
** Cropping may be very inaccurate | |||
* Or done on web site at a later time, possibly by another user | |||
** Cropping slightly easier than on mobile | |||
==== OCR ==== | |||
* Launched after cropping, done by the server which calls Google Cloud Vision | |||
* Cloud Vision returns a JSON object which is stored on the server | |||
==== Ingredients list cutting ==== | |||
* The image sent by the OCR can also contain other text content | |||
** Things that are not ingredients | |||
** Ingredients in other languages | |||
** The word "Ingredients:" | |||
* Current solution | |||
** Hardcoded regular expressions | |||
* Other possible solutions | |||
** Language identification to remove other languages | |||
* Metrics | |||
** False negatives (words before or after the ingredients list that should have been removed) | |||
** False positives (words that were removed but are part of the ingredients and should have been kept) | |||
*** It is very important to have as few false positives as possible as it destructs data | |||
* Test and training sets | |||
** Only a few adhoc tests run during builds | |||
** Test sets needs to be created | |||
==== Validation and/or correction by users ==== | |||
* Current solution: | |||
** Users on the app or the web site are shown the OCR result | |||
** OCR result is not applied if not validated by the user | |||
** but users tend to validate lists without changes even if there are errors, especially on mobile | |||
* Other possible solutions | |||
** Use the result of ingredient analysis to show users ingredients that were not recognized | |||
** Show spell suggestions | |||
==== Spell correction ==== | |||
* Current solution: | |||
** Currently only done during ingredients analysis, not during ingredients extraction | |||
** Very simple (and slow) implementation of Peter Norvig algorithm | |||
* Other possible solutions | |||
** Spell checkers trained on ingredients | |||
*** Elastic search spellchecker | |||
*** Simspell | |||
* Metrics | |||
** Recall and precision | |||
* Test and training sets | |||
** Language models can be build with lists of ingredients from OFF | |||
*** e.g. including only ingredients lists from producers, or lists for which we have a very high ingredients recognition rate | |||
** Test sets need to be created | |||
*** Run spellcheckers on actual ingredients lists from OFF, review corrections |