Ingredients Extraction and Analysis: Difference between revisions
No edit summary |
No edit summary |
||
Line 53: | Line 53: | ||
* Taken with mobile app, uploaded to OFF server | * Taken with mobile app, uploaded to OFF server | ||
Result: | |||
* https://fr.openfoodfacts.org/images/products/500/011/255/8265/21.jpg | |||
[[File:21.jpg|200px|thumb]] | |||
==== Ingredients list cropping ==== | ==== Ingredients list cropping ==== |
Revision as of 15:06, 10 September 2019
This page describes how the ingredients list extraction and ingredients analysis is done on Open Food Facts and points to ressources that could be used to improve it.
Objectives
Ingredients list extraction
The goal of ingredients list extraction is to get the text of the ingredients list of each product in exactly the same form as it appears on the product package and label.
Ingredients analysis
Once the ingredients list is available, we need to analyze it to recognize the actual ingredients and indications of quantity, labels, processing etc. There is a lot of variety in how ingredients are listed on products, with many different synonyms, ways to indicate sub-ingredients etc.
The analysis needs to work for ingredients lists written in many different languages.
The output is structured data that links to our multilingual ingredients taxonomy.
Why it's important
Ingredients list extraction and analysis is necessary for many tasks:
- Detecting food additives and allergens
- Determining the degree of processing of food products (NOVA classification)
- Identifying food products that can be or cannot be eaten by people following specific diets:
- Vegetarian, vegan
- Casher, Halal
- Palm oil
- Estimating the carbon impact of ingredients
- Translating ingredients lists
Ingredients list extraction
Data sources for ingredients lists
The possible input sources for the ingredients lists are:
- Ingredients lists typed in by users
- Time consuming and not pleasant task, especially on mobile
- Can contain typos, but usually typed ingredients lists are very close to what is written on the product
- Ingredients lists given by manufacturers in data files
- Usually of very good quality, but depending on manufacturers, can contain typos and sometimes formatting errors
- Photos of product labels
- Photo quality varies a lot
- Some products are hard to photograph (round cans and bottles, foil bags etc.)
- Sometimes very poor lighting, orientation, camera, focus etc.
- Photo quality varies a lot
- High resolution images or PDFs of the printable package
- Available for a few producers
- Perfect quality
- Needs cropping and/or rotation to select the ingredients list
Steps for ingredients lists
Picture taking
- Taken with mobile app, uploaded to OFF server
Result:
Ingredients list cropping
- Done on mobile app just after picture taking
- Cropping may be very inaccurate
- Or done on web site at a later time, possibly by another user
- Cropping slightly easier than on mobile
OCR
- Launched after cropping, done by the server which calls Google Cloud Vision
- Cloud Vision returns a JSON object which is stored on the server
Ingredients list cutting
- The image sent by the OCR can also contain other text content
- Things that are not ingredients
- Ingredients in other languages
- The word "Ingredients:"
- Current solution
- Hardcoded regular expressions
- Other possible solutions
- Language identification to remove other languages
- Metrics
- False negatives (words before or after the ingredients list that should have been removed)
- False positives (words that were removed but are part of the ingredients and should have been kept)
- It is very important to have as few false positives as possible as it destructs data
- Test and training sets
- Only a few adhoc tests run during builds
- Test sets needs to be created
Validation and/or correction by users
- Current solution:
- Users on the app or the web site are shown the OCR result
- OCR result is not applied if not validated by the user
- but users tend to validate lists without changes even if there are errors, especially on mobile
- Other possible solutions
- Use the result of ingredient analysis to show users ingredients that were not recognized
- Show spell suggestions
Spell correction
- Current solution:
- Currently only done during ingredients analysis, not during ingredients extraction
- Very simple (and slow) implementation of Peter Norvig algorithm
- Other possible solutions
- Spell checkers trained on ingredients
- Elastic search spellchecker
- Simspell
- Spell checkers trained on ingredients
- Metrics
- Recall and precision
- Test and training sets
- Language models can be build with lists of ingredients from OFF
- e.g. including only ingredients lists from producers, or lists for which we have a very high ingredients recognition rate
- Test sets need to be created
- Run spellcheckers on actual ingredients lists from OFF, review corrections
- Language models can be build with lists of ingredients from OFF
Ingredients analysis
Steps for ingredients analysis
Ingredients pre-parsing
- The ingredients list is transformed to make parsing easier
- Remove / normalize strange characters
- De-abbreviate abbreviations
- Split enumerations
- e.g. "Vitamins A, B et C" -> Vitamine A, Vitamine B, Vitamine C
- Additives E-numbers normalization (E330, e330, e-330, INS 330, SIN330 etc.)
- Additives classes + additive splits
- e.g. "Colour caramel" -> Colour: Caramel
- Split some "A of B, C and D" (but not all...)
- e.g. "Huile de palme, colza et tournesol" -> Huile de palme, huile de colza, huile de tournesol
- Handle * and other signs that indicate some ingredients are organic, fair trade etc.
- e.g. "Pomme*, ..., *: ingrédient issu de l'agriculture biologique" -> "Pomme bio"
- Current solution
- Perl code and regular expressions
- lib/ProductOpener/Ingredients.pm - preparse_ingredients_text()
- Perl code and regular expressions
Ingredients parsing
- Separate individual ingredients and match them to the ingredients taxonomy
- Extract properties of ingredients
- Labels like organic, fair trade etc.
- quantity (%)
- processing (e.g. "cooked")
- origin (e.g. "France")
- Multi-level ingredients / sub-ingredients
- e.g. "Fromage (Lait, présure, sel)"
- Recognize when "A and B" is a single ingredient, or 2 ingredients
- Uses the taxonomy to make the determination
- Extract properties of ingredients
- Current solution
- Perl code and regular expressions + multilingual ingredients taxonomy
- lib/ProductOpener/Ingredients.pm - extract_ingredients_from_text()
- Perl code and regular expressions + multilingual ingredients taxonomy
End to end metrics
Known and unknown ingredients
- For each product, we have the number of known and unknown ingredients
- https://fr.openfoodfacts.org/ingredients?stats=1 (takes a while to load and to render in a browser)
- Results on Sept 10th 2019:
- 467564 ingrédients:
- Type Unique tags Occurrences
- known 3647 (0.78%) 3458004 (81.99%)
- unknown 463917 (99.22%) 759598 (18.01%)
- all 467565 (100.00%) 4217602 (100.00%)
- Can be given for a subset of products
- https://fr.openfoodfacts.org/editor/scamark/ingredients?stats=1 (results from product data imported from Scamark / Leclerc)
- Results on Sept 10th 2019:
Results of further ingredient analysis
- Number of products for which we are able to make a vegan / non-vegan or vegetarian / non-vegetarian determination
Ressources
Data
- Ingredients text and result of ingredient parsing
- MongoDB JSON / JSONL exports: https://world.openfoodfacts.org/data
- API
- product URL on OFF: https://fr.openfoodfacts.org/produit/3560070223145/pistaches-grillees-carrefour
- add "/api/v0" to get JSON result: https://fr.openfoodfacts.org/api/v0/produit/3560070223145/pistaches-grillees-carrefour
- Sorted lists with counts of individual ingredients for a subset of product
- https://fr.openfoodfacts.org/ingredients?stats=1
- Ingredients that are known (they exist in our taxonomy): https://fr.openfoodfacts.org/editor/scamark/ingredients?status=known
- Unknown ingredients: https://fr.openfoodfacts.org/editor/scamark/ingredients?status=known
- OCR and spelling errors
- Parsing errors
- Things that are not ingredients and that should not be in the ingredient list
- Ingredients or synonyms that should be added to the taxonomy
- Can be given for a subset of products
- https://fr.openfoodfacts.org/editor/scamark/ingredients?status=unknown (for imported Scamark / Leclerc products)
- https://fr.openfoodfacts.org/ingredients?stats=1