Ingredients Extraction and Analysis: Difference between revisions

Latest revision as of 12:59, 27 August 2024

This page describes how the ingredients list extraction and ingredients analysis is done on Open Food Facts and points to ressources that could be used to improve it.

TL;DR

If you want to improve ingredients analysis, here's a nice presentation that's hopefully easy to read and immediately actionable :-)
You should also read the step by step tutorial to edit the taxonomy: Taxonomy Maintenance

Objectives

Ingredients list extraction

The goal of ingredients list extraction is to get the text of the ingredients list of each product in exactly the same form as it appears on the product package and label.

Ingredients analysis

Once the ingredients list is available, we need to analyze it to recognize the actual ingredients and indications of quantity, labels, processing etc. There is a lot of variety in how ingredients are listed on products, with many different synonyms, ways to indicate sub-ingredients etc.

The analysis needs to work for ingredients lists written in many different languages.

The output is structured data that links to our multilingual ingredients taxonomy.

Why it's important

Ingredients list extraction and analysis is necessary for many tasks:

Detecting food additives and allergens
Determining the degree of processing of food products (NOVA classification)
Identifying food products that can be or cannot be eaten by people following specific diets:
- Vegetarian, vegan
- Kosher, Halal
- No palm oil
Estimating the carbon impact of ingredients
Translating ingredients lists

High level view

Ingredients list extraction

Data sources for ingredients lists

The possible input sources for the ingredients lists are:

Ingredients lists typed in by users
- Time consuming and not pleasant task, especially on mobile
- Can contain typos, but usually typed ingredients lists are very close to what is written on the product
Ingredients lists given by manufacturers in data files
- Usually of very good quality, but depending on manufacturers, can contain typos and sometimes formatting errors
Photos of product labels
- Photo quality varies a lot
  - Some products are hard to photograph (round cans and bottles, foil bags etc.)
  - Sometimes very poor lighting, orientation, camera, focus etc.
High resolution images or PDFs of the printable package
- Available for a few producers
- Perfect quality
- Needs cropping and/or rotation to select the ingredients list

Steps for ingredients lists

Sample product for examples: https://fr.openfoodfacts.org/produit/5000112558265/coca-cola-zero

Picture taking

Taken with mobile app, uploaded to OFF server

Ingredients list cropping

Done on mobile app just after picture taking
- Cropping may be very inaccurate
Or done on web site at a later time, possibly by another user
- Cropping slightly easier than on mobile

OCR

Launched after cropping, done through the server
Current solution:
- Google Cloud Vision
- Cloud Vision returns a JSON object which is stored on the server

Result:

https://static.openfoodfacts.org/images/products/500/011/255/8265/ingredients.22.full.json
"rédients:eaugazeitiee colorant:caramelE15M difiants: acide phosphorique et Citrate de sodium: édulcorants: aspartame etacésulfame-K;extraitsvegétaux Contientunesourcedephénylalanine."
Your mileage will vary a lot

Ingredients list cutting

The image sent by the OCR can also contain other text content
- Things that are not ingredients
- Ingredients in other languages
- The word "Ingredients:"
Current solution
- Hardcoded regular expressions
Other possible solutions
- Language identification to remove other languages
Metrics
- False negatives (words before or after the ingredients list that should have been removed)
- False positives (words that were removed but are part of the ingredients and should have been kept)
  - It is very important to have as few false positives as possible as it destructs data
Test and training sets
- Only a few adhoc tests run during builds
- Test sets needs to be created

More details can be found on the page Ingredients List Cutting.

Validation and/or correction by users

Current solution:
- Users on the app or the web site are shown the OCR result
- OCR result is not applied if not validated by the user
- but users tend to validate lists without changes even if there are errors, especially on mobile
Other possible solutions
- Use the result of ingredient analysis to show users ingredients that were not recognized
- Show spell suggestions

Spell correction

Current solution:
- Currently only done during ingredients analysis, not during ingredients extraction
- Very simple (and slow) implementation of Peter Norvig algorithm
Other possible solutions
- Spell checkers trained on ingredients
  - Elastic search spellchecker
  - Simspell
Metrics
- Recall and precision
Test and training sets
- Language models can be build with lists of ingredients from OFF
  - e.g. including only ingredients lists from producers, or lists for which we have a very high ingredients recognition rate
- Test sets need to be created
  - Run spellcheckers on actual ingredients lists from OFF, review corrections

Desired result:

"Eau gazéifiée ; colorant : E150d ; acidifiants : acide phosphorique, citrate de sodium ; édulcorants : aspartame, acésulfame-K ; arômes naturels (extraits végétaux), dont caféine."

Ingredients analysis

Ingredients taxonomy

Ingredients analysis is about matching the ingredient list to known ingredients in our multilingual ingredients taxonomy.

So a pre-requisite for good ingredients analysis is to have a comprehensive ingredients taxonomy that includes translations and synonyms in the target language.

The taxonomy is on GitHub (it is a very big file): https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt

Ingredients processing taxonomy

A lot of ingredients list also contain information on how the ingredients have been processed (e.g. "cooked pork meat", "sliced tomatoes", "powdered garlic").

Instead of listing all possible combinations of processing for each ingredient in the ingredients taxonomy, we have created instead a taxonomy of processing methods that we use during ingredients parsing:

https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients_processing.txt

Steps for ingredients analysis

Ingredients pre-parsing

The ingredients list is transformed to make parsing easier
- Remove / normalize strange characters
- De-abbreviate abbreviations
- Split enumerations
  - e.g. "Vitamins A, B et C" -> Vitamine A, Vitamine B, Vitamine C
- Additives E-numbers normalization (E330, e330, e-330, INS 330, SIN330 etc.)
- Additives classes + additive splits
  - e.g. "Colour caramel" -> Colour: Caramel
- Split some "A of B, C and D" (but not all...)
  - e.g. "Huile de palme, colza et tournesol" -> Huile de palme, huile de colza, huile de tournesol
- Handle * and other signs that indicate some ingredients are organic, fair trade etc.
  - e.g. "Pomme*, ..., *: ingrédient issu de l'agriculture biologique" -> "Pomme bio"
Current solution
- Perl code and regular expressions
  - lib/ProductOpener/Ingredients.pm - preparse_ingredients_text()

Ingredients parsing

Separate individual ingredients and match them to the ingredients taxonomy
- Extract properties of ingredients
  - Labels like organic, fair trade etc.
  - quantity (%)
  - processing (e.g. "cooked")
  - origin (e.g. "France")
- Multi-level ingredients / sub-ingredients
  - e.g. "Fromage (Lait, présure, sel)"
- Recognize when "A and B" is a single ingredient, or 2 ingredients
  - Uses the taxonomy to make the determination
Current solution
- Perl code and regular expressions + multilingual ingredients taxonomy
  - lib/ProductOpener/Ingredients.pm - extract_ingredients_from_text()

Ingredient percent analysis

Goal: for each ingredient and sub-ingredient, we compute the minimum and maximum absolute percent
Constraints:
- Some ingredients have a % specified
- Ingredients are listed in descending order of quantity
Current solution
- Perl code in lib/ProductOpener/Ingredients.pm - compute_ingredients_percent_values()
- We use a recursive function to go through ingredients, sub-ingredients, sub-sub-ingredients etc.
- For each list of ingredients (or sub-ingredients), we assign starting min, max or specific percent values that comply with the constraints
  - If a % is specified, we use it
  - otherwise the min is set to 0 and the max to the max of the parent ingredient (or 100% if there is no parent)
- We then apply logic rules based on the constraints:
  - The max of an ingredient must be lower or equal to the max of the ingredient that appears before
  - The max of an ingredient must be lower or equal to the total max minus the sum of the minimums of all ingredients that appear before
  - The max of the 3rd ingredient has a max inferior or equal to the max of the 1st ingredient divided by 2 (and similarly the 4th ingredient has a max inferior or equal to the max of the 1st ingredient divided by 3, etc.)
  - The min of an ingredient must be greater or equal to the total min minus the sum of the maximums of all ingredients that appear before, divided by the number of ingredients that appear after + the current ingredient
  - The min of an ingredient must be greater or equal to the mean of an ingredient that appears after
  - The max of an ingredient must be lower or equal to the total max minus the sum of the minimums of all the ingredients after, divided by the number of ingredients that appear before + the current ingredient
  - The min of the first ingredient in the list must be greater or equal to the total min minus the sum of the maximums of all the ingredients after
- We then reapply all those rules as long as we can apply a new one, based on new min and max values of the ingredients
Issues:
- We may end up with impossible values if the ingredients list was not analyzed correctly (e.g. wrong nesting, bad % etc.)
  - in that case we delete the min and max values

Ingredients analysis result:

Add /api/v2 to get JSON results through API: https://fr.openfoodfacts.org/api/v2/produit/5000112558265/coca-cola-zero

ingredients: [
- {
  - id: "en:carbonated-water",
  - percent_estimate: 56.25,
  - percent_max: 100,
  - percent_min: 12.5,
  - text: "Eau gazéifiée",
  - vegan: "yes",
  - vegetarian: "yes" },
- {
  - id: "en:colour",
  - ingredients: [
    - {
      - id: "en:e150d",
      - percent_estimate: 21.875,
      - percent_max: 50,
      - percent_min: 0,
      - text: "e150d",
      - vegan: "yes",
      - vegetarian: "yes" }],
  - percent_estimate: 21.875,
  - percent_max: 50,
  - percent_min: 0,
  - text: "colorant" },
- {
  - id: "en:acid",
  - ingredients: [
    - {
      - id: "en:e338",
      - percent_estimate: 10.9375,
      - percent_max: 33.3333333333333,
      - percent_min: 0,
      - text: "acide phosphorique",
      - vegan: "yes",
      - vegetarian: "yes" }],
  - percent_estimate: 10.9375,
  - percent_max: 33.3333333333333,
  - percent_min: 0,
  - text: "acidifiants" },
- {
  - id: "en:sodium-citrate",
  - percent_estimate: 5.46875,
  - percent_max: 25,
  - percent_min: 0,
  - text: "citrate de sodium" },
- {
  - id: "en:sweetener",
  - ingredients: [
    - {
      - id: "en:e951",
      - percent_estimate: 2.734375,
      - percent_max: 20,
      - percent_min: 0,
      - text: "aspartame",
      - vegan: "yes",
      - vegetarian: "yes" }],
  - percent_estimate: 2.734375,
  - percent_max: 20,
  - percent_min: 0,
  - text: "édulcorants" },
- {
  - id: "en:e950",
  - percent_estimate: 1.3671875,
  - percent_max: 16.6666666666667,
  - percent_min: 0,
  - text: "acésulfame-K",
  - vegan: "yes",
  - vegetarian: "yes" },
- {
  - id: "en:natural-flavouring",
  - ingredients: [
    - {
      - id: "en:extract",
      - labels: "en:vegan",
      - percent_estimate: 0.68359375,
      - percent_max: 14.2857142857143,
      - percent_min: 0,
      - text: "extraits",
      - vegan: "en:yes",
      - vegetarian: "en:yes" }],
  - percent_estimate: 0.68359375,
  - percent_max: 14.2857142857143,
  - percent_min: 0,
  - text: "arômes naturels",
  - vegan: "maybe",
  - vegetarian: "maybe" },
- {
  - id: "en:caffeine",
  - percent_estimate: 0.68359375,
  - percent_max: 12.5,
  - percent_min: 0,
  - text: "dont caféine",
  - vegan: "yes",
  - vegetarian: "yes" }],

End to end metrics

Known and unknown ingredients

For each product, we have the number of known and unknown ingredients
https://fr.openfoodfacts.org/ingredients?stats=1 (takes a while to load and to render in a browser)
- Results on Sept 10th 2019:
  - 467564 ingrédients:
  - Type Unique tags Occurrences
  - known 3647 (0.78%) 3458004 (81.99%)
  - unknown 463917 (99.22%) 759598 (18.01%)
  - all 467565 (100.00%) 4217602 (100.00%)
- Can be given for a subset of products
  - https://fr.openfoodfacts.org/editor/scamark/ingredients?stats=1 (results from product data imported from Scamark / Leclerc)

Steps to edit and correct errors in a product

Open the product
- If there is no main front picture then select one from the pictures
- If there are no pictures of anything then delete the product
- If it is not a food product then move it to the relevant database (Beauty, Pet.. see 'Non-Food Products' in How to move these products)
- Determine if the language of the ingredients matches the main language
- Look to see if the ingredients look right or if it is totally wrong. For example contains the nutrition information, total junk etc. If it's totally wrong then delete the ingredients.
- If an ingredients label isn't selected then try to find one in the main language and cut it to size. If there is no label in any language then the ingredients aren't reliable so delete them.
- Check the ingredients match the ingredients in the label. If not then extract them. If the ingredients are not extractable or readable then delete them.
- If ingredients match then look for the reason why the ingredient is unknown (mis-spelt etc.)
- If it's an unknown ingredient then add it to the ingredients.txt taxonomy file. If the ingredient exists in English then just add the ingredient for the language you are working in.
- If the ingredient doesn't exist in any language then try to find a parent of the ingredient and add it in that section.
- Check the nutritional information if it exists and correct where necessary
- If the ingredients list doesn't match the main language then change the main language and select the tick box (if you are an administrator) to move all data and images to the main language.

Depending on the level of editing/cleaning, optionally run the list of ingredients through the Ingredients Analysis Testfor the language you are working with.

Results of further ingredient analysis

Number of products for which we are able to make a vegan / non-vegan or vegetarian / non-vegetarian determination
- A non vegetarian/vegan ingredient triggers a non vegetarian/vegan result for the product
- But to mark a product as vegetarian/vegan, we must have recognized all the ingredients of the product
- https://fr.openfoodfacts.org/ingredients-analysis
  - France - Feb 8th 2020
    - Non végétalien 104789
    - Caractère végétalien inconnu 81966
    - Végétalien 27034
    - Peut-être végétalien 6029

Number of products for which we are able to make a NOVA determination
- An ingredient marked as a NOVA 3 or NOVA 4 marker automatically makes the product NOVA 3 or 4
- NOVA classification not made if too many ingredients are unknown (unless we have a NOVA 4 marker)
- --> probably not a good indication of the quality of ingredients recognition

Ingredients analysis quality

Ingredients_Analysis_Quality explains how we measure and monitor the quality of the ingredients analysis.

Resources

Data

Ingredients text and result of ingredient parsing
- MongoDB JSON / JSONL exports: https://world.openfoodfacts.org/data
- API
  - product URL on OFF: https://fr.openfoodfacts.org/produit/3560070223145/pistaches-grillees-carrefour
  - add "/api/v2" to get JSON result: https://fr.openfoodfacts.org/api/v2/produit/3560070223145/pistaches-grillees-carrefour
Sorted lists with counts of individual ingredients for a subset of product
- https://fr.openfoodfacts.org/ingredients?stats=1
  - Ingredients that are known (they exist in our taxonomy): https://fr.openfoodfacts.org/editor/scamark/ingredients?status=known
  - Unknown ingredients: https://fr.openfoodfacts.org/editor/scamark/ingredients?status=unknown
    - OCR and spelling errors
    - Parsing errors
    - Things that are not ingredients and that should not be in the ingredient list
    - Ingredients or synonyms that should be added to the taxonomy
- Can be given for a subset of products
  - https://fr.openfoodfacts.org/editor/scamark/ingredients?status=unknown (for imported Scamark / Leclerc products)

Ingredients taxonomy

How to improve ingredients analysis

This page describes How to improve ingredients analysis (after the text of the ingredients list has been extracted).

@@ Line 1: / Line 1: @@
 This page describes how the ingredients list extraction and ingredients analysis is done on Open Food Facts and points to ressources that could be used to improve it.
+== TL;DR ==
+* If you want to improve ingredients analysis, [https://docs.google.com/presentation/d/1oAyHDrPtNEtbnrn6oy4g67vOfVCHwVR3kR_falQEKyY/edit#slide=id.ge3a97f2683_0_104 here's a nice presentation that's hopefully easy to read and immediately actionable :-)]
+* You should also read the step by step tutorial to edit the taxonomy: [[Taxonomy Maintenance]]
 == Objectives ==
@@ Line 102: / Line 106: @@
 ** Only a few adhoc tests run during builds
 ** Test sets needs to be created
+More details can be found on the page [[Ingredients List Cutting]].
 ==== Validation and/or correction by users ====
@@ Line 393: / Line 398: @@
 This page describes [[How to improve ingredients analysis]] (after the text of the ingredients list has been extracted).
-[[Category:Project:Personalized_Search]]
+[[Category:Ingredients]]