Ingredients List Cutting: Difference between revisions

From Open Food Facts wiki
Line 24: Line 24:
== Implementation ==
== Implementation ==
== Examples ==
== Examples ==
* [https://world.openfoodfacts.org/product/5281026016014 Thyme cashews]
Some example to illustrate which words refer to what part of the raw ingredients extraction.
=== [https://world.openfoodfacts.org/product/5281026016014 Thyme cashews] ===
So if we take this product as example, go to edit, ingredient, extract text:
So if we take this product as example, go to edit, ingredient, extract text:
** text start after "Ingredients:" -> nothing to add in %phrases_before_ingredients_list
* text start after "Ingredients:" -> nothing to add in %phrases_before_ingredients_list
** text does not stop! It continues with French ingredients list. We would like to add "Packed in a modified atmosphere" in %phrases_after_ingredients_list to ignore "Packed in a modified atmosphere" and everything after
* text does not stop! It continues with French ingredients list. We would like to add "Packed in a modified atmosphere" in %phrases_after_ingredients_list to ignore "Packed in a modified atmosphere" and everything after
** %may_contain_regexps, it looks good. "May also contain" is recognized and when you click on "Details of the analysis of the ingredients" in the product page https://world.openfoodfacts.org/product/5281026016014, the text "May also contain soy, peanuts, sesame seeds, milk" is not there [One allergen (milk products) is unknown in the taxonomy, it appears as unknown ingredients. But this is another topic].
* %may_contain_regexps, it looks good. "May also contain" is recognized and when you click on "Details of the analysis of the ingredients" in the product page https://world.openfoodfacts.org/product/5281026016014, the text "May also contain soy, peanuts, sesame seeds, milk" is not there [One allergen (milk products) is unknown in the taxonomy, it appears as unknown ingredients. But this is another topic].
** %ignore_regexps, "For allergens see ingredients in bold." in the ingredients list does not appear in "Details of the analysis of the ingredients". This mean that "For allergens see ingredients in bold." is already known as ignore_regexps. All good there as well.
* %ignore_regexps, "For allergens see ingredients in bold." in the ingredients list does not appear in "Details of the analysis of the ingredients". This mean that "For allergens see ingredients in bold." is already known as ignore_regexps. All good there as well.
 
* Packed in a modified atmosphere.  
I can write the following in this thread:
Packed in a modified atmosphere.  
%phrases_after_ingredients_list
%phrases_after_ingredients_list
https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai
https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai


* [https://hr.openfoodfacts.org/product/8606106174564 pear cider] Serbian (sr)
=== [https://hr.openfoodfacts.org/product/8606106174564 pear cider] ===
** Proizvodi i puni
* Serbian (sr)
* Proizvodi i puni
** %phrases_after_ingredients_list
** %phrases_after_ingredients_list
**
*
 
=== [https://hr.openfoodfacts.org/product/3858890478358 Boom Box Chocolate Granola ] ===
* Croatian (hr)
* Croatian (hr)
** Priprema obroka
* Priprema obroka
**%phrases_after_ingredients_list
*%phrases_after_ingredients_list
** https://hr.openfoodfacts.org/product/3858890478358


French (fr)
French (fr)
Line 55: Line 54:
%phrases_after_ingredients_list
%phrases_after_ingredients_list
https://fr.openfoodfacts.org/produit/3272932000336/terrine-de-campagne
https://fr.openfoodfacts.org/produit/3272932000336/terrine-de-campagne
:+1:
1


Benoit (benbenben)
Benoit (benbenben)

Revision as of 07:38, 14 October 2023

The page details the Ingredients List Cutting procedure and implementation.

Procedure

We talked about ingredients list extraction on last Data Quality meeting (by the way, DQ meetings are every first Tuesday of the month at 18:00 CEST, everyone can join). We would like to improve the extraction. Everybody can help there. When you edit a product, if you extract text from the picture and you see some parts of the text that you have to remove by yourself. "store in a cold place", "after opening blablabla", for examples. This part can be done automatically by Open Food Facts when you click on the button to extract. We just need to list all possible occurences. So, we can start with the 4 followings: %phrases_before_ingredients_list all text that are before the ingredients list, that need to be removed. %phrases_after_ingredients_list all text that are after the ingredients list, that need to be removed ("store in a cold place", for example). %may_contain_regexps the traces should be extracted, but when you click on "Details of the analysis of the ingredients" they should not appear as ingredient (except if one trace-ingredient is not recognized, see previous discussion with Moon Rabbit https://openfoodfacts.slack.com/archives/C06A7LENM/p1690126832563859). %ignore_regexps that would be text that is after ingredient list and should be ignored but you keep it because you have allergens list after Currently all possible occurences are not yet referenced for all languages. Hence, if you if you see some text that you have to remove when you extract the text, just write in this thread: the text if it is %phrases_after_ingredients_list or %may_contain_regexps or something else the link to the product

Implementation

Examples

Some example to illustrate which words refer to what part of the raw ingredients extraction.

Thyme cashews

So if we take this product as example, go to edit, ingredient, extract text:

  • text start after "Ingredients:" -> nothing to add in %phrases_before_ingredients_list
  • text does not stop! It continues with French ingredients list. We would like to add "Packed in a modified atmosphere" in %phrases_after_ingredients_list to ignore "Packed in a modified atmosphere" and everything after
  • %may_contain_regexps, it looks good. "May also contain" is recognized and when you click on "Details of the analysis of the ingredients" in the product page https://world.openfoodfacts.org/product/5281026016014, the text "May also contain soy, peanuts, sesame seeds, milk" is not there [One allergen (milk products) is unknown in the taxonomy, it appears as unknown ingredients. But this is another topic].
  • %ignore_regexps, "For allergens see ingredients in bold." in the ingredients list does not appear in "Details of the analysis of the ingredients". This mean that "For allergens see ingredients in bold." is already known as ignore_regexps. All good there as well.
  • Packed in a modified atmosphere.

%phrases_after_ingredients_list https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai

pear cider

  • Serbian (sr)
  • Proizvodi i puni
    • %phrases_after_ingredients_list

Boom Box Chocolate Granola

  • Croatian (hr)
  • Priprema obroka
  • %phrases_after_ingredients_list

French (fr) ISSUS DE L'AGRICULTURE BIOLOGIQUE %phrases_before_ingredients_list https://fr.openfoodfacts.org/produit/3330720236012/pate-a-tartiner-lucien-georgelin

French (fr) ORIGINE DES VIANDES: FRANCE %phrases_after_ingredients_list https://fr.openfoodfacts.org/produit/3272932000336/terrine-de-campagne

Benoit (benbenben)

 6 days ago

Origine des viandes: France can be kept. And it appears under "specific_ingredients" in the api: https://world.openfoodfacts.org/api/v2/product/3272932000336

Context

[Category:Ingredients]