Ingredients List Cutting
The page details the Ingredients List Cutting procedure and implementation.
Procedure
When you edit a product, if you extract text from the picture and you see some parts of the text that you have to remove by yourself. "store in a cold place", "after opening blablabla", for examples. This part can be done automatically by Open Food Facts when you click on the button to extract. We just need to list all possible occurences. So, we can start with the 4 followings regex types:
%phrases_before_ingredients_list
All text that is before the ingredients list, that need to be removed. Usually this is a word like "ingredients".
%phrases_after_ingredients_list
%phrases_after_ingredients_list all text that are after the ingredients list, that need to be removed ("store in a cold place", for example).
%may_contain_regexps
The traces should be extracted, but when you click on "Details of the analysis of the ingredients" they should not appear as ingredient (except if one trace-ingredient is not recognized)
%ignore_regexps
%ignore_regexps That would be text that is after ingredient list and should be ignored but you keep it because you have allergens list after
WARNING! This will remove the whole phrase! That is, if you have "fruits in different amounts", and you add "in different amounts" in ignore_regexps, that will remove fruits as well! (in that case, you want to use stopwords in the taxonomy instead)
Currently all possible occurrences are not yet referenced for all languages. Hence, if you if you see some text that you have to remove when you extract the text, mention it on Slack in the conversation #ingredients
Implementation
All the phrases can be found in the file ingredients.pm.
The registered phrases on 14 October 2023:
Example
Some example to illustrate which words refer to what part of the raw ingredients extraction.
Thyme cashews
So if we take this product as example, go to edit, ingredient, extract text:
- text start after "Ingredients:" -> nothing to add in %phrases_before_ingredients_list
- text does not stop! It continues with French ingredients list. We would like to add "Packed in a modified atmosphere" in %phrases_after_ingredients_list to ignore "Packed in a modified atmosphere" and everything after
- %may_contain_regexps, it looks good. "May also contain" is recognized and when you click on "Details of the analysis of the ingredients" in the product page https://world.openfoodfacts.org/product/5281026016014, the text "May also contain soy, peanuts, sesame seeds, milk" is not there [One allergen (milk products) is unknown in the taxonomy, it appears as unknown ingredients. But this is another topic].
- %ignore_regexps, "For allergens see ingredients in bold." in the ingredients list does not appear in "Details of the analysis of the ingredients". This mean that "For allergens see ingredients in bold." is already known as ignore_regexps. All good there as well.
- Packed in a modified atmosphere.
%phrases_after_ingredients_list https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai
Regex types to be added
If you notice a missing regex while editing a product, feel free to add it in this following table.
First row corresponds to the example in the last paragraph.
language | product barcode or url | text | regex type | added |
---|---|---|---|---|
EN | https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai | Packed in a modified atmosphere. | %phrases_after_ingredients_list | N |
SR | https://hr.openfoodfacts.org/product/8606106174564/pear-cider-carlsberg | Proizvodi i puni | %phrases_after_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/3858890478358/boom-box-chocolate-granola-atlantic-grupa | Priprema obroka | %phrases_after_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/3856015303240/krastavci-zvijezda | Neotvoreno čuvati na sobnoj temperaturi, zaštićeno od sunčeve svjetlosti. | %phrases_after_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/3850334010728 | Čuvati na temperatu | %phrases_after_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/3856020263416/pasta-povrtna-vegeta-natur | Upotreba u jelima | %phrases_after_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/8014190017627/dimmidis%C3%AC | Pakiranje sadrži 2 obroka | %phrases_after_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/8008698005347/biscotti-con-cioccolato-sch%C3%A4r | [Bez pšenice.->label?] Čuvati na hladnom i suhom mjestu. | %phrases_after_ingredients_list | N |
EN | https://world-hr.openfoodfacts.org/product/3850108079555/ | Can be stored unopened at room temperature. Shake well before use. | %phrases_after_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/8008698005347/ | Čuvati na sobnoj temperaturi. | %phrases_after_ingredients_list | N |
FR | https://fr.openfoodfacts.org/produit/3330720236012/pate-a-tartiner-lucien-georgelin | ISSUS DE L’AGRICULTURE BIOLOGIQUE | %phrases_before_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/3856015303240/krastavci-zvijezda | HR/BiH and sastojci are stopwords, text starts after HR/BiH but list start after sastojci | %phrases_before_ingredients_list | N |
HR | https://hr.openfoodfacts.org/product/8600043030549/%C4%8Dokolada-koncern-bambi-a-d-po%C5%BEarevac | Cokoladne mrvice: kakaovi dijelovi 32 % min. Prirodno bogat vlaknima. Bogat tiaminom, niacinom i vitaminom B6. Tiamin, niacin i vitamin B6 doprinose normalnom metabolizmu stvaranja energije. Obrok od 42 g sadržava 34% od PU*** tiamina, niacina i vitamina B6. Proizvod konzumirati kao dio raznovrsne i uravnotežene prehrane i zdravog načina života. | %ignore_regexps | N |
Context
- Ingredients Extraction and Analysis
- Forum discussion