Ingredients List Cutting: Difference between revisions

From Open Food Facts wiki
mNo edit summary
 
(21 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The page details the Ingredients List Cutting procedure and implementation.
The page details the [[Ingredients List Cutting]] procedure and implementation.


== Procedure ==
== Procedure ==
We talked about ingredients list extraction on last Data Quality meeting (by the way, DQ meetings are every first Tuesday of the month at 18:00 CEST, everyone can join).
We would like to improve the extraction.
Everybody can help there.
When you edit a product, if you extract text from the picture and you see some parts of the text that you have to remove by yourself. "store in a cold place", "after opening blablabla", for examples.
When you edit a product, if you extract text from the picture and you see some parts of the text that you have to remove by yourself. "store in a cold place", "after opening blablabla", for examples.
This part can be done automatically by Open Food Facts when you click on the button to extract.
This part can be done automatically by Open Food Facts when you click on the button to extract.
We just need to list all possible occurences.
We just need to list all possible occurences.
So, we can start with the 4 followings:
So, we can start with the 4 followings '''regex types''':
%phrases_before_ingredients_list
=== '''%phrases_before_ingredients_list''' ===
all text that are before the ingredients list, that need to be removed.
All text that is before the ingredients list, that need to be removed. Usually this is a word like "ingredients".
 
=== '''%phrases_after_ingredients_list''' ===
%phrases_after_ingredients_list
%phrases_after_ingredients_list
all text that are after the ingredients list, that need to be removed ("store in a cold place", for example).
all text that are after the ingredients list, that need to be removed ("store in a cold place", for example).
%may_contain_regexps
 
the traces should be extracted, but when you click on "Details of the analysis of the ingredients" they should not appear as ingredient (except if one trace-ingredient is not recognized, see previous discussion with Moon Rabbit https://openfoodfacts.slack.com/archives/C06A7LENM/p1690126832563859).
=== '''%may_contain_regexps''' ===
The traces should be extracted, but when you click on "Details of the analysis of the ingredients" they should not appear as ingredient (except if one trace-ingredient is not recognized)
 
=== '''%ignore_regexps''' ===
%ignore_regexps
%ignore_regexps
that would be text that is after ingredient list and should be ignored but you keep it because you have allergens list after
That would be text that is after ingredient list and should be ignored but you keep it because you have allergens list after
Currently all possible occurences are not yet referenced for all languages. Hence, if you if you see some text that you have to remove when you extract the text, just write in this thread:
 
the text
WARNING! This will remove the whole phrase! That is, if you have "fruits in different amounts", and you add "in different amounts" in ignore_regexps, that will remove fruits as well! (in that case, you want to use stopwords in the taxonomy instead)
if it is %phrases_after_ingredients_list or %may_contain_regexps or something else
 
the link to the product
 
 
Currently all possible occurrences are not yet referenced for all languages. Hence, if you if you see some text that you have to remove when you extract the text, mention it on Slack in the conversation #ingredients


== Implementation ==
== Implementation ==
== Examples ==
All the phrases can be found in the file ingredients.pm.
 
The registered phrases on 14 October 2023:
 
*[[Ingredients list cutting: before phrases|Before phrases]]
*[[Ingredients list cutting: ignore phrases|Ignore phrases]]
* [[Ingredients list cutting: may contain|May contain phrases]]
* [[Ingredients list cutting: after phrases|After phrases]]
 
== Example ==
Some example to illustrate which words refer to what part of the raw ingredients extraction.
Some example to illustrate which words refer to what part of the raw ingredients extraction.
=== [https://world.openfoodfacts.org/product/5281026016014 Thyme cashews] ===
=== [https://world.openfoodfacts.org/product/5281026016014 Thyme cashews] ===
Line 35: Line 48:
https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai
https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai


=== [https://hr.openfoodfacts.org/product/8606106174564 pear cider] ===
== Regex types to be added ==
* Serbian (sr)
If you notice a missing regex while editing a product, feel free to add it in this following table.
* Proizvodi i puni
** %phrases_after_ingredients_list
*
=== [https://hr.openfoodfacts.org/product/3858890478358 Boom Box Chocolate Granola ] ===
* Croatian (hr)
* Priprema obroka
*%phrases_after_ingredients_list
 
Pâte à tartiner
 
French (fr)
ISSUS DE L'AGRICULTURE BIOLOGIQUE
%phrases_before_ingredients_list
https://fr.openfoodfacts.org/produit/3330720236012/pate-a-tartiner-lucien-georgelin
 
French (fr)
ORIGINE DES VIANDES: FRANCE
%phrases_after_ingredients_list
https://fr.openfoodfacts.org/produit/3272932000336/terrine-de-campagne


Benoit (benbenben)
First row corresponds to the example in the last paragraph.
  6 days ago
{| class="wikitable"
Origine des viandes: France can be kept.
|+
And it appears under "specific_ingredients" in the api: https://world.openfoodfacts.org/api/v2/product/3272932000336
!language
!product barcode or url
!text
!regex type
!added
|-
|MK
|8601900501257
|Состојки
|%phrases_before_ingredients_list
|N
|-
|
|
|
|
|
|-
|'''EN'''
|'''<nowiki>https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai</nowiki>'''
|'''Packed in a modified atmosphere.'''
|'''%phrases_after_ingredients_list'''
|'''N'''
|-
|SR
|https://hr.openfoodfacts.org/product/8606106174564/pear-cider-carlsberg
|Proizvodi i puni
|%phrases_after_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/3858890478358/boom-box-chocolate-granola-atlantic-grupa
|Priprema obroka
|%phrases_after_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/3856015303240/krastavci-zvijezda
|Neotvoreno čuvati na sobnoj temperaturi, zaštićeno od sunčeve svjetlosti.
|%phrases_after_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/3850334010728
|Čuvati na temperatu
|%phrases_after_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/3856020263416/pasta-povrtna-vegeta-natur
|Upotreba u jelima
|%phrases_after_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/8014190017627/dimmidis%C3%AC
|Pakiranje sadrži 2 obroka
|%phrases_after_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/8008698005347/biscotti-con-cioccolato-sch%C3%A4r
|[Bez pšenice.->label?] Čuvati na hladnom i suhom mjestu.
|%phrases_after_ingredients_list
|N
|-
|EN
|https://world-hr.openfoodfacts.org/product/3850108079555/
|Can be stored unopened at room temperature. Shake well before use.
|%phrases_after_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/8008698005347/
|Čuvati na sobnoj temperaturi.
|%phrases_after_ingredients_list
|N
|-
|HR
|20449049
|
|%phrases_after_ingredients_list
|N
|-
|RS
|20236946
|Čuvati na hladnom i suvom mestu. Najbolje upotrebiti do:/ Lot broj: označeno na poleđini ambalaže. Zemlja porekla: Belgija. Stavlja u promet u RS: Lidl Srbija KD, Prva južna radna 3, 22330 Nova Pazova, Republika Srbija.
|%phrases_after_ingredients_list
|N
|-
|RO
|20236946
|A se păstra la loc uscat şi răcoros, atât înainte, cât și după deschidere. A se păstra în ambalajul bine închis și a se consuma în decurs de 3 zile după deschidere. Produs în U.E. pentru S.C. Lidl Discount SRL, Sat Nedelea, Comuna Aricestii Rahtivani, DN 72, Crângul lui Bot, KM 73-810, judetul Prahova, România. BG Laper h REVA WA
|%phrases_after_ingredients_list
|N
|-
|RS
|20593735
|Napomena za potrošače: Pojava ulja na površini proizvoda je posledica nagle promene temperature usled čuvanja. Normalna konzistencija se može ponovo postići kratki raturi
|%phrases_after_ingredients_list
|N
|-
|RS
|https://rs.openfoodfacts.org/product/20965198/lidl
|Čuvati na temperaturi od +4 °C do +8 °C. Upotrebljivo do: označeno na zatvaraču. Zemlja porekla: Republika Srbija. Stavlja u promet u RS: Lidl Srbija KD, Prva južna radna 3, 22330 Nova Pazova, Republika Srbija. Proizvođač: SOMBOLED d.o.o., Gakovački put b.b., Sombor, Republika Srbija. 1,5 kg Neto količina: POCOPHONE SHOT ON POCOPHONE F1 RS 579 86159602
|%phrases_after_ingredients_list
|N
|-
|HR
|https://ba.openfoodfacts.org/product/3872084007773/smokva-suha-konzum-plus
|Proizvod sadrži sumporni dioksid.
|%phrases_after_ingredients_list
|N
|-
|RO
|https://ba.openfoodfacts.org/product/8601900501257/euro-bars-80g
|A sè păstra la temperaturi până la 22°C. Cel mai bun utilizat de data imprimată pe partea din spate a produsului.
|%phrases_after_ingredients_list
|N
|-
|
|
|
|
|
|-
|FR
|https://fr.openfoodfacts.org/produit/3330720236012/pate-a-tartiner-lucien-georgelin
|ISSUS DE L’AGRICULTURE BIOLOGIQUE
|%phrases_before_ingredients_list
|N
|-
|HR
|https://hr.openfoodfacts.org/product/3856015303240/krastavci-zvijezda
|HR/BiH and sastojci are stopwords, text starts after HR/BiH but list start after sastojci
|%phrases_before_ingredients_list
|N
|-
|RO
|https://ba.openfoodfacts.org/product/8601900501257/euro-bars-80g
|Produsul poate conţine urme de migdală.
|%phrases_after_ingredients_list
|N
|-
|MK
|8601900501257
|Да се чува на темно место и на температура до 22°С. Употребливо до датумот испечатен на задниот дел на производот.
|%phrases_after_ingredients_list
|N
|-
|HR
|8601900501257
|Čuvati na tamnom mestu i temperaturi do 22°C.
|%phrases_after_ingredients_list
|N
|-
|HR
|3838945509169
|Transportirati, skladištiti i cuvati na sobnoj...
|%phrases_after_ingredients_list
|N
|-
|HR
|3858890477405
|Prijedlog za serviranje
|%phrases_after_ingredients_list
|N
|-
|HR
|3850354002239
|Neotvoreno se može čuvati na sobnoj temperaturi.
|%phrases_after_ingredients_list
|N
|-
|HR
|8606004250308
|Čuvati na čistom i suhom mjestu.
|%phrases_after_ingredients_list
|N
|-
|HR
|3858891978024
|Upute za upotrebu: Dodajte pola čajne žličice matcha, praha u 80 ml prokuhane vode (80°C). Zemlja podrijetla: Japan.
|%phrases_after_ingredients_list
|N
|-
|SI
|3858890477405
|Predlog za serviranje
|%phrases_after_ingredients_list
|N
|-
|SI
|3850104228902
|Prosječne hranjive vrijednosti 100 g proizvoda
|%phrases_after_ingredients_list
|N
|-
|EN
|3858891974309
|Country of origin: Netherlands. Best before: See top of pack. Storage conditions: Store in dry and cool place, protected from direct sunlight.
|%phrases_after_ingredients_list
|N
|-
|
|
|
|
|
|-
|RS
|20593735
|Može sadržati tragove
|%may_contain_regexps
|N
|-
|MK
|8601900501257
|Производот може да содржи
|%may_contain_regexps
|N
|-
|
|
|
|
|
|-
|HR
|https://hr.openfoodfacts.org/product/8600043030549/%C4%8Dokolada-koncern-bambi-a-d-po%C5%BEarevac
|Cokoladne mrvice: kakaovi dijelovi 32 % min. Prirodno bogat vlaknima. Bogat tiaminom, niacinom i vitaminom B6. Tiamin, niacin i vitamin B6 doprinose normalnom metabolizmu stvaranja energije. Obrok od 42 g sadržava 34% od PU*** tiamina, niacina i vitamina B6. Proizvod konzumirati kao dio raznovrsne i uravnotežene prehrane i zdravog načina života.
|%ignore_regexps
|N
|-
|HR
|3838945509169
|Piće je pasterizirano, bez konzervansa.
|%ignore_regexps
|N
|-
|RS
|20593735
|*Rainforest Alliance sertifikovano. Za više informacija posetite ra.org.
|%ignore_regexps
|N
|}


== Context ==
== Context ==
* [[Ingredients Extraction and Analysis]]
* [[Ingredients Extraction and Analysis]]
* [https://forum.openfoodfacts.org/t/ingredients-errors/395 Forum] discussion
* [https://forum.openfoodfacts.org/t/ingredients-errors/395 Forum] discussion
[Category:Ingredients]
[[Category:Ingredients]]
[[Category:Data quality]]

Latest revision as of 15:14, 28 October 2023

The page details the Ingredients List Cutting procedure and implementation.

Procedure

When you edit a product, if you extract text from the picture and you see some parts of the text that you have to remove by yourself. "store in a cold place", "after opening blablabla", for examples. This part can be done automatically by Open Food Facts when you click on the button to extract. We just need to list all possible occurences. So, we can start with the 4 followings regex types:

%phrases_before_ingredients_list

All text that is before the ingredients list, that need to be removed. Usually this is a word like "ingredients".

%phrases_after_ingredients_list

%phrases_after_ingredients_list all text that are after the ingredients list, that need to be removed ("store in a cold place", for example).

%may_contain_regexps

The traces should be extracted, but when you click on "Details of the analysis of the ingredients" they should not appear as ingredient (except if one trace-ingredient is not recognized)

%ignore_regexps

%ignore_regexps That would be text that is after ingredient list and should be ignored but you keep it because you have allergens list after

WARNING! This will remove the whole phrase! That is, if you have "fruits in different amounts", and you add "in different amounts" in ignore_regexps, that will remove fruits as well! (in that case, you want to use stopwords in the taxonomy instead)


Currently all possible occurrences are not yet referenced for all languages. Hence, if you if you see some text that you have to remove when you extract the text, mention it on Slack in the conversation #ingredients

Implementation

All the phrases can be found in the file ingredients.pm.

The registered phrases on 14 October 2023:

Example

Some example to illustrate which words refer to what part of the raw ingredients extraction.

Thyme cashews

So if we take this product as example, go to edit, ingredient, extract text:

  • text start after "Ingredients:" -> nothing to add in %phrases_before_ingredients_list
  • text does not stop! It continues with French ingredients list. We would like to add "Packed in a modified atmosphere" in %phrases_after_ingredients_list to ignore "Packed in a modified atmosphere" and everything after
  • %may_contain_regexps, it looks good. "May also contain" is recognized and when you click on "Details of the analysis of the ingredients" in the product page https://world.openfoodfacts.org/product/5281026016014, the text "May also contain soy, peanuts, sesame seeds, milk" is not there [One allergen (milk products) is unknown in the taxonomy, it appears as unknown ingredients. But this is another topic].
  • %ignore_regexps, "For allergens see ingredients in bold." in the ingredients list does not appear in "Details of the analysis of the ingredients". This mean that "For allergens see ingredients in bold." is already known as ignore_regexps. All good there as well.
  • Packed in a modified atmosphere.

%phrases_after_ingredients_list https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai

Regex types to be added

If you notice a missing regex while editing a product, feel free to add it in this following table.

First row corresponds to the example in the last paragraph.

language product barcode or url text regex type added
MK 8601900501257 Состојки %phrases_before_ingredients_list N
EN https://world.openfoodfacts.org/product/5281026016014/thyme-cashews-alrifai Packed in a modified atmosphere. %phrases_after_ingredients_list N
SR https://hr.openfoodfacts.org/product/8606106174564/pear-cider-carlsberg Proizvodi i puni %phrases_after_ingredients_list N
HR https://hr.openfoodfacts.org/product/3858890478358/boom-box-chocolate-granola-atlantic-grupa Priprema obroka %phrases_after_ingredients_list N
HR https://hr.openfoodfacts.org/product/3856015303240/krastavci-zvijezda Neotvoreno čuvati na sobnoj temperaturi, zaštićeno od sunčeve svjetlosti. %phrases_after_ingredients_list N
HR https://hr.openfoodfacts.org/product/3850334010728 Čuvati na temperatu %phrases_after_ingredients_list N
HR https://hr.openfoodfacts.org/product/3856020263416/pasta-povrtna-vegeta-natur Upotreba u jelima %phrases_after_ingredients_list N
HR https://hr.openfoodfacts.org/product/8014190017627/dimmidis%C3%AC Pakiranje sadrži 2 obroka %phrases_after_ingredients_list N
HR https://hr.openfoodfacts.org/product/8008698005347/biscotti-con-cioccolato-sch%C3%A4r [Bez pšenice.->label?] Čuvati na hladnom i suhom mjestu. %phrases_after_ingredients_list N
EN https://world-hr.openfoodfacts.org/product/3850108079555/ Can be stored unopened at room temperature. Shake well before use. %phrases_after_ingredients_list N
HR https://hr.openfoodfacts.org/product/8008698005347/ Čuvati na sobnoj temperaturi. %phrases_after_ingredients_list N
HR 20449049 %phrases_after_ingredients_list N
RS 20236946 Čuvati na hladnom i suvom mestu. Najbolje upotrebiti do:/ Lot broj: označeno na poleđini ambalaže. Zemlja porekla: Belgija. Stavlja u promet u RS: Lidl Srbija KD, Prva južna radna 3, 22330 Nova Pazova, Republika Srbija. %phrases_after_ingredients_list N
RO 20236946 A se păstra la loc uscat şi răcoros, atât înainte, cât și după deschidere. A se păstra în ambalajul bine închis și a se consuma în decurs de 3 zile după deschidere. Produs în U.E. pentru S.C. Lidl Discount SRL, Sat Nedelea, Comuna Aricestii Rahtivani, DN 72, Crângul lui Bot, KM 73-810, judetul Prahova, România. BG Laper h REVA WA %phrases_after_ingredients_list N
RS 20593735 Napomena za potrošače: Pojava ulja na površini proizvoda je posledica nagle promene temperature usled čuvanja. Normalna konzistencija se može ponovo postići kratki raturi %phrases_after_ingredients_list N
RS https://rs.openfoodfacts.org/product/20965198/lidl Čuvati na temperaturi od +4 °C do +8 °C. Upotrebljivo do: označeno na zatvaraču. Zemlja porekla: Republika Srbija. Stavlja u promet u RS: Lidl Srbija KD, Prva južna radna 3, 22330 Nova Pazova, Republika Srbija. Proizvođač: SOMBOLED d.o.o., Gakovački put b.b., Sombor, Republika Srbija. 1,5 kg Neto količina: POCOPHONE SHOT ON POCOPHONE F1 RS 579 86159602 %phrases_after_ingredients_list N
HR https://ba.openfoodfacts.org/product/3872084007773/smokva-suha-konzum-plus Proizvod sadrži sumporni dioksid. %phrases_after_ingredients_list N
RO https://ba.openfoodfacts.org/product/8601900501257/euro-bars-80g A sè păstra la temperaturi până la 22°C. Cel mai bun utilizat de data imprimată pe partea din spate a produsului. %phrases_after_ingredients_list N
FR https://fr.openfoodfacts.org/produit/3330720236012/pate-a-tartiner-lucien-georgelin ISSUS DE L’AGRICULTURE BIOLOGIQUE %phrases_before_ingredients_list N
HR https://hr.openfoodfacts.org/product/3856015303240/krastavci-zvijezda HR/BiH and sastojci are stopwords, text starts after HR/BiH but list start after sastojci %phrases_before_ingredients_list N
RO https://ba.openfoodfacts.org/product/8601900501257/euro-bars-80g Produsul poate conţine urme de migdală. %phrases_after_ingredients_list N
MK 8601900501257 Да се чува на темно место и на температура до 22°С. Употребливо до датумот испечатен на задниот дел на производот. %phrases_after_ingredients_list N
HR 8601900501257 Čuvati na tamnom mestu i temperaturi do 22°C. %phrases_after_ingredients_list N
HR 3838945509169 Transportirati, skladištiti i cuvati na sobnoj... %phrases_after_ingredients_list N
HR 3858890477405 Prijedlog za serviranje %phrases_after_ingredients_list N
HR 3850354002239 Neotvoreno se može čuvati na sobnoj temperaturi. %phrases_after_ingredients_list N
HR 8606004250308 Čuvati na čistom i suhom mjestu. %phrases_after_ingredients_list N
HR 3858891978024 Upute za upotrebu: Dodajte pola čajne žličice matcha, praha u 80 ml prokuhane vode (80°C). Zemlja podrijetla: Japan. %phrases_after_ingredients_list N
SI 3858890477405 Predlog za serviranje %phrases_after_ingredients_list N
SI 3850104228902 Prosječne hranjive vrijednosti 100 g proizvoda %phrases_after_ingredients_list N
EN 3858891974309 Country of origin: Netherlands. Best before: See top of pack. Storage conditions: Store in dry and cool place, protected from direct sunlight. %phrases_after_ingredients_list N
RS 20593735 Može sadržati tragove %may_contain_regexps N
MK 8601900501257 Производот може да содржи %may_contain_regexps N
HR https://hr.openfoodfacts.org/product/8600043030549/%C4%8Dokolada-koncern-bambi-a-d-po%C5%BEarevac Cokoladne mrvice: kakaovi dijelovi 32 % min. Prirodno bogat vlaknima. Bogat tiaminom, niacinom i vitaminom B6. Tiamin, niacin i vitamin B6 doprinose normalnom metabolizmu stvaranja energije. Obrok od 42 g sadržava 34% od PU*** tiamina, niacina i vitamina B6. Proizvod konzumirati kao dio raznovrsne i uravnotežene prehrane i zdravog načina života. %ignore_regexps N
HR 3838945509169 Piće je pasterizirano, bez konzervansa. %ignore_regexps N
RS 20593735 *Rainforest Alliance sertifikovano. Za više informacija posetite ra.org. %ignore_regexps N

Context