Ingredients taxonomy: Difference between revisions

From Open Food Facts wiki
(hungaraian translation)
 
(27 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Draft - See [[Global taxonomies]] for instructions.
[[Category:Taxonomies]]
== Potential sources ==
== Introduction ==
* http://world.openfoodfacts.org/ingredients
== Why? ==  
* https://github.com/thinkle/gourmet/tree/d19490704a23e3e5734b040f517d3a56e12fdacd/gourmet/defaults
Why do we need an ingredients ontology? The ontology describes how ingredients are derived from each other and how ingredients can be combined into new ingredients. The basic thesaurus can help to:
* https://tools.wmflabs.org/wikidata-todo/translate_items_with_property.php?prop=652
* '''Normalise ingredients''' - Producers take a lot of freedom in describing the ingredients they use. They use different words (synonyms) to designate the same ingredient. An ontology helps to standardise the ingredients;
* '''Exclusive search''' - the taxonomy can be used to support search for ingredients in any language and multiple synonyms;
* '''Translation''' - as the taxonomy contains translations of each ingredient, it can be used as means to translate ingredient lists;
* '''Ingredients analysis''' - the taxonomy can be extended with properties to indicate if ingredients are suitable for vegans, vegetarians etc. and to estimate how processed the food product is (e.g. NOVA groups)
* '''Ingredient language inconsistencies''' - it happens that ingredients in different languages of a single product are different. That can be ingredients that are left out or simplified. The thesaurus might help revealing these.


== May be generated partly from ==
If the thesaurus is extended by relations between ingredients, other benefits arise, depending on what is defined in the taxonomy. The basic relation is the isa-relation or is-a-kind-of relation. The isa-relation defines two ingredients that are related. For instance the ingredient ''strawberry'' can also be found as ''strawberry puree'' and ''strawberry juice''. ''Strawberry'' is more generic than ''strawberry puree'', which is more specific. The rule here is: would it make sense to replace the children by the parent in the ingredients list? For instance ''strawberry puree'' and ''strawberry juice'' are still ''strawberry'', but under a different shape. So if you replace ''strawberry puree'' with ''strawberry'' in the ingredient list, it is still a valid ingredient list. Usually ''strawberry'' is the parent and ''strawberry puree'' a child.
* https://translations.launchpad.net/openfoodfacts/trunk/+pots/openfoodfacts-ingredients
* Wikidata: List of ingredients in food: http://tinyurl.com/z9j2enm


== Taxonomy ==
This isa-relation makes the following functionalities possible:  
<pre>
* '''Inclusive source search''' - The '''inclusive search''' means that a search for the parent ''strawberry'' will also show the children ''strawberry puree'', etc. The '''source search''' means the the origin of the ingredient, i.e. strawberry , apple, cinnamon, soy, etc. (I want another word for source).
en:salt
* '''Inclusive condition search''' - in a similar way we could search for the condition of a source ingredient, for instance pureed, juiced, dried, reconditioned, etc. So if we would like to search for all pureed fruits, the relation between [https://world.openfoodfacts.org/ingredient/fruit-puree fruit puree] and [https://world.openfoodfacts.org/ingredient/strawberry-puree strawberry puree] should be defined in the taxonomy.
de:Salz
eo:salo
fr:sel
nl:zout
es:sal
hu:Etkezesi-so, , Étkezési-


en:sugar
One could also define other relations in order to support:
de:Zucker
* '''Hidden ingredients''' - an ingredient might contain hidden ingredients, the ontology might reveal these. For example [https://world.openfoodfacts.org/ingredient/butter butter] contains [https://world.openfoodfacts.org/ingredient/butterfat butterfat].
eo:sukero
* '''Combined ingredients''' - an ingredient might appear as a single ingredient. In reality however
fr:sucre
* '''Processed ingredients''' - often an ingredient is derived from an other ingredient through some process. We can make explicit what these processes are. Example [https://world.openfoodfacts.org/ingredient/clarified-butter clarified butter] is created from [https://world.openfoodfacts.org/ingredient/butter butter] by separating the [https://world.openfoodfacts.org/ingredient/milk-solids milk solids] and [https://world.openfoodfacts.org/ingredient/water water] from the [https://world.openfoodfacts.org/ingredient/butterfat butterfat].
nl:suiker
* '''Ingredient incompleteness''' - often an ingredient is incomplete defined in an ingredient list. For instance if an ingredient-list specifies [https://world.openfoodfacts.org/ingredient/milk milk], it should be defined from which mammal the milk comes from, for instance [https://world.openfoodfacts.org/ingredient/cow-s-milk cow's milk].
es:azúcar


en:water
== Taxonomy ==
de:Wasser
The current OFF ingredients taxonomy ([https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt github]) can be seen as a thesaurus. (although slowly more info is added).
eo:akvo
fr:eau
nl:water
es:agua
hu:Ivoviz, víz
 
en:emulsifier
de:Emulgator
fr:emulsifiant
nl:emulgator
es:emulsificador
 
en:citric acid
de:Citronensäure, Zitronensäure
eo:citrata acido
fr:acide citrique
nl:citroenzuur
es:ácido cítrico
hu:Citromsav
 
en:milk
de:Milch
eo:lakto
nl:melk
es:leche
 
en:preservative
de:Konservierungsmittel
nl:conserveermiddel
es:conservador
 
en:palm-oil
de:Palmöl
eo:palma oleo, palmoleo
nl:palmolie
es:aceite de palma
hu:pálmaolaj, pálma, pálmamag
 
en:iron
de:Eisen
eo:fero
nl:ijzer
es:hierro
 
en:colour
nl:kleurstof
es:colorante
 
en:yeast
de:Hefe
eo:gisto
nl:gist
es:levadura
 
en:glucose-syrup
de:Glukosesirup, Glucosesirup
nl:glucosesiroop
es:jarabe de glucosa
hu:Glukozszirup
 
en:high-fructose corn syrup
nl:glucose-fructosestroop, glucose-fructosesiroop
es:jarabe de maíz alto en fructosa
 
en:flavouring
nl:smaakstof
es:saborizante
 
en:sunflower-oil
de:Sonnenblumenöl
nl:zonnebloemolie
es:aceite de girasol
 
en:sunflower
de:Sonnenblume
eo:sunfloro
nl:zonnebloem
es:girasol
 
en:dextrose
de:Dextrose
eo:dekstrozo
nl:dextrose
es:dextrosa
 
en:cocoa-butter
de:Kakaobutter
eo:kakabutero
nl:cacaoboter
es:mantequilla de cacao, manteca de cacao
hu:Kakaovaj
 
en:cocoa-mass
nl:cacao massa
es:masa del cacao, masa de cacao
hu:Kakaomassza
 
en:soya-lecithin
nl:sojalecithine
es:lecitina de soya, lecitina de soja
hu:Szojalecitin
 
en:lecithin
nl:lecithine
es:lecitina
hu:Lecitinek, Lecitin
 
en:nutmeg
de:Muskat, Muskatnuss
nl:nootmuskaat
es:nuez moscada
 
en:ginger
de:Ingwer
eo:zingibro
nl:gember
es:gengibre
 
en:raisins
de:Rosinen
eo:sekvinberoj
nl:rozijnen
es:pasas
 
en:coriander
de:Koriander
eo:koriandro
nl:koriander
es:cilantro
 
en:molasses
de:Melasse
eo:melaso
nl:melasse
es:melaza
 
en:vegetable-oil
nl:plantaardige olie
es:aceite vegetal
 
en:black-pepper
de:Schwarzer Pfeffer
eo:nigra pipro
nl:zwarte peper
es:pimienta negra
 
en:stabiliser
de:Stabilisator
nl:stabilisator
es:estabilizador
 
en:butter
de:Butter
eo:butero
nl:boter
es:mantequilla
 
en:cinnamon
de:Zimt
eo:cinamo
nl:kaneel
es:canela
 
en:garlic
de:Knoblauch
eo:ajlo
nl:knoflook
es:ajo
 
en:antioxidant
nl:antioxidant
es:antioxidante
 
en:sea-salt
de:Meersalz
eo:marsalo
nl:zeezout
es:sal de mar
 
en:tomatoes
de:Tomaten
eo:tomatoj
nl:tomaten
es:tomates, jitomates
 
en:currants
nl:krenten
es:grosellas
 
en:honey
de:Honig
eo:mielo
nl:honing
es:miel
 
en:yeast-extract
nl:gistextract
es:extracto de levadura
 
en:almonds
de:Mandeln
eo:migdaloj
nl:amandelen
es:almendras
 
en:onion-powder
de:Zwiebelpulver
nl:knoflookpoeder
es:cebolla en polvo
 
en:yeast-extract
nl:gistextract
es:extracto de levadura
 
en:walnuts
de:Walnüsse
eo:juglandoj
nl:walnoten
es:nueces
 
en:kitchen-salt
nl:keukenzout
es:sal de cocina
 
en:hazelnut
de:Haselnuss
eo:avelo
nl:hazelnoot
es:avellaa
 
en:sugar-cane
nl:rietsuiker
es:azúcar de caña
 
en:Of biological origin.
nl:Van biologische oorsprong.
es:de orígen biológico
 
en:zucchini
de:Zucchini
eo:kukurbeto, itala kukurbo, italia kukurbo
nl:courgette
es:calabacín
 
en:tomato
de:Tomate
eo:tomato
nl:tomaat
es:tomate, jitomate
 
en:carrot
de:Karotte
eo:karoto
nl:wortel
es:zanahoria
 
en:cucumber
de:Gurke
eo:kukumo
nl:komkommer
es:pepino
 
en:pointed pepper
nl:puntpaprika
 
en:red onion
de:Rote Zwiebel
eo:ruĝa cepo
nl:rode ui
es:cebolla roja
 
en:red pepper
de:Roter Pfeffer
eo:ruĝa pipro
nl:rode peper
es:pimiento rojo, chile morrón rojo
 
en:coconut milk
de:Kokosmilch, Kokosnussmilch
eo:kokoslakto
nl:kokosmelk
es:leche de coco
 
en:coconut extract
nl:kokosextract
es:extracto de coco
 
en:green pepper
de:Grüner Pfeffer
eo:verda pipro
nl:groene peper
es:pimiento verde, chile morrón verde
 
nl:koriander
de:Koriander
eo:koriandro
en:coriander
es:cilantro
 
en:cumin
de:Kreuzkümmel, Kumin, Cumin
eo:oficina kumino
nl:komijn
es:comino
 
en:black mustard seed
nl:zwarte mosterdzaad
es:semilla de mostaza negra
 
en:pepper
de:Paprika
eo:kapsiko
nl:paprika
es:pimienta
 
en:pepper corns
nl:peperkorrels
es:granos de pimienta
 
en:white pepper
de:Weißer Pfeffer
eo:blanka pipro
nl:witte peper
es:pimienta blanca
 
en:bay leaf
nl:laurierblad
es:laurel
 
en:acidity-regulators
nl:zuurteregelaar
es:regulador de acidez
 
en:tomato juice
de:Tomatensaft
eo:tomatsuko, tomatosuko
nl:tomatensap
es:jugo de tomate, jugo de jitomate
 
en:curry
de:Curry
eo:kareo
nl:kerrie
es:curry
 
en:curry powder
de:Currypulver
eo:karepulvero, karea pulvero
nl:kerriepoeder
es:polvo de curry
 
en:chilli pepper
de:Chilischote
eo:kapsiketo
nl:peper
en-us:chili pepper
es:chile, ají
 
en:modified corn starch
nl:gemodificeerd maïszetmeel
es:almidón de maíz modificado
hu:Kukoricakemenyitő
 
en:acidifier
nl:voedingszuur
es:acidificador
 
en:linseed
de:Leinsamen
nl:lijnzaad
es:linaza
 
en:linseed oil
de:Leinöl, Leinsamenöl
eo:linoleo
nl:lijnzaad olie
es:aceite de linaza


en:vegetable oil
''In the context of information retrieval, a thesaurus (plural: "thesauri") is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object''. ([https://en.wikipedia.org/wiki/Thesaurus_(information_retrieval) wikipedia])
de:Pflanzenöl
eo:legoma oleo
nl:plantaardige olie
es:aceite vegetal


en:vegetable oils
The purpose of the thesaurus is to have a list of ingredients, that occur in ingredient lists, that are unique and are well defined. The OFF ingredients taxonomy is mainly a list of elements that occur in ingredient lists. However the OFF Ingredients taxonomy is more than a simple thesaurus.
de:Pflanzenöle
eo:legomaj oleoj
nl:plantaardige oliën, plantaardige olien
es:aceites vegetales


en:rapeseed
=== Example ===
nl:koolzaad
Below is the ingredient class entry of '''soybean''' in the OFF ingredients taxonomy:
es:canola, colza
<nowiki>#</nowiki>description:en: SOYBEAN (Glycine max), or soya bean is a species of legume native to East Asia.
<nowiki>#</nowiki>comment:en: There is a bit of confusion whether "bean" should be part of the name, so "soy" or "soybeen"?.
<br /><en:soya
<br />en:soya bean, soy beans, soya beans, soybean
<br />da:Sojabønne
<br />de:Sojabohne, Sojabohnen, Sojakerne
<br />es:granos de soja, habas de soja
<br />et:Sojauba
<br />fi:Soijapapu
<br />fr:fève de soja, fèves de soja, soja entier, graine de soja, graines de soja, graines soja


en:palm
=== Explanation ===
nl:palm
* '''Comment-line''' - the ''#''  describes a comment line and can be used to add other information to an ingredient. This information is not used by OFF.
es:palma
* '''Description''' the # description:en: provides a definition of the ingredient in english. In this case the definition is taken from wikipedia.
* '''Description''' the # comment:en: provides a comment in english about the entry.
* '''Super-ingredient''' - the ''<en:soya'' line describes the super-ingredient ingredient of this (sub-ingredient) ingredient. The relationship between the super-ingredient an the sub-ingredient is an ''is-a-kind-of-relationship''. This means that the sub-ingredient is a kind of super-ingredient. In this example the soybean ingredient is a kind soy ingredient. The sub-ingredient provides more details of the ingredient. In this case the super-ingredient is very broad (something with soy), wheras the sub-ingredient adds the detail of the bean. This line is optional, if no super-ingredient can be specified.
* '''Key''' - The ''en:soya bean'' line is the main name (key) of the ingredient. This doubles as name for the ingredient (in this case in english (''en:'')).
* '''Translations''' - The next lines provide translations of the main ingredient in other languages. One language per line. Each line starts with a language prefix. Thus ''de:'' means german.
* '''Synonyms''' - an ingredient in a language can have multiple synonyms. These synonyms appear as secondary entries on a language line. The first entry is however the main name in that language.


en:fully hardened palm
=== Multiple super ingredients ===
nl:volledig geharde palm, volledig gehard palm
An ingredient can have multiple super ingredients. This can be used to express the relation between ingredients. These super ingredients reflect the description of an ingredient, such as ''grapefruit juice''. The ingredient ''grapefruit juice'' has two components: the noun ''juice'' and the adjective ''grapefruit''. The noun can be seen as the processing state (purée, juice, flour, etc.) and the adjective as subject of that processing. For an user both might be interesting in helping her: what kind of juice? how was the grapefruit processed?
es:palma completamente endurecida


en:yogurt
As the same super-ingredient will also appear in combination with other ingredients, this might help the user understand what other fruits there are and how they are processed and what other juices from fruits exist.
de:Joghurt, Jogurt
eo:jogurto, jahurto
nl:yoghurt
es:yogurt, yoghurt


en:modified starch
At the moment (2018-10-05) is taxonomy is not consistently structured. It uses sometimes the nouns and sometimes the adjective as structure principle.
nl:gemodificeerd zetmeel
es:almidón modificado


en:calcium salts
These super ingredients and ingredients can also be found in ingredient lists, but then they are indicated by parentheses or colons). For example: ''Tomates (Dés, Jus, Purée)'', ''Yaourt (lait, crème, sucre, ferments lactiques)'', ''crème fraîche (dont lait)'' or ''Jus de : pomme, orange, passion, ananas, citron''. These are four different approaches to super ingredients and define different relations between ingredients.
nl:calciumzouten
es:sales de calcio


en:emulsifiers
From a formal semantic viewpoint: juice is the superclass of grapefruit juice (''grapefruit juice'' ''is-a-kind-of'' ''juice''). In an "is-a"kind-of"-relation one could replace the specific formulation ''grapefruit juice'' with the generic formulation ''juice'', and still have a valid sentence.
nl:emulgatoren
es:emulsificantes
hu:Emulgealoszerek, Emulgáloszerek


en:Mono- and diglycerides of fatty acids
In the taxonomy multiple super ingredients can be defined as:
nl:Mono- en diglyceriden van vetzuren
<br />'''<en:soya'''
es:Mono- y diglicéridos de ácidos grasos
<br />'''<en:sauce'''
<br />'''en:soy sauce'''
<br />'''de:Sojasauce'''
<br />


en:sunflower lecithin
Note that ''en:soya'' and ''en:sauce'' should exist as an ingredient in the taxonomy. (is there a consistency check done somewhere?)
nl:zonnebloemlecithine
es:lecitina de girasol


en:natural flavourings
==== • Adjectives ====
nl:natuurlijke aroma's
The ingredient can be made more specific through adjectives. As these adjectives occur for many ingredients and are essentially the same, they have been seperated into a separate taxonomies. Once they are in that taxonomy, they no longer have to be lised in the ingredients taxonomy. The adjectives can be categorised as:
es:saborizantes naturales
* '''Processing''' adjectives describe what has been done with an ingredient, i.e. ''ground'', ''cooked'', etc. Either the processing itself is used as adjective or the result, for instance ''ground'' versus ''powdered''. The identified processing adjectives can be found in the  [[Processing Taxonomy]].
* '''Labels''' adjectives that follow the labels taxonomy, such as ''organic'', etc.


en:natural flavouring
==== • Origin list specification ====
nl:natuurlijk aroma
Producers sometimes try to shorten the length of ingredients lists by suppressing repetition. For instance the entry ''vegetable fats (palm, sunflower)'' should actually be read as two elements: ''palm vegetable fat'' and ''sunflower vegetable fat''. In this case the parentheses act as method to avoid repetition.
es:saborizante natural


en:vitamines
==== • Processed list specification ====
de:Vitamine
The example Tomates (Dés, Jus, Purée) is also used to shorten an ingredient list. In this case the processing used is in the parentheses.
eo:vitaminojthiamin
nl:vitamines
es:vitaminas


en:thiamin
==== • Compound ingredients ====
de:Thiamin, Aneurin
It is possible to have a ingredient that in reality consists of multiple other ingredients. For instance ''fr:jus de soja'' can also be specified as ''en:water'' plus ''en:soybean''. On ingredients list this usually appears as ''jus de soja (eau, fèves de soja)'', i.e. the elements between parentheses define the real ingredients. The compound ingredients can also be seen as a (sub-)ingredient list. This is currently NOT encoded as super ingredient.
eo:tiamino
nl:thiamine
es:tiamina


en:riboflavin
==== • Made of specification ====
de:Riboflavin, Lactoflavin
The example ''crème fraîche (dont lait)'' designates that ''crème fraîche'' is produced from ''lait''. This is usually used to indicate if an ingredient contains an allergenic substance. This is currently NOT encoded as super ingredient.
eo:riboflavino
nl:riboflavine
es:riboflavina


en:carotene
==== Exceptions / Reality ====
de:Carotin
As the taxonomy is strongly tied to the url search function some exceptions have been implemented:
eo:karoteno
* '''en:xxxxx flavour''' - has only one parent: ''<en:flavour''. The parent ''<en:xxxxx'' is not set up, as it would mix the real ingredient ''en:xxxxx'' with artificial flavour ingredient.
nl:caroteen
es:caroteno


en:beta-carotene
=== Language prefix ===
de:Betacarotin
The language prefixes are based on the approach taken by [https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Languages Wikipedia]. In practice this implies that the languages prefixes as used by wikidata are used, this inlcudes both the language and the associated script (if applicable). Sometimes the language prefix from the wikipedia-page is used. For two-letter prefixes the same prefix is used in wikipedia and wikidata. For three (and more) letter prefixes this is no longer true ([[language acronyms]]).
eo:beta-karoteno
nl:beta-caroteen
es:beta-caroteno


en:peanut
=== Application ===
de:Erdnuss
==== Inclusive search ====
eo:ternukso, arakido
The only application for the taxonomy at the moment is search by url. You coud enter the url [https://world.openfoodfacts.org/ingredient/strawberry/ strawberry] to find all products that contain ''strawberry''. Or [https://world.openfoodfacts.org/ingredient/strawberry/language/finnish strawberry] to find all products that have a Finnish ingredient list with ''strawberry'' (puutarhamansikka). This is an inclusive search, which implies that all ingredients that are more specific are included as well in the results.
nl:pinda
es:cacahuate, cacahuete


en:peanuts
What is supported as search is determined by parents that are defined in the taxonomy. So ''strawberry flavour'' is not found when searching for ''strawberry''.
de:Erdnüsse
eo:ternuksoj, arakidoj
nl:pinda's
es:cacahuates, cacahuetes


en:soy
=== Practice ===
de:Soja
Some experiences have been gathered in building the OFF taxonomy. These can be divided into guiding principles and steps.
eo:sojo
nl:soja
es:soya, soja


en:rapeseed
==== Guiding principles ====
nl:raapzaad
* Ingredient list guides - the entries as found on ingredient lists are the basis and guide the taxonomy. So no entries from other places. (an exception is translation bootstrapping)
es:canola, colza
* '''Primary language''' - the first language of an ingredient should be in english, if it is available in that language. Otherwise use the language where the ingredient occurs the most (at the moment that will probably be french);
* '''Singular''' - ingredients should be written in singular, i.e. ''apricot'' and not ''apricots''.
* Lower case
* '''No assumption''' - if in doubt about a translation or an assignment or whatever, keep the ingredient separate. Make no assumption about an ingredient. If it specified vaguely, keep it vaguely;
* '''Or's''' - sometime an ingredient is listed as ''X or Y''. This will be entered as <en:X and <en:Y, so there are two superclasses.


en:codeseed
==== Steps ====
nl:koolzaad
To build the taxonomy from raw data, the follow steps need to be taken
* '''Gather''' - the first step is gathering the ingredients from the ingredient lists. This results in a list of  ''raw ingredients''. It is possible to create a list with ingredients that are not yet in the taxonomy. (This process needs to be described @stephane does this);
* '''Assign''' - the raw ingredients in a language should be mapped unto the existing taxonomy. For each raw ingredient one has to check whether it is not already in the taxonomy. If the raw ingredient is not yet in the taxonomy one has to check whether it can be assigned to an existing ingredient, either as synonym, or as new translation. The raw ingredient might be totally new of more specified than an existing ingredient (the ''new raw ingredients'');
* '''Merge''' - for each new raw ingredient it must be decided how it will fit in the existing taxonomy. Either it is totally new or it is a child of an existing ingredient.;
* '''Key''' - if a new raw ingredient can be mapped to the taxonomy, it will probably be a new synonym of an existing entry. Ideally one should check which of the synonyms occurs the most and set that one as main ingredient name;
* '''Create''' - if the ingredient is totally new, a new entry can be created. It should be decided what the super ingredient is (if any). It can then be entered in that part of the taxonomy.
* ''Translate'' - try to find the translations of the ingredient. Either by searching translations in OFF. For instance [https://world.openfoodfacts.org/ingredient/fr:polydextrose/languages fr:polydextrose] finds the languages available and try to use the translations offered;
* '''Define''' - add a definition from [http://wikipedia.org wikipedia] and [http://wikidata.org wikidata] if available.
* '''OFF''' - add a link to the corresponding ingredients page. For instance for [https://world.openfoodfacts.org/ingredient/fr:polydextrose/ fr:polydextrose].
* '''Occurences''' - note how often the ingredient occurs, in how many language and at what date that was determined. This might require an advanced search ([https://world.openfoodfacts.org/cgi/search.pl?action=process&tagtype_0=ingredients&tag_contains_0=contains&tag_0=polydextrose&sort_by=unique_scans_n&page_size=20&axis_x=energy&axis_y=products_n&action=display polydextrose]). This might allow us to track if changes are necessary;


en:wheat
==== Watch out ====
de:Weizen
It is possible to make mistakes in these steps.
eo:tritiko
* '''Wrong translation''' - the ingredient lists in different languages available on products, do not always offer a translations.
nl:tarwe
* '''Translation bootstrapping''' - in order to get up to speed with the translations, we use the Wikipedia and Wikidata to find translations. This can result in the wrong translations as it can be something different than what is found on the package. So be careful for wikipedia disambiguation strings, latin species names, etc. In the long run these should be superseded by what is fond on the products.
es:trigo
* '''Contaminated ingredients''' - a lot of products contain text that are not actual ingredients. Either this is caused by text recognition, which recognizes other elements as well. Or people add text that are not ingredients;


en:rye
== Setup ==
de:Roggen
Adding a new language for ingredients to OFF, starts with the basics: adding a lot of translations. Without this, your language will not be recognised, handles and analysed. There are two ways to add these translations.
eo:sekalo
=== Adding translations ===
nl:rogge
The basic translation work can be done by anyone with a web-browser. You can use the available web-interface for this. For example to translate the dutch ingredients, use this link: [https://world-nl.openfoodfacts.org/ingredients?translate=add]. In this url ''world'' stands for all products in the database and ''nl'' stands for those with dutch as a main language. And the go through the provided list of ingredients. Be careful, as you might easily introduce wrong translations, which are not actually used on ingredient lists.
es:centeno


en:barley
=== Editing translations ===
de:Gerste
If you have experience with Github, you can edit the necessary files as well on your desktop. The ingredients are split over multiple files, which allows for specific handling:
eo:hordeo
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients.txt Ingredients taxonomy] - this file contains all the normal ingredients;
nl:gerst
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/additives.txt Additives] - the file with all the additives;
es:cebada
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/allergens.txt Allergens] - the allergens required by legislation;
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/amino_acids.txt Amino-acids] - the amino acids, required for infant products;
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/minerals.txt Minerals] - the minerals required by legislation;
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/nucleotides.txt Nucleotides] - the nucleotides required by legislation;
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/other_nutritional_substances.txt Other nutritional substances] - also legislation;
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/vitamins.txt Vitamins] - legislation issue;
You will noticed that many ingredients are combined with an adjective either in front or after the ingredient. As many adjectives will occur often, it is not necessary to add all variants to the ingredient taxonomy. This will make the taxonomy smaller and makes processing faster. Several different types of adjectives are identified:
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/ingredients_processing.txt Processes] - describe what has be done to an ingredient (roasted, rehydrated, etc.);
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/labels.txt Labels] - describe the nature (?) of the ingredient, i.e. organic, fair trade, etc;
* [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/taxonomies/additives_classes.txt Additive classes] - each additive can be assigned to one or more additive class, which describes the role of the additive  in the product;


en:oats
== Maintenance ==
de:Hafer
Once the bulk of the ingredients have been added and translated, the maintenance phase begins. You will quickly see that the system identifies words and phrases as ingredients, when it should not do that. Time for finetuning and maintenance.
eo:aveno
nl:haver
es:avena


en:cocos
=== Detection ===
nl:kokos
Finetuning and maintenance starts with detection, detection that an ingredient is not or falsely detected.
es:coco
* Unknowns - start by creating a list of unknown ingredients with the url: https://nl.openfoodfacts.org/ingredients?status=unknown . This will show all products sold in the Netherlands (nl), with unknown ingredients.


en:carmine
You might find unknown ingredients (prefixed with a language code) in other languages if the main language is not the requested language.
nl:karmijnzuur, natuurlijk karmijnzuur
es:carmín


en:aroma
=== Finetuning ===
nl:aroma
Finetuing allows you to tell the system, where the ingredients start and end, which words should not be taken into account, etc. You might regularly return to this task, as the system analyses more and more ingredient lists.
es:aroma
* Adapt file [https://github.com/openfoodfacts/openfoodfacts-server/blob/master/lib/ProductOpener/Ingredients.pm ingredients.pm] - this file contains various elements, which are used to analyse the ingredients of a product (in the order they appear in the file):
hu:Aroma, Aromák, Aromak
** Traces (''%may_contain_regexps'') - is the start of a phrase that introduces traces;
** Allergens (''%contains_regexps'') - appears after an ingredient, to indicate an allergen;
** Abreviations (''%abbreviations'') - commonly used abbreviations in ingredient lists;
** Or (''%of'') - an optional choice between ingredients, eg '''colza or sunflower oil''';
** And (''%and'') - two ingredients, not separated by a delimiter, eg '''herbs and spices''';
** And of (''%and_of'') -
** And/or (''%and_or'') -
** The (''%the'') - the [https://en.wikipedia.org/wiki/The grammatical article] of the language.
** Ignore after percent (''%ignore_strings_after_percent'') - a phrase that can be discard after it appears a %-sign;
** Ignore phrases (''%ignore_regexps'') - phrases that can be ignored (black list);
** Ingredients start (''%phrases_before_ingredients_list'') - this word is used as the beginning a the ingredient list, and helps OFF determine where the recognition should start.
** Ingredients start all caps (''%phrases_before_ingredients_list_uppercase'') - For example in[https://static.openfoodfacts.org/images/products/002/222/420/0820/ingredients_en.4.full.jpg this ingredient list], the lis starts with '''INGREDIENTS''';
** After ingredients (''%phrases_after_ingredients_list'') - an ingredient list can end in many different ways. These can be listed here;
** Wrong dashes (%prefixes_before_dash) - this allows words separated by dash to be combined, eg '''demi - écrémé''', wil be changed to '''demi-écrémé'''.
You might require special handling, which in turn requires specific code. Please contact the OFF'ers on Slack for that (#ingredients).


en:gelatin
=== Adding ===
de:Gelatine
- Finding unknown ingredients
eo:gelateno
- Adding processes
nl:gelatine
- Adding labels
es:gelatina
- Adding black listed ingredients
- Language transposes


en:wheat starch
== Related pages==
nl:tarwezetmeel
There is more documentation that is related ingredients and the corresponding taxonomy:
es:almidón de trigo
* [[Ingredients Events list]]
[[Category:Taxonomies]]
[[Category:Ingredients]]

Latest revision as of 07:17, 23 October 2023

Introduction

Why?

Why do we need an ingredients ontology? The ontology describes how ingredients are derived from each other and how ingredients can be combined into new ingredients. The basic thesaurus can help to:

  • Normalise ingredients - Producers take a lot of freedom in describing the ingredients they use. They use different words (synonyms) to designate the same ingredient. An ontology helps to standardise the ingredients;
  • Exclusive search - the taxonomy can be used to support search for ingredients in any language and multiple synonyms;
  • Translation - as the taxonomy contains translations of each ingredient, it can be used as means to translate ingredient lists;
  • Ingredients analysis - the taxonomy can be extended with properties to indicate if ingredients are suitable for vegans, vegetarians etc. and to estimate how processed the food product is (e.g. NOVA groups)
  • Ingredient language inconsistencies - it happens that ingredients in different languages of a single product are different. That can be ingredients that are left out or simplified. The thesaurus might help revealing these.

If the thesaurus is extended by relations between ingredients, other benefits arise, depending on what is defined in the taxonomy. The basic relation is the isa-relation or is-a-kind-of relation. The isa-relation defines two ingredients that are related. For instance the ingredient strawberry can also be found as strawberry puree and strawberry juice. Strawberry is more generic than strawberry puree, which is more specific. The rule here is: would it make sense to replace the children by the parent in the ingredients list? For instance strawberry puree and strawberry juice are still strawberry, but under a different shape. So if you replace strawberry puree with strawberry in the ingredient list, it is still a valid ingredient list. Usually strawberry is the parent and strawberry puree a child.

This isa-relation makes the following functionalities possible:

  • Inclusive source search - The inclusive search means that a search for the parent strawberry will also show the children strawberry puree, etc. The source search means the the origin of the ingredient, i.e. strawberry , apple, cinnamon, soy, etc. (I want another word for source).
  • Inclusive condition search - in a similar way we could search for the condition of a source ingredient, for instance pureed, juiced, dried, reconditioned, etc. So if we would like to search for all pureed fruits, the relation between fruit puree and strawberry puree should be defined in the taxonomy.

One could also define other relations in order to support:

  • Hidden ingredients - an ingredient might contain hidden ingredients, the ontology might reveal these. For example butter contains butterfat.
  • Combined ingredients - an ingredient might appear as a single ingredient. In reality however
  • Processed ingredients - often an ingredient is derived from an other ingredient through some process. We can make explicit what these processes are. Example clarified butter is created from butter by separating the milk solids and water from the butterfat.
  • Ingredient incompleteness - often an ingredient is incomplete defined in an ingredient list. For instance if an ingredient-list specifies milk, it should be defined from which mammal the milk comes from, for instance cow's milk.

Taxonomy

The current OFF ingredients taxonomy (github) can be seen as a thesaurus. (although slowly more info is added).

In the context of information retrieval, a thesaurus (plural: "thesauri") is a form of controlled vocabulary that seeks to dictate semantic manifestations of metadata in the indexing of content objects. A thesaurus serves to minimise semantic ambiguity by ensuring uniformity and consistency in the storage and retrieval of the manifestations of content objects. ANSI/NISO Z39.19-2005 defines a content object as "any item that is to be described for inclusion in an information retrieval system, website, or other source of information". The thesaurus aids the assignment of preferred terms to convey semantic metadata associated with the content object. (wikipedia)

The purpose of the thesaurus is to have a list of ingredients, that occur in ingredient lists, that are unique and are well defined. The OFF ingredients taxonomy is mainly a list of elements that occur in ingredient lists. However the OFF Ingredients taxonomy is more than a simple thesaurus.

Example

Below is the ingredient class entry of soybean in the OFF ingredients taxonomy: #description:en: SOYBEAN (Glycine max), or soya bean is a species of legume native to East Asia. #comment:en: There is a bit of confusion whether "bean" should be part of the name, so "soy" or "soybeen"?.
<en:soya
en:soya bean, soy beans, soya beans, soybean
da:Sojabønne
de:Sojabohne, Sojabohnen, Sojakerne
es:granos de soja, habas de soja
et:Sojauba
fi:Soijapapu
fr:fève de soja, fèves de soja, soja entier, graine de soja, graines de soja, graines soja

Explanation

  • Comment-line - the # describes a comment line and can be used to add other information to an ingredient. This information is not used by OFF.
  • Description the # description:en: provides a definition of the ingredient in english. In this case the definition is taken from wikipedia.
  • Description the # comment:en: provides a comment in english about the entry.
  • Super-ingredient - the <en:soya line describes the super-ingredient ingredient of this (sub-ingredient) ingredient. The relationship between the super-ingredient an the sub-ingredient is an is-a-kind-of-relationship. This means that the sub-ingredient is a kind of super-ingredient. In this example the soybean ingredient is a kind soy ingredient. The sub-ingredient provides more details of the ingredient. In this case the super-ingredient is very broad (something with soy), wheras the sub-ingredient adds the detail of the bean. This line is optional, if no super-ingredient can be specified.
  • Key - The en:soya bean line is the main name (key) of the ingredient. This doubles as name for the ingredient (in this case in english (en:)).
  • Translations - The next lines provide translations of the main ingredient in other languages. One language per line. Each line starts with a language prefix. Thus de: means german.
  • Synonyms - an ingredient in a language can have multiple synonyms. These synonyms appear as secondary entries on a language line. The first entry is however the main name in that language.

Multiple super ingredients

An ingredient can have multiple super ingredients. This can be used to express the relation between ingredients. These super ingredients reflect the description of an ingredient, such as grapefruit juice. The ingredient grapefruit juice has two components: the noun juice and the adjective grapefruit. The noun can be seen as the processing state (purée, juice, flour, etc.) and the adjective as subject of that processing. For an user both might be interesting in helping her: what kind of juice? how was the grapefruit processed?

As the same super-ingredient will also appear in combination with other ingredients, this might help the user understand what other fruits there are and how they are processed and what other juices from fruits exist.

At the moment (2018-10-05) is taxonomy is not consistently structured. It uses sometimes the nouns and sometimes the adjective as structure principle.

These super ingredients and ingredients can also be found in ingredient lists, but then they are indicated by parentheses or colons). For example: Tomates (Dés, Jus, Purée), Yaourt (lait, crème, sucre, ferments lactiques), crème fraîche (dont lait) or Jus de : pomme, orange, passion, ananas, citron. These are four different approaches to super ingredients and define different relations between ingredients.

From a formal semantic viewpoint: juice is the superclass of grapefruit juice (grapefruit juice is-a-kind-of juice). In an "is-a"kind-of"-relation one could replace the specific formulation grapefruit juice with the generic formulation juice, and still have a valid sentence.

In the taxonomy multiple super ingredients can be defined as:
<en:soya
<en:sauce
en:soy sauce
de:Sojasauce

Note that en:soya and en:sauce should exist as an ingredient in the taxonomy. (is there a consistency check done somewhere?)

• Adjectives

The ingredient can be made more specific through adjectives. As these adjectives occur for many ingredients and are essentially the same, they have been seperated into a separate taxonomies. Once they are in that taxonomy, they no longer have to be lised in the ingredients taxonomy. The adjectives can be categorised as:

  • Processing adjectives describe what has been done with an ingredient, i.e. ground, cooked, etc. Either the processing itself is used as adjective or the result, for instance ground versus powdered. The identified processing adjectives can be found in the Processing Taxonomy.
  • Labels adjectives that follow the labels taxonomy, such as organic, etc.

• Origin list specification

Producers sometimes try to shorten the length of ingredients lists by suppressing repetition. For instance the entry vegetable fats (palm, sunflower) should actually be read as two elements: palm vegetable fat and sunflower vegetable fat. In this case the parentheses act as method to avoid repetition.

• Processed list specification

The example Tomates (Dés, Jus, Purée) is also used to shorten an ingredient list. In this case the processing used is in the parentheses.

• Compound ingredients

It is possible to have a ingredient that in reality consists of multiple other ingredients. For instance fr:jus de soja can also be specified as en:water plus en:soybean. On ingredients list this usually appears as jus de soja (eau, fèves de soja), i.e. the elements between parentheses define the real ingredients. The compound ingredients can also be seen as a (sub-)ingredient list. This is currently NOT encoded as super ingredient.

• Made of specification

The example crème fraîche (dont lait) designates that crème fraîche is produced from lait. This is usually used to indicate if an ingredient contains an allergenic substance. This is currently NOT encoded as super ingredient.

Exceptions / Reality

As the taxonomy is strongly tied to the url search function some exceptions have been implemented:

  • en:xxxxx flavour - has only one parent: <en:flavour. The parent <en:xxxxx is not set up, as it would mix the real ingredient en:xxxxx with artificial flavour ingredient.

Language prefix

The language prefixes are based on the approach taken by Wikipedia. In practice this implies that the languages prefixes as used by wikidata are used, this inlcudes both the language and the associated script (if applicable). Sometimes the language prefix from the wikipedia-page is used. For two-letter prefixes the same prefix is used in wikipedia and wikidata. For three (and more) letter prefixes this is no longer true (language acronyms).

Application

Inclusive search

The only application for the taxonomy at the moment is search by url. You coud enter the url strawberry to find all products that contain strawberry. Or strawberry to find all products that have a Finnish ingredient list with strawberry (puutarhamansikka). This is an inclusive search, which implies that all ingredients that are more specific are included as well in the results.

What is supported as search is determined by parents that are defined in the taxonomy. So strawberry flavour is not found when searching for strawberry.

Practice

Some experiences have been gathered in building the OFF taxonomy. These can be divided into guiding principles and steps.

Guiding principles

  • Ingredient list guides - the entries as found on ingredient lists are the basis and guide the taxonomy. So no entries from other places. (an exception is translation bootstrapping)
  • Primary language - the first language of an ingredient should be in english, if it is available in that language. Otherwise use the language where the ingredient occurs the most (at the moment that will probably be french);
  • Singular - ingredients should be written in singular, i.e. apricot and not apricots.
  • Lower case
  • No assumption - if in doubt about a translation or an assignment or whatever, keep the ingredient separate. Make no assumption about an ingredient. If it specified vaguely, keep it vaguely;
  • Or's - sometime an ingredient is listed as X or Y. This will be entered as <en:X and <en:Y, so there are two superclasses.

Steps

To build the taxonomy from raw data, the follow steps need to be taken

  • Gather - the first step is gathering the ingredients from the ingredient lists. This results in a list of raw ingredients. It is possible to create a list with ingredients that are not yet in the taxonomy. (This process needs to be described @stephane does this);
  • Assign - the raw ingredients in a language should be mapped unto the existing taxonomy. For each raw ingredient one has to check whether it is not already in the taxonomy. If the raw ingredient is not yet in the taxonomy one has to check whether it can be assigned to an existing ingredient, either as synonym, or as new translation. The raw ingredient might be totally new of more specified than an existing ingredient (the new raw ingredients);
  • Merge - for each new raw ingredient it must be decided how it will fit in the existing taxonomy. Either it is totally new or it is a child of an existing ingredient.;
  • Key - if a new raw ingredient can be mapped to the taxonomy, it will probably be a new synonym of an existing entry. Ideally one should check which of the synonyms occurs the most and set that one as main ingredient name;
  • Create - if the ingredient is totally new, a new entry can be created. It should be decided what the super ingredient is (if any). It can then be entered in that part of the taxonomy.
  • Translate - try to find the translations of the ingredient. Either by searching translations in OFF. For instance fr:polydextrose finds the languages available and try to use the translations offered;
  • Define - add a definition from wikipedia and wikidata if available.
  • OFF - add a link to the corresponding ingredients page. For instance for fr:polydextrose.
  • Occurences - note how often the ingredient occurs, in how many language and at what date that was determined. This might require an advanced search (polydextrose). This might allow us to track if changes are necessary;

Watch out

It is possible to make mistakes in these steps.

  • Wrong translation - the ingredient lists in different languages available on products, do not always offer a translations.
  • Translation bootstrapping - in order to get up to speed with the translations, we use the Wikipedia and Wikidata to find translations. This can result in the wrong translations as it can be something different than what is found on the package. So be careful for wikipedia disambiguation strings, latin species names, etc. In the long run these should be superseded by what is fond on the products.
  • Contaminated ingredients - a lot of products contain text that are not actual ingredients. Either this is caused by text recognition, which recognizes other elements as well. Or people add text that are not ingredients;

Setup

Adding a new language for ingredients to OFF, starts with the basics: adding a lot of translations. Without this, your language will not be recognised, handles and analysed. There are two ways to add these translations.

Adding translations

The basic translation work can be done by anyone with a web-browser. You can use the available web-interface for this. For example to translate the dutch ingredients, use this link: [1]. In this url world stands for all products in the database and nl stands for those with dutch as a main language. And the go through the provided list of ingredients. Be careful, as you might easily introduce wrong translations, which are not actually used on ingredient lists.

Editing translations

If you have experience with Github, you can edit the necessary files as well on your desktop. The ingredients are split over multiple files, which allows for specific handling:

You will noticed that many ingredients are combined with an adjective either in front or after the ingredient. As many adjectives will occur often, it is not necessary to add all variants to the ingredient taxonomy. This will make the taxonomy smaller and makes processing faster. Several different types of adjectives are identified:

  • Processes - describe what has be done to an ingredient (roasted, rehydrated, etc.);
  • Labels - describe the nature (?) of the ingredient, i.e. organic, fair trade, etc;
  • Additive classes - each additive can be assigned to one or more additive class, which describes the role of the additive in the product;

Maintenance

Once the bulk of the ingredients have been added and translated, the maintenance phase begins. You will quickly see that the system identifies words and phrases as ingredients, when it should not do that. Time for finetuning and maintenance.

Detection

Finetuning and maintenance starts with detection, detection that an ingredient is not or falsely detected.

You might find unknown ingredients (prefixed with a language code) in other languages if the main language is not the requested language.

Finetuning

Finetuing allows you to tell the system, where the ingredients start and end, which words should not be taken into account, etc. You might regularly return to this task, as the system analyses more and more ingredient lists.

  • Adapt file ingredients.pm - this file contains various elements, which are used to analyse the ingredients of a product (in the order they appear in the file):
    • Traces (%may_contain_regexps) - is the start of a phrase that introduces traces;
    • Allergens (%contains_regexps) - appears after an ingredient, to indicate an allergen;
    • Abreviations (%abbreviations) - commonly used abbreviations in ingredient lists;
    • Or (%of) - an optional choice between ingredients, eg colza or sunflower oil;
    • And (%and) - two ingredients, not separated by a delimiter, eg herbs and spices;
    • And of (%and_of) -
    • And/or (%and_or) -
    • The (%the) - the grammatical article of the language.
    • Ignore after percent (%ignore_strings_after_percent) - a phrase that can be discard after it appears a %-sign;
    • Ignore phrases (%ignore_regexps) - phrases that can be ignored (black list);
    • Ingredients start (%phrases_before_ingredients_list) - this word is used as the beginning a the ingredient list, and helps OFF determine where the recognition should start.
    • Ingredients start all caps (%phrases_before_ingredients_list_uppercase) - For example inthis ingredient list, the lis starts with INGREDIENTS;
    • After ingredients (%phrases_after_ingredients_list) - an ingredient list can end in many different ways. These can be listed here;
    • Wrong dashes (%prefixes_before_dash) - this allows words separated by dash to be combined, eg demi - écrémé, wil be changed to demi-écrémé.

You might require special handling, which in turn requires specific code. Please contact the OFF'ers on Slack for that (#ingredients).

Adding

- Finding unknown ingredients - Adding processes - Adding labels - Adding black listed ingredients - Language transposes

Related pages

There is more documentation that is related ingredients and the corresponding taxonomy: